Running Headless Chrome with Puppeteer and Docker
Headless Chrome has been published by Google’s Chrome team, causing a lot of excitement all across the board and immediately causing major shifts in the tooling landscape, such as a complete stop in the development of PhantomJS.
Puppeteer is the Node API for Headless Chrome. It can be used to control Headless Chrome over the DevTools protocol.
Most things that you can do manually in the browser can be done using Puppeteer!
You can use it for crawling, scraping, automation, generating PDFs, automated testing and much much more. It’s an amazing tool, and will probably become the de-facto standard for most of the above tasks.
In this post, I’d like to introduce my take on a quick dockerized Puppeteer development environment, to work with the Headless Chrome Node API locally.
To run Puppeteer on your machine, you’ll need to install all kinds of libraries and packages. This may lead to a bit of fumbling around, especially if you are on a less-well-supported OS and have to find out the names of all packages by yourself, just to discover that you need to apply all kinds of fixes to make it work properly.
That’s time you can save, and start working with Puppeteer and Headless Chrome right away instead! Don’t clutter your system, just type in a single command and skip the boring setup part.
Docker makes it possible. The following is the result of creating a very quick prototyping environment. You can find the final code in this GitHub repository. It’s meant for development only, and should be treated with caution :)
You’ll need to have Docker installed, as well as Docker Compose.
After cloning the GitHub repository, issue a
$ make rebuild
to build the images and bring up the container. By default, the containers run in the foreground, so you’ll need to switch to a new terminal tab. The image is based on the official Node image, and contains additional libraries and tools which are needed by Puppeteer and Headless Chrome.
Once the container is up, you can enter it with:
$ make enter
The only thing to do, is to install the Node packages with
$ yarn install
and you can run the example app with
$ node index.js
Understanding the Setup
The Dockerfile simply defined what additional libraries, apart from Node itself are needed.
FROM node:8 RUN apt-get update # for https RUN apt-get install -yyq ca-certificates # install libraries RUN apt-get install -yyq libappindicator1 libasound2 libatk1.0-0 libc6 libcairo2 libcups2 libdbus-1-3 libexpat1 libfontconfig1 libgcc1 libgconf-2-4 libgdk-pixbuf2.0-0 libglib2.0-0 libgtk-3-0 libnspr4 libnss3 libpango-1.0-0 libpangocairo-1.0-0 libstdc++6 libx11-6 libx11-xcb1 libxcb1 libxcomposite1 libxcursor1 libxdamage1 libxext6 libxfixes3 libxi6 libxrandr2 libxrender1 libxss1 libxtst6 # tools RUN apt-get install -yyq gconf-service lsb-release wget xdg-utils # and fonts RUN apt-get install -yyq fonts-liberation
What glues everything together, and configures the containers is the
version: '3' services: puppeteer: build: . volumes: # mount code and output directories into the container - ./output:/app/output - ./code:/app/code working_dir: /app/code shm_size: 1gb #512M # just run the container doing nothing entrypoint: ["sh", "-c", "sleep infinity"]
Among others, it makes sure that the local code directory gets mounted into the container, so you can edit everything as you’re used to with your editor of choice, but the project is immediately available inside of the container.
The Node dependencies, reside in the code folder as well, and are ignored by Git. This means, that restarting containers does not require reinstalling everything from scratch, keeping the startup time low and iteration cycles as quick as possible.
To make sure that results don’t mess with the code, an extra folder for outputs is created and mounted, so you can view the output of your code on the local machine without copying it out of the container.
Right now, docker-compose.yml only defines a single service. We might add a database or message queue - any supporting services which our code needs.
The current Node code, does not require many additional packages or libraries. You can install anything you like, to create a really cool project. If you need system libraries or tools, simply add them to the Dockerfile and execute a
$ make rebuild
followed by a new
$ make enter
Of course, the current setup is not meant for anything but development environments. There’s an emphasis on quickly iterating on the project and making changes immediately. For a production setup, you’d need to put in a bit more work to make sure that you’re running everything according to security requirements, best practices and in a practical fashion which makes maintenance possible.
This post introduced my take on a quick dockerized Puppeteer development environment, to work with the Headless Chrome Node API locally. Clone the project and try Headless Chrome and Puppeteer yourself!
As stated above, this is my take on a quick dockerized version, which is meant solely for development. Headless Chrome is started disregarding possible security issues, and your code will run as root within the container. Use it with caution!