Puppeteer Deployed with Docker
Web scraping is a foundational aspect of the internet. It powers a number of common internet services including search, data aggregation, and analytics. Therefore, having a fundamental understanding of how the process works is a valuable skill.
This tutorial is going to walk through how to build a
simple web scraping service inside a docker container.
It will provide the basic components to quickly build
and deploy the container for development and use.
The web scraping logic uses an open source package
built by Google called Puppeteer
.
Puppeteer
is an API that automates the programmatic
manipulation of a web browser.
The web scraping script provides some important
boiler plate for crawling a web page with Puppeteer
.
In addition, it demonstrates a few common data
extraction tasks.
The script is minimal, but highly extensible.
Use it as a starting point and customize it for your own
web scraping project.
Setup and Assumptions
- An understanding of using Docker as a development environment
- A basic understanding of
Puppeteer
You can find the code referenced in this post in the following Github repo.
The specific code for this tutorial is located in /node/puppet
.
$ git clone https://github.com/meanderio/docker-dev.git
Puppeteer Container
We are going to use the official Puppeteer
image.
Information about this Docker
image can be found in the
Puppeteer Docs.
While this gives use the least control over our
environment, it does allow us to build and deploy with minimal friction.
The focus of this tutorial is on using Puppeteer
for web
scraping, so we will accept this trade off in this tutorial.
The Puppeteer
package is highly extensible.
Future posts will dig into the details on configuration,
usage, and deployment considerations at scale.
Dockerfile
FROM ghcr.io/puppeteer/puppeteer:latest
RUN mkdir -p ./src
COPY ./src/puppet.js ./src
ENTRYPOINT [ "/bin/bash" ]
The Dockerfile for our deployment is very simple. The first thing we want to do is build our image.
$ docker build -t dckr-puppet .
If you are on Apple Silicon you will need to change the first line of the Dockerfile to:
FROM --platform=linux/arm64 ghcr.io/puppeteer/puppeteer:latest
The build process will log a warning about the platform, but it will not be a problem for this tutorial.
Next we can to run the container in an interactive state. This will allow us to edit, run, and test our code.
$ docker run --detach -it --rm --name dckr-puppet dckr-puppet
To streamline this entire workflow we can combine the build
and run steps with a compose.yaml
file.
services:
app:
container_name: dckr-puppet
build: .
volumes:
- type: bind
source: ./src
target: /home/pptruser/src
stdin_open: true
tty: true
entrypoint: /bin/bash
Now one command will build and run the container.
$ docker compose up --build -d
Note the volume bind on /src
to /home/pptruser/src
in the container.
This means all the code we write while in the container
will reflect back to our local machine on saves.
This is a useful way to persist our code base so we
can keep it under version control.
After you confirm your container is running, attach using the VS Code extension Dev Containers. Wi this setup complete, we are ready to implement the core logic for our web scraper.
Puppeteer Web Scraper
This the final Puppeteer
script.
Take a moment to read through it line by line.
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch({
headless: true,
dumpio: false,
});
const url = 'https://example.com';
const page = await browser.newPage();
const resp = await page.goto(url, {
timeout: 5000, waitUntil: 'load'
});
console.log("status: ", resp.status());
// extract the full page html
const content = await page.content();
console.log(content)
// take a screenshot of the page
const scrnsht = await page.screenshot({
path: 'page.jpg',
fullPage: true });
console.log("screenshot saved.")
// extract text from the page
const words = await page.$eval(
'*', (el) => el.innerText
);
console.log(words);
await browser.close();
})();
Let’s break this script down into logical chunks and step
through them one by one.
This will help you better understand what the script does
and help you better reason about how Puppeteer
works.
const browser = await puppeteer.launch({
headless: true,
dumpio: false,
});
This above chunk creates a new instance of browser.
This is done with the launch
command.
The browser is our entry-point.
The launch
command can take number of optional arguments.
We will stick to a minimal set for this tutorial.
The headless: true
command prevents a GUI from rendering
and dumpio: false
suppresses diagnostic output on the command line.
const url = 'https://example.com';
const page = await browser.newPage();
const resp = await page.goto(url, {
timeout: 5000, waitUntil: 'load'
});
console.log("status: ", resp.status());
This section instructs the browser to create a new blank
tab in the browser and visit the page defined by url
.
In this case the url
is example.com
.
The visit is executed with the goto()
.
This method takes an optional interface that defines how
long to wait after page load to return control to Puppeteer
.
The goto()
method returns a standard response object.
We store this resp
and log the status code to stdout
in the
console.
// extract the full page html
const content = await page.content();
console.log(content)
The above snippet grabs and logs the page html to stdout
.
// take a screenshot of the page
const scrnsht = await page.screenshot({
path: 'page.jpg',
fullPage: true });
console.log("screenshot saved.")
The above snippet takes and saves a screenshot of the page.
// extract text from the page
const words = await page.$eval(
'*', (el) => el.innerText
);
console.log(words);
The above snippet extracts and logs all text on the page to stdout
.
await browser.close();
Finally we close the browser when the web crawl and data extraction process is complete.
This script can be run in the container. Open a terminal and use the following command.
$ node puppet.js
You should see the output from the console.log
calls used
in the script.
Summary
We now have a powerful, lightweight web crawler. In addition, the web crawler is wrapped in a docker container. This means we can use this template as a starting point to modify, extend, and orchestrate deployment.
From here you can focus on extending the web crawler code to extract specific information from websites of interest. You can also focus on orchestrating and automating the deployment of the web crawler to scale to large workloads over many web pages and web sites, adding in support to save the extracted data to a database or other persistent storage location.
In future posts we will dive into building a feature rich
web crawler with Puppeteer
.
We will focus on the details of data extraction at scale
in a heterogenous environments.