Back to posts

Tue, Jul 23, 2024

Puppeteer Deployed with Docker

Web scraping is a foundational aspect of the internet. It powers a number of common internet services including search, data aggregation, and analytics. Therefore, having a fundamental understanding of how the process works is a valuable skill.

This tutorial is going to walk through how to build a simple web scraping service inside a docker container. It will provide the basic components to quickly build and deploy the container for development and use. The web scraping logic uses an open source package built by Google called Puppeteer. Puppeteer is an API that automates the programmatic manipulation of a web browser. The web scraping script provides some important boiler plate for crawling a web page with Puppeteer. In addition, it demonstrates a few common data extraction tasks. The script is minimal, but highly extensible. Use it as a starting point and customize it for your own web scraping project.

Setup and Assumptions

  1. An understanding of using Docker as a development environment
  2. A basic understanding of Puppeteer

You can find the code referenced in this post in the following Github repo. The specific code for this tutorial is located in /node/puppet.

$ git clone https://github.com/meanderio/docker-dev.git

Puppeteer Container

We are going to use the official Puppeteer image. Information about this Docker image can be found in the Puppeteer Docs. While this gives use the least control over our environment, it does allow us to build and deploy with minimal friction. The focus of this tutorial is on using Puppeteer for web scraping, so we will accept this trade off in this tutorial. The Puppeteer package is highly extensible. Future posts will dig into the details on configuration, usage, and deployment considerations at scale.

Dockerfile

FROM ghcr.io/puppeteer/puppeteer:latest

RUN mkdir -p ./src 

COPY ./src/puppet.js ./src

ENTRYPOINT [ "/bin/bash" ]

The Dockerfile for our deployment is very simple. The first thing we want to do is build our image.

$ docker build -t dckr-puppet .

If you are on Apple Silicon you will need to change the first line of the Dockerfile to:

FROM --platform=linux/arm64 ghcr.io/puppeteer/puppeteer:latest

The build process will log a warning about the platform, but it will not be a problem for this tutorial.

Next we can to run the container in an interactive state. This will allow us to edit, run, and test our code.

$ docker run --detach -it --rm --name dckr-puppet dckr-puppet

To streamline this entire workflow we can combine the build and run steps with a compose.yaml file.

services:
  app:
    container_name: dckr-puppet
    build: .
    volumes:
      - type: bind
        source: ./src
        target: /home/pptruser/src
    stdin_open: true
    tty: true
    entrypoint: /bin/bash

Now one command will build and run the container.

$ docker compose up --build -d

Note the volume bind on /src to /home/pptruser/src in the container. This means all the code we write while in the container will reflect back to our local machine on saves. This is a useful way to persist our code base so we can keep it under version control.

After you confirm your container is running, attach using the VS Code extension Dev Containers. Wi this setup complete, we are ready to implement the core logic for our web scraper.

Puppeteer Web Scraper

This the final Puppeteer script. Take a moment to read through it line by line.

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch({
    headless: true,
    dumpio: false,
  });

  const url = 'https://example.com';

  const page = await browser.newPage();
  const resp = await page.goto(url, { 
      timeout: 5000, waitUntil: 'load' 
  });
  console.log("status: ", resp.status());

  // extract the full page html
  const content = await page.content();
  console.log(content)

  // take a screenshot of the page
  const scrnsht = await page.screenshot({ 
    path: 'page.jpg', 
    fullPage: true });
  console.log("screenshot saved.")

  // extract text from the page
  const words = await page.$eval(
    '*', (el) => el.innerText
  );
  console.log(words);

  await browser.close();
})();

Let’s break this script down into logical chunks and step through them one by one. This will help you better understand what the script does and help you better reason about how Puppeteer works.

const browser = await puppeteer.launch({
    headless: true,
    dumpio: false,
  });

This above chunk creates a new instance of browser. This is done with the launch command. The browser is our entry-point. The launch command can take number of optional arguments. We will stick to a minimal set for this tutorial. The headless: true command prevents a GUI from rendering and dumpio: false suppresses diagnostic output on the command line.

const url = 'https://example.com';

  const page = await browser.newPage();
  const resp = await page.goto(url, { 
      timeout: 5000, waitUntil: 'load' 
  });
  console.log("status: ", resp.status());

This section instructs the browser to create a new blank tab in the browser and visit the page defined by url. In this case the url is example.com. The visit is executed with the goto(). This method takes an optional interface that defines how long to wait after page load to return control to Puppeteer. The goto() method returns a standard response object. We store this resp and log the status code to stdout in the console.

// extract the full page html
  const content = await page.content();
  console.log(content)

The above snippet grabs and logs the page html to stdout.

  // take a screenshot of the page
  const scrnsht = await page.screenshot({ 
    path: 'page.jpg', 
    fullPage: true });
  console.log("screenshot saved.")

The above snippet takes and saves a screenshot of the page.

  // extract text from the page
  const words = await page.$eval(
    '*', (el) => el.innerText
  );
  console.log(words);

The above snippet extracts and logs all text on the page to stdout.

await browser.close();

Finally we close the browser when the web crawl and data extraction process is complete.

This script can be run in the container. Open a terminal and use the following command.

$ node puppet.js

You should see the output from the console.log calls used in the script.

Summary

We now have a powerful, lightweight web crawler. In addition, the web crawler is wrapped in a docker container. This means we can use this template as a starting point to modify, extend, and orchestrate deployment.

From here you can focus on extending the web crawler code to extract specific information from websites of interest. You can also focus on orchestrating and automating the deployment of the web crawler to scale to large workloads over many web pages and web sites, adding in support to save the extracted data to a database or other persistent storage location.

In future posts we will dive into building a feature rich web crawler with Puppeteer. We will focus on the details of data extraction at scale in a heterogenous environments.