View the Code
See the code for this example on Github.
Defining our Scraping Function
We will start by defining our scraping function. This is the Beam function that will be invoked remotely. We use theImage class from the beam SDK to install these packages in the container running your code.
Building a Batch Crawler with Beam’s Function Map
Next, we’ll build a crawler that will use Beam’smap method to invoke our scrape_page function on a list of URLs. Below, is our __init__ method for the crawler.
crawl method along with a helper method to determine if a URL is a valid Wikipedia URL.
scrape_page function’s map method. This allows us to scrape multiple pages in parallel. After the pages are scraped, we collect any new links that we want to visit and add them to the pages_to_visit list.
Running the Batch Crawler
Finally, we can run our crawler. Below is the code for ourmain function which initializes the crawler and runs the crawl method.
scraped_data.json file. It will look something like this:
Building a Continuous Crawler with Beam Functions and Threads
The batched web crawler is a good starting point, but it requires waiting for a full batch to finish before starting any new jobs. If we want to keep our crawler limit continuously saturated, we can use Beam functions in conjunction with Python threads. To do this, we will use the samescrape_page function, but instead of using the map method, we will use a thread pool to invoke the function in parallel. Below is the code for our WikipediaCrawler class with a continuous crawl method.
executor.submit method and wait for any of them to complete using the concurrent.futures.wait method. We specify that we only want to wait for one of the futures to complete using the concurrent.futures.FIRST_COMPLETED constant. This means that as soon as any future completes, we will process the result and add new work to the pool.
Running the Continuous Crawler
To run the continuous crawler, you can use the samemain function as before. When you run this code, you should see output that looks like the following: