Image
class from the beam
SDK to install these packages in the container running your code.
map
method to invoke our scrape_page
function on a list of URLs. Below, is our __init__
method for the crawler.
crawl
method along with a helper method to determine if a URL is a valid Wikipedia URL.
scrape_page
function’s map
method. This allows us to scrape multiple pages in parallel. After the pages are scraped, we collect any new links that we want to visit and add them to the pages_to_visit
list.
main
function which initializes the crawler and runs the crawl method.
scraped_data.json
file. It will look something like this:
scrape_page
function, but instead of using the map
method, we will use a thread pool to invoke the function in parallel. Below is the code for our WikipediaCrawler
class with a continuous crawl method.
executor.submit
method and wait for any of them to complete using the concurrent.futures.wait
method. We specify that we only want to wait for one of the futures to complete using the concurrent.futures.FIRST_COMPLETED
constant. This means that as soon as any future completes, we will process the result and add new work to the pool.
main
function as before. When you run this code, you should see output that looks like the following: