Running a Web Scraper

Define the environment

First, we’ll define our environment:

app.py

from beam import App, Runtime, Image


app = App(
    name="web-scraper",
    runtime=Runtime(
        cpu=1,
        memory="8Gi",
        image=Image(
            python_version="python3.8",
            python_packages=["bs4", "transformers", "torch"],
        ),
    ),
)

Write scraping logic

Now, we’ll write logic to scrape the headlines from The New York Times. In order to run this on Beam, we add an @app.run() decorator to the function:

@app.run()
def scrape_nyt():
    ...

Show Code

scraper.py

from beam import App, Runtime, Image

import time
import requests
from bs4 import BeautifulSoup
from transformers import pipeline

app = App(
    name="web-scraper",
    runtime=Runtime(
        cpu=1,
        memory="8Gi",
        image=Image(
            python_version="python3.8",
            python_packages=["bs4", "transformers", "torch"],
        ),
    ),
)

@app.run()
def scrape_nyt():
    res = requests.get("https://www.nytimes.com")
    soup = BeautifulSoup(res.content, "html.parser")
    # Grab all headlines
    headlines = soup.find_all("h3", class_="indicate-hover", text=True)

    total_headlines = len(headlines)
    negative_headlines = 0

    # Iterate through each headline
    for h in headlines:
        title = h.get_text()
        print(title)
        sentiment = predict_sentiment(title)

        print(sentiment)

        if sentiment.get("NEGATIVE") > sentiment.get("POSITIVE"):
            negative_headlines += 1

    print(f"{negative_headlines} negative headlines / {total_headlines} total")


def predict_sentiment(title):
    model = pipeline(
        "sentiment-analysis", model="siebert/sentiment-roberta-large-english"
    )
    result = model(title, truncation=True, top_k=2)
    prediction = {i["label"]: i["score"] for i in result}

    return prediction

Running the scraper

Now, we’re ready to run our code using Beam. In your terminal, run:

beam run your_file.py:scrape_nyt

You should see the headlines and the detected sentiment of each:

(.venv) beta9@MacBook-Air-2 web-scraping % beam-stage run app.py:scrape_nyt
 i  Using cached image.
 ✓  App initialized.
 ✓  Container scheduled, logs will appear below.
Starting app...
Loading handler in 'app.py:scrape_nyt'...
Running task: c021040d-aea7-4406-9b5e-79d898f7592a
This Hummus Holds Up After 800 Years
{'POSITIVE': 0.9985199570655823, 'NEGATIVE': 0.0014800893841311336}
Task complete: c021040d-aea7-4406-9b5e-79d898f7592a, duration: 177.36207103729248s

​Define the environment

​Write scraping logic

​Running the scraper

Define the environment

Write scraping logic

Running the scraper