Running a Web Scraper
Changelog
- February 10, 2025
- February 8, 2025
- February 7, 2025
- February 6, 2025
- February 5, 2025
- February 4, 2025
- February 3, 2025
- February 2, 2025
- February 1, 2025
- January 30, 2025
- January 29, 2025
- January 28, 2025
- January 27, 2025
- January 26, 2025
- January 25, 2025
- January 24, 2025
- January 21, 2025
- January 20, 2025
- January 17, 2025
- January 16, 2025
- January 15, 2025
- January 14, 2025
- January 13, 2025
- January 11, 2025
- January 10, 2025
- January 9, 2025
- January 8, 2025
- January 7, 2025
- January 3, 2025
- January 2, 2025
- December 27, 2024
- December 20, 2024
- December 19, 2024
- December 18, 2024
- December 16, 2024
- December 12, 2024
- December 11, 2024
- December 10, 2024
- December 6, 2024
- December 4, 2024
- December 3, 2024
- November 30, 2024
- November 27, 2024
- November 25, 2024
- November 23, 2024
- November 22, 2024
- November 21, 2024
- November 19, 2024
- November 18, 2024
- November 14, 2024
- November 13, 2024
- November 12, 2024
- November 11, 2024
- November 7, 2024
- November 4, 2024
- November 3, 2024
- November 1, 2024
- October 31, 2024
- October 30, 2024
- October 29, 2024
- October 28, 2024
- October 24, 2024
- October 22, 2024
- October 21, 2024
- October 18, 2024
- October 17, 2024
- October 16, 2024
- October 15, 2024
- October 12, 2024
- October 11, 2024
- October 9, 2024
- October 8, 2024
- October 7, 2024
- September 23, 2024
- September 4, 2024
- August 8, 2024
- July 22, 2024
- July 11, 2024
- July 2, 2024
- June 24, 2024
- June 14, 2024
- Beam V2 Guide & FAQ
Running a Web Scraper
Let’s build a simple web scraper which extracts headlines from The New York Times and uses a BERT model from Huggingface to detect the sentiment of each.
Define the environment
First, we’ll define our environment:
app.py
from beam import App, Runtime, Image
app = App(
name="web-scraper",
runtime=Runtime(
cpu=1,
memory="8Gi",
image=Image(
python_version="python3.8",
python_packages=["bs4", "transformers", "torch"],
),
),
)
Write scraping logic
Now, we’ll write logic to scrape the headlines from The New York Times.
In order to run this on Beam, we add an @app.run()
decorator to the function:
@app.run()
def scrape_nyt():
...
scraper.py
from beam import App, Runtime, Image
import time
import requests
from bs4 import BeautifulSoup
from transformers import pipeline
app = App(
name="web-scraper",
runtime=Runtime(
cpu=1,
memory="8Gi",
image=Image(
python_version="python3.8",
python_packages=["bs4", "transformers", "torch"],
),
),
)
@app.run()
def scrape_nyt():
res = requests.get("https://www.nytimes.com")
soup = BeautifulSoup(res.content, "html.parser")
# Grab all headlines
headlines = soup.find_all("h3", class_="indicate-hover", text=True)
total_headlines = len(headlines)
negative_headlines = 0
# Iterate through each headline
for h in headlines:
title = h.get_text()
print(title)
sentiment = predict_sentiment(title)
print(sentiment)
if sentiment.get("NEGATIVE") > sentiment.get("POSITIVE"):
negative_headlines += 1
print(f"{negative_headlines} negative headlines / {total_headlines} total")
def predict_sentiment(title):
model = pipeline(
"sentiment-analysis", model="siebert/sentiment-roberta-large-english"
)
result = model(title, truncation=True, top_k=2)
prediction = {i["label"]: i["score"] for i in result}
return prediction
Running the scraper
Now, we’re ready to run our code using Beam. In your terminal, run:
beam run your_file.py:scrape_nyt
You should see the headlines and the detected sentiment of each:
(.venv) beta9@MacBook-Air-2 web-scraping % beam-stage run app.py:scrape_nyt
i Using cached image.
✓ App initialized.
✓ Container scheduled, logs will appear below.
Starting app...
Loading handler in 'app.py:scrape_nyt'...
Running task: c021040d-aea7-4406-9b5e-79d898f7592a
This Hummus Holds Up After 800 Years
{'POSITIVE': 0.9985199570655823, 'NEGATIVE': 0.0014800893841311336}
Task complete: c021040d-aea7-4406-9b5e-79d898f7592a, duration: 177.36207103729248s
Was this page helpful?