Python Web Scraping Tutorial: BeautifulSoup and Requests
Web scraping is the automated process of extracting data from websites. In the Python ecosystem, BeautifulSoup and Requests are the most popular libraries for collecting data from static HTML pages.
This guide walks you through sending HTTP requests, parsing response content, and filtering target elements cleanly.
1. Installation and Requirements
Before writing scraping scripts, install the BeautifulSoup4 parser and the requests HTTP client library using pip:
pip install beautifulsoup4 requests
2. Fetching and Parsing HTML Data
Use the requests library to send a GET request to a target webpage, then load the content into BeautifulSoup:
import requests
from bs4 import BeautifulSoup
url = "https://quotes.toscrape.com/"
response = requests.get(url)
if response.status_code == 200:
soup = BeautifulSoup(response.text, "html.parser")
# Extract the title tag of the page
print("Page Title:", soup.title.text)
else:
print("Failed to retrieve target webpage. Status Code:", response.status_code)
3. Finding Specific Elements
BeautifulSoup provides advanced methods like find() and find_all() to search through HTML tags using attributes:
# Find all quote blocks on the page
quotes = soup.find_all("div", class_="quote")
for quote in quotes:
text = quote.find("span", class_="text").text
author = quote.find("small", class_="author").text
print(f'"{text}" - by {author}')
Understanding how to target standard CSS selectors is essential for handling real-world scraping tasks efficiently.
Codecrown