Parsing HTML with BeautifulSoup
In this interactive exercise, you'll learn how to use the BeautifulSoup package to parse, prettify and extract information from HTML. You'll scrape the data from the webpage of Guido van Rossum, Python's very own Benevolent Dictator for Life. In the following exercises, you'll prettify the HTML and then extract the text and the hyperlinks.
The URL of interest is url = 'https://www.python.org/~guido/'
.
This exercise is part of the course
Intermediate Importing Data in Python
Exercise instructions
- Import the function
BeautifulSoup
from the packagebs4
. - Assign the URL of interest to the variable
url
. - Package the request to the URL, send the request and catch the response with a single function
requests.get()
, assigning the response to the variabler
. - Use the
text
attribute of the objectr
to return the HTML of the webpage as a string; store the result in a variablehtml_doc
. - Create a BeautifulSoup object
soup
from the resulting HTML using the functionBeautifulSoup()
. - Use the method
prettify()
onsoup
and assign the result topretty_soup
. - Hit submit to print to prettified HTML to your shell!
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Import packages
import requests
from ____ import ____
# Specify url: url
# Package the request, send the request and catch the response: r
# Extracts the response as html: html_doc
# Create a BeautifulSoup object from the HTML: soup
# Prettify the BeautifulSoup object: pretty_soup
# Print the response
print(pretty_soup)