Primary source is here
Today, you will learn about how to do web scraping with BeautifulSoup. You will learn how to use the requests library to fetch web pages and the BeautifulSoup library to parse the HTML in Python.
In this tutorial, you will learn:
- The basics of web scraping and parsing
- How to install BeautifulSoup
- How to parse a local HTML file
- Use Requests and BeautifulSoup to scrape and parse web pages
- Methods that you will need to find elements inside an HTML
- Filters that you can use to improve matching inside the HTML
Contenusmasquer
3 What is Parsing in Web Scraping?
4 Getting Started with BeautifulSoup
4.1 How to Install BeautifulSoup
4.2 Simple Parsing with BeautifulSoup
6 How to Parse HTML with BeautifulSoup
7 Parsing Your First HTML with BeautifulSoup
8 Parse a Local HTML File with BeautifulSoup
9 Parsing a Web page with BeautifulSoup
9.1 Practice Web Scraping on Crawler-test.com
9.2 Extract Content From a Web Page with Requests
9.3 Parsing the Response with BeautifulSoup
10 How to Extract HTML Tags with BeautifulSoup
10.1 Find Elements by Tag Name
10.3 Find Elements by HTML Class Name
11 How to Extract Text From HTML Elements
12 How to Extract HTML Tags with Attributes in Bs4
14 How to Extract all Links on a Web Page
15 Extract Element on a Specific Part of the Page
16 Remove Specific HTML Tags with BeautifulSoup
17 Find Elements by Text Content
18 Apply a Function to a BeautifulSoup Method
19 Parse HTML Page with Regular Expressions in BeautifulSoup
20 How to Use XPath with BeautifulSoup (lxml)
21 Find Parents, Children and Siblings of an HTML element
21.1 Find Parent(s) of an HTML Element
21.2 Find Child(ren) of an HTML Element
21.3 Find Sibling(s) of an HTML Element
22 Fix Broken HTML with BeautifulSoup
23 Articles Related to Web Scraping
23.1 Web Scraping with BeautifulSoup (in Python)
23.2 What is Web Scraping and How to Do it (with Examples)?
23.3 Scrapy: Web Scraping in Python (With Examples)
23.4 Python Requests Library (Examples and Video)
23.5 Web Scraping With Python and Requests-HTML
23.6 Random User-Agent With Python and BeautifulSoup (by JR Oakes)
23.7 Web Scraping and Automation With Selenium and Python
23.8 How to Scrape Google Using XPath
What is Web Scraping
Web scraping is the process of using a bot to extract data from a website and export it into a digestible format. A web scraper extracts the HTML code from a web page, which is then parsed to extract valuable information.
Remaining Time -23:23
Python Tutorial: How to use Python Requests Library
Subscribe to my Newsletter
What is BeautifulSoup
BeautifulSoup is a parsing library in Python that is used to scrape information from HTML or XML. The BeautifulSoup parser provides Python idioms to search and modify the parse tree.
Simply put, BeautifulSoup is the library that allows you to format the HTML in a usable way and extract elements from it.
What is Parsing in Web Scraping?
Parsing in web scraping is the process of transforming unstructured data into a structured format (e.g. parse tree) that is easier to read, use and extract data from.
Basically, parsing means splitting a document in usable chunks.
Getting Started with BeautifulSoup
How to Install BeautifulSoup
Use pip to install BeautifulSoup in Python.
1 | $ pip install beautifulsoup4 |
Simple Parsing with BeautifulSoup
123 | from bs4 import BeautifulSoup soup = BeautifulSoup( "<p>I am learning <span>BeautifulSoup</span></p>" ) soup.find( 'span' ) |
1 | <span>BeautifulSoup</span> |
Web Scraping Best Practices
- Use toy websites to practice
- Use APIs instead of web scraping
- Investigate if data is available elsewhere first (e.g. common crawl)
- Respect robots.txt
- Slow down your scraping speed
- Cache your requests
How to Parse HTML with BeautifulSoup
Follow these steps to parse HTML in BeautifulSoup:
- Install BeautifulSoupUse pip to install BeautifulSoup
$ pip install beautifulsoup4
- Import the BeautifulSoup library in PythonTo import BeautifulSoup in Python, import the BeautifulSoup class from the bs4 library.
from bs4 import BeautifulSoup
- Parse the HTMLTo parse the HTML, create BeautifulSoup object and add the HTML to be parsed as a required argument. The soup object will be a parsed version of the HTML.
soup = BeautifulSoup("<p>your HTML</p>")
- Use BeautifulSoup’s object methods to pull information from the HTMLThe BeautifulSoup library has many built-in methods to extract data from the HTML. Use methods like soup.find() or soup.find_all() to extract specific elements from the parsed HTML
Parsing Your First HTML with BeautifulSoup
The first learning of this BeautifulSoup tutorial will be to parse this simple HTML.
Example HTML to be Parsed
12345678910111213141516171819202122232425262728 | html = ''' <html> <head> <title>Simple SEO Title</title> <meta name="description" content="Meta Description with less than 300 characters."> <meta name="robots" content="noindex, nofollow"> <link rel="alternate" href="https://www.example.com/en" hreflang="en-ca"> <link rel="canonical" href="https://www.example.com/fr"> </head> <body> <header> <div class="nav"> <ul> <li class="home"><a href="#">Home</a></li> <li class="blog"><a class="active" href="#">Blog</a></li> <li class="about"><a href="#">About</a></li> <li class="contact"><a href="#">Contact</a></li> </ul> </div> </header> <div class="body"> <h1>Blog</h1> <p>Lorem ipsum dolor <a href="#">Anchor Text Link</a> sit amet consectetur adipisicing elit. Ipsum vel laudantium a voluptas labore. Dolorum modi doloremque, dolore molestias quos nam a laboriosam neque asperiores fugit sed aut optio earum!</p> <h2>Subtitle</h2> <p>Lorem ipsum dolor sit amet consectetur adipisicing elit. Ipsum vel laudantium a voluptas labore. Dolorum modi doloremque, dolore molestias quos nam a <a href="#" rel="nofollow">Nofollow link</a> laboriosam neque asperiores fugit sed aut optio earum!</p> </div> </body> </html>''' |
The HTML variable that we just created is similar to the output that we would get when scraping a web page. This is HTML, but stored as text.
This is not very useful as it is hard to search within it.
You could use regular expressions to parse the text content, but a better way is available: parsing with BeautifulSoup.
Parsing the HTML
To parse HTML with BeautifulSoup, instantiate a BeautifulSoup constructor by adding the HTML to be parsed as a required argument, and the name of the parser as an optional argument.
12345 | # Parsing an HTML File from bs4 import BeautifulSoup import requests soup = BeautifulSoup(html, 'html.parser' ) |
This returns the parsed HTML into a BeautifulSoup object.
We then retrieve any HTML element (the title tag in this case) from the BeautifulSoup object with the soup.find()
method.
1 | print (soup.find( 'title' )) |
1 | <title>Simple SEO Title</title> |
Parse a Local HTML File with BeautifulSoup
If you have an HTML file saved somewhere on your computer, you can also parse a local HTML File with BeautifulSoup.
12345678 | # Parsing an HTML File from bs4 import BeautifulSoup import requests with open ( '/path/to/file.html' ) as f: soup = BeautifulSoup(f, 'html.parser' ) print (soup) |
Parsing a Web page with BeautifulSoup
To parse a web page with BeautifulSoup, fetch the HTML of the page using the Python requests library. You can access the HTML of the page in many different ways: HTTP Requests, Browser-based application, Manually downloading from the web browser.
So far, we have seen how to parse a local HTML file which is not really web scraping… yet.
Now we will fetch a web page to extract its HTML and then parse the content with BeautifulSoup.
Practice Web Scraping on Crawler-test.com
For this tutorial, we will practice web scraping with BeautifulSoup on a toy website that was created for that purpose: crawler-test.com.
Extract Content From a Web Page with Requests
To extract content from a web page, make an HTTP request to a URL using the Python requests library.
12345678 | # Making an HTTP Request import requests url = 'https://crawler-test.com/' response = requests.get(url) print ( 'Status code: ' , response.status_code) print ( 'Text: ' , response.text[: 50 ]) |
Parsing the Response with BeautifulSoup
To parse the response with BeautifulSoup, add the retrieved HTML as an argument of the BeautifulSoup constructor.
When fetching a web page with requests, the response object is returned. From that response, you can retrieve the HTML of the page. That HTML is stored in Unicode or bytes (text).
We already have seen how to parse this textual representation of the HTML with BeautifulSoup. To do so, all we need is to pass the response.text to the BeautifulSoup class.
1234567 | from bs4 import BeautifulSoup # Parse the HTML soup = BeautifulSoup(response.text, 'html.parser' ) # Extract any HTML tag soup.find( 'title' ) |
1 | <title>Crawler Test Site</title> |
How to Extract HTML Tags with BeautifulSoup
Use BeautifulSoup’s find()
and find_all()
methods to extract HTML tags from the parsed HTML.
Some of the very common HTML tags that you will want to scrape are the title, the h1 and the links.
Find Elements by Tag Name
To find an HTML element by its tag name in BeautifulSoup, pass the tag name as an argument to the BeautifulSoup object’s method.
Here is an example of how you can extract these tags with bs4.
123456789 | # Extracting HTML tags title = soup.find( 'title' ) h1 = soup.find( 'h1' ) links = soup.find_all( 'a' , href = True ) # Print the outputs print ( 'Title: ' , title) print ( 'h1: ' , h1) print ( 'Example link: ' , links[ 1 ][ 'href' ]) |
123 | Title: <title>Crawler Test Site</title> h1: <h1>Crawler Test Site</h1> Example link: /mobile/separate_desktop |
Find Elements by ID
To find an HTML element by its ID in BeautifulSoup, pass the id name to the id
parameter of soup object’s find() method.
1 | soup.find( id = "header" ) |
1234 | <div id="header"> <a href="/" id="logo">Crawler Test <span class="neon-effect">two point oh!</span></a> <div style="position:absolute;right:520px;top:-4px;"></div> </div> |
Find Elements by HTML Class Name
To find an HTML element by its class in BeautifulSoup, pass a dictionary as an argument of the soup object’s find_all()
method.
1 | soup.find_all( 'div' , { 'class' : 'panel-header' })[ 0 ] |
123 | <div class="panel-header"> <h3>Mobile</h3> </div> |
You can then loop the object to do whatever you want.
123456789 | # Find Elements by Class boxes = soup.find_all( 'div' , { 'class' : 'panel' }) box_names = [] for box in boxes: title = box.find( 'h3' ) box_names.append(title.text) box_names[: 5 ] |
1 | ['Mobile', 'Description Tags', 'Encoding', 'Titles', 'Robots Protocol'] |
How to Extract Text From HTML Elements
To extract text from an HTML element using BeautifulSoup, use the .text
attribute on the soup object. If the object is a list (e.g. found using find_all
) use a for loop to iterate each element and use the text attribute on each element.
12 | logo = soup.find( 'a' , { 'id' : 'logo' }) logo.text |
1 | 'Crawler Test two point oh!' |
How to Extract HTML Tags with Attributes in Bs4
To extract HTML tags using their attributes, pass a dictionary to the attrs
parameter in the find() method.
Example attributes are the name
attribute used in the meta description, or the href
attribute used in a hyperlink.
Some HTML elements require you to get elements using their attribute.
Below, we will parse the meta description and the meta robots name attributes.
12345 | # Parsing using HTML tag attributes description = soup.find( 'meta' , attrs = { 'name' : 'description' }) meta_robots = soup.find( 'meta' , attrs = { 'name' : 'robots' }) print ( 'description: ' ,description) print ( 'meta robots: ' ,meta_robots) |
12 | description: <meta content="Default description XIbwNE7SSUJciq0/Jyty" name="description"/> meta robots: None |
Here, the meta robots element was not available, and thus returned None
.
Let’s see how to take care of unavailable tags.
Parsing Unavailable Tags
Below, we will scrape the description, the meta robots and the canonical tags from the web page. If they are not available, we will return an empty string.
123456789101112 | # Conditional parsing canonical = soup.find( 'link' , { 'rel' : 'canonical' }) # Extract if attribute ifound description = description[ 'content' ] if description else '' meta_robots = meta_robots[ 'content' ] if meta_robots else '' canonical = canonical[ 'href' ] if canonical else '' # Print output print ( 'description: ' , description) print ( 'meta_robots: ' , meta_robots) print ( 'canonical: ' , canonical) |
description: Default description XIbwNE7SSUJciq0/Jyty
meta_robots:
canonical:
How to Extract all Links on a Web Page
To extract all the links of a web page with BeautifulSoup, use the soup.find_all()
method with the "a"
tag as its argument. Set the href
parameter to True.
1 | soup.find_all( 'a' , href = True ) |
Extracting links on a web page is the most relevant thing to know in web scraping.
Now, we will learn how to do that.
In addition, we will learn how to use the urljoin method from urllib to overcome one of the most common challenges of building a web crawler: taking care of relative and absolute URLs.
12345678910111213141516171819 | # Extract all links on the page from urllib.parse import urlparse, urljoin url = 'https://crawler-test.com/' # Parse URL parsed_url = urlparse(url) domain = parsed_url.scheme + '://' + parsed_url.netloc print ( 'Domain root is: ' , domain) # Get href from all links links = [] for link in soup.find_all( 'a' , href = True ): # join domain to path full_url = urljoin(domain, link[ 'href' ]) links.append(full_url) # Print output print ( 'Top 5 links are :\n' , links[: 5 ]) |
123 | Domain root is: https://crawler-test.com Top 5 links are : ['https://crawler-test.com/', 'https://crawler-test.com/mobile/separate_desktop', 'https://crawler-test.com/mobile/desktop_with_AMP_as_mobile', 'https://crawler-test.com/mobile/separate_desktop_with_different_h1', 'https://crawler-test.com/mobile/separate_desktop_with_different_title'] |
Extract Element on a Specific Part of the Page
To extract an element on a specific part of a page (e.g using its id), assign the result of a soup.find() to another variable and use one of bs4 built-in methods on that object.
1234 | # Get div that contains a specific ID subset = soup.find('div', {'id': 'idname'}) # Find all p inside that div subset.find_all('p') |
Example:
1234567891011121314 | # Extract all links on the page from urllib.parse import urljoin domain = 'https://crawler-test.com/' # Get div that contains a specific ID menu = soup.find( 'div' , { 'id' : 'header' }) # Find all links within the div menu_links = menu.find_all( 'a' , href = True ) # Print output for link in menu_links[: 5 ]: print (link[ 'href' ]) print (urljoin(domain, link[ 'href' ]) ) |
12 | / https://crawler-test.com/ |
Remove Specific HTML Tags with BeautifulSoup
To remove a specific HTML tag from an HTML document with BeautifulSoup, use the decompose()
method.
1 | soup.tag_name.decompose() |
Example
123456789101112 | # Prompt wikipedia_text = ''' <p>In the United States, website owners can use three major <a href="/wiki/Cause_of_action" title="Cause of action">legal claims</a> to prevent undesired web scraping: (1) copyright infringement (compilation), (2) violation of the <a href="/wiki/Computer_Fraud_and_Abuse_Act" title="Computer Fraud and Abuse Act">Computer Fraud and Abuse Act</a> ("CFAA"), and (3) <a href="/wiki/Trespass_to_chattels" title="Trespass to chattels">trespass to chattel</a>.<sup id="cite_ref-6" class="reference"><a href="#cite_note-6">[6]</a></sup> However, the effectiveness of these claims relies upon meeting various criteria...</p>''' # Parse HTML wiki_soup = BeautifulSoup(wikipedia_text, 'html.parser' ) # Get first paragraph par = wiki_soup.find_all( 'p' )[ 0 ] # Get all links par.find_all( 'a' ) |
123 | # Remove references tags from wikipedia par.sup.decompose() par.find_all( 'a' ) |
Find Elements by Text Content
To find elements in the HTML using textual content, add the text to be matched as the value of the string
parameter inside the find_all()
method.
123 | # Getting Elements using Text soup = BeautifulSoup(r.text, 'html.parser' ) soup.find_all( 'a' ,string = "Description Tag Too Long" ) |
1 | [<a href="/description_tags/description_over_max">Description Tag Too Long</a>] |
The problem with this is that the string has to be exact match. Thus, using a partial string would for matching with BeautifulSoup return nothing:
12 | # Getting Elements using Partial String soup.find_all( 'a' ,string = "Description" ) |
1 |
The work around to this string matching issue is to use apply functions or use regular expressions with the string
parameter. We will next cover these two situations.
Apply a Function to a BeautifulSoup Method
To apply a function inside of the BeautifulSoup method, add the function to the string parameter of the find_all()
method.
12345 | # Apply function to BeautifulSoup def find_a_string(value): return lambda text: value in text soup.find_all(string = find_a_string( 'Description Tag' )) |
123456789 | ['Description Tags', 'Description Tag With Whitespace', 'Description Tag Missing', 'Description Tag Missing With Meta Nosnippet', 'Description Tag Duplicate', 'Description Tag Duplicate', 'Noindex and Description Tag Duplicate', 'Noindex and Description Tag Duplicate', 'Description Tag Too Long'] |
Parse HTML Page with Regular Expressions in BeautifulSoup
To use regular expressions to parse an HTML page with BeautifulSoup, import the re
module, and assign a re.compile()
object to the string parameter of the find_all()
method.
123456 | from bs4 import BeautifulSoup import re # Parse using regex soup = BeautifulSoup(r.text, 'html.parser' ) soup.find_all(string = re. compile ( 'Description Tag' )) |
123456789 | ['Description Tags', 'Description Tag With Whitespace', 'Description Tag Missing', 'Description Tag Missing With Meta Nosnippet', 'Description Tag Duplicate', 'Description Tag Duplicate', 'Noindex and Description Tag Duplicate', 'Noindex and Description Tag Duplicate', 'Description Tag Too Long'] |
How to Use XPath with BeautifulSoup (lxml)
To use XPath to extract elements from an HTML document with BeautifulSoup, you need to install the lxml python library as Beautiful does not support XPath expressions.
12345678 | from lxml import html # Parse HTML with XPath content = html.fromstring(r.content) panels = content.xpath( '//*[@class="panel-header"]' ) # get text from tags [panel.find( 'h3' ).text for panel in panels][: 5 ] |
Find Parents, Children and Siblings of an HTML element
BeautifulSoup returns a parse tree that you can move through each parent, child and sibling in the tree to find the elements that you want.
Find Parent(s) of an HTML Element
To find a single parent of an HTML element, use the find_parent()
method which will show you the element that you are looking for as well as it parent in the HTML tree.
12 | a_child = soup.find_all( 'a' )[ 0 ] a_child.find_parent() |
1234 | <div id="header"> <a href="/" id="logo">Crawler Test <span class="neon-effect">two point oh!</span></a> <div style="position:absolute;right:520px;top:-4px;"></div> </div> |
To find all the parents of an HTML element, use the find_parents()
method on the BeautifulSoup object.
1 | a_child.find_parents() |
Find Child(ren) of an HTML Element
To find a single child of an HTML element, use the findChild()
method which will show you the child of the element in the HTML tree.
12 | a_child = soup.find_all( 'a' )[ 0 ] a_child.findChild() |
1 | <span class="neon-effect">two point oh!</span> |
To find all the children of an HTML element, use the fetchChildren()
method on the BeautifulSoup object.
123 | a_child.findChildren() #or list (a_child.children) |
Find all the descendants of the HTML element.
1 | list (a_child.descendants) |
Find Sibling(s) of an HTML Element
To find the next sibling after an HTML element, use the find_next_sibling()
method which will show you the next sibling in the HTML tree.
123 | a_child = soup.find_all( 'a' )[ 0 ] a_child.find_next_sibling() a_child.find_next_siblings() |
1 | <div style="position:absolute;right:520px;top:-4px;"></div> |
Where this comes from is that if you take the parent element:
12 | a_parent = soup.find( 'div' ,{ 'id' : 'header' }) a_parent |
You get the HTML of the element, and you get the next sibling of the <a> tag, which is the highlighted div (and the result shown before.
1234 | < div id = "header" > < a href = "/" id = "logo" >Crawler Test < span class = "neon-effect" >two point oh!</ span ></ a > < div style = "position:absolute;right:520px;top:-4px;" ></ div > </ div > |
Similarly, you can find the previous sibling(s) too.
12 | a_child. find_previous_sibling() a_child. find_previous_siblings() |
To find all the parents of an HTML element, use the find_parents()
method on the BeautifulSoup object.
Fix Broken HTML with BeautifulSoup
With BeautifulSoup, you can take broken HTML and completing the missing parts using the prettify()
method on the BeautifulSoup object.
12345 | # Fix Broken HTML with BeautifulSoup from bs4 import BeautifulSoup soup = BeautifulSoup( "<p>Incomplete Broken <span>HTML<</p>" ) print (soup.prettify()) |
12345678910 | <html> <body> <p> Incomplete Broken <span> HTML </span> </p> </body> </html> |
Articles Related to Web Scraping
Web Scraping with BeautifulSoup (in Python)
What is Web Scraping and How to Do it (with Examples)?
Scrapy: Web Scraping in Python (With Examples)
Python Requests Library (Examples and Video)
Web Scraping With Python and Requests-HTML
Random User-Agent With Python and BeautifulSoup (by JR Oakes)
Web Scraping and Automation With Selenium and Python
How to Scrape Google Using XPath
Conclusion
We have covered everything that you can possibly need to know around web scraping with BeautifulSoup. Good luck!