Web scraping with BeautifulSoup - Můj blog o práci a zábavě

Primary source is here

Today, you will learn about how to do web scraping with BeautifulSoup. You will learn how to use the requests library to fetch web pages and the BeautifulSoup library to parse the HTML in Python.

In this tutorial, you will learn:

The basics of web scraping and parsing
How to install BeautifulSoup
How to parse a local HTML file
Use Requests and BeautifulSoup to scrape and parse web pages
Methods that you will need to find elements inside an HTML
Filters that you can use to improve matching inside the HTML

Contenusmasquer

1 What is Web Scraping

2 What is BeautifulSoup

3 What is Parsing in Web Scraping?

4 Getting Started with BeautifulSoup

4.1 How to Install BeautifulSoup

4.2 Simple Parsing with BeautifulSoup

5 Web Scraping Best Practices

6 How to Parse HTML with BeautifulSoup

7 Parsing Your First HTML with BeautifulSoup

7.1 Example HTML to be Parsed

7.2 Parsing the HTML

8 Parse a Local HTML File with BeautifulSoup

9 Parsing a Web page with BeautifulSoup

9.1 Practice Web Scraping on Crawler-test.com

9.2 Extract Content From a Web Page with Requests

9.3 Parsing the Response with BeautifulSoup

10 How to Extract HTML Tags with BeautifulSoup

10.1 Find Elements by Tag Name

10.2 Find Elements by ID

10.3 Find Elements by HTML Class Name

11 How to Extract Text From HTML Elements

12 How to Extract HTML Tags with Attributes in Bs4

13 Parsing Unavailable Tags

14 How to Extract all Links on a Web Page

15 Extract Element on a Specific Part of the Page

16 Remove Specific HTML Tags with BeautifulSoup

17 Find Elements by Text Content

18 Apply a Function to a BeautifulSoup Method

19 Parse HTML Page with Regular Expressions in BeautifulSoup

20 How to Use XPath with BeautifulSoup (lxml)

21 Find Parents, Children and Siblings of an HTML element

21.1 Find Parent(s) of an HTML Element

21.2 Find Child(ren) of an HTML Element

21.3 Find Sibling(s) of an HTML Element

22 Fix Broken HTML with BeautifulSoup

23 Articles Related to Web Scraping

23.1 Web Scraping with BeautifulSoup (in Python)

23.2 What is Web Scraping and How to Do it (with Examples)?

23.3 Scrapy: Web Scraping in Python (With Examples)

23.4 Python Requests Library (Examples and Video)

23.5 Web Scraping With Python and Requests-HTML

23.6 Random User-Agent With Python and BeautifulSoup (by JR Oakes)

23.7 Web Scraping and Automation With Selenium and Python

23.8 How to Scrape Google Using XPath

24 Conclusion

24.1 Related posts:

What is Web Scraping

Web scraping is the process of using a bot to extract data from a website and export it into a digestible format. A web scraper extracts the HTML code from a web page, which is then parsed to extract valuable information.

Remaining Time -23:23

Python Tutorial: How to use Python Requests Library

Subscribe to my Newsletter

What is BeautifulSoup

BeautifulSoup is a parsing library in Python that is used to scrape information from HTML or XML. The BeautifulSoup parser provides Python idioms to search and modify the parse tree.

Simply put, BeautifulSoup is the library that allows you to format the HTML in a usable way and extract elements from it.

What is Parsing in Web Scraping?

Parsing in web scraping is the process of transforming unstructured data into a structured format (e.g. parse tree) that is easier to read, use and extract data from.

Basically, parsing means splitting a document in usable chunks.

Getting Started with BeautifulSoup

How to Install BeautifulSoup

Use pip to install BeautifulSoup in Python.

1	`$ pip install beautifulsoup4`

Simple Parsing with BeautifulSoup

123	`frombs4` `importBeautifulSoupsoup` `=BeautifulSoup("<p>I am learning <span>BeautifulSoup</span></p>")soup.find('span')`

1	`<span>BeautifulSoup</span>`

Web Scraping Best Practices

Use toy websites to practice
Use APIs instead of web scraping
Investigate if data is available elsewhere first (e.g. common crawl)
Respect robots.txt
Slow down your scraping speed
Cache your requests

How to Parse HTML with BeautifulSoup

Follow these steps to parse HTML in BeautifulSoup:

Install BeautifulSoupUse pip to install BeautifulSoup
$ pip install beautifulsoup4
Import the BeautifulSoup library in PythonTo import BeautifulSoup in Python, import the BeautifulSoup class from the bs4 library.
from bs4 import BeautifulSoup
Parse the HTMLTo parse the HTML, create BeautifulSoup object and add the HTML to be parsed as a required argument. The soup object will be a parsed version of the HTML.
soup = BeautifulSoup("<p>your HTML</p>")
Use BeautifulSoup’s object methods to pull information from the HTMLThe BeautifulSoup library has many built-in methods to extract data from the HTML. Use methods like soup.find() or soup.find_all() to extract specific elements from the parsed HTML

Parsing Your First HTML with BeautifulSoup

The first learning of this BeautifulSoup tutorial will be to parse this simple HTML.

Example HTML to be Parsed

12345678910111213141516171819202122232425262728 html =''' <html> <head> <title>Simple SEO Title</title> <meta name="description" content="Meta Description with less than 300 characters."> <meta name="robots" content="noindex, nofollow"> <link rel="alternate" href="https://www.example.com/en" hreflang="en-ca"> <link rel="canonical" href="https://www.example.com/fr"> </head> <body> <header> <div class="nav"> <ul> <li class="home"><a href="#">Home</a></li> <li class="blog"><a class="active" href="#">Blog</a></li> <li class="about"><a href="#">About</a></li> <li class="contact"><a href="#">Contact</a></li> </ul> </div> </header> <div class="body"> <h1>Blog</h1> <p>Lorem ipsum dolor <a href="#">Anchor Text Link</a> sit amet consectetur adipisicing elit. Ipsum vel laudantium a voluptas labore. Dolorum modi doloremque, dolore molestias quos nam a laboriosam neque asperiores fugit sed aut optio earum!</p> <h2>Subtitle</h2> <p>Lorem ipsum dolor sit amet consectetur adipisicing elit. Ipsum vel laudantium a voluptas labore. Dolorum modi doloremque, dolore molestias quos nam a <a href="#" rel="nofollow">Nofollow link</a> laboriosam neque asperiores fugit sed aut optio earum!</p> </div> </body> </html>'''

The HTML variable that we just created is similar to the output that we would get when scraping a web page. This is HTML, but stored as text.

This is not very useful as it is hard to search within it.

You could use regular expressions to parse the text content, but a better way is available: parsing with BeautifulSoup.

Parsing the HTML

To parse HTML with BeautifulSoup, instantiate a BeautifulSoup constructor by adding the HTML to be parsed as a required argument, and the name of the parser as an optional argument.

12345 # Parsing an HTML Filefrombs4 importBeautifulSoupimportrequestssoup =BeautifulSoup(html, 'html.parser')

This returns the parsed HTML into a BeautifulSoup object.

We then retrieve any HTML element (the title tag in this case) from the BeautifulSoup object with the soup.find() method.

1	`print(soup.find('title'))`

1	`<title>Simple SEO Title</title>`

Parse a Local HTML File with BeautifulSoup

If you have an HTML file saved somewhere on your computer, you can also parse a local HTML File with BeautifulSoup.

12345678 # Parsing an HTML Filefrombs4 importBeautifulSoupimportrequestswith open('/path/to/file.html') as f: soup =BeautifulSoup(f, 'html.parser')print(soup)

Parsing a Web page with BeautifulSoup

To parse a web page with BeautifulSoup, fetch the HTML of the page using the Python requests library. You can access the HTML of the page in many different ways: HTTP Requests, Browser-based application, Manually downloading from the web browser.

So far, we have seen how to parse a local HTML file which is not really web scraping… yet.

Now we will fetch a web page to extract its HTML and then parse the content with BeautifulSoup.

Practice Web Scraping on Crawler-test.com

For this tutorial, we will practice web scraping with BeautifulSoup on a toy website that was created for that purpose: crawler-test.com.

Extract Content From a Web Page with Requests

To extract content from a web page, make an HTTP request to a URL using the Python requests library.

12345678 # Making an HTTP Requestimportrequestsurl ='https://crawler-test.com/'response =requests.get(url)print('Status code: ', response.status_code)print('Text: ', response.text[:50])

Parsing the Response with BeautifulSoup

To parse the response with BeautifulSoup, add the retrieved HTML as an argument of the BeautifulSoup constructor.

When fetching a web page with requests, the response object is returned. From that response, you can retrieve the HTML of the page. That HTML is stored in Unicode or bytes (text).

We already have seen how to parse this textual representation of the HTML with BeautifulSoup. To do so, all we need is to pass the response.text to the BeautifulSoup class.

1234567 frombs4 importBeautifulSoup# Parse the HTMLsoup =BeautifulSoup(response.text, 'html.parser')# Extract any HTML tagsoup.find('title')

1	`<title>Crawler Test Site</title>`

How to Extract HTML Tags with BeautifulSoup

Use BeautifulSoup’s find() and find_all() methods to extract HTML tags from the parsed HTML.

Some of the very common HTML tags that you will want to scrape are the title, the h1 and the links.

Find Elements by Tag Name

To find an HTML element by its tag name in BeautifulSoup, pass the tag name as an argument to the BeautifulSoup object’s method.

Here is an example of how you can extract these tags with bs4.

123456789 # Extracting HTML tagstitle =soup.find('title')h1 =soup.find('h1')links =soup.find_all('a', href=True)# Print the outputsprint('Title: ', title)print('h1: ', h1)print('Example link: ', links[1]['href'])

123	`Title: <title>Crawler Test Site</title>h1: <h1>Crawler Test Site</h1>Example link: /mobile/separate_desktop`

Find Elements by ID

To find an HTML element by its ID in BeautifulSoup, pass the id name to the id parameter of soup object’s find() method.

1	`soup.find(id="header")`

1234	`<div id="header"><a href="/" id="logo">Crawler Test <span class="neon-effect">two point oh!</span></a><div style="position:absolute;right:520px;top:-4px;"></div></div>`

Find Elements by HTML Class Name

To find an HTML element by its class in BeautifulSoup, pass a dictionary as an argument of the soup object’s find_all() method.

1	`soup.find_all('div', {'class':` `'panel-header'})[0]`

123	`<div class="panel-header"><h3>Mobile</h3></div>`

You can then loop the object to do whatever you want.

123456789 # Find Elements by Classboxes =soup.find_all('div', {'class': 'panel'})box_names =[]forbox inboxes: title =box.find('h3') box_names.append(title.text)box_names[:5]

1	`['Mobile', 'Description Tags', 'Encoding', 'Titles', 'Robots Protocol']`

How to Extract Text From HTML Elements

To extract text from an HTML element using BeautifulSoup, use the .text attribute on the soup object. If the object is a list (e.g. found using find_all) use a for loop to iterate each element and use the text attribute on each element.

12	`logo` `=soup.find('a', {'id':` `'logo'})logo.text`

1	`'Crawler Test two point oh!'`

How to Extract HTML Tags with Attributes in Bs4

To extract HTML tags using their attributes, pass a dictionary to the attrs parameter in the find() method.

Example attributes are the name attribute used in the meta description, or the href attribute used in a hyperlink.

Some HTML elements require you to get elements using their attribute.

Below, we will parse the meta description and the meta robots name attributes.

12345 # Parsing using HTML tag attributesdescription =soup.find('meta', attrs={'name':'description'})meta_robots =soup.find('meta', attrs={'name':'robots'})print('description: ',description)print('meta robots: ',meta_robots)

12	`description: <meta content="Default description XIbwNE7SSUJciq0/Jyty" name="description"/>meta robots: None`

Here, the meta robots element was not available, and thus returned None.

Let’s see how to take care of unavailable tags.

Parsing Unavailable Tags

Below, we will scrape the description, the meta robots and the canonical tags from the web page. If they are not available, we will return an empty string.

123456789101112 # Conditional parsingcanonical =soup.find('link', {'rel': 'canonical'})# Extract if attribute ifounddescription =description['content'] ifdescription else''meta_robots =meta_robots['content'] ifmeta_robots else''canonical =canonical['href'] ifcanonical else''# Print outputprint('description: ', description)print('meta_robots: ', meta_robots)print('canonical: ', canonical)

description:  Default description XIbwNE7SSUJciq0/Jyty
meta_robots:  
canonical:

How to Extract all Links on a Web Page

To extract all the links of a web page with BeautifulSoup, use the soup.find_all() method with the "a" tag as its argument. Set the href parameter to True.

1	`soup.find_all('a', href=True)`

Extracting links on a web page is the most relevant thing to know in web scraping.

Now, we will learn how to do that.

In addition, we will learn how to use the urljoin method from urllib to overcome one of the most common challenges of building a web crawler: taking care of relative and absolute URLs.

12345678910111213141516171819 # Extract all links on the pagefromurllib.parse importurlparse, urljoinurl ='https://crawler-test.com/'# Parse URLparsed_url =urlparse(url)domain =parsed_url.scheme +'://'+parsed_url.netlocprint('Domain root is: ', domain)# Get href from all linkslinks =[]forlink insoup.find_all('a', href=True): # join domain to path full_url =urljoin(domain, link['href']) links.append(full_url)# Print outputprint('Top 5 links are :\n', links[:5])

123 Domain root is: https://crawler-test.comTop 5 links are : ['https://crawler-test.com/', 'https://crawler-test.com/mobile/separate_desktop', 'https://crawler-test.com/mobile/desktop_with_AMP_as_mobile', 'https://crawler-test.com/mobile/separate_desktop_with_different_h1', 'https://crawler-test.com/mobile/separate_desktop_with_different_title']

Extract Element on a Specific Part of the Page

To extract an element on a specific part of a page (e.g using its id), assign the result of a soup.find() to another variable and use one of bs4 built-in methods on that object.

1234	`# Get div that contains a specific IDsubset = soup.find('div', {'id': 'idname'})# Find all p inside that divsubset.find_all('p')`

Example:

1234567891011121314 # Extract all links on the pagefromurllib.parse importurljoindomain ='https://crawler-test.com/'# Get div that contains a specific IDmenu =soup.find('div', {'id': 'header'})# Find all links within the divmenu_links =menu.find_all('a', href=True)# Print outputforlink inmenu_links[:5]: print(link['href']) print(urljoin(domain, link['href']) )

12	`/https://crawler-test.com/`

Remove Specific HTML Tags with BeautifulSoup

To remove a specific HTML tag from an HTML document with BeautifulSoup, use the decompose() method.

1	`soup.tag_name.decompose()`

Example

123456789101112 # Promptwikipedia_text ='''<p>In the United States, website owners can use three major <a href="/wiki/Cause_of_action" title="Cause of action">legal claims</a> to prevent undesired web scraping: (1) copyright infringement (compilation), (2) violation of the <a href="/wiki/Computer_Fraud_and_Abuse_Act" title="Computer Fraud and Abuse Act">Computer Fraud and Abuse Act</a> ("CFAA"), and (3) <a href="/wiki/Trespass_to_chattels" title="Trespass to chattels">trespass to chattel</a>.<sup id="cite_ref-6" class="reference"><a href="#cite_note-6">[6]</a></sup> However, the effectiveness of these claims relies upon meeting various criteria...</p>'''# Parse HTMLwiki_soup =BeautifulSoup(wikipedia_text, 'html.parser')# Get first paragraphpar =wiki_soup.find_all('p')[0]# Get all linkspar.find_all('a')

123	`# Remove references tags from wikipediapar.sup.decompose()par.find_all('a')`

Find Elements by Text Content

To find elements in the HTML using textual content, add the text to be matched as the value of the string parameter inside the find_all() method.

123	`# Getting Elements using Textsoup` `=BeautifulSoup(r.text,` `'html.parser')soup.find_all('a',string="Description Tag Too Long")`

1	`[<a href="/description_tags/description_over_max">Description Tag Too Long</a>]`

The problem with this is that the string has to be exact match. Thus, using a partial string would for matching with BeautifulSoup return nothing:

12	`# Getting Elements using Partial Stringsoup.find_all('a',string="Description")`

The work around to this string matching issue is to use apply functions or use regular expressions with the string parameter. We will next cover these two situations.

Apply a Function to a BeautifulSoup Method

To apply a function inside of the BeautifulSoup method, add the function to the string parameter of the find_all() method.

12345 # Apply function to BeautifulSoupdeffind_a_string(value): returnlambdatext: value intextsoup.find_all(string=find_a_string('Description Tag'))

123456789 ['Description Tags', 'Description Tag With Whitespace', 'Description Tag Missing', 'Description Tag Missing With Meta Nosnippet', 'Description Tag Duplicate', 'Description Tag Duplicate', 'Noindex and Description Tag Duplicate', 'Noindex and Description Tag Duplicate', 'Description Tag Too Long']

Parse HTML Page with Regular Expressions in BeautifulSoup

To use regular expressions to parse an HTML page with BeautifulSoup, import the re module, and assign a re.compile() object to the string parameter of the find_all() method.

123456 frombs4 importBeautifulSoupimportre# Parse using regexsoup =BeautifulSoup(r.text, 'html.parser')soup.find_all(string=re.compile('Description Tag'))

How to Use XPath with BeautifulSoup (lxml)

To use XPath to extract elements from an HTML document with BeautifulSoup, you need to install the lxml python library as Beautiful does not support XPath expressions.

12345678 fromlxml importhtml# Parse HTML with XPathcontent =html.fromstring(r.content)panels =content.xpath('//*[@class="panel-header"]')# get text from tags[panel.find('h3').text forpanel inpanels][:5]

Find Parents, Children and Siblings of an HTML element

BeautifulSoup returns a parse tree that you can move through each parent, child and sibling in the tree to find the elements that you want.

Find Parent(s) of an HTML Element

To find a single parent of an HTML element, use the find_parent() method which will show you the element that you are looking for as well as it parent in the HTML tree.

12	`a_child` `=soup.find_all('a')[0]a_child.find_parent()`

1234	`<div id="header"><a href="/" id="logo">Crawler Test <span class="neon-effect">two point oh!</span></a><div style="position:absolute;right:520px;top:-4px;"></div></div>`

To find all the parents of an HTML element, use the find_parents() method on the BeautifulSoup object.

1	`a_child.find_parents()`

Find Child(ren) of an HTML Element

To find a single child of an HTML element, use the findChild() method which will show you the child of the element in the HTML tree.

12	`a_child` `=soup.find_all('a')[0]a_child.findChild()`

1	`<span class="neon-effect">two point oh!</span>`

To find all the children of an HTML element, use the fetchChildren() method on the BeautifulSoup object.

123	`a_child.findChildren()#orlist(a_child.children)`

Find all the descendants of the HTML element.

1	`list(a_child.descendants)`

Find Sibling(s) of an HTML Element

To find the next sibling after an HTML element, use the find_next_sibling() method which will show you the next sibling in the HTML tree.

123	`a_child` `=soup.find_all('a')[0]a_child.find_next_sibling()a_child.find_next_siblings()`

1	`<div style="position:absolute;right:520px;top:-4px;"></div>`

Where this comes from is that if you take the parent element:

12	`a_parent` `=soup.find('div',{'id':'header'})a_parent`

You get the HTML of the element, and you get the next sibling of the <a> tag, which is the highlighted div (and the result shown before.

1234	`<divid="header"><ahref="/"id="logo">Crawler Test <spanclass="neon-effect">two point oh!</span></a><divstyle="position:absolute;right:520px;top:-4px;"></div></div>`

Similarly, you can find the previous sibling(s) too.

12	`a_child. find_previous_sibling()a_child. find_previous_siblings()`

To find all the parents of an HTML element, use the find_parents() method on the BeautifulSoup object.

Fix Broken HTML with BeautifulSoup

With BeautifulSoup, you can take broken HTML and completing the missing parts using the prettify() method on the BeautifulSoup object.

12345 # Fix Broken HTML with BeautifulSoupfrombs4 importBeautifulSoupsoup =BeautifulSoup("<p>Incomplete Broken <span>HTML<</p>")print(soup.prettify())

12345678910 <html> <body> <p> Incomplete Broken <span> HTML </span> </p> </body></html>

Conclusion

We have covered everything that you can possibly need to know around web scraping with BeautifulSoup. Good luck!