HTML Parser - How to use Python BS4 to work less

Hello Coders,

This article presents a few code snippets coded in Python Beautiful Soup library and open-source web apps that use the components extracted and processed by the parsing code. Thanks for reading!


What is an HTML Parser

According to Wikipedia, Parsing or syntactic analysis is the process of analyzing a string of symbols, either in natural language or in computer languages, according to the rules of a formal grammar. The meaning of HTML parsing applied here consist into load the HTML, extract and process the relevant information like head title, page assets, main sections and later on, save the processed file.


Why to build an HTML Parser in 2020

The automated workflow, used by the AppSeed platform to generate web app uses  HTML parsing to manage list of task that usually requires manual work:

  • Components extraction from flat HTML
  • Master pages detection (by comparing the pages DOM tree)
  • Hard-coded texts removal
  • Sometimes, assets tuning (CSS compression, JS minification)
  • Export for various template engines: Jinja2, Django native, Blade, Mustache

After this phase is completed, the production-ready UI and components are injected into boilerplate code and usable web apps, coded in many languages and patterns are produced. Using this process, in 2019 more than 170 web apps has been released as open-source projects (on Github):

For more free apps (and commercial), please access the AppSeed platform. Thank you!


Set up HTML parsing environment

In order to execute the code snippets presented in this article, we need Python3 installed in the workstation. Please note that Python2 reaches the EOL in jan.2020, and is recommended to move your legacy projects to the latest version.

As mentioned, the code will use  BeautifulSoup parsing library, written in Python. Let's open a terminal and install the magic:

$ # The magic library BeautifulSoup
$ pip install BeautifulSoup # the real magic is here 
$
$ # requests - library to pull HTML from a live website 
$ pip install requests # a library to pull the entire HTML page

First step, in any parsing process is to load the HTML content (aka the DOM tree), and to initialize a BS (Beautiful Soup) object using the string representation of the HTML content. The source of the content might be a physical file on the disk or a LIVE website.

Load the HTML from a file

from bs4 import BeautifulSoup as bs

# Load the HTML content
html_file = open('index.html', 'r')
html_content = html_file.read()
html_file.close() # clean up

# Initialize the BS object
soup  = bs(html_content,'html.parser') 
# At this point, we can interact with the HTML 
# elements stored in memory using all helpers offered by BS library

Load the HTML from a LIVE website

# import libraries
import requests
from bs4 import BeautifulSoup as bs

# define the URL to crawl & parse
# feel free to change this URL with your own app
app_url = 'https://flask-bulma-css.appseed.us/'

# crawling the page. This might take a few seconds
page = requests.get( app_url )

# to check the crawl status, just type:
page
<Response [200]> # all good

# to print the page contents type:
page.content
    
# Initialize the BS object
soup  = bs(html_content,'html.parser') 
# At this point, we can interact with the HTML 
# elements stored in memory using all helpers offered by BS library 

At this moment, the HTML content can be accessed and manipulated through BS objects and helpers. For instance, we can print the page header using one line:

# print the entire page head
soup.head

# print only the title
soup.head.title
<title>Flask Bulma CSS - BulmaPlay Open-Source App </title>

In a similar way, we can print the page footer:

soup.footer

# to have a nice print of elements, we can use BS prettify() helper
# using prettify(), the output is nicely indented 

print(soup.footer.prettify())

# the output
<footer class="footer footer-dark">
 <div class="container">
  <div class="columns">
   <div class="column">
    <div class="footer-logo">
     <img alt="Footer Logo for BulmaPlay - JAMStack Bulma CSS Web App." src="/static/assets/images/logos/bulmaplay-logo.png"/>
    </div>
....
    </div>
   </div>
  </div>
 </div>
</footer>

An important step in the parsing process is to have access over all assets used in HTML files (css, JS, images). Let's type a few commands and print Javascript files, for instance. The HTML code for JS files, looks as below:

...
<script type='text/javascript' src='js/bootstrap.js'></script>
<script type='text/javascript' src='js/custom.js'></script>
...

To extract the relevant information from the script nodes, we need just a few lines of parsing code:

# the code
for script in soup.body.find_all('script', recursive=False):
    print(' Js = ' + script['src'])

# the output
 Js = /static/assets/js/jquery.min.js
 Js = /static/assets/js/jquery.lazy.min.js
 Js = /static/assets/js/slick.min.js 

Parsing and printing the CSS files, is another easy task to be done using BS. The HTML code to be parsed:

...
<link rel="stylesheet" href="css/bootstrap.min.css">
<link rel="stylesheet" href="css/app.css">
...

And the BS code:

for link in soup.find_all('link'):

   # Print the src attribute
   print(' CSS file = ' + script['href'])

Locate an element by ID

Sometimes we need to locate elements based on id, and this can be easily done with a single line of code. Sample HTML:

...
<div id="1234" class="handsome">
Some text
</div>
...

The parsing code:

mydiv = soup.find("div", {"id": "1234"})

print(mydiv) 

# Useless element? 
# We can remove the element from the DOM with a single line of code
mydiv.decompose()

Need to print all the links? No problem, just run the following snippet:

# the code
for elem in soup.body.footer.find_all('a'):
    print(' footer href = ' + elem['href'])

# the output
 footer href = https://bulma.io
 footer href = https://github.com/app-generator/flask-bulma-css
 footer href = https://appseed.us/apps/bulma-css?flask-bulma-css
 footer href = https://blog.appseed.us/tag/bulma-css
 footer href = https://absurd.design/
 footer href = https://github.com/cssninjaStudio/fresh

HTML parsing, implemented in the right way, might help us to cut the manual work involved in some tasks:

  • Flat HTML processing: components extraction, conversion to production-ready templates (Jinja, Blade, Mustache, Nunjunks)
  • Hard-coded text removal, Form processing (add, remove or edit fields)
  • Extract information from LIVE websites (links, information)

Links & Resources

Show Comments

Get the latest posts delivered right to your inbox.