Developer Tools - Open-Source HTML Parser

Thank you for landing on this page. This article presents a few practical code snippets to process and manipulate the HTML information loaded from a file or crawled from a LIVE website.

What is an HTML Parser

According to Wikipedia, Parsing or syntactic analysis is the process of analyzing a string of symbols, either in natural language or in computer languages, according to the rules of formal grammar. The meaning of HTML parsing applied here means to load the HTML, extract and process the relevant information like head title, page assets, main sections, and later on, save the processed file.


Problems this tool solves

  • update the HTML files to be production-ready: check for missing images, uncompressed CSS.
  • extract components from HTML pages
  • export components for various template engines: Jinja, Blade, PUG

HTML Parser - code snippets

All the source code can be found on the HTML parser repository (MIT license)

Parser Environment

The code uses the BeautifulSoup library, the well-known parsing library written in Python. To start coding, we need a few modules installed on our system.

$ # The magic library BeautifulSoup
$ pip install BeautifulSoup # the real magic is here 

$ # requests - library to pull HTML from a live website 
$ pip install requests # a library to pull the entire HTML page

$ # ipython - optional but usefull Python terminal
$ pip install ipython # the console where we execute the code

Load the HTML content

To start the HTML parsing we need to load the Html DOM from some somewhere and initialize a BeautifulSoup object using that information.

Loading the HTML data from a file

from bs4 import BeautifulSoup as bs

# Load the HTML content
html_file = open('index.html', 'r')
html_content = html_file.read()
html_file.close() # clean up

# Initialize the BS object
soup  = bs(html_content,'html.parser') 
# At this point, we can interact with the HTML 
# elements stored in memory using all helpers offered by BS library

Loading the HTML from a LIVE website

# import libraries
import requests
from bs4 import BeautifulSoup as bs

# define the URL to crawl & parse
# feel free to change this URL with your own app
app_url = 'https://flask-bulma-css.appseed.us/'

# crawling the page. This might take a few seconds
page = requests.get( app_url )

# to check the crawl status, just type:
page
<Response [200]> # all good

# to print the page contents type:
page.content
    
# Initialize the BS object
soup  = bs(html_content,'html.parser') 
# At this point, we can interact with the HTML 
# elements stored in memory using all helpers offered by BS library  

If all goes well, the `soup` objects hold the DOM tree and we can interact with the information.


To do that, one line of code is enough:

# print the entire page head
soup.head

# print only the title
soup.head.title
<title>Flask Bulma CSS - BulmaPlay Open-Source App </title>
soup.footer

# to have a nice print of elements, we can use BS prettify() helper
# using prettify(), the output is nicely indented 

print(soup.footer.prettify())

# the output
<footer class="footer footer-dark">
 <div class="container">
  <div class="columns">
   <div class="column">
    <div class="footer-logo">
     <img alt="Footer Logo for BulmaPlay - JAMStack Bulma CSS Web App." src="/static/assets/images/logos/bulmaplay-logo.png"/>
    </div>
....
    </div>
   </div>
  </div>
 </div>
</footer>

List the page assets

Once we have the `soup` initialized, we can easily select objects of a certain type.

Print out Javascript files loaded by the HTML, basically to print the information saved in the script nodes:

The HTML code:

...
<script type='text/javascript' src='js/bootstrap.js'></script>
<script type='text/javascript' src='js/custom.js'></script>
...

And the parser code:

# the code
for script in soup.body.find_all('script', recursive=False):
    print(' Js = ' + script['src'])

# the output
 Js = /static/assets/js/jquery.min.js
 Js = /static/assets/js/jquery.lazy.min.js
 Js = /static/assets/js/slick.min.js 

Print the CSS files - to do that, we can use a similar code snippet, but for `link` nodes

...
<link rel="stylesheet" href="css/bootstrap.min.css">
<link rel="stylesheet" href="css/app.css">
...

and the html parser code:

for link in soup.find_all('link'):

   # Print the src attribute
   print(' CSS file = ' + script['href'])

How to list images? check out this two lines code snippet:

for img in soup.body.find_all('img'):

   print(' IMG src = ' + img[src]) 

   # we have the full path here
   img_path = img['src']
   
   # let's extract the image name 
   img_file = img_path.split('/')[-1]

   # let's mutate the path, why not, we are hackers
   img[src] = '/assets/img/' + img_file

Iterate on Elements

# the code
for elem in soup.body.children:
   if elem.name: # we need this check, some elements don't have name
      print( ' -> elem ' + elem.name )

# the output
 -> elem div
 -> elem section
 -> elem section
 -> elem footer
 -> elem div
 -> elem div
 -> elem div
 -> elem script
 -> elem script
 -> elem script
 -> elem script

We can easily print attributes using syntax: `elem['attr_name']` for different kind of elements:

...
<script type='text/javascript' src='js/bootstrap.js'></script>
<link rel="stylesheet" href="css/app.css">
...
<img src="images/pic01.jpg" alt="Bred Pitt">
...

And the BS parsing code:

# for Script nodes (Javascript definitions)
print( 'Script JS' + script['type'] + script['type'] )

# for Link nodes (CSS definition)
print( 'CSS file ' + link['rel'] + link['href'] )

# for images
print( 'IMG file ' + img['src'])


Locate an element by ID

This can be achieved by a single line of code. Let's imagine that we have an element (div or span) with the id 1234:

...
<div id="1234" class="handsome">
Some text
</div>
...

and the corespondent code to select the object:

mydiv = soup.find("div", {"id": "1234"})

print(mydiv) 

# Useless element? 
# We can remove the element from the DOM with a single line of code
mydiv.decompose()

# the code
for elem in soup.body.footer.find_all('a'):
    print(' footer href = ' + elem['href'])

# the output
 footer href = https://bulma.io
 footer href = https://github.com/app-generator/flask-bulma-css
 footer href = https://appseed.us/apps/bulma-css?flask-bulma-css
 footer href = https://blog.appseed.us/tag/bulma-css
 footer href = https://absurd.design/
 footer href = https://github.com/cssninjaStudio/fresh

Resources