HTML Parser - Flat HTML to PUG, Jinja2 and Blade templates

Hello Coder,

If you plan to automate the integration of new layouts into legacy web apps, this article might help you. The code snippets listed bellow are used by the AppSeed R&D team to process flat HTML files into production-ready templates and components for Javascript, Python and Php apps.

The reason & motivation

Upgrading the design of a legacy product can be a time consuming task, especially if the app is coded in Flask, for instance and the client send the link for an HTML theme coded in flat HTML. Moreover, if your application is coded in React or Vue, the workload will be more time consuming. We know that, because we did it many times, until we decided to build a tool for this translation and we want to share what we've learned with other developers.

HTML parser features

To be really useful, the parser tool should provide a minimum set of features:

  • load, and manipulate the HTML DOM elements
  • edit elements and properties (anchors href, span texts, .. etc)
  • replace the hard coded strings with real variables specific to the underline template engine
  • export production-ready components for PUG, Jinja2, Blade or any other format used by the app
  • Save the processed HTML for a future processing and editing.

With this minimum set of features in our mind, let's write some code. Our choice, to write a powerful HTML parser, was Python / BeautifulSoup library.

HTML parser - load file

First goes first. To bootstrap the HTML parsing, we need at least two things:

  • BeautifulSoup library - properly installed on our system
  • An HTML file to play with and test the code
## install the BS library using PIP
$ pip install beautifulsoup4

# load the HTML file
html_file = open('index.html', 'r')
html_content = html_file.read()
html_file.close()

# initialize BS object
# using the html.parser
soup = bs(html_content, 'html.parser')

At this point, we can interact with the HTML tree, using BS helpers. Please notice that BS library supports more than one parser (e.g. lxml, xml, html5lib), the differences between them become clear on non well-formed HTML documents. For instance, lxml will add missing closing tags for all elements. For more information please access the dedicated section in the documentation regarding this topic.

HTML parser - parse HEAD node

To select the whole HEAD node, and interact with all elements we need to write just a few lines of code:

header = soup.find('head')

# If we want to change the title
header.title.string.replace_with('My new title') 

HTML parser - JS files

Javascript files are present in the HTML using script nodes:

...
<script type='text/javascript' src='js/bootstrap.js'></script>
<script type='text/javascript' src='js/custom.js'></script>
...

To scan the HTML soup for script tags, we can use the find_all BS helper:

for script in soup.body.find_all('script', recursive=False):

   # Print the path 
   print(' JS source = ' + script[src]) 

   # Update (normalize) the path
   js_path = script['src']
   js_file = js_path.split('/')[-1] # extract the file name
   script[src] = '/assets/js/' + js_file

Parse HTML for Images

for img in soup.body.find_all('img'):

   # Print the path 
   print(' IMG src = ' + img[src]) 

   img_path = img['src']
   img_file = img_path.split('/')[-1] # extract the file name 
   img[src] = '/assets/img/' + img_file

Remove hard coded strings


for tag in my_div.descendants:
  if not isinstance(tag, NavigableString): # ignore NavigableString tags
     # replate the tag text with a variable name 
     
     # plain HTML replacement
     tag.string.replace_with( var_name )

     # Jinja2 translation
     tag.string.replace_with( "{{ var_name }}" )

     # Php translation
     tag.string.replace_with( "<?php echo $var_name ?>" )

Save our work

All our changes are made in memory. To make these changes permanent we need to extract the string representation of our processed HTML from BS, and dump it into a file for later usage:

processed_html = soup.prettify(formatter="html")
f = open( 'index_bs.html', 'w+')
f.write(processed_html)
f.close

Where to go from here

Using an HTML parser to automate your workflow can speed up the development process. Using the HTML parser tool, we are able to integrate a new design in legacy products much faster compared to the manual translation.  Some real life samples that use templates and components provided by our HTML parser:


Need assistance to integrate a new design into a legacy app? Please access the support page for more information.

Thank you!    
Show Comments

Get the latest posts delivered right to your inbox.