Friday, October 4, 2013

Parsing HTML pages using Mojo::DOM

After my previous post, I successfully wrote a Perl script to parse an XML file that I had retrieved (from an RSS feed) using Perl and cURL. XML parsing and manipulation is simple enough, if you use the XML::Simple module. But what if the web page you are retrieving with Perl and cURL is an HTML page? And what if you have to make some changes to the HTML code before you render it in your browser? This would require a module that allows you to access individual DOM elements and even modify them. That is how I learned about the Mojo::DOM module, which is part of a package called Mojolicious that you can get on CPAN. Mojolicious calls itself "a next generation web framework for the Perl programming language". You can learn more on their website. Installing it on your machine is simple enough, using the command:

ppm install Mojolicious

Meanwhile, I decided to retrieve a particular web site's home page, make a couple of changes in the code, and render the same page in my browser. The changes were: 1) I would remove their Google analytics code, which was enclosed in a <script> tag pair, and (2) I would ensure that all references to stylesheets and images are prefixed with the URL of that particular web site, so that the web page in my browser appears identical to the original. Here are some of the interesting things that I learned about Mojo::DOM while completing this script, explained usingcode snippets:

$dom = Mojo::DOM->new->xml(0)->parse($response_body);

* Here I have created a new Mojo::DOM object, by parsing the response from my cURL call. The xml(0) is to tell Mojo::DOM to use HTML mode instead of XML mode

my @links = $domref->find('[href]')->each;

* Here I am telling Mojo::DOM to return all HTML tags having a 'href' attribute. I can then go through the @links array using foreach()

$_->attr(href => $linkhref);

* Here I am replacing the value of the 'href' attribute of a particular tag to a string $linkhref, which is actually a modified version of the original value

$_->replace_content($styletext);

* Here I am replacing the entire contents of a particular <style> tag pair with a modified version of the original contents. Interestingly the replace_content function has a bug - it replaces all instances of double quotes with the HTML entity code equivalent '&quot;', which may cause problems in the HTML output. So you need to change the HTML entity code back to double quotes before the end of the script

say "$domref";

* Here I am outputting the modified HTML code to the browser. For those who are not aware, say is the Perl equivalent of Java's println. It saves you the trouble of appending a "\n" at the end of every print statement. Using say requires the line "use feature qw(say);" along with your module declarations

No comments:

Post a Comment