Dependency Management: HtmlUnit

Posted by Uncle Bob on 02/11/2007

If you are planning on building an API, please, please, think about dependency management. Don’t make me know more about your world view than necessary. Consider what happened to me as I explored HtmlUnit…

I’m using HtmlUnit to parse and interpret HTML web pages. I’ve been very impressed with this library so far. And I appreciate the hard work and dedication of people who give their software away for free. So, although this blog is a complaint, it should not be misconstrued into anything more than constructive criticism. Besides, what I am complaining about here is so universal that it really wouldn’t matter whose software I chose to scrutinize. The HtmlUnit authors just got lucky in this case.

What I want to do with HtmlUnit is quite simple. Given a string containing HTML, I’d like to query that HTML for certain tags and attributes. For example, I’d like to do this:

   HtmlPage page = HTMLParser.parse(htmlString);

   HtmlElement html = page.getDocumentElement();

   HtmlElement listForm = html.getHtmlElementById("list_form");

   assertEquals("/Library/books/manage.do", listForm.getAttributeValue("Action"));

Sweet, simple, uncomplicated. Just create the DOM from an HTML String, and then query that DOM. Unfortunately, HtmlUnit does not appear to be that simple. What you have to do instead looks like this:

   StringWebResponse stringWebResponse = new StringWebResponse(htmlString);

   WebClient webClient = new WebClient();

   webClient.setJavaScriptEnabled(false);

   HtmlPage page = HTMLParser.parse(stringWebResponse, new TopLevelWindow("", webClient));

   HtmlElement html = page.getDocumentElement();

   HtmlElement listForm = html.getHtmlElementById("list_form");

   assertEquals("/Library/books/manage.do", listForm.getAttributeValue("Action"));

The extra stuff in here is apparently due to the fact that the authors wanted to be able to simulate browsers, frames, and javascript. I think their goal was laudable. However, I wish they had done this without forcing those frames, browsers, and script engines down my throat.

Given my simple needs, why do I care about WebClient and Window. Why do I have to turn off the javascript engine? It may seem a small thing, but it bothers me nonetheless. It’s the principle of the matter that gets under my skin. The pragmatic programmers called it The Principle of Least Surprise. I call it, simply, dependency management. Don’t make people depend on more than they need.

The cost, to me, was an hour of rooting around in the documentation, example code, and my own trial-and-error experiments. (The benefit to me was another blog topic ;-) That cost may not seem great; but it must be paid again and again by everyone who wants to use the package in a way that doesn’t quite fit the authors’ world view.

There may, in fact, be a simpler way to do what I want to do with HtmlUnit. If there is, I haven’t been able to find it, and I’d be grateful if anyone out there, including the authors, could guide me in the right direction.

Comments

Leave a response