HtmlUnit as Java Screen Scraping Library

January 23rd, 2009

If you are needing behavior ‘as though a real browser was scraping and using the page’ HtmlUnit is definitely the best option available. It was designed for testing websites but works great for screen scraping and navigating through multiple pages. It takes care of cookies and other session-related stuff and can execute (if you want it to) the Javascript in the page.

I’ve personally tried several other tools from this list (Jericho, Web Harvest) but neither of them is as good as this library. For example, writing a screen scraper with Web Harvest is an easy task, but badly formatted pages cause xml parser to break and this happened to me a lot of times. Jericho is ok but it took me much more coding to achieve the same as with HtmlUnit.

Take a look at HtmlUnit getting started and start scraping in no time.

Advanced javascript tutorial

November 4th, 2008 – A really good advanced javascript tutorial.
If you want to be able to understand the code below, than this tutorial is the right thing.

// The .bind method from Prototype.js
Function.prototype.bind = function(){
  var fn = this, 
  args =,
  object = args.shift();
  return function(){
    return fn.apply(object,

Javascript Environment Tips

October 10th, 2008

One may think that if he or she wants to work “professionally” with Ajax and Javascript that he will gonna need a comprehensive IDE that enables “fancy” features.

But, in this case, you can do everything well and with great efficiency if you stick to Firebug and its built-in Javascript debugging tools. Of course, you should have a text editor with syntax highlighting at least. (We recommend Notepad++).

The instructions on how to use Firebug for debugging Javascript can be found here.

