Archive

Archive for January, 2009

HtmlUnit as Java Screen Scraping Library

January 23rd, 2009

If you are needing behavior ‘as though a real browser was scraping and using the page’ HtmlUnit is definitely the best option available. It was designed for testing websites but works great for screen scraping and navigating through multiple pages. It takes care of cookies and other session-related stuff and can execute (if you want it to) the Javascript in the page.

I’ve personally tried several other tools from this list (Jericho, Web Harvest) but neither of them is as good as this library. For example, writing a screen scraper with Web Harvest is an easy task, but badly formatted pages cause xml parser to break and this happened to me a lot of times. Jericho is ok but it took me much more coding to achieve the same as with HtmlUnit.

Take a look at HtmlUnit getting started and start scraping in no time.

Share/Save/Bookmark

Java , , , ,

‘Awesome’ administrator

January 20th, 2009

I am not sure if this clip is fake or not but it’s hilarious for sure. This guy is the administrator you don’t want to employ at your company (maybe you can recommend him to your competitors :).

Watch it from here:

Share/Save/Bookmark

fun , ,