HtmlUnit as Java Screen Scraping Library

If you are needing behavior ‘as though a real browser was scraping and using the page’ HtmlUnit is definitely the best option available. It was designed for testing websites but works great for screen scraping and navigating through multiple pages. It takes care of cookies and other session-related stuff and can execute (if you want it to) the Javascript in the page.

I’ve personally tried several other tools from this list (Jericho, Web Harvest) but neither of them is as good as this library. For example, writing a screen scraper with Web Harvest is an easy task, but badly formatted pages cause xml parser to break and this happened to me a lot of times. Jericho is ok but it took me much more coding to achieve the same as with HtmlUnit.

Take a look at HtmlUnit getting started and start scraping in no time.


Posted

in

by

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *