[Gllug] Screen-scraping tools (for HTML)?

J F jnns at linuxmail.org
Fri Nov 26 12:49:13 UTC 2004


> Our current project involves "automating" Google Adwords.  I'm doing
> this by screen-scraping the HTML (LWP + HTML::TreeBuilder + lots of
> OCaml glue code).  Google's HTML is hideous - so hideous in fact that
> HTML::TreeBuilder misparses a lot of it, resulting in nasty
> workarounds all over the place.

> I'm thinking there must be an easier way ...  Does anyone know of any
> tools to help automating / screen scraping pages?

Use HTML Tidy (http://tidy.sourceforge.net) to clean the messy HTML before you try parsing it.

Cheers
-- 
______________________________________________
Check out the latest SMS services @ http://www.linuxmail.org 
This allows you to send and receive SMS through your mailbox.


Powered by Outblaze
-- 
Gllug mailing list  -  Gllug at gllug.org.uk
http://lists.gllug.org.uk/mailman/listinfo/gllug




More information about the GLLUG mailing list