[Nottingham] Web page scraping

Richard Morris richard at tannery.co.uk
Tue Jul 31 15:37:31 BST 2007


> -----Original Message-----
> From: nottingham-bounces at mailman.lug.org.uk
[mailto:nottingham-bounces at mailman.lug.org.uk] On Behalf Of Martin
> Sent: 31 July 2007 15:15
> To: NLUG
> Subject: [Nottingham] Web page scraping
>
> "Web page scraping":
>
> Anyone recommend any software for extracting info/tables from html and
> web pages?

Martin,

I use wget -O - to output the content of the page to standard output, and
then an appropriate mix of grep, sed, awk, head, sort and tail to find the
content I need, for example I use the following crontab entry to identify
the URL for the downloadable version of the telegraph, download the PDF and
print it.

0 18 * * 1-5 wget  -O -  http://www.telegraph.co.uk`wget -O -
http://www.telegraph.co.uk/pm | grep -o -E
'/portal/graphics/[0-9]{4}/[0-9]{2}/[0-9]{2}/telegraphPM_[0-9]_[0-9]{4}.pdf'
| sort -r | head -n 1`| lpr -P tp1 -o outputorder=reverse #Telegraph

Regards

Richard





More information about the Nottingham mailing list