[Nottingham] Web page scraping
Richard Morris
richard at tannery.co.uk
Tue Jul 31 15:37:31 BST 2007
> -----Original Message-----
> From: nottingham-bounces at mailman.lug.org.uk
[mailto:nottingham-bounces at mailman.lug.org.uk] On Behalf Of Martin
> Sent: 31 July 2007 15:15
> To: NLUG
> Subject: [Nottingham] Web page scraping
>
> "Web page scraping":
>
> Anyone recommend any software for extracting info/tables from html and
> web pages?
Martin,
I use wget -O - to output the content of the page to standard output, and
then an appropriate mix of grep, sed, awk, head, sort and tail to find the
content I need, for example I use the following crontab entry to identify
the URL for the downloadable version of the telegraph, download the PDF and
print it.
0 18 * * 1-5 wget -O - http://www.telegraph.co.uk`wget -O -
http://www.telegraph.co.uk/pm | grep -o -E
'/portal/graphics/[0-9]{4}/[0-9]{2}/[0-9]{2}/telegraphPM_[0-9]_[0-9]{4}.pdf'
| sort -r | head -n 1`| lpr -P tp1 -o outputorder=reverse #Telegraph
Regards
Richard
More information about the Nottingham
mailing list