[Gllug] extracting data from specific sites

Progga proggaprogga at gmail.com
Mon Dec 17 11:33:33 UTC 2007


On Mon, Dec 17, 2007 at 03:23:58AM +0000, Nevzat Hami wrote:

> if there is no worry about copyright, :-) any ideas how to do it?

The definitive guide for PHP+CURL screen scrapers is Michael Schrenk's
"Webbots, Spiders, and Screen Scrapers: A Guide to Developing Internet Agents".
Find out more at: http://books.google.co.uk/books?id=5hpCAAAACAAJ

This book dedicates a whole chapter on legal issues and here are the things
he advises:

0. Read the terms & conditions of the website.  If it prohibits screen scraping,
you are out of luck.

1. Respect the robots.txt file of the website.  If this file is absent and
there's no terms & conditions, you are free to scrap anything.


As for the technical site, it's quite straight forward:

0. Fetch the page with curl.

1. Sanitize it via tidy (You'll need PHP's tidy extension).

2. Find the data using XPATH and then do whatever you want with it.


I have to tell you that all the legal issues people are telling you in
this thread are NOT worthless.  If you are screen scrapping a site which 
prohibits scrapping then it won't take long for the site admins to find out.
If you are living anywhere near the jurisdiction of UK/US courts, you'll
be in serious trouble.  This is the parting advice of the book.  It is
theoretically possible to dodge the site admins by using a proxy farm.  But
these are expensive and clearly state that they'll reveal your identity
if they are served with a court order.



-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 194 bytes
Desc: not available
URL: <http://mailman.lug.org.uk/pipermail/gllug/attachments/20071217/50c49979/attachment.pgp>
-------------- next part --------------
-- 
Gllug mailing list  -  Gllug at gllug.org.uk
http://lists.gllug.org.uk/mailman/listinfo/gllug


More information about the GLLUG mailing list