[Liverpool] Software patent data

Julian Todd julian at goatchurch.org.uk
Sun Apr 2 11:27:49 BST 2006


tony burrows wrote:
> After the last meeting I started playing with the idea of getting data 
> from the patent office.  Gave me an excuse to learn Python as well.
>
> Now currently a can take a downloaded page and grab the relevant data 
> into an xml file.  I know how to use python to stuff it into MySQL, 
> but I have hit a couple of problems.
>
> First, I'm not sure how to navigate around pages automatically so that 
> I can grab stuff without having to do it all manually through a browser.

All you need is urllib.urlopen(), read(), urlparse.urljoin() and some 
regexp knowledge to get whatever you want from the internet, spider 
around it, and capture the data.

    http://docs.python.org/lib/module-urllib.html


That's how I've done it for the whole of publicwhip.  Arrange a date 
from me if you want to know how to get started.  The technical term for 
what you are trying to do is making the data accessible.  So, 
downloading all the data, adding a proper search engine, and reposting 
it in a useable form is not violating the copyright, it's making it 
accessible for people who can't handle their interface.  Or so goes the 
argument.  It hasn't been tested in court, but the moral defense is: if 
the patent office is willing to take on these improved capabilities 
which people need, then you will take your website down.  It should be 
as legal as caching the webpages for quicker access. 


Julian T.





More information about the Liverpool mailing list