[Liverpool] Software patent data

tony burrows tony at tonyburrows.uklinux.net
Fri Apr 7 14:51:34 BST 2006


Aidan McGuire wrote:
> if its useful i will offer to do the web site.
>
> Aidan
>
> On 2 Apr 2006, at 11:27, Julian Todd wrote:
>
>> tony burrows wrote:
>>> After the last meeting I started playing with the idea of getting 
>>> data from the patent office.  Gave me an excuse to learn Python as 
>>> well.
>>>
>>> Now currently a can take a downloaded page and grab the relevant 
>>> data into an xml file.  I know how to use python to stuff it into 
>>> MySQL, but I have hit a couple of problems.
>>>
>>> First, I'm not sure how to navigate around pages automatically so 
>>> that I can grab stuff without having to do it all manually through a 
>>> browser.
>>
>> All you need is urllib.urlopen(), read(), urlparse.urljoin() and some 
>> regexp knowledge to get whatever you want from the internet, spider 
>> around it, and capture the data.
>>
>>    http://docs.python.org/lib/module-urllib.html
>>
>>
>> That's how I've done it for the whole of publicwhip.  Arrange a date 
>> from me if you want to know how to get started.  The technical term 
>> for what you are trying to do is making the data accessible.  So, 
>> downloading all the data, adding a proper search engine, and 
>> reposting it in a useable form is not violating the copyright, it's 
>> making it accessible for people who can't handle their interface.  Or 
>> so goes the argument.  It hasn't been tested in court, but the moral 
>> defense is: if the patent office is willing to take on these improved 
>> capabilities which people need, then you will take your website 
>> down.  It should be as legal as caching the webpages for quicker access.
>>
>> Julian T.
>>
>>
>>
>> _______________________________________________
>> Liverpool mailing list
>> Liverpool at mailman.lug.org.uk
>> https://mailman.lug.org.uk/mailman/listinfo/liverpool
>
>
> _______________________________________________
> Liverpool mailing list
> Liverpool at mailman.lug.org.uk
> https://mailman.lug.org.uk/mailman/listinfo/liverpool
>
>
Thanks for the offer Aidan, I might well take you up on that.

I thought I had it cracked. Take the url, grab the data, strip and 
convert to xml, then into MySQL.  It all worked great on a test page I'd 
downloaded.  Ran it for real and it failed - problems with how the urls 
for the patent docs are treated, as well as some missing data.  Solved 
the missing data and now I've got to fight the url refs.  With luck, 
next week should do it.  Designing the pages and so on will be the next 
job, grabbing the data with php should be easy.

Tony



More information about the Liverpool mailing list