In terms of spidering round a site, have you tried wget?&nbsp; wget --help should give you all you need to know, and it's got a pretty flexible set of options as to how far to go. I don't know about the pdf save issues though.

<br><br>hth,<br>Dave.<br><br><div><span class="gmail_quote">On 31/03/06, <b class="gmail_sendername">tony burrows</b> &lt;<a href="mailto:tony@tonyburrows.uklinux.net">tony@tonyburrows.uklinux.net</a>&gt; wrote:</span><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

After the last meeting I started playing with the idea of getting data<br>from the patent office.&nbsp;&nbsp;Gave me an excuse to learn Python as well.<br><br>Now currently a can take a downloaded page and grab the relevant data<br>

into an xml file.&nbsp;&nbsp;I know how to use python to stuff it into MySQL, but I have hit a couple of problems. First, I'm not sure how to navigate around pages automatically so that I can grab stuff without having to do it all manually through a browser.

<br>Second, the search terms I'm using are vague at best - software,<br>programming, computer.<br>Third, I wanted to grab the actual patent doc, then do a word-count.<br>The website&nbsp;&nbsp;provides this with some off sort of extension that firefox

<br>doesn't seem to handle (it's actually pdf and neither Konqueror or Opera<br>have problems with it).&nbsp;&nbsp;Worst of all, all you get is a single page at a<br>time, which doesn't seem to want to save (when I tried and reopened

<br>there was nothing there).<br>Any suggestions?<br><br>Tony<br><br>_______________________________________________<br>Liverpool mailing list<br><a href="mailto:Liverpool@mailman.lug.org.uk">Liverpool@mailman.lug.org.uk</a>

<br><a href="https://mailman.lug.org.uk/mailman/listinfo/liverpool">https://mailman.lug.org.uk/mailman/listinfo/liverpool</a><br><br></blockquote></div><br>