In terms of spidering round a site, have you tried wget? wget --help should give you all you need to know, and it's got a pretty flexible set of options as to how far to go.<br><br>I don't know about the pdf save issues though.
<br><br>hth,<br>Dave.<br><br><div><span class="gmail_quote">On 31/03/06, <b class="gmail_sendername">tony burrows</b> <<a href="mailto:tony@tonyburrows.uklinux.net">tony@tonyburrows.uklinux.net</a>> wrote:</span><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
After the last meeting I started playing with the idea of getting data<br>from the patent office. Gave me an excuse to learn Python as well.<br><br>Now currently a can take a downloaded page and grab the relevant data<br>
into an xml file. I know how to use python to stuff it into MySQL, but<br>I have hit a couple of problems.<br><br>First, I'm not sure how to navigate around pages automatically so that I<br>can grab stuff without having to do it all manually through a browser.
<br>Second, the search terms I'm using are vague at best - software,<br>programming, computer.<br>Third, I wanted to grab the actual patent doc, then do a word-count.<br>The website provides this with some off sort of extension that firefox
<br>doesn't seem to handle (it's actually pdf and neither Konqueror or Opera<br>have problems with it). Worst of all, all you get is a single page at a<br>time, which doesn't seem to want to save (when I tried and reopened
<br>there was nothing there).<br>Any suggestions?<br><br>Tony<br><br>_______________________________________________<br>Liverpool mailing list<br><a href="mailto:Liverpool@mailman.lug.org.uk">Liverpool@mailman.lug.org.uk</a>
<br><a href="https://mailman.lug.org.uk/mailman/listinfo/liverpool">https://mailman.lug.org.uk/mailman/listinfo/liverpool</a><br><br></blockquote></div><br>