[Phpwm] Full-text content searching
Ian Munday
ian.munday at illumen.co.uk
Fri Dec 12 16:57:00 UTC 2008
On 8 Dec 2008, at 17:02, Tim Williams wrote:
> On Mon, 8 Dec 2008, Ian Munday wrote:
>
>> Does anyone have any experience of performing full-text content
>> searches within Microsoft Office documents (in particular Word and
>> Excel) and PDF files using PHP? If so, could they recommend a
>> solution? (The files themselves are currently stored within a MySQL
>> database, although they could be moved out if need be.)
>
> The global search function in Moodle uses pdf2text and antiword to
> get a
> text dump of such files and then uses this in a lucene search index.
> If
> you want to keep everything in the database, you could use a similar
> approach to create a text version of the file which is then inserted
> into
> the mysql database as an indexable text field.
>
> I've also used openoffice as a server process to automatically convert
> uploaded documents between formats. That was was written using java
> though, not sure if there are any PHP classes for the OpenOffice api.
>
> Tim W
Thanks for your replies and pointers on this guys. I've looked a bit
further in to the problem and am going to try to more tightly specify
the documents that will and won't be indexed for search. I'd like to
make use of Zend_Search_Lucene (http://framework.zend.com/manual/en/zend.search.lucene.html
) and its support for Word, PowerPoint and Excel 2007 documents. I've
read that it doesn't perform brilliantly for a large number of
documents (where large means a million+), but hope that for the 20,000
or so I may need it will perform adequately.
Regards,
Ian
More information about the Phpwm
mailing list