[Phpwm] Full-text content searching

Ian Munday ian.munday at illumen.co.uk
Fri Dec 12 16:57:00 UTC 2008


On 8 Dec 2008, at 17:02, Tim Williams wrote:

> On Mon, 8 Dec 2008, Ian Munday wrote:
>
>> Does anyone have any experience of performing full-text content
>> searches within Microsoft Office documents (in particular Word and
>> Excel) and PDF files using PHP?  If so, could they recommend a
>> solution?  (The files themselves are currently stored within a MySQL
>> database, although they could be moved out if need be.)
>
> The global search function in Moodle uses pdf2text and antiword to  
> get a
> text dump of such files and then uses this in a lucene search index.  
> If
> you want to keep everything in the database, you could use a similar
> approach to create a text version of the file which is then inserted  
> into
> the mysql database as an indexable text field.
>
> I've also used openoffice as a server process to automatically convert
> uploaded documents between formats. That was was written using java
> though, not sure if there are any PHP classes for the OpenOffice api.
>
> Tim W

Thanks for your replies and pointers on this guys.  I've looked a bit  
further in to the problem and am going to try to more tightly specify  
the documents that will and won't be indexed for search.  I'd like to  
make use of Zend_Search_Lucene (http://framework.zend.com/manual/en/zend.search.lucene.html 
) and its support for Word, PowerPoint and Excel 2007 documents.  I've  
read that it doesn't perform brilliantly for a large number of  
documents (where large means a million+), but hope that for the 20,000  
or so I may need it will perform adequately.

Regards,

Ian





More information about the Phpwm mailing list