[Wolves] Is it possible...

Constantin Orăsan c.orasan at gmail.com
Tue Jun 23 10:33:32 UTC 2009


Hi Octavio,

Ok, I use Python to process text, specially because ther is this sweet
> pakage called NLTK, which runs on python, With it I can do a lot of
> things, like parsing, tagging, generate grammars, etc... but it delay's
> like .5 seconds to load, even in a cuad processor it would delay
> like .25 secs... that is a lot when you try to run the process over
> 10000 files or so. So I came with the Idea to make some of this stuff
> directly on bash, but first I want to know if it is possible, now that
> you tell me that it could be I think I will try to translate to bash...
> Thank you a lot.
>

It depends what you want to achieve. You can do simple things such as create
frequency lists using bash commands (e.g. sort | uniq -c | sort -nr), but if
you want to do more advanced language processing you will need to use some
higher level programming language. NLTK is great and in most situations you
will not be able to replace it with a pipeline of shell commands (or at
least with something that runs faster).

I can suggest two options:
1. If you need some information that you obtain from NLTK, you can annotate
it in the files and save the annotated files on disk. After that you may be
able to achieve your task using shell commands (awk, grep, sed, etc.).
Obviously this method make sense only if you need to process the files again
and again as the initial annotation will still take time.

2. If the processing itself is not too time consuming in Python and your
problems are caused by initialising NLTK, redesign your program so you do
the initialisation only once and then you process all the files within your
Python program (I mean you do not pass a file as a parameter, but a
directory where all your files are).

Processing 100,000 files it not too much, but if you want to be able to have
a scalable application you may need to start thinking about using map-reduce
approaches (and distributed processing).

See you in autumn :)

Constantin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.lug.org.uk/pipermail/wolves/attachments/20090623/6a2bbdb1/attachment-0001.htm 


More information about the Wolves mailing list