Hi Octavio,<br><br><div class="gmail_quote"><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">Ok, I use Python to process text, specially because ther is this sweet<br>


pakage called NLTK, which runs on python, With it I can do a lot of<br>

things, like parsing, tagging, generate grammars, etc... but it delay&#39;s<br>

like .5 seconds to load, even in a cuad processor it would delay<br>

like .25 secs... that is a lot when you try to run the process over<br>

10000 files or so. So I came with the Idea to make some of this stuff<br>

directly on bash, but first I want to know if it is possible, now that<br>

you tell me that it could be I think I will try to translate to bash...<br>

Thank you a lot.<br>

<div><div></div><div></div></div></blockquote><div><br>It depends what you want to achieve. You can do simple things such as create frequency lists using bash commands (e.g. sort | uniq -c | sort -nr), but if you want to do more advanced language processing you will need to use some higher level programming language. NLTK is great and in most situations you will not be able to replace it with a pipeline of shell commands (or at least with something that runs faster). <br>


<br>I can suggest two options:<br>1. If you need some information that you obtain from NLTK, you can annotate it in the files and save the annotated files on disk. After that you may be able to achieve your task using shell commands (awk, grep, sed, etc.). Obviously this method make sense only if you need to process the files again and again as the initial annotation will still take time.<br>


<br>2. If the processing itself is not too time consuming in Python and your problems are caused by initialising NLTK, redesign your program so you do the initialisation only once and then you process all the files within your Python program (I mean you do not pass a file as a parameter, but a directory where all your files are).<br>


<br>Processing 100,000 files it not too much, but if you want to be able to have a scalable application you may need to start thinking about using map-reduce approaches (and distributed processing).<br><br>See you in autumn :)<br>


<br>Constantin<br></div></div><br>