[dundee] Noob Programming Question

Rick Moynihan rick.moynihan at gmail.com
Tue Mar 16 00:49:02 UTC 2010


Hi Gary,

Not entirely sure what it is you're trying to do as you've not said
how the curl script is told what to process (I'm assuming it's
processing through a list of URL's rather than continually a single
URL).

You've implied that it's a persistent process (in need of monitoring),
but depending on the amount of data you have to churn through, you
might want to consider making the script stateless and have it
terminate when it's done.  Then just use cron to fire off the script
every minute (with a small to check to see that it's not still
running).  The benefit here is simplicity and less baggage (less need
for process monitors etc...), the downside is that you won't be
optimal in pulling data down, as the script won't always be running,
and there will be a delay before it starts another cycle.

The other thing I'd be very careful with is how you handle errors...
i.e. what happens if curl pipes into the file, but crashes half way
through?  If this a potential problem then you can either name the
files curl pipes specially or put them in a downloading/ directory and
then use mv/rename to atomically swap them into a location for the
other script to pick them up.  Then have your script clean out all the
files from downloading/ every time it starts.

As for whether you should be using bash, python or something else for
this, well it depends on complexity.  For example if you can get away
with the cron solution, and you don't need anything fancy (like
dynamically informing the script of the crawl list), then I'd consider
bash, curl/wget, as it's probably little more than 10 lines (if that).
 If however you need the script to integrate and communicate cleanly
with other processes in the system (beyond just dumping the crawled
files in a directory) then I'd strongly consider using a real language
in preference to shell duct-tape.

Be cautious of over-engineering things too...  A client of ours once
used a specialist message queue to manage a small cluster of web
crawlers, and IMHO it was overkill leading to additional unneeded
complexity.  For all but the largest systems I'd be tempted to see how
far a simple database table of URL's and associated meta-data (last
crawled times etc...) can take you.  Then just use a few SQL
SELECT/UPDATE statements to have workers crawl in the manner you
require.

R.

On 15 March 2010 15:37, Gary Short <gary at garyshort.org> wrote:
> Hi Guys,
>
> I was wondering if anyone here can give me some help. What I'm trying to do
> is to run curl and pipe the output to file1, then after a time t, stop curl
> and restart it piping it's output to file2 and so on to fileN, so that I can
> use another script to process the files, something like this...
>
> On system_startup:
>   If args[0]:
>      Ctr = args[0]
>   Else:
>      Ctr = 1
>   Start curl > /usr/home/gary/results/date_today/file + ctr
>   While true:
>      When time_up:
>         Stop curl
>         If date_today changes:
>            Ctr = 1
>         Else:
>            Ctr = ctr + 1
>         Start curl > /usr/home/gary/results/date_today/file + ctr
>
> I also need a script to monitor this first one so that if it dies it can be
> restarted, something like...
>
> If script_is_not_running:
>   Arg = digits_at_end_of_newest_file
>   Run script arg
>
> Can anyone point me in the right direction on how to do this on Ubuntu
> 8.04.3 LTS?
>
> Thanks in advance,
> Gary
> http://www.garyshort.org
> http://www.twitter.com/garyshort
>
>
>
>
>
>
>
>
> _______________________________________________
> dundee GNU/Linux Users Group mailing list
> dundee at lists.lug.org.uk  http://dundeelug.org.uk
> https://mailman.lug.org.uk/mailman/listinfo/dundee
> Chat on IRC, #tlug on irc.lug.org.uk



More information about the dundee mailing list