[Nottingham] shell script guru rqd.

Mr Alan Carter nottingham at mailman.lug.org.uk
Wed Jan 15 02:06:01 2003


> I have a large list of publications. The problem is that the order 
they =
> are
> in is the inverse to what I want (the oldest are first). The other 
major
> problem is the date is not in the same place for each article e.g.
> 
> Bloggs Joe, The Art Picking Your Nose, Journal of Useless Things, 
1975
> 
> Blair Tony, How to Make Friends and Influence People, 1997 Labour 
Manifestation pp 69 - 79.

The moving position of the dates in Matt's problem is not best 
addressed by a shell script, because a) shell is not good at string 
handling, b) the input data may contain characters shell is sensitive 
to, like ", c) he may believe he has a rigourous scheme for delimiting 
fields in his references, but you can bet there will be all sorts of 
doubles, tabs, spaces and even weird unprintables in there, d) there 
are a couple of messy aspects of the necessary loops that significantly 
increase the size and complexity of the script.

In ye olde UNIX model the solution is to scan through the file with a 
tiny C filter program, find the year fields and output each line with 
the line as the first field. Then shell can be used to sort the lines 
by year, and strip off the first field. The C program looks like this:

#include <stdio.h>
#include <string.h>

#define MAXLINE 1024
#define START 1950
#define END 2003

main(int argc, char **argv)
{
int Year;
char *Pointer;
char Buffer[MAXLINE + 1];
char Scratch[MAXLINE + 1];

while(fgets(Buffer, MAXLINE, stdin))
{
Year = 1900;
strcpy(Scratch, Buffer);
Pointer = strtok(Scratch, " \t\n");

while(Pointer)
{
if(atoi(Pointer) >= START &&
atoi(Pointer) <= END)
Year = atoi(Pointer);
Pointer = strtok(NULL, " \t\n");
}

printf("%d %s", Year, Buffer);
}
}

Just copy and paste it into a file called years.c, and compile by 
saying:

$ make years

Then just pipe the references through findyear, pipe the findyear 
output through sort(1) and pipe the results of that through a little 
command line loop that reads each line into two variables - one to be 
thrown away and the other that contains the original line, like this:

$ cat refs | years | sort | while read junk data
> do
> echo $data
> done

You could do this with awk or even perl just on the command line, both 
of which can be smart with string handling, delimiters and control flow,
but in both cases the funny characters problem would remain, and if you 
used perl you'd have to spend several hours debugging and ritually 
washing afterwards ;-)

Alan

--