[Gllug] simple file matching

Wed Aug 24 10:54:03 UTC 2005

Neil <neil at cozyspace.net> asked:

> I recently had to match a list of items in one file with that of 
> another. The lists were unordered and using diff or comm didn't work
> for me.... How have you unix gurus defeated file/list matching using
> bash?

Any reason for not sorting the lists first? I guess it depends on 
factors like your definition of 'very large', your system parameters for 
CPU speed, storage space and RAM, and whether you're going to need to 
repeat the operation.

I recently had to compare some data in a couple of large text files 
(several hundred MB each, with getting on for a million lines). Stage 
one was to filter down to just the data I needed to compare (hospital 
numbers in this case). I used a perl script for that, giving me the 
option of also easily pulling out other fields when running similar 
operations in future but passing through cut and tr would have been 
enough to get just the hospital number and trim white space from the end 
of the shorter ones; I also passed the results through sort before 
saving intermediary files to disk, which were then passed through diff 
(another file) and then split to two further files (numbers only in X 
and numbers only in Y) using grep to pick up the '<' or '>' at the start 
of each line and cut to leave only the numbers in the file.

The reason for keeping intermediary stages saved as files was so that I 
could track back to a particular point in the process without having to 
repeat all the processing. It also meant that I could peform checks on 
the data from any stage to ensure that my 'logic' wasn't making 
erroneous assumptions and dropping data. Since I've got ample storage 
space, the extra files caused no problems for the system and were 
significantly smaller (ie. much faster to process) when stripped of 
extraneous data.

Wulf

-- 
Wulf Forrester-Barker
Webmaster
http://www.lewisham.nhs.uk/

**********************************************************************
DISCLAIMER:

Any opinions expressed in this email are those of the individual and
not necessarily the Trust. This email and any files transmitted with
it are confidential and intended solely for the use of the individual
or entity to whom they are addressed. Any unauthorised disclosure of
the information contained in this email is strictly prohibited.

The contents of this email may contain software viruses which could
damage your own computer system. Whilst we have taken every
reasonable precaution to minimise this risk, we cannot accept liability
for any damage which you sustain as a result of software viruses.
You should therefore carry out your own virus checks before opening
the attachment.

If you have received this email in error please notify the sender or
postmaster at uhl.nhs.uk. Please then delete this email.

University Hospital Lewisham
Tel: 020 8333 3000
Web: http://www.lewisham.nhs.uk/
**********************************************************************

-- 
Gllug mailing list  -  Gllug at gllug.org.uk
http://lists.gllug.org.uk/mailman/listinfo/gllug