[Gllug] simple file matching
Wulf Forrester-Barker
wulf.forrester-barker at uhl.nhs.uk
Wed Aug 24 10:54:03 UTC 2005
Neil <neil at cozyspace.net> asked:
> I recently had to match a list of items in one file with that of
> another. The lists were unordered and using diff or comm didn't work
> for me.... How have you unix gurus defeated file/list matching using
> bash?
Any reason for not sorting the lists first? I guess it depends on
factors like your definition of 'very large', your system parameters for
CPU speed, storage space and RAM, and whether you're going to need to
repeat the operation.
I recently had to compare some data in a couple of large text files
(several hundred MB each, with getting on for a million lines). Stage
one was to filter down to just the data I needed to compare (hospital
numbers in this case). I used a perl script for that, giving me the
option of also easily pulling out other fields when running similar
operations in future but passing through cut and tr would have been
enough to get just the hospital number and trim white space from the end
of the shorter ones; I also passed the results through sort before
saving intermediary files to disk, which were then passed through diff
(another file) and then split to two further files (numbers only in X
and numbers only in Y) using grep to pick up the '<' or '>' at the start
of each line and cut to leave only the numbers in the file.
The reason for keeping intermediary stages saved as files was so that I
could track back to a particular point in the process without having to
repeat all the processing. It also meant that I could peform checks on
the data from any stage to ensure that my 'logic' wasn't making
erroneous assumptions and dropping data. Since I've got ample storage
space, the extra files caused no problems for the system and were
significantly smaller (ie. much faster to process) when stripped of
extraneous data.
Wulf
--
Wulf Forrester-Barker
Webmaster
http://www.lewisham.nhs.uk/
**********************************************************************
DISCLAIMER:
Any opinions expressed in this email are those of the individual and
not necessarily the Trust. This email and any files transmitted with
it are confidential and intended solely for the use of the individual
or entity to whom they are addressed. Any unauthorised disclosure of
the information contained in this email is strictly prohibited.
The contents of this email may contain software viruses which could
damage your own computer system. Whilst we have taken every
reasonable precaution to minimise this risk, we cannot accept liability
for any damage which you sustain as a result of software viruses.
You should therefore carry out your own virus checks before opening
the attachment.
If you have received this email in error please notify the sender or
postmaster at uhl.nhs.uk. Please then delete this email.
University Hospital Lewisham
Tel: 020 8333 3000
Web: http://www.lewisham.nhs.uk/
**********************************************************************
--
Gllug mailing list - Gllug at gllug.org.uk
http://lists.gllug.org.uk/mailman/listinfo/gllug
More information about the GLLUG
mailing list