[YLUG] Manipulation of text files... anyone with lots of experience?

Dan Bishop dan at viciouslime.co.uk
Mon Jul 28 20:49:59 BST 2008


I'm sending this to everyone in my address book who has any sort of
technical know-how and also to the ylug mailing list. Basically, I have
read as much as I ever care to read about sed, awk and regular
expressions and have hit the point where I don't care how it happens, I
just want it to happen.

I have a text file, an extract of which is below, that I want
rearranging to look like the second example. Quite how to do this, is
now completely beyond me, I think I've burnt out that part of my
brain...

Source text file:

Aalbutt {m} :: plaice
Aalspieß {m} | Aalspieße {pl} :: eel spear | eel spears
Aalsuppe {f} | Aalsuppen {pl} :: eel soup | eel soups
Aar {m} :: eagle
Aas {n}; Aasfleisch {n} :: carrion
Abenteuerreise {f} | Abenteuerreisen {pl} :: adventure journey;
adventure trip | adventure journeys; adventure trips
heilsam; gesund; zuträglich {adj} :: salutiferous

How it wants to look:

Aalbutt {m} :: plaice
Aalspieß {m} :: eel spear
Aalspieße {pl} :: eel spears
Aar {m} :: eagle
Aas {n} :: carrion
Aasfleisch {n} :: carrion
Abenteuerreise {f} :: adventure journey
Abenteuerreise {f} :: adventure trip
Abenteuerreisen {pl} :: adventure journeys
Abenteuerreisen {pl}:: adventure trips
heilsam {adj} :: salutiferous
gesund {adj} :: salutiferous
zuträglich {adj} :: salutiferous

I think the above example probably best illustrates what I'm trying to
achieve, though I will attempt to explain it in words now (if you
already understand, you might want to skip this next paragraph). As
you've possibly guessed, anything left of a double colon (::) is German
and anything to the right of it is English. As a single word in one
language can be equivalent to several in another, on some lines there
are two words separated by a semi-colon, on either one, or both sides of
the double colon. These need to be separated in such a way that there is
never more than one word on each side of the double colon, but none of
the "linking information" is lost. For example, the line "Aas {n};
Aasfleisch {n} :: carrion" in the original means that both Aas and
Aasfleish are equal to carrion, so this line would be come two lines
like so:
Aas {n} :: carrion
Aasfleisch {n} :: carrion

Just to add to the complication, some of the lines have another
separator, a |. Whenever there is a | there is one on both sides of the
double colon. These lines provide no useful linking information
what-so-ever for my purposes, however, they can't simply be ignored.
Instead they must be treated as though two lines have been merged into
one. Anything to the right of the first | but before the :: and to the
right of the second | should really be on a new line with a :: between
both parts. To illustrate the following line:

1 2 3 | 4 5 6 :: 1 2 3 | 4 5 6

Should be interpreted as:
1 2 3 :: 1 2 3
4 5 6 :: 4 5 6

This is an intermediate step however, and can be bypassed completely so
long as the end result is:
1 :: 1
2 :: 2
3 :: 3
4 :: 4
5 :: 5
6 :: 6

I think that just about covers everything. If anyone would like a copy
of the full text-file, please just e-mail me and one will be provided.
As to why on earth I want this, I am writing a program that will
translate words on the fly from any application on the Linux desktop.
The result will pop up in a little bubble. That side of things is
working very nicely, I just need to populate my database of words now.
Whilst I can't offer much more than my sincerest gratitude and a pint or
two to anyone who manages to help me with this, you will certainly get a
mention in the credits and you'll get that warm fuzzy feeling knowing
you contributed to a free, open source project.

If anyone has actually read up to this point in the e-mail then I think
you deserve my thanks already! I really hope to get some sort of
response from this, whilst I'm not in York at the moment, I will be in
September and I promise the pints will come then! Any clarifications
required, please ask :)



Dan Bishop
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : http://mailman.lug.org.uk/pipermail/york/attachments/20080728/1646b5ad/attachment.bin


More information about the York mailing list