[SC.LUG] Converting text files
John Southern
john at sinoda.demon.co.uk
Sun Oct 23 20:03:38 BST 2005
On Friday 21 October 2005 12:03, Dave Briggs wrote:
> Does anyone know of a quick way to make Project Gutenberg text files
> actually readable?
>
> They always come with hard line breaks, meaning that if you want to
> decrease the font size and whatnot to make printing a little cheaper, the
> whole thing ends up looking utterly rubbish.
Downloading the html files would probably be the quickest way to achieve what
you want as already stated.
However, let's try and write the perl line.
We could remove all the new line characters from a file with
perl -p -e "s/\n/ /" <test.txt >testb.txt
this line uses test.txt as the input file and the
perl -e
part tells it to evaluate a command rather than read in a perl script although
we could have easily written it as a script.
The -p makes us print the result of each line after we have modified it to the
output. In this case the testb.txt file.
The actual code is a simple substitution regex to switch \n (the newline
character) for a space character.
In practise this is a little too simple. where we have blank lines between
paragraphs we now just end up with double spaces.
OK so a perl -p -e "s/ /\n/" <tesbt.txt >testc.txt
would add back these new line characters at the paragraph breaks but we are
still left with other problems such as hyphenated characters.
I have tried to think of an easy way to handle this but simply removing all
hyphens at the end of a line is not correct. Words split at a line end should
be hyphenated between sylables unless they are already hyphenated. Most
hyphenated words have been joined these days and so this remains the relms of
double barreled names. Thus Jones-Smith if split over a line end should not
end up
Jones--
Smith
or even
Jones-
Smith
but more correctly
Jones
Smith
How you would correct this from a text file, I am not sure. Perhaps a search
through a dictionary file to look for hyphenated words would help.
A quick check gives me cross-sectional as still hyphenated, but crossroad as
now a merged word.
OK, is there a better way to do this type of filtering?
John
More information about the SC
mailing list