[SC.LUG] Converting text files

John Southern john at sinoda.demon.co.uk
Sun Oct 23 20:03:38 BST 2005

On Friday 21 October 2005 12:03, Dave Briggs wrote:
> Does anyone know of a quick way to make Project Gutenberg text files
> actually readable?
> They always come with hard line breaks, meaning that if you want to
> decrease the font size and whatnot to make printing a little cheaper, the
> whole thing ends up looking utterly rubbish.

Downloading the html files would probably be the quickest way to achieve what 
you want as already stated.

However, let's try and write the perl line.
We could remove all the new line characters from a file with

perl -p -e "s/\n/ /" <test.txt >testb.txt

this line uses test.txt as the input file and the 

perl -e

part tells it to evaluate a command rather than read in a perl script although 
we could have easily written it as a script.
The -p makes us print the result of each line after we have modified it to the 
output. In this case the testb.txt file.
The actual code is a simple substitution regex to switch \n (the newline 
character) for a space character.
In practise this is a little too simple. where we have blank lines between 
paragraphs we now just end up with double spaces.
OK so a perl -p -e "s/  /\n/" <tesbt.txt >testc.txt
would add back these new line characters at the paragraph breaks but we are 
still left with other problems such as hyphenated characters.
I have tried to think of an easy way to handle this but simply removing all 
hyphens at the end of a line is not correct. Words split at a line end should 
be hyphenated between sylables unless they are already hyphenated. Most 
hyphenated words have been joined these days and so this remains the relms of 
double barreled names. Thus Jones-Smith if split over a line end should not 
end up


or even


but more correctly


How you would correct this from a text file, I am not sure. Perhaps a search 
through a dictionary file to look for hyphenated words would help.
A quick check gives me cross-sectional as still hyphenated, but crossroad as 
now a merged word.

OK, is there a better way to do this type of filtering?


More information about the SC mailing list