[SC.LUG] Converting text files

Tue Oct 25 12:43:49 BST 2005

John Southern wrote:

>On Friday 21 October 2005 12:03, Dave Briggs wrote:
>  
>
>>Does anyone know of a quick way to make Project Gutenberg text files
>>actually readable?
>>
>>They always come with hard line breaks, meaning that if you want to
>>decrease the font size and whatnot to make printing a little cheaper, the
>>whole thing ends up looking utterly rubbish.
>>    
>>
>
>Downloading the html files would probably be the quickest way to achieve what 
>you want as already stated.
>
>However, let's try and write the perl line.
>We could remove all the new line characters from a file with
>
>perl -p -e "s/\n/ /" <test.txt >testb.txt
>
>this line uses test.txt as the input file and the 
>
>perl -e
>
>part tells it to evaluate a command rather than read in a perl script although 
>we could have easily written it as a script.
>The -p makes us print the result of each line after we have modified it to the 
>output. In this case the testb.txt file.
>The actual code is a simple substitution regex to switch \n (the newline 
>character) for a space character.
>In practise this is a little too simple. where we have blank lines between 
>paragraphs we now just end up with double spaces.
>OK so a perl -p -e "s/  /\n/" <tesbt.txt >testc.txt
>would add back these new line characters at the paragraph breaks but we are 
>still left with other problems such as hyphenated characters.
>I have tried to think of an easy way to handle this but simply removing all 
>hyphens at the end of a line is not correct. Words split at a line end should 
>be hyphenated between sylables unless they are already hyphenated. Most 
>hyphenated words have been joined these days and so this remains the relms of 
>double barreled names. Thus Jones-Smith if split over a line end should not 
>end up
>
>Jones--
>Smith
>
>or even
>
>Jones-
>Smith
>
>but more correctly
>
>Jones
>Smith
>
>How you would correct this from a text file, I am not sure. Perhaps a search 
>through a dictionary file to look for hyphenated words would help.
>A quick check gives me cross-sectional as still hyphenated, but crossroad as 
>now a merged word.
>
>OK, is there a better way to do this type of filtering?
>
>John
>
>_______________________________________________
>SC mailing list
>SC at mailman.lug.org.uk
>http://mailman.lug.org.uk/mailman/listinfo/sc
>
>  
>
I got as far as attempting to see if the previous character was also a 
line break, if not remove. Though I was then so confused i did not get 
any further.
I was attempting to do:
perl -pi -e 's/(?<!\n)\n/ /g' <rel.txt  >new.rel.txt

Geoff