[Sussex] Updated Grep, Sed and RegExp links from August moot

Dominic Humphries linux at oneandoneis2.org
Thu Sep 15 09:25:46 UTC 2011


Wow.. that's some serious grep & sed usage!

I'm impressed that you've managed to get so much functionality out of
sed, in particular.

One thing I would say, though, is that if this is the kind of thing
you're doing often, you would probably find Perl both easy to use (if
you can do regexes, you're more than halfway to being good at Perl
anyway) and immensely helpful - Perl was basically created for this kind
of text processing.

For example, your first sed example,
  sed -e '/^[[:space:]]*\- *[0-9]\{1,3\} *\-$/d' test_file_00.txt > test_file_01.txt

You could replace with the below perl script, which you would run as, e.g.:
  ./de-page-number filename.txt
And it would create filename.txt.out with the de-paginated output.

I'll grant you it would be more typing, but it's a *lot* more readable &
re-usable :)

I've planned giving a talk to SLUG about Perl at some point, let me know
if it'd be of enough interest that I should stop thinking and start
planning :o)

-------------------------------------------------------------------
#!/usr/bin/perl

# The first argument passed at execution is the filename
my $filename = $ARGV[0];

# Only continue if we can open the specified file for reading
open (my $file, "<$filename") or die 'Must specify file';
# Also don't continue unless we can write to an output file
open (my $output, ">$filename.out") or die 'Cannot write to input file';

# Go through each line & only write it to output if
# it doesn't look like a page number:
# A number one to three digits long, with hyphens either side
# and any amount of spacing
while (my $line = <$file>){
    print $output $line unless ($line =~ m/^\s*\-\s*[0-9]{1,3}\s*\-\s*$/);
    }

# Close the input & output files - not really needed here
# but it's a good habit to be in :)
close $file;
close $output;
-------------------------------------------------------------------

On Wed, 2011-09-14 at 23:28 +0100, Fay Zee wrote:
> Hi All, this is aimed at those who were at the August moot.
> 
> 
> I've updated my analysis file quite a bit and written up the
> instructions with notes so you can step through it again whenever you
> decide.
> 
> I've put it up on the East Grinstead site for download. All the same
> tutorial and cheat sheet links are there at the bottom of the page
> with notes on how I prepared the practice file, and I've retested the
> three commands we focused on.
> 
> The text I chose is particularly challenging so provides for more
> interesting experimentation. That text, in 1928, would have been typed
> up by hand (more than 1000 pages) and then the printing plates for
> each of the 16 volumes within it would have been hand set. This would
> partly explain why the paragraph indents vary from section to section,
> ranging from 2 to 6 characters. There are also a great many bordered
> quotes interspersed, which end up as spaced out character strings
> floating around haphazardly in the plain text version.
> 
> Among other manipulations remaining to be done are the elimination of
> extra white space within paragraphs, and here again, the varying
> indents limit the accuracy of any one expression - unless you take the
> text one section at a time. There are options which let you define
> line numbers but that will have to be left as a further challenge. We
> didn't even touch on stored scripts.
> 
> Anyway, here's the link:
> http://www.eglug.org.uk/bash_and_regexp_example_analysis.html
> 
> I enjoyed the exercise and I still refer back to the analysis when
> crafting additional expressions. Writing it all up prior to the moot
> was painstaking as was revisiting it afterwards, but well worth it as
> it cemented my understanding. Running (successful) sed commands one
> after the other and seeing the results is like magic :-)
> 
> Let me know if you give it a go, and please post your feedback.
> 
> 
> 
> Best Regards,
> Fay
> East Grinstead Linux User Group
> www.eglug.org.uk
> 
> --
> Sussex mailing list
> Sussex at mailman.lug.org.uk
> E-mail Address: sussex at mailman.lug.org.uk
> Sussex LUG Website: http://www.sussex.lug.org.uk/
> https://mailman.lug.org.uk/mailman/listinfo/sussex





More information about the Sussex mailing list