[Sussex] Parsing a Logfile with Perl....

Wed Mar 12 11:30:13 UTC 2008

Richie

On Wed, 2008-03-12 at 10:32 +0000, Richie Jarvis wrote:
> I have written a little perl script to read a logfile, and parse certain 
> values for matching lines into a csv file.  It works great - until I 
> tried it on one of our systems and discovered that a colleague had put a 
> '-' character into one of the usernames I am parsing.  After lots of 
> cursing, I am stuck on this one, and wonder if anyone can see how to 
> adjust my regex to span the situation where usernames with and without 
> funny characters can be encompassed?
> 
> Here is an example line from a well-formatted line:
> 
> 2007-05-31 15:21:13 Sent SMS [SMSC:mbloxpsmsca] [SVC:fusion] [ACT:] 
> [BINF:] [from:62569] [to:16474075000] [flags:-1:1:-1:-1:-1] 
> [msg:100:01062F1F2DB69181923945413141363634383631323734414246333536363635343442423438464444353732303745433300030B6A00C54601C60001550187360603773700018707060354454D502D7B31363437343037353030307D0001873806034375] 
> [udh:12:0B05040B8423F00003210401]
> 
> Here is one from the badly-formatted line:
> 
> 2008-03-07 05:09:54 Sent SMS [SMSC:mbloxpsmsca] [SVC:hpit-ems] [ACT:] 
> [BINF:] [from:62569] [to:+16475882516] [flags:-1:0:-1:-1:-1] 
> [msg:143://SS Please download mProveDM 
> https://fusiondm-itg.houston.hp.com:443/fusiondl/EMA.cab?D=a5619ITGITG13A0E4B216685DBD31C50B0C9E6F91F2N8AB384A870] 
> [udh:0:]
> 
> My script spits out the following output for these:
> 
> Good: 2007-05-31,15:21:12,fusion,62569,216.154.251.59,16474075000
> Bad: 2008-03-07,05:09:53,hpit,ems,62569> (15.243.169,to
> 
> Currently, I have the rather ungainly regex as follows:
> 
> $_ =~ 
> /^(\d+-\d+-\d+)\D+(\d+\D+\d+\D+\d+)\D+\w+\W+\w+\W+\w+\W+\w+\W+\w+\W+(\w+)\W+(\w+)\W+(\w+\W+\w+\W+\w+\W+\w+)\W+\w+\W+(\w+)\W+/;
> 
> I am sure there is a better way to do this - i.e. search for the string 
> [SVC: and gobble everything up to the ], but being a bit of a newbie to 
> regex, I am googling wildly, and not getting much inspiration. 
> 
> Does anyone have any pointers?

I'm not a perl coder by any means so be warned that my knowledge of perl
regexp is zero.
Having said that I've used regexp for year and know a trick or two.

For the lines you've shown the easy way to "gobble all" characters until
to hit
a *unique* pattern just to prefix the pattern with the unlimited
wildcard matching
pattern '.' '*'.  For example to match to the string "to:":

	.*to:

Then to use an invert range to match to the delimiter.  A range is given
inside
square brackets and if the first character is a '^' the range is
inverted.  So to 
match to the end ']' would be:

	[^]]

Note: To match to a ']' in a range the ']' must be the first character
in the range.
(IIRC - it works in sed to I must have.)

Using sed rather than perl to I tested your data thus:

	$ echo "2007-05-31 15:21:13 Sent SMS [SMSC:mbloxpsmsca] "\
	> "[SVC:hpit-ems] [ACT:] [BINF:] [from:62569] "\
	> "[to:+16474075000] [flags:-1:1:-1:-1:-1] "\
	> "[msg:100:01062F1F2DB691819239] "\
	> "[udh:12:0B05040B8423F00003210401]" | \
	> sed -e 's/\([0-9\-]*\) \([0-9:]*\).*SVC:'\
	> '\([^]]*\).*from:\([^]]*\).*to:'\
	> '\([^]]*\).*/\1,\2,\3,\4,\5/'
	2007-05-31,15:21:13,hpit-ems,62569,+16474075000

I assume that the IP addr is embedded in the data somewhere and you work
that
out in a part of your perl script that you didn't publish.

Note: In sed that the sequence '\' '(' <pattern> '\' ')' stores the text
that is 
matched by <pattern> into a number buffer that can be extracted by the
sequence
'\' <buf-no>, where <buf-no> is 0, 1, 2, 3, ...

Hope this helps.  If not you know my number so call.
Steve
-- 
Steve Dobson

Wait ... is this a FUN THING or the END of LIFE in Petticoat Junction??

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : http://mailman.lug.org.uk/pipermail/sussex/attachments/20080312/2dab31cf/attachment.pgp