[Gllug] Apache log files

Wed Apr 8 03:53:48 UTC 2009

John Hearns wrote:
> 2009/4/7 william pink <will.pink at gmail.com>:
>   
>> Hi,
>>
>> I have the rather horrible task of splitting up lots (40Gb's worth) of
>> Apache log files by date, the last time I did this I found the line number I
>> then tailed the file and outputted it into a new file which was a long
>> arduous task. I imagine this can be done in a few minutes with some
>> Regex/Sed/AwkBash trickery but I wouldn't know where to start can anyone
>> give me any pointers to get started?
>>     
>
> I would think of Perl for this task - that's what the language is good at.
>
> However, I do sometimes have a problem of a certian application
> producing huge output files.
> I deal with this using the 'csplit' utility.
> Man csplit, and think hard about the regexp PATTERN which will match
> your date ranges.
>   

This Python regex should match a line from an apache log file and split
it into  it's constituents...

import re
import sys

rexp = re.compile(r'(\d{1,3}[.]\d{1,3}[.]\d{1,3}[.]\d{1,3}) .{0,20}
.{0,20} \[(\d{2}/.../\d{4}):(\d{2}:\d{2}:\d{2}) (.....)\] \"')

for line in open( sys.argv[1], 'r' ):
    m = rexp.match( line.rstrip() )
    if m:
        print m.group(2), m.group(3), m.group(4)

Group 2 is date, 3 is the time, 4 is the timezone.

Maybe this could serve as a useful base for your filtering/splitting.

Regards,

Roger.
-- 
Gllug mailing list  -  Gllug at gllug.org.uk
http://lists.gllug.org.uk/mailman/listinfo/gllug