Gossamer Forum
Home : General : Perl Programming :

Parsing Access Log

Quote Reply
Parsing Access Log
I have a script which parses the standard Apache access log file looking for 401 errors and users who have logged in correctly. However, my server uses the combined log format. How can I change this code to parse that log format? Regexp's are still my weak point:

Code:
while (<> ) {
chop;
# Break apart the Apache/NCSA-style access log entry
# regexp assumes there is no "[" in the username.
next unless s/^(\S*) - ([^\[]+) \[*?\] ".*" (\d+) \S*$//;

Thanks in advance.

P.S., an explanation of the regexp would also be appreciated.

[This message has been edited by Bobsie (edited April 14, 1999).]
Quote Reply
Re: Parsing Access Log In reply to
I'm not sure why this is a s///, probably just want a m//, looks like it just matches and no reason to do a substitution.

The only difference between standard and combined format is that combined has two extra fields on the end separated by a space and enclosed in quotes. So an easy fix is to just remove the end of line marker on the regexp. You might also have to change ".*" to ".*?". See comments below:

Code:
s/^ start at beginning of line
(\S*) match 0 or more non spaces and store in $1
- match a dash
([^\[]+) match everything up to an opening [ and store in $2
\[*?\] match [something]. The .*? means zero or more, non greedy
(i.e. stop as soon as you have a match)
".*" match "something". You should probably change this to ".*?" with
combined log format as otherwise you might match the referer/browser.
(\d+) match one or more digits and store in $3.
\S* match 0 or more non spaces
$ match end of line
//; replace total match with empty string.

Hope that helps,

Alex
Quote Reply
Re: Parsing Access Log In reply to
Thanks Alex. It worked beautifully and I really do appreciate the explanation of the regexp... slowly I am learning how to interpret those things and your explanation really helps me understand what I am looking at.