chl - colour HTTP logfiles
Webserver logfiles are another logfile format. There is something like the "common logfile format" which is the original format of NCSA's webserver (from more than 10 years ago).
Common logfile format
The common logfile format is
| Pos | Data | Comment |
| 1 | client | the client's IP number. |
| 2 | ident-user | the username from an ident lookup (this is usually not used). |
| 3 | username | if the request was authenticated, the username is shown here. |
| 4 | date and time | these values are given in angle bracket. |
| 5 | request method | the HTTP request method. |
| 6 | request | the client's request. |
| 7 | HTTP protocol | the HTTP protocol for the request. |
| 8 | HTTP status | the HTTP status code sent from the server to the client |
| 9 | response size | number of bytes the server sent to the client |
and that's it. Sounds simple but is actually a little bit difficult. There are angle bracket around the date, but a space inside, the triple request method, request string and protocol is enclosed in double quotes.
But before we go into the details there are some more fields in the logfile I use. The logfile format is configurable under apache and the format I use is known as "combined" format. Now the additional fields are
| Pos | Data | Comment |
| 10 | referer | the client's referer URL, enclosed in double quotes. |
| 11 | user-agent | the client's user agent, enclosed in double quotes. |
A short extract from my server's logfile is
1 192.168.1.xx - - [01/Mar/2007:19:55:57 +0100] "GET /cgi-bin/wiki.cgi/syslog/00-WikiIndex?op=info HTTP/1.0" 200 11304 "http://192.168.1.12/cgi-bin/wiki.cgi/syslog/00-WikiIndex?op=edit&target=left" "Opera9.xx (Windows NT 2.7; U; de)" 2 192.168.1.xx - - [01/Mar/2007:19:56:18 +0100] "GET /cgi-bin/wiki.cgi/?op=open&ph=1&pa=syslog%2F&s=ColouringWeblog&q1=Open&fs=-&m=- HTTP/1.0" 302 123 "http://192.168.1.12/cgi-bin/wiki.cgi/syslog/00-WikiIndex?op=info" "Opera9.xx (Windows NT 2.7; U; de)" 3 192.168.1.xx - - [01/Mar/2007:19:56:18 +0100] "GET /cgi-bin/wiki.cgi/syslog/ColouringWeblog HTTP/1.0" - 175 "http://192.168.1.12/cgi-bin/wiki.cgi/syslog/00-WikiIndex?op=info" "Opera9.xx (Windows NT 2.7; U; de)" 4 192.168.1.xx - - [01/Mar/2007:19:56:29 +0100] "GET /cgi-bin/wiki.cgi/?op=open&ph=1&pa=syslog%2F&s=ColouringWeblog&q1=Open&edit=yes&fs=-&m=- HTTP/1.0" 302 139 "http://192.168.1.12/cgi-bin/wiki.cgi/syslog/00-WikiIndex?op=info" "Opera9.xx (Windows NT 2.7; U; de)" 5 192.168.1.xx - - [01/Mar/2007:19:56:29 +0100] "GET /cgi-bin/wiki.cgi/syslog/ColouringWeblog?op=edit HTTP/1.0" 200 7209 "http://192.168.1.12/cgi-bin/wiki.cgi/syslog/00-WikiIndex?op=info" "Opera9.xx (Windows NT 2.7; U; de)" 6 192.168.1.xx - - [01/Mar/2007:19:56:36 +0100] "GET /cgi-bin/wiki.cgi/syslog/csl HTTP/1.0" 200 3000 "http://192.168.1.12/cgi-bin/wiki.cgi/syslog/00-WikiIndex" "Opera9.xx (Windows NT 2.7; U; de)" 7 192.168.1.xx - - [01/Mar/2007:20:07:13 +0100] "POST /cgi-bin/wiki.cgi/syslog/ColouringWeblog HTTP/1.0" 401 479 "http://192.168.1.12/cgi-bin/wiki.cgi/syslog/ColouringWeblog?op=edit" "Opera9.xx (Windows NT 2.7; U; de)" 8 192.168.1.xx - dag [01/Mar/2007:20:07:13 +0100] "POST /cgi-bin/wiki.cgi/syslog/ColouringWeblog HTTP/1.0" 302 123 "http://192.168.1.12/cgi-bin/wiki.cgi/syslog/ColouringWeblog?op=edit" "Opera9.xx (Windows NT 2.7; U; de)" 9 192.168.1.xx - - [01/Mar/2007:20:07:13 +0100] "GET /cgi-bin/wiki.cgi/syslog/ColouringWeblog HTTP/1.0" 200 7689 "http://192.168.1.12/cgi-bin/wiki.cgi/syslog/ColouringWeblog?op=edit" "Opera9.xx (Windows NT 2.7; U; de)" 10 192.168.1.xx - - [01/Mar/2007:20:08:11 +0100] "GET /cgi-bin/wiki.cgi/syslog/ColouringWeblog?op=edit&target=left HTTP/1.0" 200 7669 "http://192.168.1.12/cgi-bin/wiki.cgi/syslog/ColouringWeblog" "Opera9.xx (Windows NT 2.7; U; de)"
As you can see, request #7 and #9 are unauthenticated, but #8 is. It is also a POST request. Since POST is used to modify the server's content (the wiki) it requires authentication. Request #7, the first attempt to POST some data, was rejected by the sevrer with a 401 response. See http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html for the definition of the various HTTP status codes.
The script's main part
Let's look at the chl script. As csl it has one block that processes each logfile line.
1
2 /./ {
3 # Read the data fields ...
4
5 client = $1;
6 ident = $2;
7 user = $3;
8 date = $4 " " $5;
9
10 method = substr($6, 2);
11 request = $7;
12 proto = substr($8, 1, length($8) - 1);
13 status = $9;
14 bytes = $10
15 referer = $11;
16
17 gsub(/(^"|"$)/, "", referer);
18 if (match($0, /"([^"]*)"$/, x) == 0)
19 agent = "";
20 else
21 agent = x[1];
22
23
24 refhost = (match(referer, /^[a-z]+:\/\/([^/]+)\//, x) > 0)? x[1]: "";
25
26
27 # ... color the interesting parts ...
28
29 if (status ~ /^5/)
30 cstatus = __red status __normal;
31 else if (status ~ /^4/)
32 cstatus = __red __bold status __normal;
33 else
34 cstatus = status;
35
36 cref = match(refhost, refmarker)? __c6 bold referer __normal: referer;
37 cuser = (user != "-")? __blue user __normal: user;
38
39 crequest = parserequest(request, q);
40
41 # ... and print the line.
42 if (issub == 0 && expr == "")
43 printline();
44 }
45
Since the scheme is basically the same as before in short:
- lines 5 - 15 extract the data fields from the line,
- lines 17 - 21 remove the double quotes from the referer and user agent,
- line 24 extracts the hostname from the referer,
- line 29 - 34 color the status code depending on the status group
- lines 36 and 37 colour the referer if the referer host matches the
refmarkerpattern and the username if it is set, - line 39 parses the client request, and
- lines 41 - 43 print the coloured logline.
The parserequest() and printline() functions are tailored for the HTTP logfile format of course.
1
2 function parserequest(req, arr, p, r, var, val, x) {
3 delete arr;
4 if (match(req, /^(.*)\?(.*)$/, x) == 0) {
5 p = __yellow req __normal;
6 return (p); # no QUERY_STRING
7 }
8
9 p = __yellow x[1] __normal "?";
10 req = x[2];
11 while (match(req, /^([^=&]*)=([^&]*)(&)?/, x) > 0) {
12 p = p __grey x[1] __normal "=" __yellow __bold x[2]__normal x[3];
13 arr[x[1]] = x[2];
14 req = substr(req, RSTART+RLENGTH);
15 }
16
17 p = p req;
18 return (p);
19 }
20
21 function printline() {
22 print client, ident, cuser, date, "\"" method, crequest, proto "\"", cstatus, bytes, \
23 "\"" cref "\"", "\"" agent "\"" __normal;
24 }
25
When you have verified the explanation, you'll notice the conditional printline() call. Before we look at this call chl:
# chl </path/to/http/logfile | head -10
Make sure that you feed the logfile with IO redirection, not as command line argument.
You should see the first 10 lines from your webserver logfile, coloured depending on their contents. Now when you tail -f the webserver logfile to watch it and you have a lot of icons and pictures on you server, the tail will just rush across the screen. Every request creates one entry in the logfile.
Now try
# tail -f /path/to/http/logfile | chl 'status >= 400'
this should reduce the output lines to only those that have a HTTP status code above 400 (request errors and server errors).
sh: line 1: chl: command not found", call chl (for some reason we'll find below) the following way
# tail -f /path/to/http/logfile |
PATH=.:$PATH chl 'status >= 400'
So obviously chl is able to take an expression from the command line. How is this working?
The BEGIN block
When chl is called the BEGIN block is executed first. It begins with the usual stuff and colour value initialisation (lines 3 - 16). Line 18 sets the refmarker pattern we already met above in the script's main block, line 36. Hosts in the referer field that match the refmarker pattern give the referer field a special colour in the output.
1
2 BEGIN {
3 program = "chl";
4
5 STDERR = "/dev/stderr";
6 IGNORECASE = 1;
7
8 __bold = "\033[01m";
9 __red = "\033[31m";
10 __green = "\033[32m";
11 __yellow = "\033[33m";
12 __blue = "\033[34m";
13 __c5 = "\033[35m";
14 __c6 = "\033[36m";
15 __grey = "\033[37m";
16 __normal = "\033[0m";
17
18 refmarker = "(freshmeat.net|\\.google\\.[a-z]+)$";
19 argi = 1;
20 issub = debug = 0;
21 while (argi < ARGC && substr(ARGV[argi], 1, 1) == "-") {
22 options = nextarg("option");
23 if (options == "--")
24 break;
25
26 for (i = 2; i<=length(options); i++) {
27 c = substr(options, i, 1);
28 if (c == "0")
29 issub = 1;
30 else if (c == "d")
31 debug = 1;
32 else if (c == "q")
33 noprint = 1;
34 else {
35 printf ("%s: unkown option: -%s\n", program, c)
36 exit (1);
37 }
38 }
39 }
40
41 if (issub == 0) {
42 if (argi < ARGC)
43 expr = nextarg("expression");
44
45 if (expr == ".")
46 expr = "";
47 }
48
49 if (issub == 0 && expr != "") {
50 cmd = program;
51
52 p = namesubst(expr, "q");
53 if (noprint == 0)
54 p = p " { printline(); }";
55
56 cmd = cmd " --source " quote(p);
57
58 # Append flag that to prevent endless recursion.
59 cmd = cmd " -- -0";
60
61 # Append any command line arguments we have.
62 while (argi < ARGC)
63 cmd = cmd " " nextarg("command line argument");
64
65 if (debug != 0) {
66 printf ("cmd= %s\n", cmd);
67 exit (0);
68 }
69
70 rc = system(cmd);
71 exit (rc);
72 }
73 }
74
Lines 19 - 39 process any command line options, lines 41 - 47 read (if "issub == 0") the logline select expression.
In line 49 the interesting part begins. If a select expression is given, the script construct a shell command line in lines 50 - 56:
- the program to run is the script itself (line 50),
- calling namesubst() replaces patterns of the form
$namewithq[name], where theqarray is constructed by the parserequest() function; it holds the values from the request's QUERY_STRING part, - appends a call to printline() - if the given expression matches (line 54), and
- and adds the result to the command line as awk source code (line 56).
Line 59 is a safety option to prevent that the chl script we are going to invoke, starts another script, lines 62 and 63 append all arguments from the current command line to the new and line 70 finally calls the script. This script reinvocation is also the reason for the PATH=.:$PATH command prefix above. If you have chl in a directory which is not in your PATH, the script can not be found when it is called the second time.
If you want to see the constructed command without actually running it, put the -d option on the command line:
# tail -f /path/to/http/logfile |
PATH=.:$PATH chl -d 'status >= 400'
The interesting (and yet missing) functions are
1
2 function quote(string) {
3 gsub(/'/, "'\\''", string);
4 string = "'" string "'";
5 return (string);
6 }
7
8 function namesubst(string, arr, p, x) {
9 while (match(string, /\$([a-z]+)/, x) > 0) {
10 p = p substr(string, 1, RSTART-1) arr "[\"" x[1] "\"]";
11 string = substr(string, RSTART + RLENGTH);
12 }
13
14 p = p string;
15 return (p);
16 }
17
Invocation
chl [options] expression [logfile ...]
colour HTTP server logfile lines in apache's combined logfile format. expression can be any valid awk expression to select the desired lines.
Command line options
- -d
-
debug option: print the constructed chl command but do not execute.
- -q
-
quiet option: do not append the call to printline() automatically. A
printstatement is then expected to be part of the expression.
Notice that if expression is empty, chl will not invoke a second script but process the line direct. Since expression is mandantory you may put a single dot as expression which is taken as empty expression.