identicon

Published

21 Tips for Parsing Log Files

 

At the dawn of human history, the first sysadmins wandered the primeval African savanna. Theirs was a nomadic life, dominated by the seasonal migrations of wild servers. By day these proud superusers stalked the vast herds of cloud services with keen spears tipped with scripts. Swift of leg and keen of eye they were, descended from ancestors who stole the secret of remote shell access from the sun itself.

As night fell these noble savages would retreat to primitive coffee shops, where they lit fires, sipped artisan coffees and sang the songs of log analysis to hold back the encroaching night. Shamans beat their drums there beneath the bright stars, and chanted of AWK and sed, of perl and grep and regex.

Little has changed since then.

In this article I give some basic (and not so basic) tips to help you parse the Apache access.log file.

Apache’s can log every visitor and request made in the access.log file. The location and name of the Apache log file will vary by *nix platform. The access.log format that I use for this article is:

LogFormat “%h %l %u %t \”%r\” %>s %O \”%{Referer}i\” \”%{User-Agent}i\”” combined

1. %h – IP address.

2. %l – Remote logname. Disabled by default unless IdentityCheck On is enabled. Will be dash if disabled.

3. %u – Remote username. As above.

4. %t – Timestamp, English format.

5. %r – First line of request. Whether request is POST or GET, the target file on the domain, and the protocol.

6. %>s – HTTP response code number.

7. %0 – Number of bytes sent.

8. %{Referer]i – The referring URL. A link was clicked or content was hosted on this page.

9. %{User-Agent]i – Visitor’s user agent string.

1. See the most recent entries


The access.log file is too big to read in one glance. My own website doesn’t receive more than 100 hits on a given day, and yet:

$ du -hs access.log
409M    access.log
$ wc -l access.log
2304883 access.log

 

head and tail are two great commands to start. They will each output the first and last n lines of a text file:

 

$ tail -n 50 access.log
23.20.9.23 – – [06/Jul/2015:11:21:48 +0000] “GET /archives/2374 HTTP/1.0” 200 4613 “-” “CCBot/2.0 (http://commoncrawl.org/faq/)”

188.40.97.23 – – [06/Jul/2015:11:21:53 +0000] “GET /page/64?showimage=125 HTTP/1.0” 301 649 “-” “Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6 – James BOT – WebCrawler http://cognitiveseo.com/bot.html”

 

2. View specific lines with sed


 

sed, stream editor, is a utility that parses, formats and replaces text in an input stream using regular expressions and its own programming syntax. sed reads from STDIN or text files. Print line 15:

 

$ sed -n 15p access.log

 

Print lines 3 to 100 (inclusive):

 

$ sed -n 3,100p access.log

 

3. More sed


 

sed can print lines based on almost any criteria: every other line, lines where the nth character is “f” or with regex. Here are a few examples. This will print every fourth line, starting at 0:

 

$ sed -n 0-4p access.log

 

Or find lines with the Facebook user agent:

 

$ tail -n 300 access.log | sed -n ‘/facebook/’
2a03:2880:3010:9ff6:face:b00c:0:1 – – [07/Jul/2015:20:39:01 +0000] “GET /archives/13544804758 HTTP/1.1” 206 9857 “-” “facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)”

 

4. Find antipatterns


 

To repeat: sed can print on almost any criteria at all, even lines that match the inverse of a pattern. A fun use is to find interesting user agents, since every user agent has “Mozilla” in it:

 

$ tail -n 30 access.log | sed -n ‘/Mozilla/!p’
199.16.186.106 – – [07/Jul/2015:20:18:53 +0000] “GET /feed HTTP/1.1” 301 652 “-” “BoardReader Blog Indexer(http://boardreader.com)”
208.79.208.172 – – [07/Jul/2015:20:18:54 +0000] “GET /feed HTTP/1.1” 200 13218 “-” “BoardReader Blog Indexer(http://boardreader.com)”
2a01:7e00::f03c:91ff:fe56:2846 – – [07/Jul/2015:20:30:44 +0000] “POST /wp-cron.php?doing_wp_cron=1436301044.4617938995361328125000 HTTP/1.0” 200 6179 “-” “WordPress/4.2.2; https://www.bhalash.com”
2a03:2880:3010:9ff6:face:b00c:0:1 – – [07/Jul/2015:20:39:01 +0000] “GET /archives/13544804758 HTTP/1.1” 206 9857 “-” “facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)”
65.19.138.33 – – [07/Jul/2015:20:42:42 +0000] “GET /feed HTTP/1.1” 200 13526 “-” “Feedly/1.0 (+http://www.feedly.com/fetcher.html; like FeedFetcher-Google)”

 

‘/Mozilla/!p’ will match any line that does not contain “Mozilla”.

 

5. Split output with tee


 

In several examples I use the tee command to show output before and after processing. tee redirects STDOUT from a command into the specified text file, as well as pipe the output to the next command in the pipeline.

 

In my examples I use process substitution into pipe the output into other commands instead of a file:

 

$ tail -n 1 access.log | tee >(wc -w) >(wc -c) >(awk ‘{ print $1 }’
66.249.67.95 – – [07/Jul/2015:20:41:29 +0000] “GET /archives/98 HTTP/1.1” 200 13906 “-” “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)”
66.249.67.95
163
15

 

The order of output of tee can be unpredictable because results are output as soon as completed.

 

6. Display surrounding lines with grep


 

It can be useful to have context for a given line,  for example to see there are any related errors. The grep command can output contextual lines before or after matches:

 

$ tail -n 300 access.log | grep -C 3 -m 1 jquery
180.76.15.143 – – [06/Jul/2015:20:07:27 +0000] “GET /archives/173 HTTP/1.1” 301 668 “-” “Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)”
180.76.15.155 – – [06/Jul/2015:20:07:28 +0000] “GET /archives/173 HTTP/1.1” 200 10489 “-” “Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)”
180.76.15.147 – – [06/Jul/2015:20:07:35 +0000] “GET /our-wp-content/themes/sheepie/assets/js/min/functions.min.js?ver=2 HTTP/1.1” 200 7492 “https://www.bhalash.com/archives/173” “Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)”
180.76.15.142 – – [06/Jul/2015:20:07:35 +0000] “GET /wp-includes/js/jquery/jquery-migrate.min.js?ver=1.2.1 HTTP/1.1” 200 9993 “https://www.bhalash.com/archives/173” “Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)”

 

7. Format output with AWK


 

AWK is a Turing-complete language for the formatting of text streams. Although AWK can operate via scripts, its design is for single-line input through the command line.

 

Print first column-the IP-address of the top of access.log:

 

$ awk ‘{ print $1 ‘} access.log
66.249.64.76
66.249.64.76
66.249.64.71
14.207.190.119
14.207.190.119
14.207.190.119

 

$ head -n 1 access.log | awk ‘{ print “IP: ” $1 ‘}
IP: 66.49.64.76

 

8. More formatting with AWK


 

One of AWK’s true powers is that it can parse raw input into tables without you having to insert either line breaks or other text. This example prints the first, fourth and fifth columns of a line after square braces are removed:

 

tail -n 1 access.log | tee >(echo; awk ‘{ print “IP: ” $1; print “Date: ” $4$5 }’ | sed -e ‘s/\[//g;s/\]//g’)

 

66.249.78.54 – – [07/Jul/2015:20:23:23 +0000] “GET /archives/13544803360 HTTP/1.1” 200 10786 “-” “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)”

 

IP: 66.249.78.54
Date: 07/Jul/2015:20:23:23+0000

 

9. B-AWK-ward output


 

AWK can print text from the last column with the $NF variable:

 

$ tail -n 1 access.log | tee >(awk ‘{print $NF}’)
107.182.143.46 – – [07/Jul/2015:20:45:30 +0000] “GET /feed/atom HTTP/1.1” 200 7945 “-” “Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.114 Safari/537.36″
Safari/537.36”

 

10. Watch visitors in realtime


 

tail -f will monitor the specified file and display new lines as they are added. This is a fantastic development tool.

 

$ tail -n 1 -f access.log
67.196.50.63 – – [06/Jul/2015:16:16:09 +0000] “GET /archives/13544804736 HTTP/1.1” 200 11361 “-” “Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.66 Safari/537.36”
<refresh site>
62.210.215.119 – – [06/Jul/2015:16:16:08 +0000] “GET /feed HTTP/1.1” 200 15568 “-” “Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.66 Safari/537.36”

 

11. Sort and filter


 

There will be times-and this is true when things break-that you need to filter the log file. It might be support a usage report, or pinpoint and error, or to identify unique users. First, you extract a subset of data with head, tail or sed, and from there pipe it through commands to parse it. For instance, I want to see failed login attempts on my blog:

 

$ tail -n 50000 access.log | grep ‘wp-login\|regist’ | grep GET
91.188.72.170 – – [02/Jul/2015:20:25:07 +0000] “GET /wp-login.php HTTP/1.0” 301 631 “-” “-”
37.54.90.68 – – [02/Jul/2015:21:45:33 +0000] “GET /wp-login.php HTTP/1.1” 301 663 “-” “Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko”
5.58.31.82 – – [02/Jul/2015:21:45:47 +0000] “GET /wp-login.php HTTP/1.1” 301 667 “-” “Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko”

 

12. Dissect DNS and IP information


 

The handy dig and nslookup commands can return useful information from IP addresses and domains alike:

 

$ access.log | awk ‘{ print $1 }’ | tee >(nslookup) >(dig)

 

; <<>> DiG 9.9.5-3ubuntu0.2-Ubuntu <<>>
;; global options: +cmd
;; Got answer:

;; Query time: 1 msec
;; SERVER: 109.74.192.20#53(109.74.192.20)
;; WHEN: Mon Jul 06 21:45:46 UTC 2015

 

I excluded the output as it is quite verbose. dig returns public information on nameservers, if any, tied to a domain, or to see if there is a domain attached to an IP address.

 

13. Read archived logfiles


 

Cron jobs compress, prune and archive access.log files on a regular basis before their size can become a problem. My personal site receives only moderate traffic, yet:

 

$ du -hs access.log && wc -l access.log
410M    access.log
2307681 access.log

 

zcat dumps the content of an archived log file to STDOUT:

 

$ zcat access.log.tar.gz | head -n 1
79.97.253.56 – – [03/Nov/2014:14:38:26 +0000] “GET /icons/blank.gif HTTP/1.1” 200 430 “http://www.bhalash.com/” “Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.19 Safari/537.36”

 

14. Speed up grep


 

You can force grep to run up to hundreds of times faster if you reset the system locale before your search. Modern terminal emulators use the Unicode character set, which has over 110,000 unique glyphs. When grep searches through Unicode, it includes all those thousands of extra glyphs. If you reset the system locale to use ASCII, grep only has to include 128.

 

#!/bin/zsh
OLD_LC=$LC_TIME
LC_TIME=C
grep “$@”
LC_TIME=$OLD_LC

 

$ chmod +x ./lc_grep.sh
$ lc_grep -e ‘wp-login.php\|regist’ access.log

 

15. Geolocate visitors


 

The geoiplookup utility and its databases ship as part of several distributions. The utility consults supplied IP addresses and, depending on the quality of the database, returns information on the country and region:

 

#!/bin/zsh
ip=$(dig $1 | grep -A 1 ‘ANSWER SECTION’ | tail -n 1 | awk ‘{ print $NF }’)
geoiplookup $ip | sed -e ‘s/^.*: //g’

 

$ ./get_country serversaustralia.com.au
AU, Australia

 

16.  Master regular expressions


 

Regular expressions are the most powerful tool to have when parsing log files. Take the time to learn how to put together a detailed, efficient regular expression for awk, sed and grep:

 

$ tail -n 1 access.log | tee >(echo -ne ‘\n’; sed -e ‘s/^.*-” “//g;s/”//g’)
66.249.78.54 – – [07/Jul/2015:20:48:45 +0000] “GET /archives/13544793529 HTTP/1.1” 200 10388 “-” “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)”

 

Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

 

17. Visitor lookup


 

This is a typical script you might use to investigate a visitor to your site. In this case I use lookup a Google scrape bot:

 

#!/bin/zsh
matches=$(grep $@)
last_visit=$(echo $matches | tail -n 1)
echo “IP: $(echo $last_visit | awk ‘{print $1 }’)”
echo “Total visits: $(echo $matches | wc -l)”
echo “User Agent: $(echo $last_visit | sed -e ‘s/^.*-” “//g’ | tr -d ‘”‘)”
echo “Last visit: $(echo $last_visit | tr -d ‘[]’ | awk ‘{ print $4$5 }’)”

 

$ ./visitors.sh 66.249.78.54 access.log
IP: 66.249.78.54
Total visits: 3856
User Agent: Mozilla/5.0 (compatible; Googlebot/2.1;+http://www.google.com/bot.html)
Last visit: 07/Jul/2015:21:51:35+0000

 

18. Colourize output with multitail


 

Straight white-on-black output can blend one line into another when you look at enough of them. multitail a tail utility that allows you to colourise output. grc, the generic colouriser, will add colour to any command. Call grc or multitail with:

 

$ multitail -f access.log
$ grc tail -f -n 50 access.log

 

19. See missing files


 

AWK can act like grep on input data by taking a column or line and performing a regex match on it:

 

$ tail -n 300 access.log | awk ‘($9 ~ /404/)’ | awk ‘{print $9″: ” $7}’
404: /apple-touch-icon.png
404: /archives/11107275524
404: /our-wp-content/plugins/revslider/js/admin.js

 

In the above command ($9 ~ /404/) only prints line where the ninth column if  it is exactly “404”. It then pipes it to another AWK command that formats output for the line.

 

20. Report on access codes


 

This handy one-liner sorts the log, counts occurrences of a string and sorts the output in descending order:

 

$ awk ‘{ print $9 }’ access.log | sort | uniq -c | sort -rn
1999022 200
188287 301
65430 302
36262 404
15242 304
6299 500
2743 204
618 429
600 206
599 405
174 409
123 400
90 403
72 503

 

21. Report on missing files


 

This final one-liner combines the last two to show the top 10 missing files encountered by visitors.

 

$ awk ‘($9 ~ /404/)’ access.log | awk ‘{ print $7 }’ | sort | uniq -c | sort -nr | head -n 10                                         [INSERT]
5343 /our-wp-content/themes/rmwb/assets/images/fallback.jpg
3199 /our-wp-content/themes/rmwb/images/favicon.png
1471 /our-wp-content/themes/sheepie/assets/images/fallback.jpg
406 /apple-touch-icon.png
345 /contact
301 /archives/751303118
266 /13544793483/m/1.jpg
250 /our-wp-content/themes/rmwb/img/icon.png
200 /archives/tag/killer/page/our-wp-content/plugins/reflex-gallery/admin/scripts/FileUploader/php.php

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *