19 Oct

21 Awesome SSH Log File Analysis Commands

Log file are one of our greatest diagnostic tools. Whenever something goes wrong — or maybe when it goes right — you want to find out why in precise detail, but to do that you need to parse, reduce and extract from multi-gigabyte log files.  In this article I give out tips and commands to help you parse log files through the example of the Apache access.log file.

The access.log file by itself is a powerful analytic resource. Through it you can see referrals, detect hotlinks to deep content and patterns in traffic.

Apache’s can log every visitor and request made in the access.log file. The location and name of the Apache log file will vary by *nix platform, so here is the access.log format that I use for this article:

LogFormat "%h %l %u %t \"%r\" %>s %O \"%{Referer}i\" \"%{User-Agent}i\"" combined

1. %h – IP address.

2. %l – Remote logname. Disabled by default unless IdentityCheck On is enabled. Will be dash if disabled.

3. %u – Remote username. As above.

4. %t – Timestamp, English format.

5. %r – First line of request. Whether request is POST or GET, the target file on the domain, and the protocol.

6. %>s – HTTP response code number.

7. %0 – Number of bytes sent.

8. %{Referer]i – The referring URL. A link was clicked or content was hosted on this page.

9. %{User-Agent]i – Visitor’s user agent string.

There are frequent outputs of log file entries in this article, but for clarity I have truncated each example to one file each.

1. See the most recent entries

The access.log file is too big to read in one glance. I’m not joking, either; my own website doesn’t receive more than 100 hits on a given day, and yet:

$ du -hs access.log
409M    access.log
$ wc -l access.log
2304883 access.log

409 MB on disk and with 2,304,883 lines of hits. I keep everything because I often go through the log for curiosity.

du

returns the size of the file on the hard disk in whatever format you tell it, while

wc

counts the number of lines, words, or characters in input.

head

and

tail

are two great commands to start extracting data from the log file.
‘head’
and
‘tail’
dump the top or bottom X lines of a file to output. In this case I ask tail to pick the bottom line of
access.log
:

$ tail -n 1 access.log
23.20.9.23 - - [06/Jul/2015:11:21:48 +0000] "GET /archives/2374 HTTP/1.0" 200 4613 "-" "CCBot/2.0 (http://commoncrawl.org/faq/)"

2. View specific lines with sed

The commands above all generate a stream of text: ‘tail’ and ‘head’ output things, while ‘du’ and ‘wc’ count things. sed (to recurse) is the program that does the things with the things.

sed sits at the middle or the end of the pipeline, where it can modify the stream using regular expressions. One thing I do all the time is substitute spaces in filenames with friendlier underscores:

$ a=”this has spaces”
$ sed ‘s/\s/_/g’ <<< $a
this_has_spaces

Throughout the rest of this article I use sed for both substitution, and to print specific lines from a file.  Print line 15:

span>$ sed -n 15p access.log
Print lines 3 to 100 (inclusive):
$ sed -n 3,100p access.log

3. More sed

sed can print lines based on almost any criteria: you can ask for every other line; or specify lines where the ninth character is “f”.

Print every fourth line, starting at 0:

span>$ sed -n 0-4p access.log

Or find lines with the Facebook user agent. Facebook have a crawler that they use to add sites to an index like Google’s, so you share a link to a page, information on the page will be ready:

$ tail -n 300 access.log | sed -n  '/facebook/'
2a03:2880:3010:9ff6:face:b00c:0:1 - - [07/Jul/2015:20:39:01 +0000] "GET /archives/13544804758 HTTP/1.1" 206 9857 "-" "facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)"

4. Find antipatterns

sed can print a line on almost any criteria at all. You can even print lines that don’t match a pattern. A fun use is to find interesting user agents, since every browser is Mozilla:

$ tail -n 30 access.log | sed -n '/Mozilla/!p' 
199.16.186.106 - - [07/Jul/2015:20:18:53 +0000] "GET /feed HTTP/1.1" 301 652 "-" "BoardReader Blog Indexer(http://boardreader.com)"
'/Mozilla/!p'

matches any line that does
not
contain “Mozilla”. This is an unconventional use for
sed
, but it serves as a great example of the strength of the program.

5. Split output with tee

tee is one method to split output from a command:

  1. One stream is redirected to a file.
  2. The other stream goes to STDOUT and continues the pipe.

Here, I pipe the final line of the file into tee. I redirect the output into process substitution instead of to a file. In this manner I can perform several tasks in parallel without having to iterate. The order of output of tee can be unpredictable because results are output as soon as completed. To explain the order:

  1. STDOUT prints the entire line.
  2. awk prints the first column of the line.
  3. wc prints the character count.
  4. wc prints the word count.
$ tail -n 1 access.log | tee >(wc -w) >(wc -c) >(awk '{ print $1 }')
66.249.67.95 - - [07/Jul/2015:20:41:29 +0000] "GET /archives/98 HTTP/1.1" 200 13906 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.67.95
163
15

6. Display surrounding lines with grep

grep — which I cover at length in my article on grep and regex — is one of the first and still powerful search utilities. grep uses regular expressions to find matching text. Here I searched for instances of a Baidu crawler that accessed an obscure part of the WordPress filesystem:

grep -C 1 -m 1 ‘jquery’ access.log
79.97.253.56 - - [03/Nov/2014:14:42:22 +0000] "GET /style.css HTTP/1.1" 200 1704 "http://www.bhalash.com/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.19 Safari/537.36"
79.97.253.56 - - [03/Nov/2014:14:42:22 +0000] "GET /jquery-2.1.0.min.js HTTP/1.1" 200 29653 "http://www.bhalash.com/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.19 Safari/537.36"
79.97.253.56 - - [03/Nov/2014:14:42:23 +0000] "GET /functions.js HTTP/1.1" 200 607 "http://www.bhalash.com/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.19 Safari/537.36"usi

grep -C 1 provides a context of one line on either side of the match, and -m 1 restricts grep to the first match found.

7. Format output with AWK

grep finds the things.

sed does things with the things.

awk makes the things pretty.

AWK is a utility that forms the third leg of our tripod of tools. grep finds the lines we want to use; sed performs substitutions to make it more accessible; and AWK formats the output however we specify Although AWK can be scripted– AWK is an entire programming language by itself! — it is supposed to take input from STDIN.

The commonest use is to print a given column (words separated by spaces) from input. Here I want the IP address in the first column of access.log:

$ awk '{ print $1 '} access.log
66.249.64.76

Here I insert “IP: “ before each address as a way to add more order:

$ awk '{ print "IP: " $1 }’ < access.log
IP: 66.49.64.76

8. More formatting with AWK

AWK’s is powerful enough to any parse raw input into a table without any extra effort on my part. This example prints the IP and timestamp of a visit after I remove extra characters [“like these”] with sed:

$ awk 'END{ print "IP: " $1; print "Date: " $4$5 }' < access.log | sed 's/\[//g;s/\]//g' 
IP: 157.55.39.30
Date: 30/Jul/2015:19:24:10+0000

END prints the final line only of input. There is an implicit newline between print commands.

9. B-AWK-ward output

AWK can print text from the last column with the

$NF

variable:

$ awk 'END{print $NF}' < access.log
Safari/537.36"

10. Watch visitors in realtime

tail -f
tells
tail
to monitor the specified file for appended changes, and display new lines in STDOUT. This is a fantastic development tool. When I work on a site I will keep several debug logs open in
tail
and keep an eye on them as I make changes.

$ tail -n 1 -f access.log
67.196.50.63 - - [06/Jul/2015:16:16:09 +0000] "GET /archives/13544804736 HTTP/1.1" 200 11361 "-" "Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.66 Safari/537.36"
<I refresh site and the new visit is added to output>
62.210.215.119 - - [06/Jul/2015:16:16:08 +0000] "GET /feed HTTP/1.1" 200 15568 "-" "Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.66 Safari/537.36"

11. Sort and filter

grep supports powerful regex which allows. In this case I wanted to examine the waves of bots who attack my own website’s WordPress login:

$ grep -P '(wp-login|regist)' access.log | grep GET
91.188.72.170 - - [02/Jul/2015:20:25:07 +0000] "GET /wp-login.php HTTP/1.0" 301 631 "-" "-"

I tackle this in great depth in my article on regex, but (foo|bar) will find either foo OR bar on a line. I pipe this into a second command that looks for the GET command, that is, when the bot requested the login pages, because each request generated at least two lines:

AWK can print text from the last column with the $NF variable:

$ awk 'END{print $NF}' < access.log
Safari/537.36"
  1. The request for the login page.
  2. The redirect to the “wrong password” screen.

12. Dissect DNS and IP information

The handy dig and nslookup commands search the Internet for information about domains and IP addresses alike:

Take dig: it is interrogates nameservers about either an IP address or domain, and will return anything it finds. Depending on the configuration of the server, you might see information about related domains. I have truncated the full output from the command as it runs to more than 30 lines.

$ dig $(awk ‘END{ print $1 }’ access.log)
; <<>> DiG 9.9.5-3ubuntu0.2-Ubuntu <<>>
;; global options: +cmd
;; Got answer:
...
;; Query time: 1 msec
;; SERVER: 109.74.192.20#53(109.74.192.20)
;; WHEN: Mon Jul 06 21:45:46 UTC 2015
…

nslookup will query any domains that connect to the IP.

$ nslookup $(awk 'END{print $1 }' access.log)
[INSERT]
Server:  109.74.192.20
Address: 109.74.192.20#53
** server can't find 251.251.185.146.in-addr.arpa: NXDOMAIN

nslookup and dig do their best work when you start with a domain and trace it back to the IP address, but sometimes you will uncover their domain.

13. Read archived logfiles

Cron jobs compress, prune and archive access.log files on a regular basis before their size can become a problem. One regular cron job creates gzipped copy that day’s access.log file and then empties the original file.

zcat reads the content of an gzipped file into STDOUT:

$ zcat access.log.tar.gz | head -n 1
79.97.253.56 - - [03/Nov/2014:14:38:26 +0000] "GET /icons/blank.gif HTTP/1.1" 200 430 "http://www.bhalash.com/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.19 Safari/537.36"

You can join this archived output with grep’d or cat’d output from the original access.log file:

awk { print “IP: “ $1 }’ < <({zcat access.log*gz & cat access.log}
IP: 66.49.64.76

14. Speed up grep

You can force grep to run up to hundreds of times faster if you reset the system locale before your search. Modern terminal emulators use the 110,000-character Unicode charset. When grep searches through Unicode, it includes all those thousands of extra glyphs. If you reset the system locale to use ASCII, grep only has to include 128.

Here I have a small script named lc_grep that resets the terminal locale to use the ASCII charset. The computer’s locale represents:

  • The charset (UTF8 or ASCII or whatever).
  • Timestamp (UTC + or – hours).
  • Language it uses.

The downside is that grep will work in a fuzzier manner. A good example is the German ü versus the English u. grep will interpret the former as the latter,  so a search for non-English characters may return odd results.

#!/bin/zsh
OLD_LC=$LC_TIME
LC_TIME=C
grep "$@"
LC_TIME=$OLD_LC

After the locale is reset, grep’s arguments past as normal through $@, which is a list of all arguments given to the script.

$ lc_grep -P '(wp-login.php|regist)' access.log

15. Geolocate visitors

The geoiplookup utility and its databases ship as part of several distributions. The utility consults supplied IP addresses and, depending on the quality of the database, returns information on the country and region. Here is a small wrapper script that takes any domain or IP and strips out a geoiplookup’s edition and country code:

#!/bin/zsh
ip=$(dig $1 | grep -A 1 'ANSWER SECTION') | awk 'END{ print $NF }')
echo geoiplookup $ip | sed -e 's/^.*: //g'

$ ./get_country serversaustralia.com.au

AU, Australia

16.  Master regular expressions

I’ve already mentioned regular expressions are my most powerful tool when I parse files. It’s well worth your time to check out my other article and introduce yourself to regex in grep. Although AWK can print any number of columns, each user agent string is different and unique in length. It is easier to remove the preceding line with regex:

$ tail -n 1 access.log | sed 's/^.*-" "//g;s/"//g'
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

The line before the user agent string ends with literally -” “, so first I remove that. There is a final quote mark after the user agent which I then remove

17. Visitor lookup

This is a typical script I use to investigate a visitors to my  site. In this case I use lookup a Google scrape bot:

#!/bin/zsh
matches=$(grep $@)
last_visit=$(echo $matches | tail -n 1)
echo "IP: $(echo $last_visit | awk '{print $1 }')"
echo "Total visits: $(echo $matches | wc -l)"
echo "User Agent: $(echo $last_visit | sed -e 's/^.*-" "//g' | tr -d '"')"
echo "Last visit: $(echo $last_visit | tr -d '[]' | awk '{ print $4$5 }')"

You pass it the script an IP address and the target log file, and from there it returns the total visits and date and time of the last visit.

$ ./visitors.sh 66.249.78.54 access.log
IP: 66.249.78.54
Total visits: 3856
User Agent: Mozilla/5.0 (compatible; Googlebot/2.1;+http://www.google.com/bot.html)
Last visit: 07/Jul/2015:21:51:35+0000

18. Colourize output with multitail

Straight white-on-black lines  blend one line into another when you look enough of them for four hours at a time. multitail is an awesome wrapper for tail that colourised output. Separate to this, grc, “the generic colouriser”, will add colour to any command.

multitail is a straight-up replacement for tail:

$ multitail -f access.log

You first call grc and then pass through the rest of your command as as arguments:

$ grc tail -f -n 50 access.log

19. See missing files

awk has abilities like awk and sed, in that it can match patterns and perform substitutions:

awk 'END{($9 ~ /404/)}' < access.log | awk '{print $9": " $7}'
404: /apple-touch-icon.png

In the above command ($9 ~ /404/) only prints lines where the returned status code is  “404”, which It forwards to another AWK command to format.

20. Report on access codes

This handy one-liner sorts the log, counts occurrences of a string and sorts the output in descending order:

$ awk '{ print $9 }' access.log | sort | uniq -c | sort -rn
1999022 200
 188287 301
  65430 302

The position of the return code on the line varies by distro. Check your own access.log file and find the return code before you start!

21. Report on missing files

This final one-liner combines the last two to show the top 10 missing files encountered by visitors. sort and  uniq should be self-explanatory:

  • sort rearranges the file, either by letter or number;
  • while uniq removes adjacent, identical lines.

This pipeline aggregates duplicate entries, sort occurrences in numerical order, and outputs the  top missing files. In this case the biggest culprit was a file that I hadn’t yet uploaded while testing the site during development.

$ awk '($9 ~ /404/)' access.log | awk '{ print $7 }' | sort | uniq -c | sort -nr
   5343 /wp-content/themes/rmwb/assets/images/fallback.jpg
   3199 /wp-content/themes/rmwb/images/favicon.png
 1471 /wp-content/themes/sheepie/assets/images/fallback.jpg

Written by Servers Australia

Helping you understand the best Cloud Infrastructure options for your business is our business. For performance, reliability and solutions designed to meet your needs, we won’t be beaten.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

*

SEE MORE

Recent DDoS Attacks to the SAU Gaming Network26 Jun

Recent DDoS Attacks to the SAU Gaming Network

The past few weeks have seen fairly large scale...

Read more
The Move to Store the World11 Sep

The Move to Store the World

Last month, a revolutionary method of storing information was...

Read more
Public, Private, or Hybrid? Choosing the right cloud solution10 Aug

Public, Private, or Hybrid? Choosing the right cloud solution

It’s no big secret that cloud computing is slowly...

Read more