Due to a clerical error at my ISP, my internet connection is down for the weekend. So what's an activity that doesn't require a working internet connection? Blogging. Earlier this year I was interviewing and on 3 different occasions I was presented with a technical challenge of parsing Apache access logs. I know from experience that the goal of the challenge is to gauge how well I can parse text files with the standard Unix command line tools (mainly awk
but also grep
, wc
, sed
and some plain shell scripting).
Now awk
is really a great tool and its performance is better than any Hadoop cluster (as long as you can fit the data in a single machine) and usually the challenge is limited enough that it's doable with awk
. However, it may be easier to read and debug to write the solution using Python and obviously with Python being a much more flexible tool, we can solve any problem (even a real-life one) that has to do with parsing Apache's access logs.
For this I'm going to use Parse. With Parse we can write specifications (sort of the reverse of f-strings) and with them we can parse log lines and get back nicely structured data. Also, for bonus points, I'm going to use generators (it should also improve performance a bit).
from parse import compile def parse_log(fh): """Reads the file handler line by line and returns a dictionary of the log fields. """ parser = compile( '''{ip} {logname} {user} [{date:th}] "{request}" {status:d} {bytes} "{referer}" "{user_agent}"''' ) for line in fh: result = parser.parse(line) if result is not None: yield result
This function will work with any file opened with open
or with sys.stdin
. Let's grab an example log file from Elastic's examples repo and print the 10 most frequent client IP addresses.
import urllib.request ipaddresses = {} with urllib.request.urlopen( "https://github.com/elastic/examples/blob/master/Common%20Data%20Formats/apache_logs/apache_logs?raw=true", ) as fh: for record in parse_log(fh): ip = record["ip"] if ip in ipaddresses: ipaddresses[ip] = ipaddresses[ip] + 1 else: ipaddresses[ip] = 1 sorted_addresses = sorted( ipaddresses.items(), key=lambda x: x[1], reverse=True, ) for i in range(10): print(f"{sorted_addresses[i][0]}: {sorted_addresses[i][1]}")
Obviously this is a simple example, but this method is not limited in any way. There's no messing around with delimiters or worrying about long strings inside quotation marks nor checking the awk
man page for functions you never used before. The resulting code is pretty clear and the performance is on-par with any shell script you can whip together.