Self-Hosted Web Traffic Analytics

29 June 2022 #apache #ruby

I like looking at visitor statistics on my personal blog, not because it really matters or affects anything… I just think it’s pretty interesting. I looked at some of the popular options like Google Analytics and some others, but they didn’t quite fit what I was looking for. I decided I had a few requirements:

  1. Must be free
  2. Should respect visitor’s privacy (no sharing data with advertisers)
  3. Minimal performance impacts

Items #1 and #2 were usually mutually exclusive, except for some open source projects. Those projects generally seemed like more of a hassle to set up than what would be worth it. A lot of options also required including some extra javascript files, which I wanted to avoid too.

In the end, I settled on just using the logs generated by Apache web server and wrote a small ruby script to parse those log files.

Apache Logs

The default Apache logs include a good amount of information, but you can also configure Apache to log some additional information that’s useful for site analytics such as the Referer1 and User-Agent header fields. You can find the Apache log configuration documentation here.

This is a snippet of my personal blog’s Apache config file with a custom log format:

# /etc/apache2/sites-available/caleb.software.conf

<VirtualHost *:80>
        ServerName caleb.software

        ErrorLog ${APACHE_LOG_DIR}/error.log

        LogFormat "%t %h %U %>s %{Referer}i %{User-Agent}i" blog
        CustomLog /var/log/site/custom.log blog
</VirtualHost>

First, we define a custom LogFormat named “blog” and then set the CustomLog file path and format. Here’s what each of the custom log flags mean:

Make sure Apache has permission to write to the custom log file’s destination. I just created an empty file with touch custom.log and then chown‘d it.

Ruby Script

I wrote an accompanying ruby script to process these log files and get some insights from them. Since my blog is a static site built with Jekyll, the ruby script just spits out a markdown file which gets built into the final site. The script is only about fifty lines long, and you can find it here on Github Gists.

The resulting markdown file is also pretty simple, with just a couple of tables listing the most popular pages (including total views and number of unique viewers) and the most frequent referrers. Unique viewers are determined by combining the User-Agent string with the requester’s IP address… definitely not a perfect method, but good enough to give us a rough estimate.

Cron Job

Finally, the ruby script is run every hour by a simple crontab entry:

0 * * * * cd /var/www/site ; ruby analysis.rb && jekyll build

Roadmap

In the future, I have a few extra features I want to add to the script. It would be nice to see a chart of the number of viewers of each page over time. It would also be cool to see which posts result in the most email signups and engagement.

  1. The english word is actually spelled “referrer”, but because of a funny typo back in the 90s the HTTP header is named “referer”. You can read more about it on Wikipedia.