-- Leo's gemini proxy

-- Connecting to senders.io:1965...

-- Connected

-- Sending request

-- Meta line: 20 text/gemini; lang=en;

Capsule Stats


I was curious what the general traffic of my capsule was. I tinkered with the idea for my webserver actually setting up some sort of Elk (elastic, logstash, kibana) setup to get some monitoring and metrics on my actual server. But for gemini, where I actually HAVE traffic, I decided to just to have a live look at it.


Logfile


I am running my own server whose access logging is in the syntax:


2021-04-15T02:41:04,899Z	IN	/67.86.nnn.nnn:33378	gemini://senders.io/feed/atom.xml	33
2021-04-15T02:41:04,907Z	OUT	20	application/xml; lang=en;	3452
2021-04-15T02:41:04,950Z	IN	/67.86.nnn.nnn:33380	gemini://senders.io/gemlog/feed/atom.xml	40
2021-04-15T02:41:04,951Z	OUT	20	application/xml; lang=en;	3467

These are tab separated lines broken down into two categories: IN and OUT.


IN logline


IN logs are requests:


[timestamp] [tab] IN [IP] [tab] [URI] [tab] [SIZE]

OUT logline


OUT logs are responses:


[timestamp] [tab] OUT [STATUS] [tab] [META] [tab] [SIZE]

Generating stats


These are tab structured lines and it is pretty easy to just calculate some basic stats on incoming and outgoing messages using the wonderful world of bash scripting.


calc.sh


#!/usr/bin/env bash

LOGFILE=$1
OUTFILE=$2

if [ $# -lt 2 ]; then
  echo "Usage:
    ./calc.sh logs/access.log gemini/stats.gmi
  "
fi

# Stats for today
TODAY=$(date -Id)
echo -e "Stats for day:\t$TODAY" > $OUTFILE
echo -e "   Total Reqs:\t"$(grep 'OUT' ${LOGFILE} | grep "${TODAY}" | wc -l) >> $OUTFILE
echo -e " Gemlog Reads:\t"$(grep 'IN' ${LOGFILE} | grep "${TODAY}" | grep "gemlog" | grep "gmi" | wc -l) >> $OUTFILE
echo "Top 5 Gemlogs" >> $OUTFILE
echo "--------------" >> $OUTFILE
grep "IN" ${LOGFILE} | grep "${TODAY}" | cut -f4 | grep "gemlog" | grep ".gmi" | sort | uniq -c | sort -rn | head -n5 >> $OUTFILE

# Stats total
EARLIEST=$(head -n1 $LOGFILE | cut -f1)
echo "" >> $OUTFILE
echo -e "  Stats since:\t$EARLIEST" >> $OUTFILE
echo -e "   Total Reqs:\t"$(grep 'OUT' ${LOGFILE} | wc -l) >> $OUTFILE
echo -e " Gemlog Reads:\t"$(grep 'IN' ${LOGFILE} | grep "gemlog" | grep "gmi" | wc -l) >> $OUTFILE
echo "Top 5 Gemlogs" >> $OUTFILE
echo "--------------" >> $OUTFILE
grep "IN" ${LOGFILE} | cut -f4 | grep "gemlog" | grep ".gmi" | sort | uniq -c | sort -rn | head -n5 >> $OUTFILE

# print generating timestamp
echo -e "\n// generated $(date -u -Is)" >> $OUTFILE

This bash script is basically a combination of: grep, cut, sort, uniq. I know that I can optimize this much further, but I wrote this in a way where I filter down in steps to aid in my understanding of what and why I am filtering.


I also wrote the script to be run where I can change the in and out file, but that was a relic of this being something I ran locally and not to a fixed location on my server.


What I filter for


I decided to break information into two things: total requests - where I filter all log lines basically. And then "gemlog reads" since the homepage and atom.xml are things I don't really care about. But it's pretty good to see the percent of the requests are page reads. And I also decided to show the "from the beginning of the file" stats as well (originally I just was just calculating the stats for day).


The output


Stats for day:	2021-04-14
   Total Reqs:	301
 Gemlog Reads:	155
Top 5 Gemlogs
--------------
     53 gemini://senders.io/gemlog/2021-04-13-digital-hygiene-one-week-in.gmi
     14 gemini://senders.io/gemlog/2021-04-09-humans-first-words.gmi
     13 gemini://senders.io/gemlog/2021-04-12-girl-2020-land-before-time.gmi
      7 gemini://senders.io/gemlog/2021-04-10-floc.gmi
      7 gemini://senders.io/gemlog/2021-04-03-digital-hygiene.gmi

  Stats since:	2021-04-07T00:53:38,811Z
   Total Reqs:	3500
 Gemlog Reads:	1852
Top 5 Gemlogs
--------------
    239 gemini://senders.io/gemlog/2021-04-10-floc.gmi
    207 gemini://senders.io/gemlog/2021-04-13-digital-hygiene-one-week-in.gmi
    186 gemini://senders.io/gemlog/2021-04-07-devlog-4-deployed-in-production.gmi
    173 gemini://senders.io/gemlog/2021-04-09-humans-first-words.gmi
    138 gemini://senders.io/gemlog/2021-04-12-girl-2020-land-before-time.gmi

// generated 2021-04-15T02:56:01+00:00

Generating this file


I run this via a cronjob because I don't have any CGI support to load these stats on demand on my server. I run the calc script every minute to write it to file on my capsule:


/stats.txt


Conculsion


I found this a fun exercise to see how well a particular gemlog "was doing" - were people clicking into it? It's also interesting to see some traffic numbers on days (like today) where I haven't posted.


P.S - GDPR


I realized upon writing this calc process - I probably should do something about the fact I am logging IPs onto my server outside of the EU, but I know some of you are IN the EU. I have a retention setup via cron to wipe my logs every month which if I recall should be compliant. But I might just remove the actual IP from the log and add a UUID on the IN and OUT so I can properly match up the IN and OUT lines. I really don't need your IP, nor would I want my IP sitting on some random server somewhere (though I am not subject to GPDR so I probably have no recourse to ask you to remove it). I know some of you run larger sites/capsules - what do you do about access logs? If this were HTTP it would probably make sense to keep the IP logs to capture/manage any traffic to monitor for potentially malicious actions and ban them etc? So just curious to not reinvent the wheel here...


I thought it was neat to take a look at the general traffic on my server and share the script :)


Gemlog

Home


-- Response ended

-- Page fetched on Wed Apr 24 17:05:09 2024