2022-09-17

Musings on SmolZINE, continued

I wrote:

> And what is in this list now? Ok, I can look at the 203 entries. But I can try to extract the domain name of the capsules and see, whether some are mentioned a lot more than others

/en/2022/20220915-musings-on-smolzine.gmi

kelbot was quick to point out, that my shell one liner seemed a bit strange, since pollux.casa and flounder were indeed referenced in SmolZINE multiple times. He was right, of course. I had not "normalized" the domains to their last two fields. So, try it again, Sam:

So let's explain the resulting one liner in pieces. From all issues of SmolZINE, they being files in the current directory, grep the gemini links. Be sure to supress the file name (-h). Also be sure to anchor the search at the beginning of the line with '^':

> grep -h '^=> gemini://' smolzine-issue-*.gmi |

The resulting lines start with the Marker for links and a blank (optional but apparently always present in this output). So I ask good ol'awk to just print the second string:

> awk '{print $2;}' |

Now I have a list of all the gemini:// entries. In order to extract the highest two parts of any entry, I ask good ol'sed to massage the strings.

remove everything before and including the protocol marker (optional, now that I think of it)

remove everything after the domain including the leading '/'

now the important bit: match three parts:

1. from the beginning everything (greedy) up to but not including a period '\.' --- be sure to escape the periods of their wild card meaning. This could match too much, so I need to explicitly match the two fields I'm interested in

2. a string consisting of anything but a period and at least one character: '[^\.][^\.]*' Wrap this regex in '' because we want to remember it. The expression ends at another explicitly requested period.

3. same as 2 but anchored to the end of the examined line with a dollar '$'

after matching three parts and remembering two of them we replace the input line with a new output, consisting just of the two remembered matches concatenated with a period '\1\.\2'

Now, that was easy, wasn't it?

Turns out, that a small number of links feature an explicit port number. This is removed with the last subsitution

> sed -e 's|^.*gemini://||' \

> -e 's|/.*$||' \

> -e 's|^.*\.$[^\.][^\.]*$\.$[^\.][^\.]*$$|\1\.\2|' \

> -e 's|:[0-9][0-9]*$||' |

I changed the domain names by removing the subdomains, so sorting is meaningful only after this point

> LANG=C sort |

And then we count adjacent identical lines and sort the resulting list numerically.

> uniq -c | sort -nr

Now repeat after me (I broke the one liner for readability):

grep -h '^=> gemini://' smolzine-issue-*.gmi |
  awk '{print $2;}' |
  sed -e 's|^.*gemini://||' -e 's|/.*$||' \
  -e 's|^.*\.\([^\.][^\.]*\)\.\([^\.][^\.]*\)$|\1\.\2|' \
  -e 's|:[0-9][0-9]*$||' |
  LANG=C sort | uniq -c | sort -nr
      9 circumlunar.space
      8 flounder.online
      6 warmedal.se
      6 tilde.team
      5 transjovian.org
      5 smol.pub
      4 yesterweb.org
      4 midnight.pub
      4 locrian.zone
      3 tilde.pink
      3 thegonz.net
      3 skylarhill.me
      3 skyjake.fi
      3 rawtext.club
      3 pollux.casa
      3 gemi.dev
      3 breadpunk.club
      2 usebox.net
      2 tilde.club
      2 thurk.org
      2 susa.net
      2 noulin.net
      2 nightfall.city
      2 mozz.us
      2 gemlog.blue
      2 dimakrasner.com
      2 bacardi55.io
      2 antipod.de
      2 alchemi.dev
      1 yysu.xyz
      1 yujiri.xyz
      1 ynh.fr
...

And there they are: circumlunar.space, flounder.online, tilde.team, smol.pub ... and so on. Great.

Now, while we are at it, can we not generate an index pointing to the issue in which the domain is referenced? Sure we can, and it is still just a oneliner:

grep -h '^=> gemini://' smolzine-issue-*.gmi |
  awk '{print $2;}' |
  sed -e 's|^.*gemini://||' -e 's|/.*$||' \
  -e 's|^.*\.\([^\.][^\.]*\)\.\([^\.][^\.]*\)$|\1\.\2|' \
  -e 's|:[0-9][0-9]*$||' |
  LANG=C sort | uniq |
  while read D; do \
    X=$(grep -l "^=> .*$D" smolzine-issue-*.gmi | sed -e 's|^.*-||' -e 's|\.gmi$||' | sort -n | tr '\n' ' ') ; \
    echo -e "$D:\t$X" ; \
  done
0x80.org:       5
1436.ninja:     11 13
725.be: 17
7irb.tk:        9
adele.work:     32
adventuregameclub.com:  19
ainent.xyz:     18
ajul.io:        22
alchemi.dev:    6
alkali.me:      31
antipod.de:     9 12
atyh.net:       25
bacardi55.io:   7 24
beyondneolithic.life:   27
bjornwestergard.com:    21
bortzmeyer.org: 32
breadpunk.club: 12
bunburya.eu:    6
cadadr.space:   5
cadence.moe:    11
calmwaters.xyz: 1
campaignwiki.org:       5
chriswere.uk:   4
circumlunar.space:      2 9 11 14 16 19 32
...

However, is this thing useful? Not so much, because here the subdomains should maybe stay. And the protocol, too, so we have gopher and http links in the list as well. And can we make the number a link to the issue? Sure we can, one line per link. And wouldn't it be nice, if we could somehow scrape the preceeding description into the result as well? It surely would.

And where do we stop? I'll stop right here.

Have the appropriate amount of fun!

Home

-- Response ended

-- Page fetched on Fri May 3 00:52:34 2024