-- Leo's gemini proxy

-- Connecting to gemini.bortzmeyer.org:1965...

-- Connected

-- Sending request

-- Meta line: 20 text/gemini; lang=en

Logbook of the running Lupa crawler at gemini.bortzmeyer.org


12 january 2022


Completely new handling of exclusion files "robots.txt". We now use the code in the Python standard library instead of a custom code. It should work better with some complicated robots.txt files (those using both directives Allow and Disallow:, for instance).


The issue

The Python package


There is currently no proper standard for robots.txt. The Internet-Draft is still under evaluation.


State of the Internet-Draft


Note that many robots.txt files in the wild are wrong (for instance, having several user agents on one line) so will be ignored.


Example of a broken file


14 december 2021


One year of Lupa! We now have 334,000 working URLs, 1,500 working

capsules (in 1,000 registered domains), using 1,000 different IP addresses.


28 november 2021


We no longer record and display the fact that there was no proper TLS shutdown (close_notify). This is because it does not seem that Agunua returned reliable information.


The Agunua issue


10 october 2021


We now have more than one thousand (1,000) registered domains (the capsules foo.flounder.online and bar.flounder.online are in the same registered domain, so it is two capsules but one registered domain).


19 may 2021


We now have more than one thousand (1,000) working capsules.


(This is partly because we now keep the capsules whoses robots.txt prevented any crawling; before that, they were regarded as non-working.)


The bug report


8 may 2021


List of known capsules are now published


As a text file

As a gemtext, with links


31 march 2021


URLs whose status code is 31 ("Permanent redirect") are now purged.


The issue


29 march 2021


Lupa now displays separately the language statistics for the language only and for the full language tag.


Remember: tag wisely


26 march 2021


Lupa now connect to .onion capsules (capsules reachable only through the Tor network). Currently, there are only two.


The Tor project

This capsule, on .onion, to see if your Gemini browser can do it

How to set up a .onion capsule


24 march 2021


The number of URLs decreased because Lupa automatically deleted URLs that returned an error for too long. Remember that the "geminispace" is small so just one big capsule changing its content/policies can seriously impact the figures.


14 march 2021


We now have 800 working capsules. And 180,000 working URLs although I

believe this number is less important (any capsule can generate a lot

of dynamic URLs).


10 march 2021


We now display the TLS versions used by capsules. (A majority uses TLS 1.3.) We also display the percentage of capsules that use an expired certificate (more than 2 %). And we also report the URL without a proper TLS shutdown.


9 march 2021


We now display the maximum and average number of links pointing to URLs in our database. We do not display a list of URLs with most links towards them, to avoid popularity contests.


The issue


12 february 2021


We now display TLD (Top-Level Domains) also per number of registered domains, not just per number of capsules. We use Mozila's Public Suffix List (not perfect but there is no better resource).


The Public Suffix List


26 january 2021


We start to purge old and stale data from the database. Therefore, several numbers will decrease.


The original issue


20 january 2021


A bug in the counting of Let's Encrypt certificates have been fixed. Therefore, the percentage of Let's Encrypt will increase.


The patch


19 january 2021


The statistics page is now much more strict with the freshness of the data. We ignore, for instance, capsules that were not contacted recently (currently 31 days). As a result, several numbers decreased.


The stats

The ticket #12


4 january 2021


A bug prevented robots.txt to be retrieved from capsules with an invalid certificate. Now that it is fixed, it will probably lead to a decrease in the number of retrieved URLs.


The bug


21 december 2020


The crawler now uses the Agunua library instead of its own internal Gemini library.


Agunua


16 december 2020


The database now contains 31 145 URIs (16 273 successfully retrieved) and 484 capsules (270 successfully contacted).


16 december 2020


Stupid bug when updating the state of the capsules after a successful connect.


The bug


14 december 2020


The crawler entered in production state.


All about the crawler

-- Response ended

-- Page fetched on Sat Jun 1 23:35:36 2024