-- Leo's gemini proxy

-- Connecting to freeshell.de:1965...

-- Connected

-- Sending request

-- Meta line: 20 text/gemini;lang=en-GB

Things I learned from writing a crawler 🕷️

See previous posts about #hashtags


Geminispace is smaller than I thought.

Other crawlers report hundreds of thousands if URLs. Mine's looking like it might stay in tens of thousands. Estimating this is hard.


Geminispace is larger too

There are all sorts of things a crawler finds that I don't see in everyday usage. There are message boards and mirrors of web content and blog entries from decades ago and many many things that make no sense to me at all. You won't see those just following Antenna.


Some URLs just don't want to load

Sometimes the crawler just waits forever for a URL. I thought it was broken, but other clients behave the same way. So occasionally I have to kill a stuck request by hand. Odd.


Being a good citizen is hard

The sequence of URLs is random, and I thought that would be enough to avoid hammering anyone's capsule. But I stll got some "44 slow down" responses. Apologies to those people. I noticed that in all cases, the requested wait before another request was many days. So I just stopped crawling those hosts.


Psychology is weird

I find it hard to just let the crawler run. I want to know how it's doing. All the time. I keep running stats scripts and checking for this and that. I should let it be. Particularly as there is no time it will stop.


OK, that's all the things.


#crawler


back to gemlog

-- Response ended

-- Page fetched on Fri May 3 23:03:08 2024