-- Leo's gemini proxy

-- Connecting to kennedy.gemi.dev:1965...

-- Connected

-- Sending request

-- Meta line: 20 text/gemini

🔭 Notes on Crawling and Indexing


Home


Kennedy creates its search index by crawling content only within Geminispace. It will not crawl or index other content from other protocols like Gopher or HTTP.


Crawler details


Kennedy crawls Geminispace using the following IP addresses:

IPv4: 64.149.155.184

IPv6: 2600:1700:1731:d0f:35a7:42d4:c71f:a02b


Crawler speed


Kennedy throttles itself and waits 1.5 seconds between making requests to the same IP address. This increases the amount of time it takes to crawl multiple capsules hosted from the same IP address, such as Flounder.online.


Robots.txt Support


Kennedy will respect sites that are using the simplified robots.txt protocol defined for Gemini.


Robots.txt subset for Gemini


Specifically, Kennedy will follow the Deny rules defined for the follow user-agents:

*

indexer


Note: There are a number of robots.txt files in Geminispace which use rules outside of the simplified standard above. These include:

Allow Rules

Deny Rules with wildcard characters in the middle

Crawl-Delay directives


Kennedy does not currently respect these rules.


Crawler Limits


Kennedy has the following limits:

Will not download responses larger than 10 MB.

Closes a connection if a URL takes more than 45 seconds to fully respond.

-- Response ended

-- Page fetched on Fri May 17 08:04:09 2024