-- Leo's gemini proxy

-- Connecting to radia.bortzmeyer.org:1965...

-- Connected

-- Sending request

-- Meta line: 20 text/gemini; lang=en

Lupa, a Gemini crawler


Lupa is a Gemini crawler. Starting from a few

given URLs, it retrieves them, analyzes the links in gemtext (Gemini

format) files and adds them to the database of URLs to crawl. It is

not a search engine, it does not store the content of the resources,

just the metadata, for research and statistics.


The instance of the crawler that I manage currently operates from `2001:41d0:302:2200::180` (and `193.70.85.11` on the old networks).


See the current statistics

Logbook of the production crawler

Previous statistics


If you want the list of capsules known to Lupa:


As a text file

As a gemtext, with links


If you notice a missing capsule, write me (address at the end of this

page).


If you want the entire content of the database, you'll have to write

me (address at the end of this page) and explain why. I tend to be

liberal with such requests since, after all, it is public data and

anyone could gather it.


Lupa is written in:


Python


No real installation procedure, you have to get the sources, put them

where you want and setup PYTHONPATH and PATH. Pre-requisites (all of

them on PyPi): psycopg2, pyopenssl, scfg, public_suffix_list and

agunua.


(On a Debian machine, the packaged prerequitises are packages python3-pip,

python3-psycopg2 and python3-openssl, agunua, public_suffix_list and

scfg have to be installed with pip or manually.)


Usage requires a PostgreSQL database,

to store the URLs and the result of crawling. Once you've created the

database, prepare it with the `create.sql` file:


createdb lupa
psql -f ./admin-scripts/create.sql lupa
export PYTHONPATH=$(pwd)
./admin-scripts/lupa-insert-url gemini://start.url.example/
./admin-scripts/lupa-insert-url gemini://second-start.url.example/

PostgreSQL


At the present time, you need a separate script to retrieve robots.txt

exclusion files. It is *not* done by the crawler. This script must be

run from time to time, for instance from cron, every two hours:


./admin-scripts/lupa-add-robotstxt

You run the crawler with `./lupa-crawler`. The crawler does not run

forever, you need to start it from cron. Locking is done by the

database, so it is not an issue if two instances run at the same time.


You can have a list of options with `--help` but, at this time, you

need to read the source to understand them. Some interesting options:


`--num`: maximum number of URLs to test. It is very low by default,

to allow testing, so you may want to set it to a more reasonable

value such as 1000.

`--among`: number of URLs among which the "num" before are choosen

at random. You typically set it to the size of the database, but it

can be smaller.

`--sleep`: by default, the crawler goes as fast as possible but you

can slow it down with this parameter. Between two URLs, the crawler

will sleep at a time randomly choosen between 0 and this number of

seconds.

`--old`: the crawler retrieves the URLs that has never been

retrieved or retrieved more than this number of days. Default is 14 days.

`--maximum`: the crawler has a maximum time of running, to be sure

it is not blocked forever if there is a blocking operation. It is

one hour by default.

`--debug`: makes the log more talkative.


Also, you can use a configuration file, using the scfg syntax. An

example is in the sources, `sample-lupa.conf`.


scfg


A log file is created in `/var/tmp/Lupa.log`. It is up to you to

ensure it is replaced from time to time.


Name


Lupa means she-wolf in latin. It refers to the wolf who took care of

the twins Romulus and Remus. (Many Gemini programs have names related

to twins, gemini in latin.)


Reference site


On Gemini


On the Web, at FramaGit


Author


Stéphane Bortzmeyer stephane+gemini@bortzmeyer.org

-- Response ended

-- Page fetched on Sun May 12 00:04:23 2024