gemini://auragem.letz.dev/devlog/20220719.gmi

The AuraGem Search Engine has been brought back up and is now crawling again. The database was cleared out and redone so that I can update how data is stored. Below lists some of the changes with crawling and how data is stored:

Domains are now being stored, as well as mentions and tags. Tags are parsed from tag lines that being with "Tags: " or "🏷 "

Keyword Extraction is no longer used for the moment. The results of keyword extraction were poor and not useful and resulting in a lot of wasted space.

The replacements for keyword extraction are tags and mentions. Tags and mentions are much more useful than random keywords extracted from a page.

The crawler will now extract metadata from pdf and djvu files, like it does with audio files.

The speed of the search engine should be improved (no more 1 minute searches!). More work on this will be done in the future.

I also wanted to warn all gemini capsule owners that some mimetypes are incorrect given their filetype. Some servers are sending audio and other binary files as text/gemini or text/plain.

Another issue that I've found is that the first level-1 heading on the root index of many capsules are generic or don't have the name of the capsule. The downside to this is that the crawler will use whatever this level-1 heading is on the root index as the title of the capsule, and results in search results that don't have good titles. This also goes for the titles of pages, and also includes capsules whose pages use the same level-1 heading on all or many of its pages. Some major capsules can be worked around (e.g. Station), but working around every single capsule would be impossible.

AuraGem Search does not store the contents of any pages, and therefore cannot do Full-Text Searching on the contents of pages. I don't like the idea of storing all of geminispace on my harddrive, and I don't have the space to do so either.

Additionally, Full-Text Searching can provide decent results, but I think we can provide even better results with other methods. These methods include searching by keywords, tags, and mentions. Full-Text Search often searches for the whole phrase that is put into the query. However, AuraGem Search instead searches for each word within the search query, and results that have more of the words from the query will be moved up in the search results - often called Sorting by Relevance.

However, AuraGem also searches through page titles too. This can be slow, so FTS or something else like it is being considered for this.

One other idea that I have that could improve results is a different method of keyword extraction. Previously, keyword extraction produced less-than-desirable results because it was extracting too many useless words (or strings that looked like words but weren't). However, AuraGem now stores tags from tag lines in pages. This means that AuraGem can use these tags as keywords to extract from pages that don't have tag lines. This same process could also apply to mentions.

Another problem that the above process would not catch are names and proper nouns. These are often very important words that people would want to search for (e.g. Mathematics, C++, Celine Dion, FTS). I do not have an easy method for this atm.

The final problem with Keyword Extraction is that it needs to work for multiple languages. The first method above, using tags and mentions, would work with other languages, but the second method would be much harder to work with.

There could also be an abuse of these tag lines. I have not seen such abuse, yet. However, I will deal with it when the day comes that I find tag line abuse out in the wild.

2022-07-19 AuraGem Search Engine Update

Why Not use Full-Text Search on Page Contents?

Improving Keyword Extraction