AuraGem Search Features

Current State of Features

Full Text Search of page and file metadata, with Stemming, because apparently other search engines think it's important and unique to advertise one of the most common features in searching systems, lol.

Complex search queries using AND, OR, and NOT operators, as well as grouping using parentheses and quotes for multiword search terms. By default, if you do not use any of these operators, search terms are combined using OR, much like you would expect from web search engines. However, searches that have all the terms provided will still be ranked higher than searches with just one or a portion of the terms provided.

+ and - operators. + is for a required term, - is for a search term that must not be matched.

Title extraction using first apparent heading, regardless of its level.

Can detect gemsub feeds.

Line Counts of text files, and publication dates indexed based on dates in filenames.

File size information

Mp3, Ogg, and Flac file metadata (ID3, MP4, and Ogg/Flac) is indexed.

A feed of Posts from Past Year organized based on publication date, from most recent to least recent.

Filters include "TITLE", "URL", "ALBUM", "ARTIST", "ALBUMARTIST", "COPYRIGHT", "CONTENTTYPE", "LANGUAGE", and "PUBLISHDATE", as well as others that are untested. The syntax is "field: term". You can also use groups for filters. Field names must be in all capital letters.

Wildcards * and ?

Fuzzy Searching by placing ~ after a search term

Proximity Searching: if you want to search for two words that are within a distance of 10 words of each other, then query with "term_one term_two"~10

Range Searching: For searching in ranges of numbers or dates. Can be used with filters, like the PUBLISHDATE filter. An example of filtering based on a publication date range would be, PUBLISHDATE:[20220101 to 20231201]

Crawler: Robots.txt is followed, including "Allow", "Disallow", and "Crawl-Delay" directives. The Slow Down gemini status code is also followed.

Crawler: 2 second delay between crawling of pages on the same domain.

Features Coming Soon

PDF and Djvu file metadata indexed

Image file metadata indexed

Plain text file full contents indexed

Backlinks and searching of link text

Page Metadata Lookup

Full Markdown, Tinylog, and Twtxt parsing to get links, titles, and heading information.

Audio Transcript Search

History

AuraGem was a search engine that I started about 2 years ago under its original name, Ponix Search. It was originally designed to experiment with how I could make search results better. The official announcement of the Search Engine happened on 2021-07-01:

2021-07-01 Search Engine & Ponix Capsule Now Open Source (MIT)

2021-12-05 AuraGem Search Begins Crawling Again

Note that some of the information in the above posts have been recently updated to match the current URL and Ip Address of the crawler and gemini capsule.

One of the first priorities with AuraGem Search was to have extraction of file metadata for as many files as possible. Audio files were one of the first to get this feature. PDFs and Djvu files were supposed to be next, and support was added for them on 2022-07-19, but the feature was buggy and never worked, unfortunately. As you can see in the below post, I chose to go with Keyword Extraction (which was later removed and replaced with simple mentions and tags extraction) instead of Full Text Searching on page contents. Part of this was to save space, and part of it was to respect copyright. However, I am rethinking this approach now that the Stats page can determine how large the text-only portion of geminispace is (no more than 5GB total).

2022-07-19 AuraGem Search Engine Update

Stats Page

In the above article, you can see that I start to play with the notion of different types of searches. I think this idea remains important today:

> Another problem that the above process would not catch are names and proper nouns. These are often very important words that people would want to search for (e.g. Mathematics, C++, Celine Dion, FTS). I do not have an easy method for this atm.

The next update on 2022-07-21 added Full Text Searching of link and file metadata, which drastically improved the speed of searches. Yes, this came with stemming because my database's FTS uses Lucene++.

2022-07-21 AuraGem Search Update

Not long after I wrote an article about FTS, ranking systems, and some of the problems that Search Engines have to handle:

2022-07-22 Search Engine Ranking Systems Are Being Left Unquestioned

The most important portion of this article, however, is recognizing how people do searches:

> This also introduces the argument that the ranking systems are really only important for underspecified queries (broad queries), so the emphasis on the problems with ranking algorithms is unwarranted. This argument hardly makes sense when the majority of searches that people make are broad. I would also argue that broad searches are most used for *discovering* pages, not for getting to a specific page. However, ranking based on popularity prioritizes what it thinks people would want, which is more suited for specific searches using broad queries, at the expense of discovery of broad topics. Broad discovery using broad topic queries and specific searches using proper-noun queries or very specific queries are both much better ways of dealing with searches without relying on popularity.

When making a search engine, one must balance the search results between discovery (broadness) and exact matches (exactness). Relevancy applies to both of these, but is more important for discovery. I continue to think that link analysis assumes that people want exact matches of pages while using broad queries. For example, if someone types in "search engine", a PageRank system would put the most popular search engine at the top along with popular articles about search engines, assuming that the person wanted that specific search engine, when it's more likely they wanted a collection of search engines. Rather, my approach is to return broad relevant discovery-based results with broad queries, and exact pages with exact queries.

Exact queries include words from titles, domain names, capsule names, service names, basically mainly proper nouns or a specific combination of words that matches the page information. Broad queries, however, use category names and common nouns.

When I type "Station", I want an exact match for Station itself. However, when I type "social network", I want search results that give a very broad set of capsules that are social networks. I believe that this is how most people would use search engines, especially if they do not rely much on filtering, and this is the exact methodology that I use for my article analyzing gemini's search engines:

2022-08-07 Gemini Search Results Study, Part 1