-- Leo's gemini proxy

-- Connecting to gemi.dev:1965...

-- Connected

-- Sending request

-- Meta line: 20 text/gemini

Building and evolving a crawler


[Geocities-esque "UNDER CONSTRUCTION" GIF here]


> Despite the apparent simplicity of this basic algorithm, web crawling has many inherent challenges

> - Marc Najork


I enjoy building crawlers. I've done so in both a hobbyist and a profession context for over 15 years. Crawlers are so interesting to me because there is such a enormous range of usable capabilities. The simplest crawler can be only a dozen lines in a high level language and will work. However, crawlers can scan to incredible complexity allow them to process hundreds of pages a second per crawling node. Let's start with a simple crawler design and add capabilities it.


Note: These techniques apply equally to writing any crawler, regardless of protocol. In fact, early web crawlers handle HTTP, FTP, and Gopher. I may use Gemini in my examples, but nothing is specific to any one protocol.


A basic crawler


The most basic crawler looks something like this:


Queue = []
Queue.enqueue(startUrl)
while(Queue.Length > 0) {
    Url = Queue.dequeue()
    Response = Request(Url)
    NewLinks = ExtractLinks(Response)
    Queue.Enqueue(NewLinks)
}

It has:

A way to store the URLs that need to be visited (in this case a queue).

A set of starting URLs that are used to seed the queue.

A way to request the URL and parse the response.

A way to extract fully qualified links from a response.


Even with such a simple design, there are minor modifications that have big implications on the behavior of the crawler. In the example, we use a queue to store our set of URLs to visit. A queue is a First-In-First-Out data structure. We could have used a stack instead. A stack is a Last-In-First-Out data structure. Using a queue or a stack would change the behavior of our crawler:


A queue makes our crawler is breadth-first. When it visits a new page, and finds new links to visit, those links are enqueued at the end of the queue. They will not be visited until all the other pending URLs have been visited.

A stack makes our crawler depth-first. When it visits a new page, and finds new links to visit, those links are pushed on to the top of the stack. They will be immediately visited.


Crawlers typically use a breadth first approach, because this allows them to visit a wide variety of URLs from different sources, whereas depth-first can result in the crawler plunging into a specific area consuming all links. If a site has hundreds or thousands of URLs, or ever dynamically generates URLs so there is an infinite amount, this can result in the crawler getting stuck and not visiting any other sites. So our simplest crawler will use a queue.


Initial Improvements


Our basic design has a lot of room for improvements.


First, our crawler should do something valuable with the responses it gets, such as index them for a search engine. So let's add a placeholder DoSomething() function for that.


Queue = []
Queue.enqueue(startUrls)
while(Queue.Length > 0) {
    Url = Queue.dequeue()
    Response = Request(Url)
    DoSomething(Response)
    NewLinks = ExtractLinks(Response)
    Queue.Enqueue(NewLinks)
}

Next, we don't want our crawler to visit the same URL multiple times (at least for now). So we need to keep track of not only the URLs the crawler has already visited, but also the URLs that the crawler already has in its queue. Lets pass our list of extracted URLs through a filter function to remove links we have already seen.


Now our crawler code looks like this:


Queue = []
Queue.enqueue(startUrls)
while(Queue.Length > 0) {
    Url = Queue.dequeue()
    Response = Request(Url)
    DoSomething(Response)
    NewLinks = ExtractLinks(Response)
    NewLinks = RemoveAlreadySeen(NewLinks)
    Queue.Enqueue(NewLinks)
}

Aside: URL Normalization


If we are adding code that tries to track if we have seen a URL before, we need a way to see if 2 URLs are the same. This is more complex than doing simple string comparison. URLs have a lot of complexities

Parts of the URL are case sensitive (e.g. the path) and parts are not (e.g. the scheme, the hostname)

Parts of the URL are optional (e.g. the port number is implied by the scheme)

Order of query string parameters (e.g. "a=1&b=2" is equivalent to "b=2&a=1")

Almost any characters can be URL encoded. (e.g. "/foo+bar/" "/foo%20bar/" "%66%6f%6f%20%62%61%72" are all equivalent)


This means the root of any crawler is a robust set of URL normalization functions.


Avoiding duplicate work


Consider that a resource gemini://example.com/bar.gmi which contains a relative hyperlink to "foo/a.gmi".


page: gemini://example.com/bar.gmi

=> foo/a.gmi a link

Due to a site misconfiguration, when you vist "foo/a.gmi" you get the same page content, again with a relative URL:


page: gemini://example.com/foo/a.gmi

=> foo/a.gmi a link

This creates a crawler trap where the crawler keeps new finding URLs:

gemini://example.com/foo/a.gmi
gemini://example.com/foo/foo/a.gmi
gemini://example.com/foo/foo/foo/a.gmi
gemini://example.com/foo/foo/foo/foo/a.gmi
gemini://example.com/foo/foo/foo/foo/foo/a.gmi
...

While eventually you will hit a limit on the size of the URL, ideally you want to avoid something like this. This is an example of how processing duplicate content can trap you. But duplicate content has other problems too. These URLs could point to the same content, like this:

gemini://example.com/foo/
gemini://example.com/foo/index.gmi

You probably don't want to process both of them. This is especially true as your processing gets smarter and thus takes longer. You want to avoid duplicate content. This means modifying our crawler so that it detect when it has seen a response before. Depending on the protocol, this may involve just hashing the response body (e.g. Gemini), or potentially normalizing some response headers, while removing others, and including them in the hash (e.g. HTTP).


This is usually best accomplished by hashing a response and keeping a table of all the hashes you have already seen. So now our crawler looks like this:


Queue = []
Queue.enqueue(startUrl)
while(Queue.Length > 0) {
    Url = Queue.dequeue()
    Response = Request(Url)

    if(!SeenBefore(Response)) {
        DoSomething(Response)
        NewLinks = ExtractLinks(Response)
        NewLinks = RemoveAlreadySeen(NewLinks)
        Queue.Enqueue(NewLinks)
    }
}

Balancing High Performance with Politeness


Crawlers always have a fundamental tension. They have potentially hundreds of thousands (in the case of Gemini) to millions (early web) to billions (modern web) of URLs to visit. So they must work very very quickly.


However, despite the need for speed, anyone running a crawler that overloads servers soon learns that not only is such behavior is considered unacceptable, it contradictorily limits your effectiveness


In other words, going too fast will actually make your perform worst than going more slowly. Why?

Site owners will throttle you or outright ban you.

Sites that are overwhelmed will send errors, refuse connections, or worst of all timeout, leaving your crawler needlessly waiting

Sites will do this inconsistently, so the results you get back from your crawl will vary from crawl to crawl making it difficult to get usable results.


At the very least, a crawler should not attempt to download multiple URLs from the same server simultaneously. better, it should impose a limit on the portion of a web server’s resources it consumes. Thus it is responsible for crawlers to include a delay between requests.


That looks something like this:


Queue = []
Timer
Queue.enqueue(startUrl)
while(Queue.Length > 0) {
    Url = Queue.dequeue()
    Timer = new Timer()
    Timer.start()
    Response = Request(Url)
    if(!SeenBefore(Response)) {
        DoSomething(Response)
        NewLinks = ExtractLinks(Response)
        NewLinks = RemoveAlreadySeen(NewLinks)
        Queue.Enqueue(NewLinks)
    }
    while(Timer.ElapsedSec < 1.5) {
        sleep()
    }
}

Here we are using a timer so that, regardless of how quickly the site responds, the crawler will not request URLs faster than once every 1.5 seconds.


Doing Working in Parallel


So far, our crawler has operated like this:

                       +-------------+
              +------->+    Queue    | +------+
              |        +-------------+        |
Add New URLs  |                               |
to Queue      |                               |  Pull off URL to vist
              |        +-------------+        |
              +--------+  Crawler    | <------+
                       +----+--+-----+
                            |  ^
                            |  | Requests/Responses
                            |  |
                            v  +
                       Gemini Capsules

While we have made our crawler smarter, we haven't increased it throughput. It only requests a single URL at a time. In fact, we just added a timer to make it polite, so it is capped. Most of requesting and downloading a URL is waiting, so this is a perfect opportunity to download multiple URLs in parallel.


    +-------------------+
    |    URL Queue      |
    +-------------------+
         X    X     X
        X     X      X
       X      X       XX
      X       X         X
     X        X          X
    X         X           X
+----+     +----+       +----+
| A  |     |  B |       | C  |
+----+     +----+       +----+

There are some immediate, an obvious challenges with having multiple threads or worker this:

Access to the URL Queue needs to be thread safe.

Knowing when the crawl is "done" is more complex

All your data structures that kept track of whether a URL has been seen or not, whether a response hash has been senn or not, need to be thread safe. Also, the data structures used are very important, as this can quickly become a bottleneck throttling your crawler.


When dealing with a single thread, knowing if the crawler was "done" was pretty easy. If the queue was empty, there was no more work to do. This is not the case with a multi-threaded crawler. A and B are not doing any work, and they may check the queue and see that it is empty. However, if C is still making a request or processing a response, it may yet find more URLs to add to the queue. The crawler is only "done" if there is no threads are actively doing work, AND the queue is empty.


while(queue.Lenth > 0 || threadsDoingWork > 0) {

}

For simplicity, lets define an abstract function IsCrawlerComplete(). This will handle checking all of this stuff for us.


However there are other, more subtle challenges when building a multi-threaded crawler.


Imagine that one of the crawler threads has visited example.com, and added a number of new links to the queue. So inside our queue, there is a run of consecutive URLs, all going to example.com. In a multi-threaded design with a single crawl queue, multiple different threads will pull work off that queue. They are now all sending requests for different URLs on example.com. This circumvents the throttling mechanism we put into our crawler before. In fact its worse than our original design. In our original design, we could only send a new request after a previous request had finished. In our new design, we could have 5, 10, over 20 requests to the same site, all in flight at the same time!


We need a way to have multiple workers, while ensuring that a URLs for a site cannot be processed by more than 1 worker at a time.


Najork's work with the Mercator crawler produced several papers about this. A simplified version of their approach is to add URLs into N different queues, where each queue has a dedicated worker. A URL is sorted into a specific queue by hashing its hostname and using modulus N. This ensures that URLs for a specific host will always end up in the same queue, and thus multiple works cannot make requests simultaneously to the same site. This allows our polite policy to continue to work, while dramatically increasing throughput.


               Incoming URL
                   +
                   |
                   |
                   |
                   v
               +---+-----+
               |  Hash   |
              X+---------+XXXX
           XXXX   X    XXX   XXXXXX
       XXXXX     XX      XXX      XXXXXXXXXX
+----XX-+    +---X---+    +-X------+    +--X----+
|Queue  |    |Queue  |    |Queue   |    |Queue  |
|  A    |    | B     |    | C      |... | N     |
|       |    |       |    |        |    |       |
+--+----+    +---+---+    +----+---+    +---+---+
   |             |             |            |
   |             |             |            |
   v             v             v            v
+--+----+    +---+---+    +----+---+    +---+---+
|Worker |    |Worker |    |Worker  |    |Worker |
+-------+    +-------+    +--------+    +-------+


To do this, this replace our Queue with a "URL Frontier". The URL frontier represents the URLs that need to be visited. For know, we implement the URL frontier as a set of N queues. We need new functions to add to the frontier, and to get the next URL from it. Our GetNext() function requires that the worker provide an ID for the queue it is associated with. Remember, in this model, we have N workers for N queues in a 1:1 mapping. Here is what our URL Frontier looks like

//URL Frontier
Frontier.Queues = [N] //array of N queues
Frontier.AddUrl(url) {
    var ip = resolveIp(url)
    var num = hash(ip) % MAX_QUEUES
    Frontier.Queues[num].enqueue(url)
}

Frontier.GetNext(queueNum) {
    Frontier.Queues[queueNum].dequeue()
}

Under this new system, a worker could check the frontier and there isn't a URL, so we need to account for that. Now our crawler looks like this.

//Function for each worker thread. ID for thread, 0...N is passed
Worker(id) {
    while(IsCrawlerComplete()) {
        Url = Forntier.GetNext(id)
        Timer = new Timer()
        Timer.start()
        if(Url != null) {
            Response = Request(Url)
            if(!SeenBefore(Response)) {
                DoSomething(Response)
                NewLinks = ExtractLinks(Response)
                NewLinks = RemoveAlreadySeen(NewLinks)
                Frontier.AddUrl(NewLinks)
            }
        }
        while(Timer.ElapsedSec < 1.5) {
            sleep()
        }
    }
}

How much can this increase throughput? A tremendous amount. You could potentially have N workers, where N is the number the total number of sites you want to visit. This is wasteful, since the vast majority of sites don't have that many pages, but a few sites have a large number of pages. In a well designed crawler, you can usually increase N until to run into one of a few different bottle necks:


Available Network Bandwidth

Available CPU to run all the threads

DNS resolution (Which is often blocking, regardless of the number of workers)

Diminishing returns of context switching between threads


Handling Virtual hosting

Actually, simply hashing the hostname is not good enough because of virtual hosting. Virtual hosting allows multiple sites to be served by the same single IP address. Since network connections are made to IP addresses, and not hostnames, when we talk about crawlers needing to be polite and not make to connections or requests in parallel, our atomic unit is a IP address, not a hostname.


A good example of this in Gemini space is Flounder. Flounder serves more than hundred capsules, all as subdomains off the flounder.online hostname (e.g. foo.flounder.online, bar.flounder.online). All of these however come from the same IP address. A crawler with multiple workers and with its worker queues partitioned by hostname will quickly overwhelm flounder.online sites and you will get errors and timeouts trying to access its content.


To resolve this issue, we should be using the IP address that a hostname resolves to when determining what queue to place a URL.


This is another example of how DNS resolution is critical for crawlers. In fact, as we will see DNS resolution is often a bottle neck for crawlers.


Avoiding Robots.txt

Robots.txt provides a way to for a site owner to tell crawlers areas of the site they should avoid. You will want to respect this.


You also will find sites or URLs that don't make sense to crawl or index:

pages that don't close the connection

pages that are extremely slow

pages with low value content

pages with errors or other issues that cause your crawler to behave strangely.


Maintaining a list like this is difficult to scale. Ideally you want your crawl to be able to detect things like this and automatically start skipping them (Don't request URLs with content that rapidly changes, etc). However even early web crawlers, through the mid 2000's, maintained lists of known bad URLs or patterns they would skip.


You can combine your "robots.txt" filter and your "URLs to skip" features into a single module, FilterAllowedLinks(), since they basically are doing the same thing: checking a URL against a list of rules and skipping it if it matches one. So now our crawler looks like this:


//Function for each worker thread. ID for thread, 0...N is passed
Worker(id) {
    while(IsCrawlerComplete()) {
        Url = Forntier.GetNext(id)
        Timer = new Timer()
        Timer.start()
        if(Url != null) {
            Response = Request(Url)
            if(!SeenBefore(Response)) {
                DoSomething(Response)
                NewLinks = ExtractLinks(Response)
                NewLinks = RemoveAlreadySeen(NewLinks)
                NewLinks = FilterAllowedLinks(NewLinks)
                Frontier.AddUrl(NewLinks)
            }
        }
        while(Timer.ElapsedSec < 1.5) {
            sleep()
        }
    }
}

-- Response ended

-- Page fetched on Mon May 13 10:55:15 2024