How not to get caught in a network

> Notes on writing a network crawler.

A network crawler typically works as follows:

get resource

find links to other resources

do the same for all the resource not seen before

That's typical depth first recursion, here's some pseudo code:

  having empty list of seen_urls

  function crawl(url)
    add url to seen_urls
    fetch and extract new_urls from page at url

    for new_url in (new_urls not in seen_urls)
      crawl(new_url)

  for url in seed_urls
    crawl(url)

Note: we need to keep a list of URLs we've "seen" to avoid going in loops; A -> B -> A etc.

This program will touch all of the reachable nodes eventually but what if the network has some really deep chains which might even go on forever? For instance, imagine a resource which renders all possible moves on a chess board; every page shows a board and has links to all possible moves. It's nice programming exercise, not too hard. The amount of positions is huge (look up Shannon number), when our crawler enters this, it will still be going through chess moves when our sun dies.

Maximum depth

Our crawler is not interested in all chess moves and not smart enough to recognize it took a wrong turn. A basic rule we could apply is to go no deeper than some generously chosen amount of hops, we'll define a maximum depth. Let's revisit our program and put that in:

  having empty list of seen_urls

  function crawl(url, depth)
    skip if depth is not below limit

    fetch and extract new_urls from page at url
    add url to seen_urls

    for new_url in (new_urls not in seen_urls)
      crawl(new_url, depth + 1)

  for url in seed_urls
    crawl(url, 0)

Note: the generosity allowed for the maximum depth depends on how deep our programming language allows use to recurse, since this is not a tail call it depends on the stack size. We could rewrite it to avoid this problem but more on that later.

Selecting a value for maximum depth depends on: the pace of the crawler, the estimated amount of pages which will be crawled, the average depth, and in what amount of time it's reasonable for our crawler to finish. That's a lot of parameters.. We could try and figure out how to automatically adjust it after every run, but that doesn't seem like a trivial problem.

A down side to limiting depth in this manner is that all exploration of new territory (not part of our seed URLs) will be relatively shallow. How do we fix that?

Reset depth counter when entering a new domain?

We could reset the depth counter very time we wander into another domain. This seems like an obvious solution but consider the chess problem again and image implementing it using two domains, one for white moves, the other for black, our crawler is back on its eternal quest. We could try to do some pattern detecting but staying out of traps like this will takes serious AI.

Discover new seeds automatically?

We could try and automatically promote some of the new territory to our list of start URLs. This will allow the crawler to traverse the new territory deeper in subsequent runs.

But on what basis should we select URLs for promotion? Root URLs could be good candidates when encountered but what about all those tilde sites, what's "root" for them and should we be able recognize those automatically? There's probably plenty of other kinds of islands on domains which are hard to recognize.

Another issue: will we actually find these "roots"? A lot of pages deeplink to some other page on some new domain. How big are the chances these new pages links to its "root"?

This maximum depth approach seems to have a lot of issues. I feel we should leave it and look for other solutions.

Max pages per domain (per run)

So what if we, instead of putting a limit on depth, put a limit on the amount of pages hit per domain per run? That way all domains are treated equally.

Our recursive code isn't very suitable for this (and suffers from the stack issue) so let's flatten it and fix the issue:

  having remaining_urls list with seed urls
  having empty list of seen_urls

  while any remaining_urls
    take url from start of remaining_urls

    fetch and extract new_urls from page at url
    add url to seen_urls

    remove all urls also in seen_urls from new_urls
    remove all urls also in remaining_urls from new_urls
    append new_urls to remaining_urls

The recursion is replaced by a loop and instead of going depth first, we're going breadth first. An interesting property of crawling breadth first is that we're first hitting all our seed URLs before going deeper and, possibly, getting lost in some wormhole. This seems like a big improvement on our recurve version.

  depth first                 breadth first

             1                           1
             |                           |
     ----------------            ----------------
     |       |      |            |       |      |
     2       5      9            2       3      4
     |       |                   |       |
   -----   -----               -----   -----
   |   |   |   |               |   |   |   |
   3   4   6   7               5   6   7   8
               |                           |
               8                           9

This is still, potentially, an endless loop. Next, we'll keep track of the domains we've seen and put a limit on the amount of times we'll hit them.

  having remaining_urls list with seed urls
  having empty list of seen_urls
  having domain hit counters

  reset domain hit counters
  while any remaining_urls having domain hit counter below limit
    take first url of remaining_urls having domain hit counter below limit

    fetch and extract new_urls from page at url
    add url to seen_urls
    increment hit counter for url domain

    remove all urls also in seen_urls from new_urls
    remove all urls also in remaining_urls from new_urls
    append new_urls to remaining_urls

Note: we'll keep collecting URLs on domains for which the domain counter already reached the limit but won't visit them.

Now our program will hit all domains no more than a specified amount of times. The URLs remaining are from the domains hosting a lot of pages. But what should we do with those? A couple of extra churns with the remaining URLs after resetting domain hit counters until there's only "a couple" of different domains left?

Subsequent runs

In our first breadth first version we kept taking from the remaining URLs list until it was depleted and call it a day. In our second version we only take URLs when they can still be traversed (domain hit counter still below limit) and we end up with a list a URLs which still need attention.

When we'd like to do a subsequent run to refresh our knowledge of the network we'd need run through our seed URLs again but what about the remain URLs from the previous run? If we start with the seed URLs we'll probably end up with the same remaining URLs, so those will never be visited. Appending the seed URLs to the remaining URLs works a lot better but we can't be a 100% sure all of our seed URLs will be visited and refreshed in this run.

Aging

Let's add aging to the mix. We'll record the necessary timestamps for all the pages we're hitting and use that to determine what's stale and what's not.

  having seeded list of known_urls
  having domain hit counters

  eligible:
    having domain hit counter below limit and not stale

  reset domain hit counters
  while any eligible known_urls
    borrow first eligible known_urls

    fetch and extract new_urls from page at url
    record timestamp for url
    increment hit counter for url domain

    remove all urls also in known_urls from new_urls
    append new_urls to known_urls

Note: instead of taking (and thus deleting) from our list we now "borrow" (that's the best verb I can come up with). Also note: when the crawler does "record timestamp for url" it is no longer stale thus no longer eligible for a new visit during this run.

On the first run we'll start with the seed URLs. In subsequent runs we'll feed it whatever we've recorded earlier. Every run starts by resetting the domain hit counters. New seeds will be added to the beginning of the list to ensure they'll be picked up as soon as possible.

Keeping domain counters seems like the most obvious thing to partition on to ensure the crawler finishes at some point but we can make that a bit smarter. Maybe we'll want to give the tilde domains a bit more space by extending "domain" to include the ~username bit for instance.

The definition of "stale" allows for some logic like: different rules for different response types, recognizing news pages which change often, increasing "wait time" when a resource doesn't seem to change, etc.

The known URL list needs to be cleaned up periodically to delete all the pages which became unconnected and are yielding errors when fetched (DNS errors, connection refused, not found, etc). By keeping some stats on pages which disappeared but are still linked to, we can even provide some kind of deadlink detection service.

In conclusion?

Are we there yet? No. But I think we made some progress. Going breath first is a big win. Knowing when to stop is still an issue but the domain hit counter seem like a nice start to me.

📅 2020-11-09

📧 hello@rwv.io

CC BY-NC-SA 4.0

-- Response ended

-- Page fetched on Sat May 4 05:02:40 2024