-- Leo's gemini proxy

-- Connecting to michaelnordmeyer.com:1965...

-- Connected

-- Sending request

-- Meta line: 20 text/gemini;lang=en-US

Why I Blocked Bing, Yandex, and Other High-Profile Crawlers


A couple of days ago I blocked Bing, Yandex, and some others from crawling and indexing my website.


My rationale for blocking bots and crawlers is simple. If I can use the crawler’s resulting product for free, they can crawl my site for free. If I cannot use it for whatever reason, they cannot take my data. SEO bots are the worst, and will be blocked immediately.


And I block all misbehaving or malicious crawlers, of course. Misbehaving could be not requesting `robots.txt` beforehand to see, if they are allowed to crawl anything. Malicious could be checking for security holes or other shenanigans which disturb and/or threaten my site.


If crawlers don’t use a custom user-agent, but a generic one, I will block them as well. A generic user-agent could be a fake browser one or one provided by the programming language or library they use, e.g. `Java/1.8.0` or `python-requests/2.28.1`. Bots need to disclose themselves properly.


While Bing and Yandex didn’t do anything wrong while crawling, they don’t let me use their products properly. Bing has effectively de-indexed my site more than two years ago, and Yandex is not usable because of a very aggressive CAPTCHA, which is triggered on 90 percent of page loads. While I understand that services want to make sure that only real users and no bots are using their services, I’m not going to relax any anti-tracking measures that I use.


Bing and DuckDuckGo Remove Indexed Websites


Because of all that, some people only allow known crawlers like Googlebot to crawl their sites, because they don’t want to play this whack-a-mole game of identifying and blocking unwanted crawlers. I think this is a mistake, because it doesn’t allow for newcomers to create compelling alternatives to those well-known big-tech behemoths.


“AI” bots are blocked as well, obviously, because while their products currently might be free to use, the crawled content becomes part of their model and cannot be removed anymore. This impacts the rights of those creating the content unjustly.


Update 2024-04-14


I unblocked Yandex, because I didn’t encounter any CAPTCHAs when I tried it again today and last week.

-- Response ended

-- Page fetched on Tue May 21 17:38:09 2024