gemini://michaelnordmeyer.com/gemlog/2024-04-25-the-problem-with-full-content-web-feeds-today.gmi

Web feeds (Atom, JSON, or RSS) are much nicer if they have the full post content included. It allows users to read the post in their feed readers without any distractions that could be on the post’s page, be it advertisements, animations, sidebars, or creative choices which may look nice, but decrease the ability to read efficiently. It also means all content has been downloaded after syncing, which allows for reading offline.

The problems arise if you either want to have all your posts included or you want to have control over who is going to download your content.

I include all my posts in my feed, because I, as a user, want to be able to read all potentially interesting posts in my feed reader, after I have discovered an interesting website. It’s feasible, because I only have a little more than a hundred posts in total, but I deleted about the same amount over the years to keep it somewhat relevant.

The resulting size is currently about 0.5 MiB, which is quite large for a potential download every 30 minutes. If feed reader support compression and the web server is setup properly, then the size is 0.13 MiB¹. Same goes for caching, which means that the download should only occur if the feed has changed. Caveats apply, and static site generators recreate the feed on every build, which might be triggered for unrelated reasons.

For some those sizes and requests seem negligible, but others have a more sustainable future in mind, where unnecessary resource consumption should be avoided. Feed subscribers can help by setting the refresh interval to manual.

Limiting the number of posts in a feed to e.g. ten can be problematic, because if I don’t get around reading the last ten posts, even though I want to, then a newly published post would remove the oldest of those ten from the feed and, depending on the feed reader, from the feed reader as well.

But the bigger problem nowadays is that many people or companies want to take advantage of seemingly free content. If you don’t want so called AI companies to enrich themselves by using your content while effectively making the same invisible by doing so, then full-content feeds are harmful. They make it particularly easy to get all content in one place with very little effort.

But limiting the content only makes sense, if ordinary requests of those linked posts are being detected as bot requests by the server to block them.

I’m not sure if well-known structural copyright and license tags in the HTML page and feed would help, but it is a start. While you can add copyright information to feeds, Atom has `rights`, and JSON feed and RSS 2.0 have `copyright`, you cannot add licensing information. For this to be effective, people have to use prohibitive licenses with regards to replication and commercial use, like Creative Commons BY-NC-ND.

Even though I wrote above that I like to read the whole post in my feed reader, I’m warming to the idea of just getting a reasonable abstract in my subscribed feeds. Not just a cut off after N characters, but an abstract which sums up the content of the post. Selecting the URL would open the whole post in a browser to display in “design mode” and not reading mode. You could choose the latter by default as well, if your browser supports it, or only for some sites.

Not only would feed downloads be much smaller, you would only download what you actually read. Excessive downloads are even worse for podcast subscriptions, where some clients always download a newly published episode by default.

Content creators would see the actual number of readers or listeners, if they would download only the parts they are actually interested in, which, with advanced bot heuristics, would double as non-intrusive statistics.

I’m aware this would mean increased traffic while reading on the go, but I prefer resource usage in small increments when actually needed over large just-in-case usage.

Ads or tracking are not much of a problem here, because sites relying on those will never have full-content feeds.

The problem of having all posts in a feed could be mitigated by having more than one feed URL with different numbers of potential posts or content size.

Better feed readers will show these options to choose from, if just the domain name is added to the subscription list.

I’m not sure what I’m going to do, because if I can detect robots confidently, then it can applied to both feed and posts. Unfortunately I only have a chance do this on the web, but not on Gemini, where I only have the requesting IP address. I have to decide what’s more important to me: avoiding exploitation or publishing on the smolnet.

¹ Minifying the feed by removing empty lines, leading spaces on lines, the space before the end of self-closing tags, and line breaks doesn’t help much. `sed -E "/^$/d" feed.xml | sed -E "s/^[ ]+(.+)/\1/" | sed "s| />|/>|g" | tr -d "\n" > feed.min.xml` takes it from 491 KiB to 474 KiB, and 134 KiB to 133 KiB gzip’ed. Compression takes care of redundancies.

The Problem With Full-Content Web Feeds Today

Large Sizes and Many Downloads

Fighting Grifters, aka AI Bots, and Others

Alternative Feed Reading Patterns

Simple Solutions