gemini://gemi.dev/gemlog/2022-07-10-newswaffle.gmi

2022-07-10 | #cgi #news | @Acidus

Today I'm releasing NewsWaffle, a Gemini gateway to (almost) any news websites, allowing you to get lists of current news articles, and read those news articles, all from inside Gemini.

There are already a few news gateways in Gemini, like @sloum's great Geminews, or Jon's mirror of The Guardian

While awesome, these only work with a few specific "text only" or "lite" versions of news sites.

NewsWaffle is different. It works with nearly any news site:

In fact, you can supply the URL of any news site your want as well:

Read (almost) any news site.

Automatically builds a list of news stores, separate from the navigational hyperlinks.

Detects RSS/Atom feeds to provide a more accurate list of news stories.

Uses Readability to show only article content on article pages.

Uses meta data like OpenGraph or Twitter cards to provide richer formatting, and to determine page type.

Uses a modified version of Gemipedia's HTML-to-gemtext library, so it supports images, tables, lists, block quotes, etc.

I like to read news. Specifically technical content. However new websites are increasingly user-hostile, even with an ad-blocker installed:

Cookie consent boxes with dark patterns

Annoying popups to subscribe to newsletters.

Nag-walls limiting access.

Sticker headers or footers that cover the content.

I wanted to read news sites via Gemini. But I didn't want to have to hard code support for each site I liked. So I needed to write code that could convert arbitrary news websites.

News sites primarily consist of 2 kinds of pages:

Link Pages: These are pages like the home page or a topic page. The primary purpose of these pages is they include a bunch of headlines, snippets of text, and photos, all linking to more in-depth articles. An efficient way to consider this page is as a list of headlines, with links to articles.

Article Pages: These are the pages with the actual content. A series of paragraphs, images, block quotes, lists, and tables.

To be able to access any news website via Gemini I need to:

Detect if a page is a link page or an article page .

For a link page, extract the links to news articles, but ignore the navigational links.

For an article page, extract only the article content, meta data, and format it nicely.

Helping this process is embedded meta data like OpenGraph and Twitter Cards. News websites want to make their content look good when shared on social media, so they tend to use these meta data standards (though not always correctly):

HTML with an "og:type" of "website" tend to be link pages

HTML with an "og:type" of "article" tend to be article pages

Meta data is used to determine the proper title, site owner, copyright info to display, feature image, and more.

I won't lie. Part of me smiles that I am able to use social media nonsense against them.

Once I know the page type, I can move forward. As I discussed in my last post, converting HTML to Gemtext has a lot of challenges, mainly stemming from the structure of modern HTML.

Rendering article pages is pretty easy. I use Readability to extract out the article's content. I parse any meta data like OpenGraph, Twitter cards, and old-school <meta> tags, so I can gather semantic information about the content, and run the HTML all through a modified version of Gemipedia's HTML-to-gemtext converter.

Link pages are a little trickier:

I fetch all the hyperlinks on the page.

I discard anything that doesn't have link text, or if it points to an external site, or is just an anchor.

I deduplicate the links, and if 2 anchors point to the same URL, I use the one with the longer link text

That gives me my "All Links" list. Now I want just the links that seem like they point to articles.

Remove from the DOM things that look like navigation. <header> tags, <nav> tags, <footer> tags, funny names in classes, etc

Fetch all the hyperlinks again, with the same criteria as above. I also discard any where the link text is less than 4 words. This helps filter out any links to categories, or authors, or tags, etc.

Now I have a "Content Links" list, which point to likely news articles, and a list of everything. If a link appears in the Content Links list, I remove it from the "All Links" list. What remains is a list of links that are probably just navigation links.

How do I know they page type, and whether to render a page as a "Link View" or as an "Article View?"

I use some other fuzzing logic to try and guess page type. Whenever I render a webpage as a "Link View, I also give an option to the user to switch it to "Article View" and back again. So even if I'm wrong, the user can quickly get to the right content.

Sure. You can access it via Gemini or HTTP

Because waffles are delicious.

🧇 NewsWaffle: Read any news website, all via Gemini

Features:

Why build NewsWaffle?

Structure of a News Website.

How NewsWaffle works

Let me see the code!

Why did you name it NewsWaffle?