-- Leo's gemini proxy
-- Connecting to danq.me:1965...
-- Connected
-- Sending request
-- Meta line: 20 text/gemini
2024-03-09
Did I mention recently that I love RSS? That it brings me great joy? That I start and finish almost every day in my feed reader? Probably.
I used to have a single minor niggle with the BBC News RSS feed: that it included sports news, which I didn't care about. So I wrote a script that downloaded it, stripped sports news, and re-exported the feed for me to subscribe to. Magic.
But lately - presumably as a result of technical changes at the Beeb's side - this feed has found two fresh ways to annoy me:
The feed now re-publishes a story if it gets re-promoted to the front page... but with a different <guid> (it appears to get a #0 after it when first published, a #1 the second time, and so on). In a typical day the feed reader might scoop up new stories about once an hour, any by the time I get to reading them the same exact story might appear in my reader multiple times. Ugh.
They've started adding iPlayer and BBC Sounds content to the BBC News feed. I don't follow BBC News in my feed reader because I want to watch or listen to things. If you do, that's fine, but I don't, and I'd rather filter this content out.
Luckily, I already have a recipe for improving this feed, thanks to my prior work. Let's look at my newly-revised script (also available on GitHub):
#!/usr/bin/env ruby require 'bundler/inline' # # Sample crontab: # # At 41 minutes past each hour, run the script and log the results # */20 * * * * ~/bbc-news-rss-filter-sport-out.rb > ~/bbc-news-rss-filter-sport-out.log 2>>&1 # Dependencies: # * open-uri - load remote URL content easily # * nokogiri - parse/filter XML gemfile do source 'https://rubygems.org' gem 'nokogiri' end require 'open-uri' # Regular expression describing the GUIDs to reject from the resulting RSS feed # We want to drop everything from the "sport" section of the website, also any iPlayer/Sounds links REJECT_GUIDS_MATCHING = /^https:\/\/www\.bbc\.co\.uk\/(sport|iplayer|sounds)\// # Load and filter the original RSS rss = Nokogiri::XML(open('https://feeds.bbci.co.uk/news/rss.xml?edition=uk')) rss.css('item').select{|item| item.css('guid').text =~ REJECT_GUIDS_MATCHING }.each(&:unlink) # Strip the anchors off the s: BBC News "republishes" stories by using guids with #0, #1, #2 etc, which results in duplicates in feed readers rss.css('guid').each{|g|g.content=g.content.gsub(/#.*$/,'')} File.open( '/www/bbc-news-no-sport.xml', 'w' ){ |f| f.puts(rss.to_s) }
It's amazing what you can do with Nokogiri and a half dozen lines of Ruby.
That revised script removes from the feed anything whose <guid> suggests it's sports news or from BBC Sounds or iPlayer, and also strips any "anchor" part of the <guid> before re-exporting the feed. Much better.
You're free to take and adapt the script to your own needs, or - if you don't mind being tied to my opinions about what should be in BBC News' RSS feed - just subscribe to my copy: link below -
-- Response ended
-- Page fetched on Sun Jun 2 03:26:54 2024