It’s no surprise to anyone that I prefer reading peoples’ long-form thoughts to tweets or pithy social media posts. Microblogging is interesting for quick, in-the-now status updates, but I find myself craving more nuance and depth.
Luckily, Blogging is enjoying a resurgence off the back of movements like the Indieweb (at one end of the spectrum) and platforms like Substack (at the other), and far more people are writing in public on their own sites than they were ten years ago. Hooray! This is great for me, but how do I find all those sites to read?
I figured that the people I’m connected to on Mastodon would probably be the most likely to be writing on their own sites, so I wondered if it was possible to subscribe to all the blogs of the people I followed.
I had a few criteria:
- I only wanted to subscribe to blogs. (No feeds of updates from GitHub, for example, or posts in forums.)
- I didn’t want to have to authenticate with the Mastodon API to get this done. This felt like a job for a scraper — and Mastodon’s API is designed in such a way that you need to make several API calls to figure out each user’s profile links, which I didn’t want to do.
- I wanted to write it in an hour or two on Sunday morning. This wasn’t going to be a sophisticated project. I was going to take my son to the children’s museum in the afternoon, which was a far more important task.
On Mastodon, people can list a small number of external links as part of their profile, with any label they choose. Some people are kind enough to use the label blog, which is fairly determinative, but lots don’t. So I decided that I wanted to take a look at every link people I follow on Mastodon added to their profiles, figure out if it’s a blog I can subscribe to or not, and then add the reasonably-bloggy sites to an OPML file that I could then add to an RSS reader.
Here’s the very quick-and-dirty command line tool I wrote yesterday.
Mastodon helpfully produces a CSV file that lists all the accounts you follow. I decided to use that as an index rather than crawling my instance.
Then it converts those account usernames to URLs and downloads the HTML for each profile. While Mastodon has latterly started using JavaScript to render its UI — which means the actual profile links aren’t there in the HTML to parse — it turns out that it includes profile links as rel=“me”
metatags in the page header, so my script finds end extracts those using the indieweb link-rel parser to create the list of websites to crawl.
Once it has the list of websites, it excludes any that don’t look like they’re probably blogs, using some imperfect-but-probably-good-enough heuristics that include:
- Known silo URLs (Facebook, Soundcloud, etc) are excluded.
- If the URL contains
/article
,/product
, and so on, it’s probably a link to an individual page rather than a blog. - Long links are probably articles or resources, not blogs.
- Pages with long URL query strings are probably search results, not blogs.
- Links to other Mastodon profiles (or Pixelfed, Firefish, and so on) disappear.
The script goes through the remaining list and attempts to find the feed for each page. If it doesn’t find a feed I can subscribe to, it just moves on. Any feeds that look like feeds of comments are discarded. Then, because the first feed listed is usually the best one, the script chooses the first remaining feed in the list for the page.
Once it’s gone through every website, it spits out a CSV and an OPML file.
After a few runs, I pushed the OPML file into Newsblur, my feed reader of choice. It was able to subscribe to a little over a thousand new feeds. Given that I’d written the script in a little over an hour and that it was using some questionable tactics, I wasn’t sure how high-quality the sites would be, so I organized them all into a new “Mastodon follows” folder that I could unsubscribe to quickly if I needed to.
But actually, it was pretty great! A few erroneous feeds did make it through: a few regional newspapers (I follow a lot of journalists), some updates to self-hosted Git repositories, and some Lemmy feeds. I learned quickly that I don’t care for most Tumblr content — which is usually reposted images — and I found myself wishing I’d excluded it. Finally, I removed some non-English feeds that I simply couldn’t read (although I wish my feed reader had an auto-translate function so that I could).
The upshot is that I’ve got a lot more blogs to read from people I’ve already expressed interest in. Is the script anything close to perfect? Absolutely not. It it shippable? Not really. But it did what I needed it to, and I’m perfectly happy.
I’m writing about the intersection of the internet, media, and society. Sign up to my newsletter to receive every post and a weekly digest of the most important stories from around the web.