Daniel van Strien, a machine learning librarian at Hugging Face, took a million Bluesky posts and turned them into a dataset expressly for training AI models:
“This dataset could be used for “training and testing language models on social media content, analyzing social media posting patterns, studying conversation structures and reply networks, research on social media content moderation, [and] natural language processing tasks using social media data,” the project page says. “Out of scope use” includes “building automated posting systems for Bluesky, creating fake or impersonated content, extracting personal information about users, [and] any purpose that violates Bluesky's Terms of Service.””
There was an outcry among users, who felt that they hadn’t consented to such an activity. The idea that a generative AI model could potentially be used to build new content based on users’ work without their participation, consent, or awareness was appalling.
Van Strien eventually saw that his act was a violation and subsequently removed the dataset, writing an apology in a Bluesky post:
I've removed the Bluesky data from the repo. While I wanted to support tool development for the platform, I recognize this approach violated principles of transparency and consent in data collection. I apologize for this mistake.
Which is true! Just because something can be done, that doesn’t mean it should be. It was a violation of community norms even if it wasn’t a legal violation.
Bluesky subsequently shared a statement with 404 Media and The Verge about its future intentions:
“Bluesky is an open and public social network, much like websites on the Internet itself. Just as robots.txt files don't always prevent outside companies from crawling those sites, the same applies here. We'd like to find a way for Bluesky users to communicate to outside orgs/developers whether they consent to this and that outside orgs respect user consent, and we're actively discussing how to achieve this.”
It turns out a significant number of users moved away from X not because of the far-right rhetoric that’s become prevalent on the platform, but because they objected to their content being used to train AI models by the company. Many of them were aghast to discover that building a training dataset on Bluesky was even possible. This event has illustrated, in a very accessible way, the downside of an open, public, permissionless platform: the data is available to anyone.
There is a big difference in approaches here: on X, models are trained on platform data by the platform owner, for its own profit, whereas on Bluesky, the platform is trying to figure out how to surface user consent and does not, itself, participate in training a model. But the outcome on both may be similar, in that the end result is a generative model trained on user data, which someone other than the people who wrote the underlying posts may profit from.
The same is true on Mastodon, although gathering a central dataset of every Mastodon post is much harder because of the decentralized nature of the network. (There is one central Bluesky interface and API endpoint; Mastodon has thousands of interoperating community instances with no central access point or easy way to search the whole network.) And, of course, it’s true of the web itself. Despite being made of billions of independent websites, the web has been crawled for datasets many times, for example by Common Crawl, as well as the likes of Google and Microsoft, which have well-established crawler infrastructure for their search engines. Because website owners generally want their content to be found, they’ve generally allowed search engine bots to crawl their content; using those bots to gather information that could be used to build new content using generative models was a bait and switch that wiped away decades of built-up trust.
So the problem Bluesky is dealing with is not so much a problem with Bluesky itself or its architecture, but one that’s inherent to the web itself and the nature of building these training datasets based on publicly-available data. Van Strien’s original act clearly showed the difference in culture between AI and open social web communities: on the former it’s commonplace to grab data if it can be read publicly (or even sometimes if it’s not), regardless of licensing or author consent, while on open social networks consent and authors’ rights are central community norms.
There are a few ways websites and web services can help prevent content they host from being swept up into training data for generative models. All of them require active participation from AI vendors: effectively they must opt in to doing the right thing.
- Block AI crawlers using robots.txt. A robots.txt file has long been used to direct web crawlers. It’s a handshake agreement at best: there’s no legal enforcement, and we know that AI developers and vendors have sometimes ignored it.
- Use Do Not Train. Spawning, a company led in part by Mat Dryhurst and the artist Holly Herndon, has established a Do Not Train registry that already contains 1.5B+ entries. The name was inspired by the Do Not Track standard to opt out of user tracking, which was established in 2009 but never widely adopted by advertisers (who had no incentive to do so). Despite those challenges, Do Not Train has been respected in several new models, including Stable Diffusion.
- Use ai.txt to dictate how data can be used. Spawning has also established ai.txt, an AI-specific version of robots.txt that dictates how content can be used in training data.
- Establish a new per-user standard for consent. All of the above work best on a per-site basis, but it’s hard for a platform to let a crawler know that some users consent to having their content being used as training data while others do not. Bluesky is likely evaluating how this might work on its platform; whatever is established there will almost certainly also work on other decentralized platforms like Mastodon. I imagine it might include on-page metadata and tags incorporated into the underlying AT Protocol data for each user and post.
I’m in favor of legislation to make these measures binding instead of opt-in. Without binding measures, vendors are free to prioritize profit over user rights, perpetuating a cycle of exploitation. The key here is user consent: I should be able to say whether my writing, photos, art, etc, can be used to train an AI model. If my content is valuable enough, I should have the right to sell a license to it for this (or any) purpose. Today, that is impossible, and vendors are arguing that broad collection of training data is acceptable under fair use rules.
This won’t stifle innovation, because plenty of content is available and many authors do consent to for their work to be used in training data. It doesn’t ban AI or prevent its underlying mechanisms from working. It simply gives authors a say in how their work is used.
By prioritizing user consent and accountability, we can create a web where innovation and respect for creators coexist, without stifling innovation or disallowing entire classes of technology. That’s the fundamental vision of an open social web: one where everyone has real authorial control over their content, but where new tools can be built without having to ask for permission or go through gatekeepers. We’re very close to realizing it, and these conversations are an important way to get there.
I’m writing about the intersection of the internet, media, and society. Sign up to my newsletter to receive every post and a weekly digest of the most important stories from around the web.