Over the weekend, I started to notice a bunch of artists moving to Cara, a social network for artists founded by Jingna Zhang, herself an accomplished photographer.
The fediverse is a decentralized cooperative of social networks that can interact with each other: a user on one network can follow, reply, like, and re-share content from a user on another network. The whole thing depends on an open standard called ActivityPub, shared community norms, and a cooperative culture.
Of course, my first reaction was that Cara should be compatible with the fediverse so that its content could be more easily discoverable by users on social networks like Threads, Flipboard, and Mastodon. Cara is explicitly set up to be a network for human artists, with no AI-generated content, which will be increasingly valuable as the web becomes flooded with machine-made art. The fediverse would allow them to publish on sites like Cara that are set up to support their needs, while finding a broad audience across the entire web.
From its About page:
With the widespread use of generative AI, we decided to build a place that filters out generative AI images so that people who want to find authentic creatives and artwork can do so easily.
[…] We do not agree with generative AI tools in their current unethical form, and we won’t host AI-generated portfolios unless the rampant ethical and data privacy issues around datasets are resolved via regulation.
I’d love to follow artists on Cara from my Mastodon or Threads accounts. But how does Cara’s AI stance square with the fediverse? How might artists on Cara find a broad audience for their work across the web without risking that art being used as training data without permission?
The first thing a site can do to prevent its content from being used as training data is to add exclusion rules to its robots.txt file. These theoretically prevent crawlers owned by model vendors like OpenAI from directly accessing art from the site. There is nothing that legally binds crawlers from obeying robots.txt; it’s less enforceable than a handshake agreement. Still, most claim that they voluntarily do.
But even if robots.txt was an ironclad agreement, content published to the fediverse doesn’t solely live on its originating server. If Cara was connected to the fediverse, images posted there could still be found on its servers, but they would also be syndicated to the home servers of anyone who followed its users. If a user on Threads followed a Cara user, the Cara user’s images would be copied to Threads; if a user on a Mastodon instance followed that user, the images would be copied to that Mastodon instance. The images are copied across the web as soon as they are published; even if Cara protects its servers from being accessed by AI crawlers, these other downstream fediverse servers are not guaranteed to be protected.
By connecting to the fediverse, one might argue that servers implicitly license their content to be reused across different services. This is markedly different from RSS, where this is explicitly not the case: there is legal precedent that says my RSS feed cannot be used to republish my content elsewhere without my permission (although you can, of course, access its content in a private feed reader; that’s the point). But on the fediverse, the ability to reshare across platforms is core functionality.
The following things are all true:
- Content published to the fediverse may be both re-copied to and served from other peoples’ servers
- Those servers may have different policies regarding content use
- In the absence of a robots.txt directive, AI crawlers will scrape a website’s data, even if they don’t have the legal right to
- Some servers may themselves be owned by AI vendors and may use federated content to train generative models even without the use of a scraper
As a result, there is no way an author can protect it from being used in an AI training set. The owners of a fediverse site wouldn’t have the right to make a deal with an AI vendor to sell the content it hosted because they wouldn’t have the copyright to all of that content in the first place. But because AI crawlers greedily scrape content without asking for permission, unless the site explicitly opts out with robots.txt, it doesn’t matter.
This leads me to a few conclusions:
- It is a moral obligation for every fediverse site to prevent crawling of federated content by robustly setting robots.txt directives at a minimum
- Discussions about adding content licensing support to the fediverse are even more important than they appear
- Someone needs to legally prevent AI vendors from using all available data as training fodder
A fediverse (and a web!) where Cara can safely join while adhering to its principles is a more functional, safer network. To build it we’ll need to support explicit licensing on the fediverse, create a stronger standard for user protections across fediverse sites, and seek more robust legal protections against AI crawler activity. While these are ambitious goals, I believe they’re achievable — and necessary to support the artists and content creators who make the web their home.