The exact types of data from each platform going to each company are not spelled out in documentation we’ve reviewed, but internal communications reviewed by 404 Media make clear that deals between Automattic, the platforms’ parent company, and OpenAI and Midjourney are imminent.
Various arms of Automattic made subsequent clarifications. Specifically, it seems like premium versions of WordPress’s online platform, like the WordPress VIP service that powers sites for major newsrooms, will not sell user data to AI platforms.
This feels like a direct example of my point about how the relationship between platforms and users has been redefined. It appears that free versions of hosted Automattic platforms will sell user data by default, while premium versions will not.
Reddit announced a similar deal last week, and in total has made deals worth $203M for its content. WordPress powers over 40% of the web, which, given these numbers, could lead to a significant payday for the company. Much of that is on the self-hosted open source project rather than sites powered by Automattic, but that number gets fuzzier once you consider the Jetpack and Akismet plugins.
From a platform’s perspective it seems like AI companies might look like a godsend. They have an open license to tens or hundreds of millions of users’ content, often going back years — and suddenly, thanks to AI vendors’ need for legal, structured content to train on — the real market value of that content has shot up. It wouldn’t surprise me to see new social platforms emerge that have underlying data models designed specifically in order to sell to AI vendors. Finally, “selling data” is the business model it was always purported to be.
It’s probably no surprise that publishers are a little less keen, although there have been well-publicized deals with Axel Springer and the Associated Press. The deals OpenAI is offering to news companies for their content tend to top out at $5M each, for one thing. But social platforms don’t trade on the content themselves: they’re scalable businesses because they’re building conduits for other peoples’ posts. Their core value is the software and an enormous, engaged user-base. In contrast, publishers’ core value really is the articles, art, audio, images, and video they produce; the hard-reported journalism, the unscalable art, and the slow-burning communities that emerge around those things. Publishing doesn’t scale. The rights to that work should not be given away easily. The incentives between platforms and AI vendors are more or less aligned; the incentives between publishers and AI vendors are not.
I don’t think bloggers and social video producers should give those rights away easily either. They might not be publishing companies with large bodies of work, but the integrity of what they produce still matters.
For WordPress users, it’s kind of a bait and switch.
While writers may be using the free, hosted version of a publishing platform like WordPress, they retain the moral right of authorship:
As defined by the Berne Convention for the Protection of Literary and Artistic Works, an international agreement governing copyright law, moral rights are the rights “to claim authorship of the work and to object to any distortion, mutilation or other modification of, or other derogatory action in relation to, the said work, which would be prejudicial to his honor or reputation.”
The hosted version of WordPress contains this sentence about ownership in its TOS:
We don’t own your content, and you retain all ownership rights you have in the content you post to your website.
A reasonable person could therefore infer that their content would not be licensed for an AI vendor. And yet, that seems to be on the cards.
So now what?
If every platform is more and more likely to sell user data to AI platforms over time, the only way to object is to start to use self-hosted indieweb platforms.
But every public website can also be scraped directly by AI vendors, in some cases even if they use the Robots Exclusion Protocol that has been used for decades to prevent search engine bots from indexing unauthorized content. A large platform can sue for violation of content licenses, but individual publishers are unlikely to have the means — unless they gather together and form a collective organization that can fight on their behalf.
If every public website is more and more likely to be scraped by AI vendors over time, the only way to object is to thwart the scrapers. That can be done electronically, but that’s an arms race between open source platforms and well-funded AI vendors. Joining together and organizing collectively is perhaps more effective; organizing for regulations that can actually hold vendors to account would be more effective still.
It’s time for publishers, writers, artists, musicians, and everyone who publishes cultural work for a living (or for themselves) to start working together and pushing back. The rights of the indie website are every bit as important as the rights of organizations like the New York Times that do have the funds to sue. And really, truly, it’s time for legislators to take notice of the untrustworthy, exploitative actions of these vendors and their platform accomplices.