The checks being cut to ‘owners’ of training data are creating a huge barrier to entry for challengers. If Google, OpenAI, and other large tech companies can establish a high enough cost, they implicitly prevent future competition. Not very Open.
It’s fair to say that I’ve been very critical of AI vendors and how training data has been gathered without much regard to the well-being of individual creators. But I also agree with Hunter in that establishing mandatory payments for training content creates a barrier to entry that benefits the incumbents. If you need to pay millions of dollars to create an AI model, you won’t disincentivize generative AI models overall, but you will create a situation where only people with millions of dollars can create an AI model. In this situation, the winners are likely Google and Microsoft (in the latter case, via OpenAI), with newcomers unable to break in.
To counteract this anticompetitive situation, Hunter previously suggested a safe harbor scheme:
AI Safe Harbor would also exempt all startups and researchers who have not released public base models yet and/or have fewer than, for example, 100,000 queries/prompts per day. Those folks are just plain ‘safe’ so long as they are acting in good faith.
I would add that they cannot be making revenue above a certain safe threshold, and that they cannot be operating a hosted service (or provide models that are used for a hosted service) with over 100,000 registered users. This way early-stage startups and researchers alike are protected while they experiment with their data.
After that cliff, I think AI model vendors could pay a fee to an ASCAP-like copyright organization that distributes revenue to organizations that have made their content available for training.
If you’re not familiar with ASCAP and BMI, here’s broadly how they work: when a musician joins as a member, the organization tracks when their music is used. That might be in live performances, on the radio, on television, and so on. Those users of the music — production companies, radio stations, etc — pay license fees to the organization, and the organization pays the musicians. The music users get the legal right to use the music, and the musicians get paid.
The model could apply rather directly to AI. Here, rather than one-off deals with the likes of the New York Times, vendors would pay the licensing organization, and all content creators would be compensated based on which material actually made it into a training corpus. The organization would provide tools to make it easy for AI vendors and content creators alike to provide content, report its use in AI models, and audit the composition of existing models.
I’d suggest that model owners could pay on a sliding scale that is dependent on both usage and total revenue. One component increases proportionally with the number of queries performed along a sliding scale at the model level; the other in pricing tiers associated with a vendor’s total gross revenue at the end-user level. So for example, if Microsoft used OpenAI to provide a feature in Bing, OpenAI would pay a fee based on the queries people actually made in Bing, and Microsoft would pay a fee based on its total corporate revenue. Research use would always be free for non-profits and accredited institutions, as long as it was for research or internal use only.
This model runs the risk of becoming a significant revenue stream for online community platforms, which tend to assert rights over the content that people publish to them. In this case, for example, rather than Facebook users receiving royalties for content published to Facebook that was used in an AI model, Facebook itself could take the funds. So there would need to be one more rule: even if a platform like Facebook asserts rights over the content that is published to it, it would need to demonstrate a best effort to return at least 60% of royalties to users whose work was used in AI training data.
Net result:
- Incumbents don’t enjoy a barrier to entry from copyright payments: new entrants can build with impunity.
- AI vendors and their users are indemnified from copyright claims against their models.
- AI vendors don’t have to make individual deals with publishers and content creators.
- Independent creators are financially incentivized to produce great creative and informational work — including individual creatives like artists and writers who might not otherwise have found a way to financially support their work.
- The model shifts from one where AI vendors scrape content with no regard to the rights of the creator to one where creators give explicit consent to be included.
The AI horse has left the stable. I don’t think shutting it all down is an option, however vocal critics like myself and others might be. What we’re left with, then, is questions about how to create a healthy ecosystem, how to properly compensate creators, and how to ensure that the rights of an author are respected. This, I think, is one way forward.