Robots.txt needs an update for the 2020s. Instead of just saying what content can be indexed, it should also grant rights.
Like crawl my site only to provide search results not train your LLM.
Call it license.txt.
The robots.txt standard allows a website owner to specify which crawlers (by user agent) can access which parts of a website. It’s been useful, as far as it goes: for example, a website owner can theoretically tell Google or Bing to not index certain pages. But there’s no way of restricting use in the way Dare suggests.
Imagine extending the standard so that further use of the content can be specified in the same way as user agent. Content uses might include:
- archive (eg, archival by a service like the Internet Archive)
- search (eg, use in a search engine like Google)
- model (eg, scraping by LLM providers like OpenAI)
- republish (eg, Creative Commons or open source)
Each one might have parameters. For example:
- republish:BY-NC-SA would specify the Creative Commons attributes “attribution, non-commercial, share-alike”
- republish:(license URL) would specify use of a license with a specific URI, including commercial licenses whose terms would be available at that URL.
In addition to the opt-out “disallow” wording employed by robots.txt, license.txt would default to opt-in and use an “allow” keyword.
A simple license.txt file might then look like:
Use: archive, search Allow: / Disallow: /admin Use: model Disallow: / Use: republish:BY-NC Allow: /free Use: republish:Allow: /
This would allow all site content to be archived and indexed for search, aside from an /admin path. The site would not allow crawling for model use. Pages under the /free path would allow republishing for non-commercial purposes while the whole site would be licensable for commercial purposes using a license available at the specified URL.
One of the disadvantages of robots.txt is that it’s a “secret” URL that you have to know about to discover. To that end, the specification also allows for meta tags in the header of each page. Likewise, license tags in the header of an HTML page should also be supported, specifying uses on a given page:
<meta name="license" content="archive, search, republish:BY-NC">
These tag values would specify which rights are allowed under the license. For example, the tag above allows the page to be archived and indexed for search, as well as republished under BY-NC. Because “model” isn’t specified, it’s understood that licensing for use in LLM models is not granted.