Skip to main content
 

Perplexity AI Is Lying about Their User Agent

[Robb Knight]

Perplexity AI doesn't use its advertised browser string or IP range to load content from third-party websites:

"So they're using headless browsers to scrape content, ignoring robots.txt, and not sending their user agent string. I can't even block their IP ranges because it appears these headless browsers are not on their IP ranges."

On one level, I understand why this is happening, as everyone who's ever written a scraper (or scraper mitigations) might: the crawler for training the model likely does use the correct browser string, but on-demand calls likely don't to prevent them from being blocked. That's not a good excuse at all, but I bet that's what's going on.

This is another example of the core issue with robots.txt: it's a handshake agreement at best. There are no legal or technical restrictions imposed by it; we all just hope that bots do the right thing. Some of them do, but a lot of them don't.

The only real way to restrict these services is through legal rules that create meaningful consequences for these companies. Until then, there will be no sure-fire way to prevent your content from being accessed by an AI agent.

[Link]

· Links · Share this post