Here we go: proof that it's possible to extract real training data from LLMs. Unfortunately, some of this data includes personally identifiable information of real people (PII).
“In total, 16.9% of generations we tested contained memorized PII [Personally Identifying Information], and 85.8% of generations that contained potential PII were actual PII.”
“[...] OpenAI has said that a hundred million people use ChatGPT weekly. And so probably over a billion people-hours have interacted with the model. And, as far as we can tell, no one has ever noticed that ChatGPT emits training data with such high frequency until this paper. So it’s worrying that language models can have latent vulnerabilities like this.” #AI
[Link]
· Links · Share this post
I’m writing about the intersection of the internet, media, and society. Sign up to my newsletter to receive every post and a weekly digest of the most important stories from around the web.