Skip to main content
 

Project Analyzing Human Language Usage Shuts Down Because ‘Generative AI Has Polluted the Data’

[Jason Koebler at 404 Media]

"The creator of an open source project that scraped the internet to determine the ever-changing popularity of different words in human language usage says that they are sunsetting the project because generative AI spam has poisoned the internet to a level where the project no longer has any utility."

Robyn Speer, who created the project, went so far as to say that she doesn't think "anyone has reliable information about post-2021 language used by humans." That's a big statement about the state of the web. While spam was always present, it was easier to identify and silo; AI has rendered spam unfilterable.

She no longer wants to be part of the industry at all:

"“I don't want to work on anything that could be confused with generative AI, or that could benefit generative AI,” she wrote. “OpenAI and Google can collect their own damn data. I hope they have to pay a very high price for it, and I hope they're constantly cursing the mess that they made themselves.”"

It's a relatable sentiment.

[Link]

· Links · Share this post

 

No one’s ready for this

[Robin Rendle]

Robin Rendle on Sarah Jeong's article about the implications of the Pixel 9's magic photo editor in The Verge:

"But this stuff right here—adding things that never happened to a picture—that’s immoral because confusion and deception is the point of this product. There are only shady applications for it."

Robin's point is that the core use case - adding things that never happened to a photograph with enough fidelity and cues that you could easily be convinced that they did - has no positive application. And as such, it should probably be illegal.

My take is that the cat is out of the bag. The societal implications aren't good - at all - but I don't think banning the technology is practical. So, instead, we have to find a way to live with it.

As Sarah Jeong says in the original article:

"The default assumption about a photo is about to become that it’s faked, because creating realistic and believable fake photos is now trivial to do. We are not prepared for what happens after."

In this world, what constitutes evidence? How do we prove visual evidentiary truth?

There may be a role for journalism and professional photographers here. Many newsrooms, including the Associated Press, have joined the Content Authenticity Initiative, which aims to provide programmatically-provable credentials to photographs used by a publication. This will be an arms race, of course, because there are incentives for a nefarious actor to develop technical circumventions.

Ultimately, the biggest counter to this problem as a publisher is going to be building a community based on trust, and for an end-user is finding sources you can trust. That doesn't help in a legal context, and it doesn't help establish objective truth. But it's something.

[Link]

· Links · Share this post

 

Productivity gains in Software Development through AI

[tante]

Tante responds to Amazon's claim that using its internal AI for coding saved 4500 person years of work:

"Amazon wants to present themselves as AI company and platform. So of course their promises of gains are always advertising for their platform and tools. Advertising might have a tendency to exaggerate. A bit. Maybe. So I heard."

He makes solid points here about maintenance costs given the inevitably lower-quality code, and intangibles like the brain drain effect on the team over time. And, of course, he's right to warn that something that works for a company the size of Amazon will not necessarily (and in fact probably won't) make sense for smaller organizations.

As he points out:

"It’s the whole “we need to run Microservices and Kubernetes because Amazon and Google do similar things” thing again when that’s a categorically different problem pace than what most companies have to deal with."

Right.

[Link]

· Links · Share this post

 

Andy Jassy on using generative AI in software development at Amazon

[Andy Jassy on LinkedIn]

Andy Jassy on using Amazon Q, the company's generative AI assistant for software development, internally:

"The average time to upgrade an application to Java 17 plummeted from what’s typically 50 developer-days to just a few hours. We estimate this has saved us the equivalent of 4,500 developer-years of work (yes, that number is crazy but, real)."

"The benefits go beyond how much effort we’ve saved developers. The upgrades have enhanced security and reduced infrastructure costs, providing an estimated $260M in annualized efficiency gains."

Of course, Amazon is enormous, and any smaller business will need to scale down those numbers and account for efficiencies that may have occurred between engineers there.

Nevertheless, these are incredible figures. The savings are obviously real, allowing engineers to focus on actual work rather than the drudgery of upgrading Java (which is something that absolutely nobody wants to spend their time doing).

We'll see more of this - and we'll begin to see more services which allow for these efficiency gains between engineers across smaller companies, startups, non-profits, and so on. The dumb companies will use this as an excuse for reductions in force; the smart ones will use it as an opportunity to accelerate their team's productivity and build stuff that really matters.

[Link]

· Links · Share this post

 

AI in Journalism Futures 2024

[Open Society Foundations]

"In February 2024, the Open Society Foundations issued a call for applications for a convening in which selected participants would share their visions of an AI-mediated future."

I thought this, from the concluding observations, was telling:

"Participants were generally reluctant or unable to articulate exactly how AI might transform the information ecosystem. [...] Relatively few submitted scenarios described an AI-driven transformation in specific detail, and it was clear that many participants who were convinced that AI would fundamentally restructure the information ecosystem also had no specific point of view on how that might occur."

In other words, while many people in journalism see that this set of technologies may transform their industry, and are potentially excited or terrified that it will, they have no idea how that will happen. This is the very definition of hype: one can imagine people proclaiming that blockchain, or push notifications, or RealMedia, or WebTV might do the same.

It's not that there are no uses for AI (just like it's not that there are no uses for blockchain). It will find its way into end-user applications, underpin newsroom tools, and power data-driven newsroom investigations, without a doubt. But the hype far exceeds that, and will eventually, inevitably, deflate.

In the meantime, journalists are not as worried about the technologies themselves as who controls them:

"Throughout the application process and workshop discussions, it became clear that much of the conversation was not actually about AI, nor about journalism, nor about the current or future information ecosystem, but instead about power. It was clear that power, and the potential for transfers of power from one group to another, was the explicit or implicit subject of many of the submitted scenarios as well as the five final scenarios that were distilled from the workshop."

Technologists, in turn, were blind to these power dynamics, while simultaneously predicting more dramatic changes. There's a fundamental truth here: it's ultimately about money, and who controls the platforms that allow readers to read about the world around them.

Same as it ever was: that's been the struggle on the web since its inception. AI just shifts the discussion to a new set of platforms.

[Link]

· Links · Share this post

 

Procreate’s anti-AI pledge attracts praise from digital creatives

[Jess Weatherbed at The Verge]

"“Generative AI is ripping the humanity out of things. Built on a foundation of theft, the technology is steering us toward a barren future,” Procreate said on the new AI section of its website. “We think machine learning is a compelling technology with a lot of merit, but the path generative AI is on is wrong for us.”"

This is a company that knows its audience: the lack of concern for artist welfare demonstrated by AI vendors has understandably not made the technology popular with that community. Adobe got into trouble with its userbase for adding those generative AI features.

It's a great way for Procreate to deepen its relationship with artists and take advantage of Adobe's fall from grace. There's also something a bit deeper here: if work created with generative AI does run into copyright trouble at the hands of current and future lawsuits, work created with Procreate will be clean of those issues.

[Link]

· Links · Share this post

 

Perplexity is cutting checks to publishers following plagiarism accusations

[Kylie Robison at The Verge]

"Perplexity’s “Publishers’ Program” has recruited its first batch of partners, including prominent names like Time, Der Spiegel, Fortune, Entrepreneur, The Texas Tribune, and Automattic (with WordPress.com participating but not Tumblr). Under this program, when Perplexity features content from these publishers in response to user queries, the publishers will receive a share of the ad revenue."

Now we're talking. This was inevitable.

It also opens the floodgates: there's a world where any publisher gets a direct revenue share for being a source, if they sign up and license their content. This seems like a solid improvement.

Which brings me to Automattic's involvement. As Matt Mullenweg says in the piece:

"It’s a much better revenue split than Google, which is zero."

Automattic will actually be sharing the revenue with customers of its hosted WordPress product. I'm not sure if that includes WordPress VIP, its premium product for publishers. Whether free hosted WordPress publishers who are used as sources by Perpexity see any kind of revenue share is also a mystery, which might put some foreign publishers in a bad place in particular.

Still, in general, although there will certainly be kinks to work out, this sets a really good precedent. More, please.

[Link]

· Links · Share this post

 

Runway Ripped Off YouTube Creators

[Samantha Cole at 404 Media]

"A highly-praised AI video generation tool made by multi-billion dollar company Runway was secretly trained by scraping thousands of videos from popular YouTube creators and brands, as well as pirated films."

404 Media has linked to the spreadsheet itself, which seems to be a pretty clear list of YouTube channels and individual videos.

Google is clear that this violates YouTube's rules. The team at Runway also by necessity downloaded the videos first using a third-party tool, which itself is a violation of the rules.

This is just a video version of the kinds of copyright and terms violations we've already seen copious amounts of in static media. But Google might be a stauncher defender of its rules than most - although not necessarily for principled reasons, because it, too, is in the business of training AI models on web data, and likely on YouTube content.

[Link]

· Links · Share this post

 

When ChatGPT summarises, it actually does nothing of the kind.

[Gerben Wierda at R&A IT Strategy & Architecture]

"ChatGPT doesn’t summarise. When you ask ChatGPT to summarise this text, it instead shortens the text. And there is a fundamental difference between the two."

The distinction is indeed important: it's akin to making an easy reader version, albeit one with the odd error here and there.

This is particularly important for newsrooms and product teams that are looking at AI to generate takeaways from articles. There's a huge chance that it'll miss the main, most pertinent points, and simply shorten the text in the way it sees fit.

[Link]

· Links · Share this post

 

Declare your AIndependence: block AI bots, scrapers and crawlers with a single click

[Cloudflare]

"To help preserve a safe Internet for content creators, we’ve just launched a brand new “easy button” to block all AI bots. It’s available for all customers, including those on our free tier."

This is really neat! Whatever you land on AI scraping, giving site owners the one-click ability to make a choice is great. Some will choose not to use this; others will hit the button. Making it this easy means it's a choice about the principles, not any kind of technical considerations. Which is what it should be.

Not every site is on Cloudflare (and some also choose not to use it because of how it's historically dealt with white supremacist / Nazi content). But many are, and this makes it easy for them. Other, similar providers will likely follow quickly.

[Link]

· Links · Share this post

 

Fighting bots is fighting humans

[Molly White]

"I fear that media outlets and other websites, in attempting to "protect" their material from AI scrapers, will go too far in the anti-human direction."

I've been struggling with this.

I'm not in favor of the 404 Media approach, which is to stick an auth wall in front of your content, forcing everyone to register before they can load your article. That isn't a great experience for anyone, and I don't think it's sustainable for a publisher in the long run.

At the same time, I think it's fair to try and prevent some bot access at the moment. Adding AI agents to your robots.txt - although, as recent news has shown, perhaps not as effective a move as it might be - seems like the right call to me.

Clearly an AI agent isn't a human. For ad hoc queries - where an agent is retrieving content from a website in direct response to a user query - it clearly is acting on behalf of a human. Is it a browser, then? Maybe? If it is, we should just let it through.

It's accessing articles as training data that I really take issue with (as well as the subterfuge of not always advertising what it is when it accesses a site). In these cases, content is copied into a corpus in a manner that's outside of its licensing, without the author's knowledge. That sucks - not because I'm in favor of DRM, but because often the people whose work is being taken are living on a shoestring, and the software is run by very large corporations who will make a fortune.

But yes: I don't think auth walls, CAPTCHAs, paywalls, or any added friction between content and audience are a good idea. These things make the web worse for everybody.

Molly's post is in response to an original by Manu Moreale, which is also worth reading.

[Link]

· Links · Share this post

 

The Future of Fashion Commerce Is a Designer's AI Bot Saying You Look Great and Your Personal AI Bot Sifting Through the Bullshit

[Hunter Walk]

"The best commerce platforms will be constantly grooming you, priming you, shaping you to buy. The combination of short-term and long-term value that leads to the optimal financial outcome for the business."

I think this is inevitably correct: the web will devolve into a battle between different entities who are all trying to persuade you to take different actions. That's already been true for decades, but it's been ambient until now; generative AI gives it the ability to literally argue with us. Which means we're going to need our own bots to argue back.

Hunter's analogy of a bot that's supposedly in your corner calling bullshit on all the bots trying to sell things to you is a good one. Except, who will build the bot that's in your corner? Why will it definitely be so? Who will profit from it?

What a spiral this will be.

[Link]

· Links · Share this post

 

I Will Piledrive You If You Mention AI Again

[Nikhil Suresh at Ludicity]

"This entire class of person is, to put it simply, abhorrent to right-thinking people. They're an embarrassment to people that are actually making advances in the field, a disgrace to people that know how to sensibly use technology to improve the world, and are also a bunch of tedious know-nothing bastards that should be thrown into Thought Leader Jail until they've learned their lesson, a prison I'm fundraising for."

I enjoyed this very much.

Here's the thing, though: I don't think what Nikhil wants will happen.

I mean, don't get me wrong: it probably should. The author is a leader in his field, and his exasperation at the hype train is well-earned.

But it's not people like Nikhil who actually make the decisions, or invest in the companies, or make the whole industry (or industries) tick over. Again: it should be.

What happens again and again is that people who see that they can make money out of a particularly hyped technology leap onto the bandwagon, and then market the bandwagon within an inch of everybody's lives. Stuff that shouldn't be widespread becomes widespread.

And here we are again with AI.

This is exactly right:

"Unless you are one of a tiny handful of businesses who know exactly what they're going to use AI for, you do not need AI for anything - or rather, you do not need to do anything to reap the benefits. Artificial intelligence, as it exists and is useful now, is probably already baked into your businesses software supply chain."

And this:

"It did not end up being the crazy productivity booster that I thought it would be, because programming is designing and these tools aren't good enough (yet) to assist me with this seriously."

There is work that will be improved with AI, but it's not something that most industries will have to stop everything and leap on top of. The human use cases must come first with any technology: if you have a problem that AI can solve, by all means, use AI. But if you don't, hopping on the hype train is just going to burn you a lot of money and slow your actual core business down.

[Link]

· Links · Share this post

 

Succor borne every minute

[Michael Atleson at the FTC Division of Advertising Practices]

"Don’t misrepresent what these services are or can do. Your therapy bots aren’t licensed psychologists, your AI girlfriends are neither girls nor friends, your griefbots have no soul, and your AI copilots are not gods."

The FTC gets involved in the obviously rife practice of overselling the capabilities of AI services. These are solid guidelines, and hopefully the precursor to more meaningful action when vendors inevitably cross the line.

While these points are all important, for me the most pertinent is the last:

"Don’t violate consumer privacy rights. These avatars and bots can collect or infer a lot of intensely personal information. Indeed, some companies are marketing as a feature the ability of such AI services to know everything about us. It’s imperative that companies are honest and transparent about the collection and use of this information and that they don’t surreptitiously change privacy policies or relevant terms of service."

It's often unclear how much extra data is being gathered behind the scenes when AI features are added. This is where battles will be fought and lines will be drawn, particularly in enterprises and well-regulated industries.

[Link]

· Links · Share this post

 

Perplexity AI Is Lying about Their User Agent

[Robb Knight]

Perplexity AI doesn't use its advertised browser string or IP range to load content from third-party websites:

"So they're using headless browsers to scrape content, ignoring robots.txt, and not sending their user agent string. I can't even block their IP ranges because it appears these headless browsers are not on their IP ranges."

On one level, I understand why this is happening, as everyone who's ever written a scraper (or scraper mitigations) might: the crawler for training the model likely does use the correct browser string, but on-demand calls likely don't to prevent them from being blocked. That's not a good excuse at all, but I bet that's what's going on.

This is another example of the core issue with robots.txt: it's a handshake agreement at best. There are no legal or technical restrictions imposed by it; we all just hope that bots do the right thing. Some of them do, but a lot of them don't.

The only real way to restrict these services is through legal rules that create meaningful consequences for these companies. Until then, there will be no sure-fire way to prevent your content from being accessed by an AI agent.

[Link]

· Links · Share this post

 

On being human and "creative"

[Heather Bryant]

"What generative AI creates is not any one person's creative expression. Generative AI is only possible because of the work that has been taken from others. It simply would not exist without the millions of data points that the models are based upon. Those data points were taken without permission, consent, compensation or even notification because the logistics of doing so would have made it logistically improbable and financially impossible."

This is a wonderful piece from Heather Bryant that explores the humanity - the effort, the emotion, the lived experience, the community, the unique combination of things - behind real-world art that is created by people, and the theft of those things that generative AI represents.

It's the definition of superficiality, and as Heather says here, living in a world made by people, rooted in experiences and relationships and reflecting actual human thought, is what I hope for. Generative AI is a technical accomplishment, for sure, but it is not a humanist accomplishment. There are no shortcuts to the human experience. And wanting a shortcut to human experience in itself devalues being human.

[Link]

· Links · Share this post

 

The Encyclopedia Project, or How to Know in the Age of AI

[Janet Vertesi at Public Books]

"Our lives are consumed with the consumption of content, but we no longer know the truth when we see it. And when we don’t know how to weigh different truths, or to coordinate among different real-world experiences to look behind the veil, there is either cacophony or a single victor: a loudest voice that wins."

This is a piece about information, trust, the effect that AI is already having on knowledge.

When people said that books were more trustworthy than the internet, we scoffed; I scoffed. Books were not infallible; the stamp of a traditional publisher was not a sign that the information was correct or trustworthy. The web allowed more diverse voices to be heard. It allowed more people to share information. It was good.

The flood of automated content means that this is no longer the case. Our search engines can't be trusted; YouTube is certainly full of the worst automated dreck. I propose that we reclaim the phrase pink slime to encompass this nonsense: stuff that's been generated by a computer at scale in order to get attention.

So, yeah, I totally sympathize with the urge to buy a real-world encyclopedia again. Projects like Wikipedia must be preserved at all costs. But we have to consider if all this will result in the effective end of a web where humans publish and share information. And if that's the case, what's next?

[Link]

· Links · Share this post

 

These Wrongly Arrested Black Men Say a California Bill Would Let Police Misuse Face Recognition

[The Markup]

"Now all three men are speaking out against pending California legislation that would make it illegal for police to use face recognition technology as the sole reason for a search or arrest. Instead it would require corroborating indicators."

Even with mitigations, it will lead to wrongful arrests: so-called "corroborating indicators" don't assist with the fact that the technology is racially biased and unreliable, and in fact may provide justification for using it.

And the stories of this technology being used are intensely bad miscarriages of justice:

“Other than a photo lineup, the detective did no other investigation. So it’s easy to say that it’s the officer’s fault, that he did a poor job or no investigation. But he relied on (face recognition), believing it must be right. That’s the automation bias this has been referenced in these sessions.”

"Believing it must be right" is one of core social problems widespread AI is introducing. Many people think of computers as being coldly logical deterministic thinkers. Instead, there's always the underlying biases of the people who built the systems and, in the case of AI, in the vast amounts of public data used to train them. False positives are bad in any scenario; in law enforcement, it can destroy or even end lives.

[Link]

· Links · Share this post

 

AI Lobbying Group Launches Campaign Defending Tech

"Chamber of Progress, a tech industry coalition whose members include Amazon, Apple and Meta, is launching a campaign to defend the legality of using copyrighted works to train artificial intelligence systems."

I understand why they're making this push, but I don't know that it's the right PR move for some of the wealthiest corporations in the world to push back on independent artists. I wish they were actually reaching out and finding stronger ways to support the people who make creative work.

The net impression I'm left with is not support of user freedom, but bullying. Left out of the equation is the scope of fair use, which is painted here as being under attack as a principle by the artists rather than by large companies that seek to use peoples' work for free to make products that they will make billions of dollars from.

The whole thing is disingenuous and disappointing, and is likely to backfire. It's particularly sad to see Apple participate in this mess. So much for bicycles of the mind.

[Link]

· Links · Share this post

 

Zoom CEO Eric Yuan wants AI clones in meetings

Eric Yuan has a really bizarre vision of what the future should look like:

"Today for this session, ideally, I do not need to join. I can send a digital version of myself to join so I can go to the beach. Or I do not need to check my emails; the digital version of myself can read most of the emails. Maybe one or two emails will tell me, “Eric, it’s hard for the digital version to reply. Can you do that?” Again, today we all spend a lot of time either making phone calls, joining meetings, sending emails, deleting some spam emails and replying to some text messages, still very busy. How [do we] leverage AI, how do we leverage Zoom Workplace, to fully automate that kind of work? That’s something that is very important for us."

The solution to having too many meetings that you don't really need to attend, and too many emails that are informational only, is to not have the meetings and emails. It's not to let AI do it for you, which in effect creates a world where our avatars are doing a bunch of makework drudgery for no reason.

Instead of building better business cultures and reinventing our work rhythms to adapt to information overload and an abundance of busywork, the vision here is to let the busywork happen between AI. It's an office full of ghosts, speaking to each other on our behalf, going to standup meetings with each other just because.

I mean, I get it. Meetings are Zoom's business. But count me out.

[Link]

· Links · Share this post

 

Study Finds That 52 Percent of ChatGPT Answers to Programming Questions Are Wrong

On answering programming questions: "We found that 52 percent of ChatGPT answers contain misinformation, 77 percent of the answers are more verbose than human answers, and 78 percent of the answers suffer from different degrees of inconsistency to human answers."

To be fair, I do expect AI answers to get better over time, but it's certainly premature to use it as a trusted toolkit for software development today. One might argue that its answers are more like suggestions for an engineer to check and adapt as appropriate, but will they really be used that way?

I think it's more likely that AI agents will be used to build software by people who want to avoid engaging with a real, human engineer, or people who want to cut corners for one reason or another. So I think the warnings are appropriate: LLMs are bad at coding and we shouldn't trust what they say.

[Link]

· Links · Share this post

 

The Fatal Flaw in Publishers' OpenAI Deals

"It’s simply too early to get into bed with the companies that trained their models on professional content without permission and have no compelling case for how they will help build the news business."

This piece ends on the most important point: nobody is coming to save the news industry, and certainly not the AI vendors. Software companies don't care about news. They don't think your content is more valuable because it's fact-checked and edited. They don't have a vested interest in ensuring you survive. They just want the training data - all of it, in order to build what they consider to be the best product possible. Everything else is irrelevant.

[Link]

· Links · Share this post

 

Slop is the new name for unwanted AI-generated content

Simon Willison has a perfect name for unreviewed content that is shared with other people: "slop".

He goes on:

"I’m happy to use LLMs for all sorts of purposes, but I’m not going to use them to produce slop. I attach my name and stake my credibility on the things that I publish."

I think that's right. I'm less worried about using LLMs internally - as long as you understand that they're not impartial or perfectly factual sources, and as long as you take into account the methods used to generate the datasets that were used to train them. (Those are some big "if"s.)

But don't just take that output and share it with the public. And *certainly* don't do it so that you can publish content at scale without having to hire real writers. Not only is that not a good look, but you're going to harm your brand and your reputation in the process.

[Link]

· Links · Share this post

 

Stack Overflow bans users en masse for rebelling against OpenAI partnership — users banned for deleting answers to prevent them being used to train ChatGPT

"Users who disagree with having their content scraped by ChatGPT are particularly outraged by Stack Overflow's rapid flip-flop on its policy concerning generative AI. For years, the site had a standing policy that prevented the use of generative AI in writing or rewording any questions or answers posted. Moderators were allowed and encouraged to use AI-detection software when reviewing posts."

This is all about money: "partnering" with OpenAI clearly means a significant sum has changed hands. The same thing may have happened at Valve, which also unblocked AI-generated art from its marketplace.

This feels like short-term thinking to me: while Stack will clearly make some near-term revenue through the deal, it comes at a cost to the health of its community, which is ultimately what drives the company's value. If motivated contributors drop off, the only thing left will be the AI-generated content - and there's no way that this will be as valuable over time.

I'd love to have been a fly on the wall of the boardroom where this deal was undoubtedly decided. What are they measuring that made this seem like a good idea - and what are they not measuring that means they're blind to the community dynamics that drive their actual sustainability? It's all fascinating to me.

[Link]

· Links · Share this post

 

Meet AdVon, the AI-Powered Content Monster Infecting the Media Industry

"We found the company's phony authors and their work everywhere from celebrity gossip outlets like Hollywood Life and Us Weekly to venerable newspapers like the Los Angeles Times, the latter of which also told us that it had broken off its relationship with AdVon after finding its work unsatisfactory."

Even if the LA Times broke off its relationship because the work was unsatisfactory, the fact that this was attempted in the first place is unsettling. What if the work hadn't been "unsatisfactory"? What if it had been "good enough"?

It's not so much the technology itself as the intention behind it: to produce content at scale without employing human journalists, largely to generate pageviews in order to sell ads. There's no public service mission here, or even a mission to provide something that people might really want to read. It's all about arbitrage.

[Link]

· Links · Share this post