Evaluating AI

How to think about vendors, technology, and power

Evaluating AI

AI is everywhere. The conversations are ubiquitous and the technology is rapidly transforming. It dominates the conferences I attend and the strategy blogs I read across journalism and tech. This week, even Mark Zuckerberg, seeing the writing on the wall, published an essay about Meta building “a personal superintelligence that serves everyone”.

So I thought this might be a good time to talk about how I’m thinking about AI.

This is my personal mental model (not, to be clear, a company policy at my employer). A few people have said it’s helped them make decisions about AI in their own teams, workplaces, and lives, so I thought I’d share.

My approach to evaluating AI is through two main lenses: the technology itself and the vendors who make it.

Let’s start with the vendors, since they shape how we access the technology.

AI vendors

In turn, we should think about evaluating vendors in terms of two questions: what do they want? and how do they work?

Understanding these dynamics helps you predict how they’ll evolve and whether their interests align with yours.

What do they want?

Bluntly: they want to make money. AI vendors see a rare chance to piggyback on a major shift in technology trends in order to become a generational tech institution. Tech companies tend towards monopoly, not implicitly but as a declared intention: former Meta board member and Palantir chairman Peter Thiel has argued that they are good for society and “competition is for losers”.

Although OpenAI began as a nonprofit, it shares this motive, and will convert to a for-profit business. As anyone who’s read Karen Hao’s Empire of AI will tell you, this is not a break from strategy, but in line with the founders’ intentions. Anthropic, whose founders broke away from OpenAI, doesn’t fall far from the tree, and Meta and Google are, well, Meta and Google.

Microsoft and Amazon, in turn, want to be the infrastructure provider (and have made major investments in OpenAI and Anthropic respectively). Palantir wants to be the service provider for government and law enforcement. And so on.

In order to get there, it serves them to build models that are as generally applicable as possible, in order to capture as many markets as possible. To do that, they need to train these models with as much data from across different disciplines as they possibly can; to become more useful in our work, they need to gain more access to data from inside our businesses. In turn, they either need us to provide that data willingly, or for that data collection to be impossible to avoid.

There are two main arguments being made to help us buy these products:

  • You’ll be left behind if you don’t: everyone else is using AI.
  • These will become incredible superintelligences that will remake human society.

These are pure marketing.

As I’ll mention later on, we should evaluate these technologies based on what they can do today, rather than claims about the future. It’s science fiction storytelling that we should treat with the same skepticism we gave Elon Musk when he told us we’d be going to Mars in 2024.

But the “fear of being left behind” argument is particularly dangerous: it suggests that you should pick a technology first and then find problems for it to solve. That’s always backwards. You should always start with the human problems you need to solve in order to serve your organizational strategy or personal goals, and then figure out what potential tools are in your toolbox to address them. AI might well be one of them, but we shouldn’t go looking for places to use it for its own sake.

A lot of money has been spent to encourage businesses to adopt AI — which means deeply embed services provided by these vendors into their processes. The intention is to make their services integral as quickly as possible. That’s why there’s heavy sponsorship at conferences for various industries, programs to sponsor adoption, and so on. Managers and board members see all this messaging and start asking, “what are we doing with AI?” specifically because this FOMO message has reached them.

I want to be clear: I believe there are uses for AI, and it should be included as part of a modern toolbox. But never in a vacuum, never because of FOMO, and always while considering the wider context of the tool and the vendor.

How do they work?

Technically, most AI services are SaaS businesses. You usually either pay a monthly fee to use a web interface, or you pay per API call / token to access models programmatically. Enterprise plans make costs more predictable for larger organizations. Downstream software vendors often embed AI capabilities and charge for it on a per-feature basis; those vendors usually pay API fees upstream to the model vendors.

In each of these cases, the prompt makes its way to servers run by the vendor, is processed, and the response is returned. When you’re using AI features provided by downstream vendors, you don’t have a direct relationship with the upstream API provider; it may not even be clear which provider is being used. For instance, when you use AI features in Slack or Notion, your data may be processed by OpenAI or Anthropic, even though you never signed up with them directly.

This has clear data privacy implications. I think it’s completely reasonable to use such a service with public data. But as soon as you’re using sensitive data of any kind — which includes both business data and detailed information about your life — you should consider the chain of custody. Vendors will often make attestations that they won’t train a model on this data at certain subscription levels, but that’s only half the problem: that data is still hitting their servers. Sometimes, depending on the service and subscription level, queries and responses may also be analyzed by engineers in order to improve the service.

For some sensitive data, you need to decide whether you trust a service’s security in order to use it. For others, it might not be appropriate or even legally permissible to use a service to process it. It’s also worth considering: if the service was compromised, or had weak security controls that meant a bad actor or service employee could read your data, how much would it matter? Would you be upholding your commitments, agreements, and responsibilities with your community?

It’s worth saying that local models exist: these allow you to run models on your own infrastructure, or even on your own laptop. That removes much of the service privacy risk, while not removing risks associated with the models themselves.

AI tech

Underlying principles

When we talk about AI in today’s technology discourse, we’re mostly talking about Large Language Models (LLMs). These are trained on massive amounts of text from the internet (as well as pirated books and other materials) to predict what word should come next in a sentence, over and over again. When you ask it a question, it's not actually reasoning; it's using those patterns it learned to generate text that sounds like a plausible human response. That text can include written language, but also structured data, programming languages, and so on. These days, AI services based on LLMs can usually go fetch websites in real time and include their content as source material in their answers.

In addition to LLMs, generative AI — models that can produce content in response to prompts — also include image generation, music generation, and so on. In each case, the same high-level principle applies: the model makes predictions about what the content should be based on what it’s learned from huge amounts of training data. Services with user-generated content can be incredibly valuable for this: for example, YouTube became a giant training set for Google’s video generation, and other services may have used that dataset less legally. Similarly, xAI is trained in part on X/Twitter. The open indie web blog publishing world became fodder for any model to be trained on.

A black box for answers

Although the training data behind a model is sometimes made open, it’s very rare for a model itself to be open. They’re almost impossible to audit, and in fact, we actually don’t fully know how they work! As such, they’re black boxes that take in prompts and produce answers.

Because they’re so deeply-bound to their training data, they reflect any biases inherent to them. Because vendors also shape these datasets through their choices about what to include and how to weight different sources, they also reflect any biases inherent to the companies that produce them. (You might remember that xAI started making antisemitic comments after vendor tweaks for tone and viewpoint.)

Because they’re predictive models rather than brains that reason, they’re also often terribly wrong in ways that go far beyond bias. Hallucinations are an inherent by-product of the technology itself and not something that can easily be fixed.

Some vendor assumptions are particularly troubling. Timnit Gebru and Émile P. Torres coined the term TESCREAL — Transhumanism, Extropianism, Singularitarianism, (modern) Cosmism, Rationalist ideology, Effective Altruism, and Longtermism — to describe an ideology that deprioritizes current risks like climate change and systemic oppression in favor of working towards colonization of the stars and science fiction superintelligences that they think could form the basis of human civilization. To do so, many of them believe we need to accelerate technological progress, even at the cost of justice and human well-being — and as such, in execution it’s indistinguishable from technofascism. It’s an absurd ideology, and it would be tempting to dismiss it as ludicrous were there not plenty of people in the industry who adhere to it.

It’s not that models are wildly wrong all the time, but because they can be wildly wrong, this needs to be a part of your mental model for assessing them. Similarly, it’s not that models are going to declare themselves to be MechaHitler and parrot white supremacist talking points, but the fact that one did should give everyone pause and help us consider what other biases, big and small, are being returned in their responses. It’s not that everyone in AI believes we need to ignore current problems in favor of colonizing Mars, but the fact that some of their founders do should feature prominently in how we evaluate them.

But because of the way they work, you don’t see any of those dynamics. Even though they don’t reason, are often terribly wrong, and are susceptible to bias, models simply return a confident response that looks like fact. In a world where a model vendor tends towards monopoly, its training data and corporate biases have the potential to affect how many people learn about the world, with very little transparency or auditability. That potentially gives them enormous power.

Power dynamics

We’ve already discussed how models can encode the goals and biases of the vendors who build them. But there’s more to consider about how AI bakes in certain kinds of power dynamics.

The first is that black box model. The Trump Administration seeks to remove “woke AI” from the federal government, an intent that will affect the priorities of every AI vendor. By dictating how models answer questions and make recommendations that touch on societal imbalances and issues like climate change, they will have a significant impact on how people learn about the world. It’s effectively saying that software systems need to stay in line with a hard-right ideology.

Workers’ rights are at risk too. The hallucinations and bias inherent to the technology should make it clear that AI should never be used to replace a human employee. It can be used to augment their work, much as a spreadsheet, a grammar checker, or a web browser does. But there always needs to be a human not just “in the loop” but in control of the process. The AI must always be in the hands of a real person. I also tend to think that AI output should never be seen by an end user (a reader at a publication, a consumer of a report, and so on): they’re potentially useful tools to speed up someone’s work, but the end result still needs to be human.

Still, some people very much want to use AI to trim their workforces and increase their profit margins, regardless of the drawbacks. Inevitably, that mostly affects people at the lower end of the ladder, although it can occur everywhere below the strategic management layer. Notably, this appears to have been the guiding philosophy behind DOGE, the repurposed government department that fired huge swathes of government employees, installing AI models in their absence (and then sometimes re-hiring them as it became clear that this approach didn’t actually work).

Another power dynamic is that of content ownership. I’ve mentioned that not all of the data used in training sets was legally used. Many large publishers are suing AI models for doing this; Anthropic was found to possibly have broken the law when it pirated books for this purpose, and is now subject to a class action suit by authors whose books were stolen. Larger publishers and authors with representation can afford to conduct these suits; independent artists cannot. The result is that people who already have power and a platform will see any benefit from a legal win, but independents (who are more likely to come from vulnerable communities) are less so.

And the act on vulnerable communities is intense. People in developing nations help to train the models by correcting, filtering out, and labeling data, often for low compensation and with high potential exposure to upsetting or traumatic material. AI datacenters are draining local water supplies and spewing toxic emissions, often in poorer communities. And a great deal of investment in AI is for military use cases, where models are sometimes used to select targets for assassination in places like Gaza.

Too often, in other words, AI systems allow value to be extracted from poorer communities for the benefit of richer ones. As these systems become more entrenched in everyday life, these dynamics become locked in.

So how can I use AI?

The answer to the question “how can I use AI” is the same as the answer to how you should approach using any technology: carefully, and with a strong handle on your needs and values.

Make sure you start with real, human problems. What are you trying to solve for yourself or your organization? Every technical solution must be in response to a human problem.

Evaluate services not just through cost/benefit, but through the lenses of values and liability. Who are the vendors and what do they believe in? How do they work? How might their ethical stances create financial, technical, or reputational risk in the future?

Be clear-eyed about what the products can do today. Ignore hyperbolic claims about the future and phrases like superintelligence or artifical general intelligence. What are they capable of now? Where do they excel and what are their shortcomings? How will you deal with bias and incorrect answers? For creative work specifically, consider whether you’re undermining human creativity and livelihoods for marginally useful output.

Follow the data. What data are you handing to whom? Many third-party AI-based services are thin veneers over OpenAI or Anthropic. What are you being encouraged to hand over? Who has custody of it? What are their commitments? What are they capable of doing to it, regardless of their commitments?

Maintain human control. Always keep a human not just in the loop but always fully in control of AI processes. Remember that because of bias and hallucinations, fully autonomous AI may be risky.

Maintain optionality. Are you locking an AI vendor into your critical processes? Consider what will happen to your business if the vendor radically changes the functionality of their service or its pricing model.

I think there’s a lot of scope for personal and local models to be useful — particularly with tightly-scoped tasks that aren’t trying to replicate human creativity.

A lot of engineers now use agentic AI to build software; I see that as less problematic than many use cases, and it will be particularly powerful when those models can be run internally within an organization and tailored to their coding preferences and history.

Similarly, pipelines for streamlining and classifying data are proving to be really interesting. Models can take fairly vague instructions and reach out to web services, sources, and databases to create structured datasets. Those sets still need human oversight, but they can save a ton of time. AI is pretty decent at finding patterns in data.

Local models that connect to your calendar and other local productivity tools also have the potential to be useful — as do models provided by your calendar and cloud productivity host themselves under the same terms and conditions, if you use one. For example, you can ask a model to give you an overview of what to prepare for a meeting. On-device tools like Apple Intelligence work this way, and although its summaries are famously awful, it does make useful suggestions. Google’s built-in Gemini AI tools can similarly provide helpful nudges. I wouldn’t ask it to write an email, but it can be useful to check what you’ve missed or summarize things you need to understand. If you’re using Google’s tools, you’ve already bought into its cloud, and these AI services are bound by the same agreement.

Given the privacy and power centralization concerns, my suspicion is that we’ll see more local, on-device AI in the future. This will also alleviate the need for huge datacenters.

Ultimately: no, you won’t be left behind if you don’t use AI. There’s a lot to be gained by resisting the hype cycle and staying true to your own needs and values. But it’s also not true that there’s no utility in AI: many of these tools really can speed you up, as long as you’re mindful of their realities and understand their shortcomings.

Further reading

Some books, blogs, and websites to consider reading in order to deepen your knowledge: