Most "agentic AI" is theater. Here's how to tell the demos from the deployments.

Apr 26, 2026

Sahil Saini

11 min read

I sat through a vendor demo last month. The agent didn't fail once. It didn't write to anything. It didn't escalate, it didn't say "I don't know," it didn't ask a human a question. It just answered, beautifully, every single time. The CEO called it "a production deployment serving thousands of users daily."

I think I'm the only one in the room who left bothered.

Not because the demo was fake. Demos are always staged a little, that's fine. But because nothing in those fifteen minutes told me whether the system worked outside that room. And the audience didn't ask. The conference moved on. Someone forwarded the deck to procurement. A check is probably getting written this quarter.

This is the moment we're in. Agentic AI is the loudest category in enterprise software right now. Gartner has it on the Hype Cycle. McKinsey wrote a piece. Every Big 4 has a practice. Most of what's shipping is theater. The hardest part of being a buyer in 2026 isn't finding agentic AI. It's separating the people who are actually doing the work from the people who got really good at filming themselves doing it.

I run a Palantir partner shop. We build and ship models and multi-agent orchestrations on AIP & Foundry. Some of ours have worked. Some haven't. The ones that worked all share a few traits. The ones that didn't all failed in similar ways. So this isn't theory, it's just patterns we keep seeing, on our own work and on competitors' demos and at every conference I've been to this year. Naming them helps.

Why this is happening

Theater is the rational response to the buyer's incentives. Read that again. Theater is rational.

A buyer wants to see progress. A board wants to see progress. The CIO has a check to write and an enterprise AI line item that has to show movement by the next quarter. Anybody pitching them a slow, careful, "let's design the eval suite first" approach is going to lose to anybody pitching them a slick recorded demo of an agent that books meetings, files tickets, and drafts emails inside fifteen minutes.

It's the same dynamic as gym ads vs. actual fitness. Promising a body in 90 days outsells "you have to show up Tuesday and Thursday for the next five years" by a margin of ten thousand to one.

So vendors built for the buying meeting, not the production cutover. Of course they did. They wanted to eat.

The tax shows up later, in the 95% pilot abandonment rate Gartner is now reporting, in the McKinsey number where 88% of orgs use AI but only 39% see any EBIT impact. It's an industry-wide accounting problem. We sold the demo, the demo got bought, the system never came, the numbers don't move, and nobody wants to be the executive who admits it on a quarterly call.

I don't blame the vendors. I blame the rest of us for letting "demo" and "deployment" use the same vocabulary.

Seven patterns of agentic theater

Seven-patterns-of-agentic-theater-1-.png

Here's a taxonomy. Not exhaustive, not academic, just the seven I see most often. Each one has a test you can run inside a single meeting.

1. The recorded demo (and its cousin, the cherry-picked happy path)

The agent is shown via Loom, or via a sandbox the vendor refuses to let you touch, or via a live screen but with the same five questions the deck previewed.

Test: Ask them to enter a question you write on the spot. Watch their hands. If they pause, glance at the SDR, or steer you back to the deck, the agent doesn't generalize. You're being shown a movie.

2. The "agent" that's actually a workflow

There's a chat UI. You ask it something. It runs a script that's been pre-defined step by step. There's no decision-making, no tool selection on the fly, no failure handling. It's an if-then chain wearing a sweater.

Test: Ask "what does it do when [external API] is down?" If the answer is "we'd build in a fallback" instead of "it picks an alternate tool and notes the failure," it's a workflow. Workflows are fine. Just don't pay agentic prices for one.

3. The fake HITL

There's a thumbs-up / thumbs-down button next to the AI output. The vendor calls this "human in the loop."

It isn't. HITL means the human's correction is captured structurally, with the original output, the corrected version, the user, a timestamp, and a reason code, and that record flows back into your eval set and your training data.

Test: Ask to see the HITL schema. If there isn't one, or if the answer is "users can edit the response," it's a feedback button, not HITL. Big regulatory difference when the auditor calls.

4. The read-only agent

The AI generates a recommendation. A human copies that recommendation into another system, where the actual work happens. The AI never writes to the ERP, the EMR, the OMS, or the ticketing system. It's never the source of a state change. It's an advisor with a chat window.

This is fine if it's priced and positioned as decision support. It's not fine if it's sold as "AI that runs your operations."

Test: Ask the vendor to walk you through one real write-back path, end to end. If they pivot to "we integrate with your existing systems via Zapier" or "our customer's team handles the data entry from there," you're paying for a smart summary tool, not an agent.

5. The ship-by-vibes setup (no eval, no benchmark, no comparison)

The vendor can't tell you their model's accuracy on their own benchmark. There's no eval suite. There's no scoring rubric. Performance is asserted, not measured. Improvements are felt, not tested.

Test: "What's your current pass rate against your own eval suite, and what's the change from last quarter?" A real team answers in numbers in five seconds. A theater shop answers with "we monitor quality continuously."

6. The 2019 RPA / chatbot / BI vendor with a new coat of paint

Look at the product's history. If the architecture diagram from 2022 had a rules engine and a workflow builder, and the 2026 version has those same components plus an "LLM Reasoner" box bolted on, you're looking at a rebrand. The new logo, the new pricing, the new "agentic" positioning, all of it landed on a foundation that wasn't built for non-deterministic systems.

Test: Ask what fundamentally changed in the platform between 2023 and now. If the honest answer is "we added GPT integration," they're a rules engine with an LLM accessory. Could be fine for narrow use cases, just be honest with yourself about what you're buying.

7. The "agent that never says I don't know"

Every demo answer is confident. Nothing is flagged uncertain. The agent never asks a clarifying question or escalates to a human. In production, this is the most dangerous pattern, because confident wrong answers in regulated work cost real money.

A 2026 Compliance Hub roundup tracked over $145,000 in U.S. court sanctions on AI hallucinations in Q1 alone, including an Oregon $110K and a Nebraska license suspension. That's the price of agents that never say "I don't know."

Test: Ask the agent something it shouldn't be able to answer. If it answers anyway, you've just seen the production failure mode in a controlled environment. Don't ignore it.


What a real deployment actually has

If theater is what's missing, deployment is what's there. Four things, mostly. None of them are glamorous. All of them are the difference between a quarter-long pilot and a system that's still running in three years.

Write-back with audit

The agent changes the state of a system of record. Idempotently. With a log of who/what/when. With a rollback path if the wrong change goes through. We wrote about this in the production-ready on day one piece, and it's the single biggest thing that separates a research project from a system.

Evals as a deploy gate, not a quarterly report

The eval suite runs on every change. If the score drops below a threshold, the deploy doesn't go through. AIP Evals on Foundry is the native way to do this, and it's free with the platform. Most theater vendors aren't using it because they'd rather not have the numbers.

HITL that captures the correction

The human's correction goes into a structured record. It's an asset, not a UX nicety. It becomes training data, eval data, and the audit story for the regulator. The Human Validation Station pattern we use on petition signature work came out of getting yelled at by a state auditor. Once.

A deprecation plan

Real systems get retired. Eventually the model changes, the workflow changes, the business changes. A vendor who hasn't thought about how their system gets swapped out is building you a load-bearing dependency without a permit. Ask them. Watch the face.

Why this matters now (and why I'm not just venting)

The consequences are getting expensive. Three trends are converging in 2026:

  • Regulators caught up. The EU AI Act is in force, ISO/IEC 42001 is being adopted as the universal enterprise standard, and U.S. courts are sanctioning AI hallucinations on a case-by-case basis. "We didn't know" isn't a defense anymore.
  • CFOs caught up. McKinsey's late-2025 number was 39% of organizations reporting EBIT impact from AI. The other 61% are about to get put on a budget. Theater is great until the line item gets a magnifying glass.
  • Builders caught up. AIP Evals, AIP Logic, Compute Modules, Ontology SDK, all generally available. There's no technical excuse anymore for shipping read-only theater in 2026. The platforms support real deployment. Vendors who aren't shipping real deployment are choosing not to.

If you bought theater in 2024, fine, the category was new. If you buy theater in late 2026, you're not learning. You're just paying for a slower version of the same lesson.

What to do as a buyer

A short list. Apply it to us and to everyone else.

  1. Ask for the eval suite. Not a screenshot. The suite.
  2. Ask for the write-back path. End to end. Names of systems, transaction shape, audit record location.
  3. Ask for the HITL schema. Words don't count, only the schema.
  4. Off-script every demo. Type a question they didn't prepare for. See what happens.
  5. Get the production go-live milestone in the SOW. With acceptance criteria. With a payment hold.

That's it. Five questions, one contract clause. You'll cut your funnel by 60% and your odds of shipping by triple.

What to do as a builder

Stop shipping demos that work and start shipping systems that fail well. Failure modes are the real product. If your agent can say "I don't know," can escalate to a human, can mark a write-back attempt as failed and recover, you've already shipped more system than 80% of the market.

Side note: this is mostly a self-talk. We've built theater too, in the early days. The petition signature project taught us painfully that an agent which is confident on a forged signature is worse than a human flagging it manually. We rebuilt half the pipeline. The Agentic Quality Conductor module is what came out of it. We didn't release a press release about the rebuild. Probably should have.

What I'm not saying

I'm not saying every agentic AI vendor is a fraud. Most of them aren't. The vast majority are smart people genuinely trying to ship hard things, in a buyer's market that punishes them for being honest about the timeline. Some of the theater is the buyer's fault, not the vendor's. We get the pitch decks we deserve.

I'm also not saying Palantir partners are above this. We're not. There are Palantir-stamped pitches in the wild right now that would fail every test on this list. The certification is a floor, not a ceiling.

And I'm certainly not saying AKOS is perfect at this. We've shipped late. We've shipped things that needed three eval rounds before they earned trust. We've talked one customer out of a project last quarter because we couldn't get to "production-ready" inside their budget and didn't want to take their money for a movie. That decision didn't feel good. It was the right one.

What am I missing?

If you're a buyer who's been burned, or a builder who disagrees with any of these patterns, or a vendor who thinks I just described you unfairly, write me. The fastest way to be wrong less often in this market is to be willing to be wrong loudly in public. I've been wrong about plenty.

We're not the only ones doing real work. Some of the partners we compete with do this exact stuff and do it well. The point isn't tribalism, it's a vocabulary. If "deployment" stops meaning anything, every builder loses, including the honest ones.

The fix is for buyers to stop accepting "production deployment" as a phrase and start treating it as a checklist. The seven patterns above are how I'd start.

We packaged this up as a one-page PDF, The 7-Pattern Sniff Test, built to take into a vendor meeting. Print it. Use it on whoever's pitching you. Use it on us. The fastest way for this market to clean up is for buyers to start asking the questions theater can't survive.

Talk to us if you want a second pair of eyes on a pitch. Or send the deck. We'll be honest about what we see.

Written by

Sahil Saini

Sahil Saini

Founder & CEO

LinkedIn

Contact us

Ready to revolutionize your Industry or Organization?

Fill out the form with as much detail as possible. The more information you provide, the better we can tailor our questions and solutions to fit your unique needs. Let's take the first step towards creating something extraordinary together.

What are you?
What are you working on?

By submitting this form, you are agreeing to the privacy policy.

Scared of your submission getting lost in transition?
Just write us an email.