What “production-ready” actually means on day one (and how to interview a partner for it)

The stat that should ruin your week

…

RAND looked at 800+ enterprise AI projects and found that 80.3% fail to deliver business value.

Gartner’s number for GenAI specifically is uglier: 95% of pilots get abandoned after PoC. McKinsey’s late-2025 survey says 88% of organizations now use AI somewhere, but only 39% report any EBIT impact. Eighty-something percent of the market is paying for AI and getting no measurable result.

Pause on that. We’re four years past ChatGPT. The tooling is mature. The models are good. Foundry has been GA’d, expanded, and partnered out to every Big 4. And the failure rate is still 80%+.

The failure isn’t usually a model failure. It’s almost always a production failure. The pilot ran. The demo wowed the steering committee. Then someone asked “ok, can we put it in front of a real claims adjuster” and the wheels came off because nobody designed for what happens on the 1,000th transaction, the 10th model regression, the third audit, the first wrong answer that costs the company money.

That gap, between “the demo worked” and “the system holds,” is what “production-ready” is supposed to mean. The phrase has been so abused on vendor websites that it now means nothing. Time to give it a definition you can actually hold a partner to.

What “production-ready on day one” actually means

This is the part that should be boring and isn’t. The reason 80% of projects fail is that the buyer never wrote down what production-ready means before signing the SOW. Here’s our list. Ten things. If all ten exist before go-live, you have a system. If even one is missing, you have a glorified pilot.

1. An eval harness, not a demo script

You need a test suite that scores model output against a reference set, runs on every change, and blocks deploys when scores drop. In Foundry, AIP Evals is the native tool for this. If your partner has never used it, that’s a tell. If they have, ask to see the eval suite, not a screenshot, the actual suite. Look for: variety in test cases, edge cases, regression coverage, and a pass/fail threshold tied to deployment gates.

A demo script is “here are five questions that work.” An eval harness is “here are 200 questions, of which 12 currently fail, here’s why, here’s our plan, and here’s when it runs next.”

2. Write-back with audit, or the system is read-only theatre

Most “AI deployments” never write to anything. They read, summarize, and display. That’s a research tool, not a system. The hard problem starts when an agent has to write a row to the ERP, defer a PO, update a clinical record, or trigger a payment.

Real write-back has: idempotency keys (so retries don’t double-write), two-phase or compensating-transaction semantics (so partial failures unwind), and an audit trail that records who/what/when at the row level. Without those three, you’re one network blip away from a finance disaster and an unanswerable audit question.

This is the single most underestimated cost in enterprise AI work. We built the Write-Back Orchestrator because we got tired of bolting these patterns on every project.

3. Human-in-the-loop that captures the correction, not just the override

“We have HITL” is a vendor phrase that usually means “there’s a button the user can click to ignore the AI.” That’s not HITL. That’s a kill switch.

Real HITL captures the correction structurally: the original AI output, the human’s edit, the reason, and links it back to the underlying record. That signal becomes training data, eval data, and the audit story for the regulator. Our Human Validation Station does this because petition signature work taught us that the auditor will ask, and “we let the user override it” is not an answer.

4. Rollback paths for models, prompts, ontology, and write-back

Four different things can break in production. The LLM. The prompt. The ontology change someone made on Tuesday. The write-back logic. You need a rollback story for each one, separately. If your partner’s answer is “we’ll redeploy the previous version,” ask them how long that takes, who approves it, and whether the in-flight transactions get rolled back too.

The right answer is usually: model and prompt are versioned with promotion gates between sandbox and prod, ontology changes go through a review branch, write-back has a “compensating action” pattern for transactions already committed. If you don’t hear all four, you have a hope, not a plan.

5. Model and prompt versioning that an outsider can read

Every change to a model, prompt, or agent definition gets a version, a diff, and a reason. Stored somewhere a compliance officer who doesn’t code can read.

This is so basic it’s embarrassing to write down. Half of partner engagements still don’t do it. They edit prompts inline in Agent Studio, deploy, and pray nobody asks what changed last Thursday. When the wrong answer happens (and it will), the post-mortem is a nightmare.

6. RBAC tied to the ontology, not just the app

In Foundry, this is mostly free if you do the ontology work right. Access controls live at the object and link level, not just the UI level. That means a user who shouldn’t see customer SSNs in the analyst dashboard also can’t see them through the AI agent that queries the same ontology.

Ask the partner to draw the access model from data → ontology → AIP Logic → agent → UI. If they can’t, the AI agent is going to leak something the dashboard wouldn’t have.

7. Observability on the LLM layer, not just the app layer

Foundry has tamper-proof audit logs for every LLM call and data access. That doesn’t help if nobody’s looking at it. Production-ready means there’s a dashboard, an on-call rotation, and a threshold-based alert when token cost spikes, hallucination rate climbs, or write-back failure rate moves.

If the partner can’t show you the dashboard during the pitch, it doesn’t exist yet.

8. Runbooks that someone on your team can execute at 2am

Six months in, your partner’s lead engineer is on a beach in Portugal. The agent stops responding. The on-call SRE on your side has never touched Foundry. What do they do?

If the answer is “call the partner,” congratulations, you’ve signed a managed service disguised as a build engagement. That’s fine if you priced it that way. It’s a problem if you thought you were buying a system that your team would own.

Real runbooks have: how to restart, how to roll back, how to escalate, who owns what, and the four most likely failure modes with their fixes. PDF or Confluence, doesn’t matter, just has to exist before launch.

9. SLAs measured in operator outcomes, not system uptime

“99.9% uptime” is the lowest possible SLA bar and it’s the only one most partners commit to. It’s table stakes. The SLAs that matter are: how fast does the AI respond under load, what’s the acceptable hallucination rate on the eval suite, what’s the maximum write-back latency, what’s the resolution SLA when a model regression is detected.

Get those four into the contract or you have no leverage when things drift.

10. A deprecation plan

This one will get me hate mail. A production-ready system has a plan for how it ends.

Maybe the model gets replaced when Claude 5.5 ships. Maybe the workflow gets absorbed back into the ERP in 18 months. Maybe the agent gets retired when the underlying process changes. None of these are bad outcomes. What’s bad is buying a system without thinking about how you’d swap it out, then discovering five years later that it’s load-bearing and the partner who built it has been acquired twice.

If a partner’s never been asked this question, watch how they answer. The good ones will smile and walk you through it because they think about it constantly. The bad ones will say “why would you want to deprecate it?”

The 7-question partner interview

Ten attributes is a lot. Here’s the short form. Seven questions, what a good answer sounds like, what a bad answer sounds like. Use this on any Palantir vendor. Use it on us. We’ve answered worse.

Q1: “Show me the eval suite for the closest project you’ve shipped to ours.”

Good answer: They share a screen, walk you through 50-200 test cases, name the model versions tested, point to the pass/fail thresholds, and tell you which cases currently fail.
Bad answer: “We use AIP Evals, here’s a tutorial link.” (They’ve never used it on a customer project.)
Worse answer: “Evals are part of the next phase.” (There is no next phase. There’s just this phase, forever.)

Q2: “Walk me through the worst write-back failure you’ve had in production and how it got fixed.”

Good answer: A specific story with a date, a system (ERP/EMR/OMS), a root cause (idempotency missed, schema mismatch, etc.), and what changed afterward. Bonus points if they admit the fix took longer than they’re proud of.
Bad answer: “We haven’t had any failures.” (Then they haven’t run write-back at scale.)

Q3: “What’s your HITL capture model? Show me the schema.”

Good answer: They draw or pull up a data model showing original output, corrected output, user, timestamp, reason code, and link back to the source record.
Bad answer: “Users can edit the output.” (That’s a text field, not a HITL pattern.)

Q4: “If your lead engineer disappears for two weeks, can my team operate the system?”

Good answer: Yes, here are the runbooks, here’s the training plan, here’s the read-only admin tooling we build for your ops team.
Bad answer: “We provide 24/7 support during the contract.” (Translation: you don’t own this, we do, and we will keep billing you.)

Q5: “How do you version a prompt change, and can an auditor read the change log?”

Good answer: Branch + PR + reviewer + commit message, stored in Foundry or a connected repo. A compliance officer can read it without help.
Bad answer: “Our engineers track changes.” (No they don’t.)

Q6: “Show me the production observability dashboard from a current client.”

Good answer: Live dashboard (or a redacted screenshot) showing token cost, response latency, eval pass rate, write-back success rate, alert history.
Bad answer: A pitch slide showing what they “could” build. The dashboard either exists or it doesn’t.

Q7: “What’s the deprecation plan for this system in three years?”

Good answer: A real consideration of swap-out paths, vendor lock-in concerns, data portability. They’ll tell you what’s hard to undo and what’s easy.
Bad answer: “Why would you want to deprecate it?”

If a partner aces six of seven, you have a real shortlist candidate. If they ace fewer than four, they’re selling something that isn’t yet built.

The anti-checklist: what looks production-ready but isn’t

A few traps that have burned us, our customers, and people we know.

A polished demo with hardcoded answers. If the demo never fails, it’s been rehearsed. Ask them to enter a new question in front of you, off-script. Watch their hands.

“It works in Foundry sandbox.” Sandbox is a museum. Production is a kitchen at dinner rush. Different physics. Ask to see something running against live data with real users.

A team of generalists with one Palantir certification. Palantir certifications are real but a single Foundry App Dev cert across the team isn’t depth. Ask the team makeup. Ratio of Palantir-certified engineers to subcontractors is the number that matters.

“AI-first” branding with no actual model deployments. This is the easiest tell. Ask what models they’ve deployed inside Foundry beyond LLM calls. Real answers include things like YOLO, paddle table extractors, custom OCR. If the answer is “we orchestrate GPT,” they’re a prompt engineering shop, not an AI partner.

Quote-per-phase pricing with no production milestone. Read the SOW. If there’s no “production go-live” milestone with acceptance criteria, the project will never end. Add one. Tie payment to it.

Where AKOS fits (and where we don’t)

To save you time on the pitch: we built the Arsenal because we were tired of solving the same ten problems on every Foundry engagement. Idempotent write-back. HITL with audit. Eval scaffolding. Context filters. Quality conductors that route low-confidence agent output back for regeneration. Real CV models running inside Foundry against media sets. Handwriting OCR for the 40-year-old paper records nobody else wants to touch.

We are most useful when:

You have Palantir already, or you’re seriously evaluating it, and you need a partner who treats it as the operating system, not a BI tool.
Your problem has hard edges. Regulated, audited, write-back-heavy, real-time, or all four.
You want to own the system after we leave. We build runbooks. We document. We don’t subscribe you to ourselves.

We are not the right fit when:

You want a 6-month exploratory phase to “see what’s possible with AI.” We are bad at this. We will get bored, push you toward an outcome, and you will feel rushed. Hire a strategy firm for that phase. Come back to us when you have a problem to solve.
You want pure managed services with no internal team involvement. We can do it. We’d rather not. Forward-deployed means we work with your people, not around them.

Use this on us

Honestly. Use the 7 questions on AKOS. Send us a Loom of a competitor’s pitch and we’ll point out what’s missing. Ask us about the worst write-back bug we’ve shipped (it’s in the petition signature project, and the fix is in the Arsenal because of it).

Most of the 80% of AI projects that fail die quietly. The buyer doesn’t tell anyone. The vendor doesn’t tell anyone. Both sides move on. The next buyer has no idea the same trap was set six months ago.

If we can take even a small bite out of that, by helping you write better RFPs, ask sharper questions, and define “production-ready” before signing anything, this post did its job. Even if you don’t pick us.

Talk to us. Or take the 7-Question Partner Interview PDF and run it on whoever’s in your funnel.

What am I missing? If you’ve shipped Foundry to production and disagree with any of the ten attributes, hit reply, we’ll either fold it in or argue back. Either is fine.

What “production-ready” actually means on day one (and how to interview a partner for it)

The stat that should ruin your week

What “production-ready on day one” actually means

1. An eval harness, not a demo script

2. Write-back with audit, or the system is read-only theatre

3. Human-in-the-loop that captures the correction, not just the override

4. Rollback paths for models, prompts, ontology, and write-back

5. Model and prompt versioning that an outsider can read

6. RBAC tied to the ontology, not just the app

7. Observability on the LLM layer, not just the app layer

8. Runbooks that someone on your team can execute at 2am

9. SLAs measured in operator outcomes, not system uptime

10. A deprecation plan

The 7-question partner interview

Q1: “Show me the eval suite for the closest project you’ve shipped to ours.”

Q2: “Walk me through the worst write-back failure you’ve had in production and how it got fixed.”

Q3: “What’s your HITL capture model? Show me the schema.”

Q4: “If your lead engineer disappears for two weeks, can my team operate the system?”

Q5: “How do you version a prompt change, and can an auditor read the change log?”

Q6: “Show me the production observability dashboard from a current client.”

Q7: “What’s the deprecation plan for this system in three years?”

The anti-checklist: what looks production-ready but isn’t

Where AKOS fits (and where we don’t)

Use this on us

Scottsdale (HQ)

San Diego

Toronto

What “production-ready” actually means on day one (and how to interview a partner for it)

The stat that should ruin your week

What “production-ready on day one” actually means

1. An eval harness, not a demo script

2. Write-back with audit, or the system is read-only theatre

3. Human-in-the-loop that captures the correction, not just the override

4. Rollback paths for models, prompts, ontology, and write-back

5. Model and prompt versioning that an outsider can read

6. RBAC tied to the ontology, not just the app

7. Observability on the LLM layer, not just the app layer

8. Runbooks that someone on your team can execute at 2am

9. SLAs measured in operator outcomes, not system uptime

10. A deprecation plan

The 7-question partner interview

Q1: “Show me the eval suite for the closest project you’ve shipped to ours.”

Q2: “Walk me through the worst write-back failure you’ve had in production and how it got fixed.”

Q3: “What’s your HITL capture model? Show me the schema.”

Q4: “If your lead engineer disappears for two weeks, can my team operate the system?”

Q5: “How do you version a prompt change, and can an auditor read the change log?”

Q6: “Show me the production observability dashboard from a current client.”

Q7: “What’s the deprecation plan for this system in three years?”

The anti-checklist: what looks production-ready but isn’t

Where AKOS fits (and where we don’t)

Use this on us

Contact us