The single biggest mistake in AI-agent procurement is signing a contract before evaluating the agent on your real workflow. Vendor demos are tuned; your reality is messy. This is the checklist that survives contact with reality.
TLDR — the framework
Evaluate in 5 phases:
- Define the role — what outcomes does the agent need to deliver, measurably?
- Capability tests — does it actually do what you need, on YOUR data?
- Integration audit — does it fit into your stack without expensive rewiring?
- TCO model — all-in cost vs. status-quo cost, honest version
- Vendor due diligence — will they be here in 2 years?
No pilot, no purchase. Skip the pilot and you're buying a vendor pitch.
Phase 1: Define the role
Before talking to vendors, write down:
- Outcome metrics. "Resolve 60% of tier-1 tickets" not "deploy AI in support." "Book 50 qualified meetings/month" not "automate SDR work."
- Boundary conditions. What's in scope? What's out of scope? Where does it escalate?
- Failure modes. What's the worst the agent could do? Refund $100 vs. refund $10,000? Send a message vs. send a contract?
- Success criteria. Concretely: at month 3, this agent must hit X, Y, Z or we kill the contract.
If you can't write these in one page, you're not ready to evaluate. Spend a week thinking about the role first.
Phase 2: Capability tests
The deal-or-no-deal phase. Test the agent on your real workflow, not the demo.
Build your test set
Pick 30-50 real cases from your actual operations:
- 10 easy cases (the obvious wins)
- 20 medium cases (the bread + butter)
- 10 hard cases (the edge cases you actually see)
- Bonus: 5 cases the agent SHOULD escalate
For each: write down the expected outcome.
Run the test
Have the vendor (or yourself with a trial account) run the agent against your test set. Record:
- Did the agent reach the expected outcome?
- If yes — how confident? Did it require multiple turns?
- If no — what failed? Wrong action, wrong reasoning, missing tool, missing knowledge?
- For escalation cases — did it escalate appropriately or push through?
Score: deflection rate, accuracy, escalation appropriateness, response time.
Red flags during capability testing
- Vendor pushes back on testing against your real data (they should welcome it)
- "Our agent improves with use" — only valid IF they can run the test, retrain, and re-run with measurable improvement during the pilot
- Demo perfect, your data 60% — vendor demos are tuned; that gap won't close magically
- Escalation rates near zero — the agent is bluffing. Real edge cases should escalate.
Phase 3: Integration audit
The deployment surface. What does it cost to make this agent useful?
Required integrations
Map every system the agent needs to read from + write to:
- CRM (Salesforce, HubSpot, etc.)
- Ticketing (Zendesk, Intercom, etc.)
- Messaging (Slack, Teams, etc.)
- Email (Gmail, Outlook)
- Knowledge base (Confluence, Notion, Google Drive)
- Your internal databases
For each: native connector available? Pre-built? Custom build required? Estimated effort?
Auth + identity
- Does the agent respect existing user permissions or run with elevated access?
- What credentials does it hold? How are they stored + rotated?
- Audit trail — every action logged in a way you can review later?
Cost of integration
Most vendors quote the agent's monthly fee. The integration cost is often 2-4× the platform fee in year one. Budget honestly:
- Implementation engineering: 60-90 days typical for serious deployments
- Change management: a Sierra-class deployment needs a dedicated owner on your team
- Ongoing tuning: 0.1-0.3 FTE per agent in steady-state
Phase 4: TCO model
Build an honest TCO model. Don't trust the vendor's "ROI calculator."
Costs to include
- Platform subscription / per-outcome fees
- Integration engineering (year 1)
- Change management + training
- Ongoing tuning + maintenance
- Compliance + governance overhead
- Incident response capacity
- The 5-15% of outputs that need human correction in the first quarter
Costs to compare against
- Status quo (current human + tool cost)
- Alternatives (other vendors at same capability tier)
- Doing nothing (residual work that won't get done at all)
The honest math
For an AI SDR at $5K/month:
- Vendor's pitch: "Replaces 2 SDRs at $80K/year each = $13.3K/mo cost. You save $8.3K/mo."
- Honest math: agent replaces ~50-70% of SDR work, not 100%. Integration cost year 1 ~$30K. Ongoing tuning 0.2 FTE = $20K/year. True savings: $3-5K/mo, not $8.3K. Still positive, but ~half the pitch.
If the vendor's math doesn't accommodate this kind of reality, push back.
Phase 5: Vendor due diligence
Will the vendor be here in 2 years?
Financial health signals
- Funding stage + last round date (if it's been > 18 months and no new round, ask why)
- Revenue / customer count (vendors of well-funded companies talk about this; reluctance is a yellow flag)
- Burn rate vs runway (usually opaque, but reference customers sometimes know)
Product trajectory
- Roadmap they'll share under NDA — does it match the direction your needs are going?
- Release cadence over the past year — shipping every 2 weeks vs every 2 months matters
- Their bet on the category long-term — does it survive in a market with agentic AI consolidation?
Customer references
Ask for 3 customer references whose use case looks like yours. On the calls, ask:
- What surprised you in onboarding (positive and negative)?
- What's the agent NOT good at that the demo suggested it was?
- Would you buy this vendor again knowing what you know now?
A vendor that can't produce references whose situation looks like yours is a yellow flag.
Exit strategy
- If you decide to stop using this agent in 18 months, how does that work?
- Do you keep your data, prompts, configurations?
- Is there a portability story or are you locked in?
The no-pilot-no-purchase rule
Treat this as inviolable: never sign a multi-year contract before completing a real-data pilot. Period.
A vendor who refuses a pilot is signaling that their product doesn't survive contact with your reality. Walk.
A vendor who insists on a multi-year commitment for the pilot is signaling that they can't earn the renewal on outcomes. Walk.
A vendor who's flexible on pilot terms + offers a clear go/no-go review at the end is signaling confidence in their outcomes. Engage.
The decision review
At the end of the pilot, hold a structured review:
- Did the agent hit the outcome metrics from Phase 1?
- What was the actual TCO vs. modeled?
- What edge cases surprised us? Are they deal-breakers?
- What's our confidence level in projected scale?
- Do we have an owner on our team who's bought in?
- Are we going to be happy with this in 18 months?
If 4+ of these are clearly positive, sign. Otherwise, renegotiate or walk.
See also
- How to pick an AI agent
- How to deploy an AI agent
- How to evaluate an AI agent
- How to evaluate an AI tool trial
- When not to use an AI agent
Bottom line
Most AI-agent procurement mistakes are made in the first 2 weeks — by skipping pilot + signing on vendor pitch. The 5-phase framework above adds 6-10 weeks to the cycle and dramatically improves outcomes. The vendors worth buying from welcome this process; the ones who push back are the ones you can't afford to buy from.