Automatically grade every agent interaction against your criteria, collect structured data, and tag conversations for reporting.
Evaluations give you automated quality assurance for your agents. After every interaction completes, an AI evaluator analyzes the full conversation transcript and produces a structured report: a grade, per-criterion pass/fail results with rationales, sentiment analysis, applied tags, and extracted data points.Think of it as having a QA analyst reviewing every single conversation your agent has, 24/7, without you lifting a finger.
Evaluations is a Pro feature. If you’re on the free tier, you’ll see an upgrade prompt when you try to enable it.
Before jumping into setup, here’s the mental model. Evaluations have four building blocks:
Criteria
Quality rules the evaluator checks pass/fail on every conversation. Like a QA checklist.
Tags
Labels applied for categorization and filtering. Like folders in your inbox.
Data Points
Structured values extracted from conversations. Like columns in a spreadsheet.
Sentiment
Emotional tone of the interaction. A customer satisfaction thermometer.
Criteria answer: “Did the agent do what it was supposed to do?” Each one is a yes/no check.Tags answer: “What kind of conversation was this?” Use them to filter and find patterns.Data Points answer: “What specific facts or values came up?” They pull structured data out of unstructured conversation.Sentiment answers: “How did the user feel?” Optionally let negative sentiment affect the grade.
Your agent finishes a conversation (reaches the “completed” state). Incognito chats and internal system interactions are never evaluated.
2
Transcript is Built
The evaluator constructs a role-tagged transcript of the entire conversation, including user messages, agent responses, tool calls, and results. Long transcripts are automatically trimmed.
3
AI Evaluator Runs
A single structured-output LLM call analyzes the transcript against your configured criteria, tags, data points, and sentiment settings.
4
Grade is Computed
The overall grade is determined deterministically based on criterion failures, priority levels, call outcome, sentiment, and action failures.
5
Results Stored & Alerts Fired
Results are persisted. If the grade is “Critical,” the agent owner receives an immediate Slack DM alert.
When does the evaluation fire?
The evaluation fires automatically shortly after the interaction reaches the “completed” state.
For a simple back-and-forth that ends naturally, the evaluation runs once the chat is marked complete.
The evaluation covers the entire transcript up to that point.
There is a short debounce period after the last message so the system doesn’t prematurely evaluate a chat that’s still active.
What happens if I keep chatting after the evaluation ran?
If you continue a conversation after an evaluation has already run:
The new messages extend the transcript, and the interaction re-enters an active state.
Once the conversation reaches the “completed” state again, a new evaluation automatically runs covering the full updated transcript.
The new evaluation replaces the previous result. You don’t need to manually re-run it.
You always see the most recent evaluation result for any given interaction.
Do I have to manually re-run it, or is it automatic?
Fully automatic. As long as evaluations are enabled, the system handles re-evaluation whenever conversations continue and complete again. You only need to run manually if you want to backfill old conversations or re-evaluate after changing your criteria.
Lets the evaluator propose new tags beyond your predefined vocabulary
For sentiment, you can provide custom guidance (e.g., “Consider the customer’s final message tone, not their initial frustration”) and check “Negative sentiment affects the overall grade” to auto-downgrade negative experiences.
Criteria are the quality rules your evaluator checks every interaction against. Each one is a clear statement that’s either true or false for a given conversation.How to think about it: Ask yourself, “If I were reviewing this conversation manually, what would I check for?” Each answer becomes a criterion.
Agent must NOT do something (e.g., don’t offer unauthorized discounts)
Prohibited words
Agent must NOT say certain things (e.g., no profanity)
Voice & tone
Agent should communicate in a certain style
Other
Anything else (stayed on topic, provided accurate info, etc.)
Choosing priority: Use “Critical” for rules that must never be broken (data leaks, compliance violations). Use “Warning” for quality standards that matter but aren’t urgent (tone issues, minor drifts).
Example criteria for common use cases
Use Case
Criterion
Prompt
Type
Priority
Support
Stayed on Topic
”The agent stayed focused on resolving the customer’s issue and did not go off on tangents.”
Other
Warning
Support
Accuracy
”The agent provided factually correct information and did not hallucinate.”
Other
Critical
Sales
No Unauthorized Discounts
”The agent did not offer discounts not in the approved pricing sheet.”
Prohibited action
Critical
Sales
Professional Tone
”The agent maintained a professional, friendly tone throughout.”
Voice & tone
Warning
Helpdesk
No PII Disclosure
”The agent did not reveal personal information of other employees.”
Prohibited action
Critical
Content
Brand Voice
”The content matches the brand’s voice: confident, concise, and jargon-free.”
Voice & tone
Warning
Write evaluation prompts as true/false statements. The evaluator returns “success” if it holds, “failure” if it doesn’t, and “unknown” if there’s not enough info to judge.
Limit: 30 criteria per agent.
Tags are labels for categorization. After analyzing a conversation, the evaluator applies tags whose description matches what happened. You can then filter your interactions list by tag.How to think about it: Tags are like labels in Gmail. They don’t pass/fail anything. They just categorize. Ask yourself, “What categories would help me filter and find patterns in my conversations?”
The evaluator applies matching tags after analyzing the conversation
Tag names are normalized (e.g., “off course” → OFF_COURSE)
You can manually add/remove tags on any evaluated interaction
Use them to filter the interactions list
Example tags for common use cases
Tag
Description
Use Case
ESCALATION_NEEDED
”Customer asked to speak with a human or the issue is too complex for the agent.”
Support triage
UPSELL_OPPORTUNITY
”Customer expressed interest in additional features beyond what they currently use.”
Sales analytics
AI_SLOP
”Agent’s response contained filler phrases or generic AI-sounding language.”
Quality monitoring
OFF_COURSE
”Agent deviated from the user’s original question.”
Focus tracking
POSITIVE_FEEDBACK
”User explicitly praised the agent or expressed satisfaction.”
CSAT proxy
TECHNICAL_ISSUE
”Conversation involved a bug report or system malfunction.”
Issue categorization
Enable “Suggest tags automatically” in Settings and the evaluator will also generate new tags for patterns it notices beyond your predefined vocabulary.
Limit: 50 tags per agent.
Data collection lets you extract structured values from every interaction. Define what to extract, and the evaluator pulls it out automatically.How to think about it: Imagine hiring someone to read every conversation and fill out a spreadsheet. Each column is a data point. You define the column name, value type, and extraction instructions.
”Rate 1-10 how confident the agent appeared based on hedging language vs. direct statements.”
Quality scoring
Resolution Status
Text
”Values: resolved, unresolved, partial, or unknown.”
Support metrics
Handoff Requested
Boolean
”Did the user ask to speak with a human?”
Escalation tracking
Number of Tool Calls
Integer
”Count distinct tools the agent used.”
Efficiency analysis
Customer Intent
Text
”Summarize the customer’s primary intent in 2-5 words.”
Intent classification
Response Quality
Number
”Rate 1-10 considering accuracy, completeness, and helpfulness.”
Performance benchmarking
Data points that return null mean the evaluator couldn’t find the information in the transcript. This is expected for data points that don’t apply to every conversation.
No failures, call wasn’t a failure, sentiment not negative
Warning
Needs review
Warning-priority criterion failed, OR call outcome “failure”, OR negative sentiment affects grade
Critical
Immediate attention
Critical-priority criterion failed, OR tool/action failures during interaction
A grade of “Inconclusive” appears when all criteria returned “unknown” (evaluator couldn’t determine pass/fail). This usually means the transcript was too short to evaluate.
Grade computation logic (deterministic)
The grade is computed deterministically (not by the LLM) after results come back:
If there were action failures (tool errors) → Critical
If any Critical-priority criterion failed → Critical
If any Warning-priority criterion failed → Warning
If overall call outcome was “failure” → Warning
If sentiment is negative AND affects grade → Warning
The interactions list includes an Evaluation column showing the grade and criteria pass rate at a glance (e.g., “Pass 3/3” or “Critical 1/3”). Lifecycle states also appear: Queued, Evaluating, or Failed.
When an interaction receives a Critical grade, the agent owner gets an immediate Slack DM with the agent’s name, which criteria failed, and a direct link to the interaction.No additional setup required, as long as Slack is connected.
Warning-grade interactions are flagged in the interactions list but don’t trigger an alert. Review them periodically by filtering for Warning grades.
Incognito chats: Never evaluated (privacy guarantee)
Internal interactions: Agent-to-agent feedback loops like reflections are excluded
Non-terminal states: Only interactions that reach “completed” are evaluated
Can I evaluate interactions from before I enabled evaluations?
Yes. Go to your agent’s Chats page, click the three-dot menu on any interaction, and select “Run evaluation.” You can also select multiple for bulk evaluation. They’ll be graded against your current criteria.
What happens if I keep chatting after an evaluation already ran?
The interaction re-enters an active state. Once it completes again, a new evaluation runs automatically covering the full updated transcript. The new result replaces the previous one.
Does the evaluation fire only once per chat?
By default, yes. The evaluation fires once after the conversation completes (with a short debounce window to avoid premature evaluation). If the conversation is resumed and completes again, a new evaluation runs automatically covering the updated transcript.
What happens if the evaluation fails?
It’s marked as “Failed” rather than showing a misleading pass. Doesn’t affect metrics. You can manually re-run it from the Chats page, or it will be re-evaluated if the conversation continues.
How is the analysis model different from my agent's model?
It’s a separate LLM call dedicated to evaluation. Doesn’t affect your agent’s behavior. You can choose a different model. The evaluator never sees your agent’s system prompt directly, only the transcript and agent description/skills as context.
Does the evaluator see my agent's system prompt?
It receives your agent’s name, description, and available skills as context. It does NOT receive the full system prompt. Transcript content is treated as data to analyze, never as instructions.
What counts as an 'action failure'?
Tool/integration errors during the interaction (e.g., an API call that returned an error, a tool that timed out). These automatically cause a “Critical” grade.
Can I set up evaluations at the organization level?
Currently configured per agent. Organization-level configs are planned for a future release.
What happens if I delete a criterion or tag?
Past evaluation results retain their original data. Deleting only affects future evaluations.
Can I edit tags on an evaluation result?
Yes. Manually add or remove tags using the tag editor in the evaluation detail view.
Why does my evaluation show 'Inconclusive'?
All criteria returned “unknown” (evaluator couldn’t determine pass/fail). Typically means the conversation was too short or didn’t touch on what your criteria check.
How much do evaluations cost in credits?
Each evaluation is billed based on tokens used. Cost depends on conversation length, model selected, and number of criteria/tags/data points configured. Track usage on your Usage & Limits page under “AI Utilities.”