Skip to main content
Evaluations give you automated quality assurance for your agents. After every interaction completes, an AI evaluator analyzes the full conversation transcript and produces a structured report: a grade, per-criterion pass/fail results with rationales, sentiment analysis, applied tags, and extracted data points. Think of it as having a QA analyst reviewing every single conversation your agent has, 24/7, without you lifting a finger.
Evaluations is a Pro feature. If you’re on the free tier, you’ll see an upgrade prompt when you try to enable it.

Where to Find Evaluations

Open your agent and click Evaluations in the left-hand sidebar.
Agent sidebar showing the Evaluations menu item

The Building Blocks

Before jumping into setup, here’s the mental model. Evaluations have four building blocks:

Criteria

Quality rules the evaluator checks pass/fail on every conversation. Like a QA checklist.

Tags

Labels applied for categorization and filtering. Like folders in your inbox.

Data Points

Structured values extracted from conversations. Like columns in a spreadsheet.

Sentiment

Emotional tone of the interaction. A customer satisfaction thermometer.
Criteria answer: “Did the agent do what it was supposed to do?” Each one is a yes/no check. Tags answer: “What kind of conversation was this?” Use them to filter and find patterns. Data Points answer: “What specific facts or values came up?” They pull structured data out of unstructured conversation. Sentiment answers: “How did the user feel?” Optionally let negative sentiment affect the grade.

How Evaluations Work

1

Interaction Completes

Your agent finishes a conversation (reaches the “completed” state). Incognito chats and internal system interactions are never evaluated.
2

Transcript is Built

The evaluator constructs a role-tagged transcript of the entire conversation, including user messages, agent responses, tool calls, and results. Long transcripts are automatically trimmed.
3

AI Evaluator Runs

A single structured-output LLM call analyzes the transcript against your configured criteria, tags, data points, and sentiment settings.
4

Grade is Computed

The overall grade is determined deterministically based on criterion failures, priority levels, call outcome, sentiment, and action failures.
5

Results Stored & Alerts Fired

Results are persisted. If the grade is “Critical,” the agent owner receives an immediate Slack DM alert.
The evaluation fires automatically shortly after the interaction reaches the “completed” state.
  • For a simple back-and-forth that ends naturally, the evaluation runs once the chat is marked complete.
  • The evaluation covers the entire transcript up to that point.
  • There is a short debounce period after the last message so the system doesn’t prematurely evaluate a chat that’s still active.
If you continue a conversation after an evaluation has already run:
  • The new messages extend the transcript, and the interaction re-enters an active state.
  • Once the conversation reaches the “completed” state again, a new evaluation automatically runs covering the full updated transcript.
  • The new evaluation replaces the previous result. You don’t need to manually re-run it.
  • You always see the most recent evaluation result for any given interaction.
Fully automatic. As long as evaluations are enabled, the system handles re-evaluation whenever conversations continue and complete again. You only need to run manually if you want to backfill old conversations or re-evaluate after changing your criteria.

Setting Up Evaluations

Navigate to your agent’s Evaluations tab in the sidebar. You’ll see the settings panel:
Evaluation settings panel showing the enable toggle, model selector, sentiment analysis, and auto-tags options
SettingWhat It Does
Enable evaluationsToggle to start automatically grading interactions
Default analysis modelWhich LLM runs the evaluation. “Smartest” = most accurate but costs more per token
Sentiment analysisCaptures overall sentiment (Positive / Neutral / Negative). Optionally affects grade.
Suggest tags automaticallyLets the evaluator propose new tags beyond your predefined vocabulary
For sentiment, you can provide custom guidance (e.g., “Consider the customer’s final message tone, not their initial frustration”) and check “Negative sentiment affects the overall grade” to auto-downgrade negative experiences.

Configuring Criteria, Tags & Data Points

Criteria are the quality rules your evaluator checks every interaction against. Each one is a clear statement that’s either true or false for a given conversation.How to think about it: Ask yourself, “If I were reviewing this conversation manually, what would I check for?” Each answer becomes a criterion.
Evaluation criteria table

Adding a Criterion

Click + Add criterion and fill in:
Add criterion form
FieldDescription
NameA short label (e.g., “Accuracy”, “Stayed on Topic”)
Evaluation promptA true/false statement describing the desired behavior. Be specific.
TypeCategorizes the criterion: Prohibited action, Prohibited words, Voice & tone, or Other
PriorityWarning (downgrades to Warning) or Critical (downgrades to Critical + Slack alert)

Types & Priority

TypeWhen to Use
Prohibited actionAgent must NOT do something (e.g., don’t offer unauthorized discounts)
Prohibited wordsAgent must NOT say certain things (e.g., no profanity)
Voice & toneAgent should communicate in a certain style
OtherAnything else (stayed on topic, provided accurate info, etc.)
Choosing priority: Use “Critical” for rules that must never be broken (data leaks, compliance violations). Use “Warning” for quality standards that matter but aren’t urgent (tone issues, minor drifts).
Use CaseCriterionPromptTypePriority
SupportStayed on Topic”The agent stayed focused on resolving the customer’s issue and did not go off on tangents.”OtherWarning
SupportAccuracy”The agent provided factually correct information and did not hallucinate.”OtherCritical
SalesNo Unauthorized Discounts”The agent did not offer discounts not in the approved pricing sheet.”Prohibited actionCritical
SalesProfessional Tone”The agent maintained a professional, friendly tone throughout.”Voice & toneWarning
HelpdeskNo PII Disclosure”The agent did not reveal personal information of other employees.”Prohibited actionCritical
ContentBrand Voice”The content matches the brand’s voice: confident, concise, and jargon-free.”Voice & toneWarning
Write evaluation prompts as true/false statements. The evaluator returns “success” if it holds, “failure” if it doesn’t, and “unknown” if there’s not enough info to judge.
Limit: 30 criteria per agent.

Understanding Results

Once an interaction is evaluated, you can see the full results in the interaction detail view’s Overview tab.
Evaluation result showing summary, grade, outcome, sentiment, criteria results, tags, and collected data

Grades

GradeMeaningWhat Triggers It
PassMet all criteriaNo failures, call wasn’t a failure, sentiment not negative
WarningNeeds reviewWarning-priority criterion failed, OR call outcome “failure”, OR negative sentiment affects grade
CriticalImmediate attentionCritical-priority criterion failed, OR tool/action failures during interaction
A grade of “Inconclusive” appears when all criteria returned “unknown” (evaluator couldn’t determine pass/fail). This usually means the transcript was too short to evaluate.
The grade is computed deterministically (not by the LLM) after results come back:
  1. If there were action failures (tool errors) → Critical
  2. If any Critical-priority criterion failed → Critical
  3. If any Warning-priority criterion failed → Warning
  4. If overall call outcome was “failure” → Warning
  5. If sentiment is negative AND affects grade → Warning
  6. Otherwise → Pass

What’s Shown in Results

  • Summary: One or two sentence narrative of what happened
  • Grade: Pass / Warning / Critical badge
  • Outcome: Successful / Failed / Unknown
  • Sentiment: Positive / Neutral / Negative (if enabled)
  • Tags: Applied tags from your vocabulary + auto-generated
  • Criteria: Per-criterion pass/fail with the evaluator’s rationale
  • Collected data: Extracted values for each data point

Interactions List View

The interactions list includes an Evaluation column showing the grade and criteria pass rate at a glance (e.g., “Pass 3/3” or “Critical 1/3”). Lifecycle states also appear: Queued, Evaluating, or Failed.

Running Evaluations Manually

Evaluations run automatically, but you can also trigger them manually from the Chats page.
1

Go to Chats

Navigate to your agent’s Chats page from the left-hand sidebar.
2

Find the Interaction

Locate the chat you want to evaluate. The Evaluation column shows the current state.
3

Open the Actions Menu

Click the three-dot menu (⋮) on the right side of the interaction row.
4

Click Run Evaluation

Select “Run evaluation” from the dropdown. Results will appear shortly.
Interactions list showing the three-dot menu with Run evaluation option
You can also select multiple interactions for bulk evaluation. When to use manual runs:
  • Backfilling existing conversations after enabling evaluations
  • Re-evaluating after changing your criteria/tags/data points
  • Retrying a failed evaluation
  • Spot-checking specific conversations on demand
Manual evaluations use the current configuration. If you’ve changed your setup, a re-run reflects the new rules.

Alerts

When an interaction receives a Critical grade, the agent owner gets an immediate Slack DM with the agent’s name, which criteria failed, and a direct link to the interaction. No additional setup required, as long as Slack is connected.
Warning-grade interactions are flagged in the interactions list but don’t trigger an alert. Review them periodically by filtering for Warning grades.

Credits and Costs

Evaluations are billed as AI credits under the “AI Utilities” category. Each evaluation is a single LLM call.
Three factors:
  1. Transcript length: Longer conversations use more input tokens. A 5-message chat costs significantly less than a 50-message conversation.
  2. Analysis model: The model you select determines the per-token rate. “Smartest” costs more than faster alternatives.
  3. Schema complexity: More criteria, tags, and data points = more output tokens to generate.
The evaluator checks that the user has sufficient credits before running. If credits are insufficient, the evaluation is skipped silently.

Where to See Credit Usage

View all evaluation credit usage on your Usage & Limits page. Filter by “AI Utilities” to see individual evaluation runs and their credit amounts.
Credit Usage Logs page filtered to AI Utilities showing Interaction Evaluation entries

Limits

ResourceMaximum
Criteria per agent30
Tags per agent50
Data points per agent40
Tag name length100 characters
Tag description length500 characters

FAQ

All completed interactions are evaluated, except:
  • Incognito chats: Never evaluated (privacy guarantee)
  • Internal interactions: Agent-to-agent feedback loops like reflections are excluded
  • Non-terminal states: Only interactions that reach “completed” are evaluated
Yes. Go to your agent’s Chats page, click the three-dot menu on any interaction, and select “Run evaluation.” You can also select multiple for bulk evaluation. They’ll be graded against your current criteria.
The interaction re-enters an active state. Once it completes again, a new evaluation runs automatically covering the full updated transcript. The new result replaces the previous one.
By default, yes. The evaluation fires once after the conversation completes (with a short debounce window to avoid premature evaluation). If the conversation is resumed and completes again, a new evaluation runs automatically covering the updated transcript.
It’s marked as “Failed” rather than showing a misleading pass. Doesn’t affect metrics. You can manually re-run it from the Chats page, or it will be re-evaluated if the conversation continues.
It’s a separate LLM call dedicated to evaluation. Doesn’t affect your agent’s behavior. You can choose a different model. The evaluator never sees your agent’s system prompt directly, only the transcript and agent description/skills as context.
It receives your agent’s name, description, and available skills as context. It does NOT receive the full system prompt. Transcript content is treated as data to analyze, never as instructions.
Tool/integration errors during the interaction (e.g., an API call that returned an error, a tool that timed out). These automatically cause a “Critical” grade.
Currently configured per agent. Organization-level configs are planned for a future release.
Past evaluation results retain their original data. Deleting only affects future evaluations.
Yes. Manually add or remove tags using the tag editor in the evaluation detail view.
All criteria returned “unknown” (evaluator couldn’t determine pass/fail). Typically means the conversation was too short or didn’t touch on what your criteria check.
Each evaluation is billed based on tokens used. Cost depends on conversation length, model selected, and number of criteria/tags/data points configured. Track usage on your Usage & Limits page under “AI Utilities.”