Evaluations Overview
Quality scoring with human ratings, LLM judges, and programmatic checks.
Evaluations in Pipelines score field outputs against defined quality dimensions. Scores can come from a human filling out a rating in the task form, an LLM acting as a judge, or an automated programmatic check (regex, keyword, JSON validity, etc.).
The building blocks
- Criterion — a single, reusable quality metric (e.g. Factual accuracy, 1–5 or Response is valid JSON). Criteria live in the Evaluations → Criteria library and are the atomic unit: every evaluation result in the system is tied to exactly one criterion at a specific version.
- Evaluator — a criterion that has been attached to a specific pipeline node. This is what actually produces a result for a task. Evaluators are configured in one of two places:
- Inline, via the Form Builder — surfaced as a form field visible to contributors.
- Hidden, via the Evaluators panel — run server-side, invisible to contributors.
- Evaluation — a named bundle of criteria you can curate in the library for reuse. Evaluations are a library-level grouping primitive today and are not yet used when attaching evaluators to pipeline nodes.
Inline vs. Hidden evaluators
Both kinds produce evaluation results attached to tasks; the difference is who sees the evaluator during tasking.
| Inline evaluator | Hidden evaluator | |
|---|---|---|
| Where configured | Form Builder (as a field) | Evaluators panel |
| Visible to contributors | Yes (part of the form) | No (runs in the background) |
| Supported criterion types | Human Rating, LLM Judge, Programmatic | LLM Judge, Programmatic only (not Human Rating) |
| When to use | You want a contributor to fill it in or see the score, or the field under evaluation is also something the contributor should rate or re-run | You want quality monitoring that contributors should not see or be influenced by |
You can mix both on the same node — for example, contributors fill in an inline human rating while a hidden LLM judge scores the same response in the background.
Where things live in the UI
| Thing | How to get there |
|---|---|
| Criteria library | Sidebar → Evaluations → Criteria tab |
| Evaluations library | Sidebar → Evaluations → Evaluations tab |
| Attach an inline evaluator | Pipeline Builder → select a node → open the Form Builder, add a field (see Running evaluations) |
| Attach a hidden evaluator | Pipeline Builder → select a node → Evaluators panel → Add |
| Trigger a manual evaluator on existing tasks | Pipeline → Data Explorer → select rows → Evaluate button |
| View results per task | Data Explorer table, or click View on a row to open the Task Detail panel |
| View aggregate analytics | Pipeline → Data Explorer → Evaluation Analytics tab |
End-to-end flow
-
Create the criteria you need in the Criteria tab, or reuse existing ones.
-
(Optional) Bundle criteria into an Evaluation for organizational reuse.
-
Open the Pipeline Builder, select a subtask or review node, and attach evaluators — either inline via the Form Builder or hidden via the Evaluators panel. See Running evaluations for the three ways to make a field evaluative.
-
For hidden evaluators only, choose a Trigger:
- On Submit — runs automatically when the node is submitted.
- Manual — only runs when triggered from the Data Explorer.
Inline evaluative fields (Criteria fields and toggled-evaluative form fields) do not have a trigger selector — they always execute inline as part of the node's lifecycle.
-
Submit tasks. Inline and On-Submit evaluators produce results as the node is submitted (inline fields populate immediately; hidden On-Submit evaluators queue as a background job).
-
For Manual hidden evaluators, open the Data Explorer, select tasks, click Evaluate, and confirm in the dialog.
-
View results in the Data Explorer table (per-task) and the Evaluation Analytics tab (aggregate charts and breakdowns).
Versioning
Criteria and Evaluations are versioned. A new criterion version is created when its display label, config, or output schema changes on save; name- or description-only edits do not bump the version. When a criterion's version bumps, any Evaluations that include it automatically bump their own version and update their pinned criterion reference to the new version. Pipelines that reference the criterion do not auto-update — they keep running against the version they were pinned to until the pipeline is edited.
Existing evaluation results stay attached to the version that produced them, so historical data is never silently rewritten.