Evaluations Overview

Evaluations in Pipelines score field outputs against defined quality dimensions. Scores can come from a human filling out a rating in the task form, an LLM acting as a judge, or an automated programmatic check (regex, keyword, JSON validity, etc.).

The building blocks

Criterion — a single, reusable quality metric (e.g. Factual accuracy, 1–5 or Response is valid JSON). Criteria live in the Evaluations → Criteria library and are the atomic unit: every evaluation result in the system is tied to exactly one criterion at a specific version.
Evaluator — a criterion that has been attached to a specific pipeline node. This is what actually produces a result for a task. Evaluators are configured in one of two places:
- Inline, via the Form Builder — surfaced as a form field visible to contributors.
- Hidden, via the Evaluators panel — run server-side, invisible to contributors.
Evaluation — a named bundle of criteria you can curate in the library for reuse. Evaluations are a library-level grouping primitive today and are not yet used when attaching evaluators to pipeline nodes.

Inline vs. Hidden evaluators

Both kinds produce evaluation results attached to tasks; the difference is who sees the evaluator during tasking.

	Inline evaluator	Hidden evaluator
Where configured	Form Builder (as a field)	Evaluators panel
Visible to contributors	Yes (part of the form)	No (runs in the background)
Supported criterion types	Human Rating, LLM Judge, Programmatic	LLM Judge, Programmatic only (not Human Rating)
When to use	You want a contributor to fill it in or see the score, or the field under evaluation is also something the contributor should rate or re-run	You want quality monitoring that contributors should not see or be influenced by

You can mix both on the same node — for example, contributors fill in an inline human rating while a hidden LLM judge scores the same response in the background.

Where things live in the UI

Thing	How to get there
Criteria library	Sidebar → Evaluations → Criteria tab
Evaluations library	Sidebar → Evaluations → Evaluations tab
Attach an inline evaluator	Pipeline Builder → select a node → open the Form Builder, add a field (see Running evaluations)
Attach a hidden evaluator	Pipeline Builder → select a node → Evaluators panel → Add
Trigger a manual evaluator on existing tasks	Pipeline → Data Explorer → select rows → Evaluate button
View results per task	Data Explorer table, or click View on a row to open the Task Detail panel
View aggregate analytics	Pipeline → Data Explorer → Evaluation Analytics tab

End-to-end flow

Create the criteria you need in the Criteria tab, or reuse existing ones.
(Optional) Bundle criteria into an Evaluation for organizational reuse.
Open the Pipeline Builder, select a subtask or review node, and attach evaluators — either inline via the Form Builder or hidden via the Evaluators panel. See Running evaluations for the three ways to make a field evaluative.
For hidden evaluators only, choose a Trigger:
- On Submit — runs automatically when the node is submitted.
- Manual — only runs when triggered from the Data Explorer.
Inline evaluative fields (Criteria fields and toggled-evaluative form fields) do not have a trigger selector — they always execute inline as part of the node's lifecycle.
Submit tasks. Inline and On-Submit evaluators produce results as the node is submitted (inline fields populate immediately; hidden On-Submit evaluators queue as a background job).
For Manual hidden evaluators, open the Data Explorer, select tasks, click Evaluate, and confirm in the dialog.
View results in the Data Explorer table (per-task) and the Evaluation Analytics tab (aggregate charts and breakdowns).

Versioning

Criteria and Evaluations are versioned. A new criterion version is created when its display label, config, or output schema changes on save; name- or description-only edits do not bump the version. When a criterion's version bumps, any Evaluations that include it automatically bump their own version and update their pinned criterion reference to the new version. Pipelines that reference the criterion do not auto-update — they keep running against the version they were pinned to until the pipeline is edited.

Existing evaluation results stay attached to the version that produced them, so historical data is never silently rewritten.

The building blocks

Inline vs. Hidden evaluators

Where things live in the UI

End-to-end flow

Versioning

On this page