Running Evaluations

Evaluations run on pipeline nodes. You attach a criterion to a node as an evaluator, then choose how and when it runs.

Three ways to create an evaluative field

An evaluative field is any field on a node that produces an evaluation result. The Pipeline Builder supports three ways to create one (all in the node's Form Builder or Evaluators panel):

1. Add a Criteria field (Form Builder)

Add a field of type Criteria in the Form Builder and pick a criterion from the library. This works for Human Rating, LLM Judge, and Programmatic criteria — and is the only way to surface a Human Rating criterion on a node.

Open the node's Form Builder, click + Add field, choose Criteria.
Pick a criterion from the library (or define one inline and optionally save it to the library with Save to Library).
Click the Targets button on the field row to open the picker and choose which field(s) this criterion evaluates.
For Programmatic Exact Match, the popover additionally requires a Reference (compared to) field.

The evaluator is inline: it shows up as a form field in the task form. Contributors see it and can (for Human Rating) fill it in, or (for LLM Judge / Programmatic) see the result and re-run it.

2. Toggle a compatible field to evaluative (Form Builder)

Some native field types can become evaluative directly, without wrapping them in a Criteria field. There are three variants depending on the field type and input mode:

2a. Human input fields — assign targets via the Evaluates popover

Human input fields (numeric, rating, select, boolean) can be made evaluative by assigning evaluation targets:

In the Form Builder, click the Evaluates button on the field's toolbar.
Pick one or more target fields from the evaluation target picker — you can choose fields on the same node or on any upstream node (grouped and labeled by distance: "this node", "1 node upstream", etc.).
With 2+ targets selected, you can optionally turn on the Independent switch (see below).

Use this when a contributor is directly scoring or rating another field's output (e.g. a human rating field tied to evaluating an upstream LLM response).

2b. LLM response fields — enable evaluation in the prompt modal

For LLM response fields (numeric, rating, select, boolean), evaluation is configured inside the field's prompt modal:

Open the field's prompt modal.
In the Evaluation section, toggle evaluation on using the switch.
Pick one or more target fields from the evaluation target picker.
With 2+ targets selected, you can optionally turn on the Independent switch (see below).

Use this when an LLM is producing a score (e.g. an LLM-generated numeric scoring an upstream long-text field) without creating a separate criterion in the library.

2c. Pairwise / ranking — set the item source to From other fields

pairwise and ranking don't use a separate evaluation toggle. Instead, their Item source setting decides whether they are evaluative:

Define inline — static items typed into the field config; the field is not evaluative.
From other fields — the items are references to other fields' values. The referenced fields are automatically the evaluation targets. No target picker is needed.

This works in both human input mode (a contributor picks the winner or ranks the items) and LLM response mode (an LLM picks the winner or ranks the items). In both modes, the field produces evaluation results — the difference is who does the evaluating.

See Auto-evaluative pairwise / ranking with field references below for details.

In 2a, 2b, and 2c, the evaluator is inline — it renders as part of the task form.

3. Add a Hidden Evaluator (Evaluators panel)

Open the node's Evaluators panel and click Add under Hidden Evaluators. Hidden evaluators run server-side and are not shown to contributors.

Only LLM Judge and Programmatic criteria can be hidden. Attempting to attach a Human Rating criterion is rejected.
Pick a criterion from the library (or define one inline; you can later click Save to Library to promote it).
Pick one or more Targets — same picker semantics as inline fields.
Programmatic Exact Match also requires a Reference field.

Use hidden evaluators for background quality monitoring that contributors should not see or be influenced by.

The Independent toggle

When an evaluator has 2+ targets, the Evaluators panel (and the equivalent prompt modal / form-field popover) shows an Independent switch:

Off (grouped) — one evaluator record. The criterion sees all targets at once and produces a single combined result.
On (independent) — the evaluator is split into one evaluation per target. Each runs independently against its own target and produces its own result, so you get N independent results instead of one combined one.

Use Independent when you want per-target scores (e.g. evaluate the same "tone" criterion on three different LLM responses and compare them). Use grouped when the criterion needs to see all targets together to produce a single holistic score (e.g. "rate the overall coherence across these three sections").

Auto-evaluative pairwise / ranking with field references

pairwise and ranking are native field types with rich contributor UIs (side-by-side cards and drag-to-rank, respectively). Each of them supports two item sources, selected in the field's config under Item source:

Define inline — items are static text you type into the field config (items[0] / items[1] for pairwise; a list of items for ranking). The contributor compares/ranks those fixed strings. Not evaluative.
From other fields — items are live references to other fields in the pipeline. At runtime, each item's content is replaced with the referenced field's value. The field becomes automatically evaluative — the referenced fields are the evaluation targets, and the evaluator produces standard evaluation results regardless of whether a human or an LLM does the comparison.

	Pairwise	Ranking
Field references	Exactly 2	2 or more
Referenced fields can live on	The current node or any upstream node	Same
LLM output shape	The label of the winning item (derived from item labels configured on the field)	An ordered array of all item labels, ranked from best to worst
Human equivalent	Contributor clicks the winning card	Contributor drags items into order
Display label mode	Field titles (default), custom, or both	Same, with per-item custom labels
Minimum	Both references required	At least 2 references required

How targets are resolved

Targets for pairwise and ranking fields are determined automatically from the referenced fields — you don't need to pick targets separately. There is no "Targets" button for these fields; the Evaluates popover instead exposes the Execution Context selector (see below). The evaluator produces standard result columns in the Data Explorer like any other evaluator.

Display labels

When items come from field references, the labels shown in the contributor UI or sent to the model are controlled by the Label mode setting:

Field titles (default) — each item displays its referenced field's display title.
Custom — you supply per-item labels that replace the titles.
Both — your custom label is primary; the field title is shown as secondary context.

Labels affect display only. Analytics and exports always identify items by the underlying reference, so renaming a label doesn't break historical results.

Analytics

Pairwise and ranking fields get bespoke aggregate summaries in the Evaluation Analytics tab:

Pairwise — winner (majority A vs. B, with ties), win percentages, and a distribution keyed by the resolved item labels. When items come from field references, analytics also attributes outcomes to the contributors of each referenced field, so you can see who wrote the response that won most often.
Ranking — average rank per item, top-ranked item, and a ranking item list. Reference-sourced items similarly get contributor attribution.

Execution Context for these fields

When the referenced items are LLM-generated fields, the Execution Context selector in the Form Builder is especially useful: turn on tokens, cost, latency, model, tool_calls, or thinking to surface that metadata next to each compared/ranked item at runtime. Contributors (on inputMode='input' pairwise/ranking) see "this response was generated by gpt-4o at 1.2s, 1240 tokens" alongside each card or item.

Triggers

Only hidden evaluators expose a Trigger selector. Inline evaluative fields (Criteria fields and toggled-evaluative form fields) have no trigger control in the Form Builder — they always run as part of the node's lifecycle:

Human Rating criteria fields wait for the contributor to fill them in; the value is written when the node submits.
LLM Judge / Programmatic criteria fields auto-run inline as soon as their target values are populated, and contributors can manually re-run them from the field's Run button. Their results are written on node submission.
Native-UI evaluative fields (a numeric, rating, select, or boolean field toggled evaluative; a pairwise or ranking field with field references) similarly run inline on submit.

Hidden evaluators have two options:

On Submit

Runs automatically when the node is submitted.

Results are produced asynchronously in a background job; there's typically a short delay before a result appears in the Data Explorer. The Data Explorer auto-refreshes while evaluations are in progress.
On-Submit evaluators skip themselves when the target field values are unchanged since the last successful run.
If the node has no target values yet (all targets are null/empty), the evaluator is skipped until there's real content.

Manual

Does not run automatically. Results only appear when triggered from the Data Explorer.

Open the pipeline's Data Explorer.
Select the task rows you want to evaluate.
Click Evaluate in the toolbar.
The dialog lists every Manual-triggered evaluator on this pipeline. For each, it shows:
- Criterion name and type
- Which node it's attached to
- Number of selected tasks that will run
- Number of selected tasks that are already done (cached)
Toggle Force re-run all to recompute results for tasks that are already cached (useful after a criterion change).
Select which evaluators to run (or use the Select all / Deselect all link), then click Run Selected.
Jobs queue in the background. Results populate the Data Explorer as they complete.

Execution Context (LLM metadata)

When evaluation targets are LLM-generated fields, you can include execution metadata — tokens, cost, latency, model, etc. — alongside the target content. The available metadata keys are grouped as:

Performance — tokens, cost, latency, model, temperature, generation info
Tool Trace — tool calls
Internals — extended thinking

How you select metadata depends on how the evaluator is configured:

Auto-append mode (no placeholders) — use the Execution Context selector on the evaluator. The selected metadata is appended under an --- Execution Metadata --- header after each target in the --- CONTENT TO EVALUATE --- block.
Placeholder mapping mode — when mapping a target placeholder, a metadata picker appears on each target mapping row (labeled "Include with value"). Select which metadata keys to include; they are rendered inline alongside the target's value under an --- Execution Metadata --- header.
On a human-facing evaluative field (including pairwise/ranking) — the selected metadata is shown to the contributor as context next to the referenced field (e.g. "Response generated by gpt-4o at 1.2s, 1240 tokens"). Use this to give a human rater visibility into how each option was generated.

In all cases, metadata is also stored on the result so the Task Detail panel can display it next to the evaluator's output.

Only target fields produced by an LLM in the pipeline contribute metadata; it's a no-op for plain text/form-filled targets.

What happens after a run

Results are written to the pipeline's result store and appear as columns in the Data Explorer for that pipeline.
Inline evaluator results are also stored on the form submission itself (as the field's value, plus a _reasoning field when reasoning is enabled).
Aggregate statistics are recomputed on the Evaluation Analytics tab (see Viewing results).
Errors and skipped runs are preserved; you can retry manual evaluators by re-running them with Force re-run all.
Each LLM Judge evaluation incurs an LLM API call. Before running batch manual evaluations on many tasks, consider the token and cost implications (visible in the LLM Analytics tab after the run).

Common mistakes

Attaching Human Rating as a Hidden Evaluator — not supported. Use a Criteria field in the Form Builder.
Forgetting the Reference field on Exact Match — the evaluator fails validation on save.
Expecting Manual evaluators to run on submit — they won't. Trigger them from the Data Explorer.
Editing a criterion and expecting existing results to update — they won't. Use Force re-run all to recompute with the new criterion config.
Expecting composition-level binding — Evaluations (compositions) can't be attached to nodes today. Add the individual criteria.
Large batch LLM Judge runs — each LLM Judge evaluation is a separate LLM API call. Running across hundreds of tasks can accumulate significant cost. Check the LLM Analytics tab to monitor spend.

On this page