Viewing Results

Evaluation results are stored per task, per evaluator, per criterion version. There are three main places to view them: the Data Explorer table (per-task values), the Evaluation Analytics tab (aggregate charts and breakdowns), and the LLM Analytics tab (LLM usage metrics).

Result shape

A single evaluation result contains, at minimum:

Score — numeric for numeric/rating, 0/1 for boolean, a category string for categorical.
Passed — boolean indicator (derived from score + threshold/categories when applicable).
Reasoning — short explanation string, present when the criterion was configured with Include Reasoning on.
Output type — numeric | rating | boolean | categorical.
Evaluator metadata — criterion name and version, which node and target(s) produced it, LLM tokens/cost/latency (for LLM Judge), error details (for failures).

Per-task view: Data Explorer table

Open the pipeline's Data Explorer. Each evaluator appears as its own column.

How a cell renders:

Numeric / rating — the score as a number.
Boolean — 1 ✓ (green) or 0 ✗ (red).
Categorical — the selected category label.
With reasoning enabled — hover the cell to see the reasoning in a tooltip. For the full, formatted reasoning (and any additional claim breakdowns for PII), open the task detail: click View on a row to open the Task Detail panel.
Missing value — renders as - (e.g. evaluator hasn't run yet, or was skipped).

If a task has evaluators still processing, the Data Explorer auto-refreshes in the background and results appear as the jobs complete.

Aggregate view: Evaluation Analytics tab

The pipeline's Data Explorer has an Evaluation Analytics tab. It is automatically populated based on the pipeline's evaluative fields (numeric, rating, select, boolean, pairwise, ranking, criteria).

Top-level controls:

Latest Values vs All Attempts — switches the dataset being aggregated. Latest Values takes the most recent value per evaluator per task; All Attempts includes re-runs.
Scorecard strip — top-level summary tiles (totals, averages, pass rates) across the pipeline.
Comparison Insights — grouped side-by-side views. A group forms automatically whenever the pipeline layout expresses a comparison dimension between eval fields. The platform recognizes four patterns:
1. Multi-model comparison — the same logical target is generated by multiple LLM models side-by-side. The evaluator scores are compared across models.
2. Same target, multiple evaluators — distinct evaluators pointing at the same target field (e.g. a Human Rating and an LLM Judge both scoring the same response). The scores are compared across evaluators.
3. Eval variant comparison — siblings produced by turning on the Independent toggle. The scores are compared across the split targets.
4. Multi-model × multi-target — when a group spans both model variants and multiple targets, the two dimensions are merged into a single 2D comparison view.
Per-field analytics — for each evaluative field: mean, median, distribution, histograms, and (where applicable) breakdowns by evaluator and by evaluated target.

Dataset-level analytics

Every dataset detail page has a dedicated Analytics tab (alongside Overview, Dataset Content, and Studio). That tab renders the same scorecards, comparison insights, and per-field analytics as the pipeline Data Explorer — plus inter-rater agreement and a contributors breakdown that are only available at the dataset level. The Dataset Overview tab shows only high-level dataset metadata (tasks / completed / nodes / fields / date range and the list of models, criteria, prompts, contributors) — not the analytics charts.

Inter-rater agreement

The Data Explorer Evaluation Analytics tab does not compute inter-rater agreement — it's not included in the pipeline-level analytics response.

Inter-rater agreement is computed only on the Dataset → Analytics tab. When a dataset has ≥ 20 shared items rated by multiple evaluators on the same field, the analytics panel renders an Evaluator Agreement card with Cohen's Kappa (2 raters) or Krippendorff's Alpha (3+ raters), interpreted on the standard scale (below chance / slight / fair / moderate / substantial / almost perfect). Agreement is only reported for field types where it's meaningful: boolean, select, pairwise, ranking, numeric, and rating. For interval-level data (numeric, rating), weighted Cohen's Kappa (quadratic weighting) is used to account for the ordinal distance between scores.

LLM Analytics tab

When a pipeline contains any LLM-generated fields (response fields or LLM Judge evaluators), the Data Explorer also shows an LLM Analytics tab with:

Summary stats: total calls, total tokens (prompt / completion / total), total cost, average latency, success rate.
Usage over time charts.
Breakdowns by model and by field, so you can see where your LLM spend and latency are concentrated.
Thinking / tool-trace availability where applicable.

This tab is hidden on pipelines that don't use LLMs.

Task Detail

Click View on any Data Explorer row to open the Task Detail panel. This shows the full value of every field and evaluator for that task, including:

The full reasoning string for each evaluator (not just the tooltip preview).
Error messages for failed evaluators.
For Programmatic PII Detection, the per-claim verdict breakdown and the measured score vs. configured threshold.
The resolved prompt sent to LLM Judge evaluators (useful for verifying placeholder substitution worked).

Some information is only visible here — if a cell's tooltip seems to truncate reasoning, open the row.

Exports

Evaluation results are included in the pipeline's CSV and JSON exports:

Each evaluator contributes a column named after its target (and a variant label, when the evaluator was split into independent siblings).
When Include Reasoning is enabled, a parallel *_reasoning column is emitted alongside the result column.
File and tag fields are flattened as they are in the Data Explorer view.

Evaluation results are not queryable via the external API today — they ship out in exports but are not available as a filterable query endpoint.

Verifying a run worked

Open the Data Explorer and look for the evaluator's column on the task's row.
A value (or 1 ✓ / 0 ✗) means the run completed.
A - while the table is still refreshing means the job is still running.
A - without pending means either the evaluator was skipped (empty targets, cached duplicate) or errored — open the Task Detail panel to see the error string.
To force a recompute, click Evaluate from the Data Explorer with Force re-run all on for the evaluator in question.

On this page