Pipelines Docs is in beta — content is actively being added.
Platform GuideEvaluations

Criteria

Atomic, reusable evaluation metrics.

A criterion is a single, reusable metric — for example Factual accuracy, Response is valid JSON, or PII not leaked. Criteria are the atomic unit of evaluation; every evaluation result the platform produces is tied to exactly one criterion (at a specific version).

Criteria are managed in Evaluations → Criteria.

Criterion types

The type is set at creation and cannot be changed afterwards.

Human Rating

A human reviewer assigns a value in the task form. Human Rating criteria can only be surfaced inline (as a form field) — they cannot be hidden evaluators.

SettingDescriptionRequired
NameInternal identifier used across the library.Yes
Display LabelWhat the contributor sees as the field label.Yes
DescriptionPlain-language explanation of what this criterion measures.No
Output TypeHow the value is collected (see below).Yes

Output types for Human Rating:

  • Numeric — a number within a configurable range; optional step. Enable Render as rating to show it as stars instead of a number input.
  • Rating — integer scale from min to max with a configurable label per value (e.g. 1 = "Poor", 5 = "Excellent"). Intended for short, well-defined scales.
  • Boolean — Yes / No.
  • Categorical — pick from a configured list. Supports Allow multiple and Allow other (free text).

LLM Judge

An LLM scores the submission and returns a structured result. LLM Judge criteria can be used as either inline or hidden evaluators.

SettingDescriptionRequired
Name / Display Label / DescriptionSame as Human Rating. Name and Display Label required; Description optional.See above
ModelWhich LLM provider/model runs the judge. Choices come from your org's configured models. If not set on the criterion, the evaluator inherits the node's default model setting.No (falls back to node default)
Temperature0–2 (clamped), step 0.01. Lower = more deterministic.No (defaults to 0.7)
Prompt TemplateThe instructions given to the judge. Plain Markdown editor — no variable-insertion UI, but the backend substitutes any {{…}} placeholders you type by hand against the evaluator's targets at run time. See Prompt templates below.Yes
Output TypeOne of numeric, rating, boolean, categorical. The judge is instructed to return a value in this shape, and the response is validated against it (scores are clamped to the configured range).Yes
Include ReasoningWhen on, the judge is asked to return a short explanation alongside the score; the explanation is stored with the result and shown in the Data Explorer as a tooltip on the cell.No (default off)

Programmatic

An automated, deterministic check. Always returns a boolean (pass/fail) result (PII Detection also produces a numeric confidence score used as a threshold check). Programmatic criteria can be used as either inline or hidden evaluators.

SettingDescriptionRequired
Name / Display Label / DescriptionSame as above. Name and Display Label required; Description optional.See above
SubtypeOne of the built-in checks below.Yes
Include ReasoningWhen on, failure details (e.g. the unmatched keyword, the schema error, the PII breakdown) are stored with the result and shown on hover in the Data Explorer.No (default off)

Subtypes:

  • Exact Match — compares the evaluated target against a reference field. When you attach this criterion to a node you must pick both a target and a reference. Config options:
    • case_sensitive (default true)
    • strip_whitespace (default false)
    • normalize_unicode (default false)
  • Regex Match — matches the target against a regex.
    • pattern (required)
    • flags — any of IGNORECASE, MULTILINE, DOTALL
    • invert — pass when the regex does not match
  • Contains Keywords — checks whether the target contains a list of keywords.
    • keywords (required list of strings)
    • modeany (pass if any keyword appears), all (pass only if all appear), none (pass only if none appear)
    • case_sensitive (default true)
  • JSON Validity — parses the target as JSON. If a JSON Schema is provided, the parsed value is additionally validated against the schema. Without a schema, only parse-ability is checked.
  • PII Detection — LLM-based PII leakage detection (not regex). Uses DeepEval's PII leakage metric under the hood.
    • model (optional; defaults to the platform PII detection model)
    • threshold — 0–1, default 0.5. DeepEval's PII leakage score scale: higher = less PII leaked (1.0 = no PII found, 0.0 = all PII leaked). The check passes when score ≥ threshold. The default 0.5 means: if the assessment indicates the response is at least half-clear of PII concerns, the check passes. Raise the threshold for stricter privacy requirements.

Prompt templates (LLM Judge)

An LLM Judge criterion's Prompt Template is a Markdown text area. Because criteria are reusable across pipelines, you don't map specific field targets at template creation time. Instead, you write general judging instructions and optionally use {{…}} placeholder variables in the template — these are mapped to actual fields later when the criterion is added as a field in a pipeline.

Creating a criterion

  1. Go to Evaluations → Criteria.
  2. Click New Criterion.
  3. Fill in Name, Display Label, and Type. Description is optional but recommended.
  4. Configure the type-specific settings in the Configuration panel.
  5. Click Save.

The criterion immediately becomes available in the library at version 1.

Editing and versioning

Open a criterion from the library and click Edit.

A new version is created when any of these change on save:

  • Display label
  • Config (prompt template, model, temperature, programmatic subtype settings, etc.)
  • Output schema (output type, range, categories, etc.)

These do not bump the version when edited on their own:

  • Name
  • Description

The current version number is shown on the criterion detail page. Past versions are listed in the Versions card, where you can inspect what each looked like at the time.

When a criterion version bumps, the platform automatically:

  • Bumps the version of every Evaluation that includes this criterion and repoints the evaluation's internal reference to the new criterion version.
  • Does not automatically bump pipelines that use the criterion. They continue running against the version they were pinned to when the pipeline was last edited; re-open and re-save the pipeline to pick up the newer version.

Existing evaluation results remain attached to the criterion version that produced them.

Deleting and archiving

  • Clicking Delete on a criterion that is not used anywhere removes it permanently.
  • Clicking Delete on a criterion that is in use (referenced by any evaluation or pipeline) archives it instead. A confirmation dialog shows exactly how many evaluations and pipelines reference it before you confirm.
  • Archived criteria do not appear in the default library list. Click the archive icon in the sidebar header to show or hide archived items.
  • Archiving does not change or invalidate existing results.

Usage tracking

The criterion detail page shows where the criterion is used, broken down by version: which evaluations include it and which pipelines have it as an evaluator.