Criteria
Atomic, reusable evaluation metrics.
A criterion is a single, reusable metric — for example Factual accuracy, Response is valid JSON, or PII not leaked. Criteria are the atomic unit of evaluation; every evaluation result the platform produces is tied to exactly one criterion (at a specific version).
Criteria are managed in Evaluations → Criteria.
Criterion types
The type is set at creation and cannot be changed afterwards.
Human Rating
A human reviewer assigns a value in the task form. Human Rating criteria can only be surfaced inline (as a form field) — they cannot be hidden evaluators.
| Setting | Description | Required |
|---|---|---|
| Name | Internal identifier used across the library. | Yes |
| Display Label | What the contributor sees as the field label. | Yes |
| Description | Plain-language explanation of what this criterion measures. | No |
| Output Type | How the value is collected (see below). | Yes |
Output types for Human Rating:
- Numeric — a number within a configurable range; optional
step. Enable Render as rating to show it as stars instead of a number input. - Rating — integer scale from
mintomaxwith a configurable label per value (e.g. 1 = "Poor", 5 = "Excellent"). Intended for short, well-defined scales. - Boolean — Yes / No.
- Categorical — pick from a configured list. Supports Allow multiple and Allow other (free text).
LLM Judge
An LLM scores the submission and returns a structured result. LLM Judge criteria can be used as either inline or hidden evaluators.
| Setting | Description | Required |
|---|---|---|
| Name / Display Label / Description | Same as Human Rating. Name and Display Label required; Description optional. | See above |
| Model | Which LLM provider/model runs the judge. Choices come from your org's configured models. If not set on the criterion, the evaluator inherits the node's default model setting. | No (falls back to node default) |
| Temperature | 0–2 (clamped), step 0.01. Lower = more deterministic. | No (defaults to 0.7) |
| Prompt Template | The instructions given to the judge. Plain Markdown editor — no variable-insertion UI, but the backend substitutes any {{…}} placeholders you type by hand against the evaluator's targets at run time. See Prompt templates below. | Yes |
| Output Type | One of numeric, rating, boolean, categorical. The judge is instructed to return a value in this shape, and the response is validated against it (scores are clamped to the configured range). | Yes |
| Include Reasoning | When on, the judge is asked to return a short explanation alongside the score; the explanation is stored with the result and shown in the Data Explorer as a tooltip on the cell. | No (default off) |
Programmatic
An automated, deterministic check. Always returns a boolean (pass/fail) result (PII Detection also produces a numeric confidence score used as a threshold check). Programmatic criteria can be used as either inline or hidden evaluators.
| Setting | Description | Required |
|---|---|---|
| Name / Display Label / Description | Same as above. Name and Display Label required; Description optional. | See above |
| Subtype | One of the built-in checks below. | Yes |
| Include Reasoning | When on, failure details (e.g. the unmatched keyword, the schema error, the PII breakdown) are stored with the result and shown on hover in the Data Explorer. | No (default off) |
Subtypes:
- Exact Match — compares the evaluated target against a reference field. When you attach this criterion to a node you must pick both a target and a reference. Config options:
case_sensitive(default true)strip_whitespace(default false)normalize_unicode(default false)
- Regex Match — matches the target against a regex.
pattern(required)flags— any ofIGNORECASE,MULTILINE,DOTALLinvert— pass when the regex does not match
- Contains Keywords — checks whether the target contains a list of keywords.
keywords(required list of strings)mode—any(pass if any keyword appears),all(pass only if all appear),none(pass only if none appear)case_sensitive(default true)
- JSON Validity — parses the target as JSON. If a JSON Schema is provided, the parsed value is additionally validated against the schema. Without a schema, only parse-ability is checked.
- PII Detection — LLM-based PII leakage detection (not regex). Uses DeepEval's PII leakage metric under the hood.
model(optional; defaults to the platform PII detection model)threshold— 0–1, default 0.5. DeepEval's PII leakage score scale: higher = less PII leaked (1.0 = no PII found, 0.0 = all PII leaked). The check passes whenscore ≥ threshold. The default 0.5 means: if the assessment indicates the response is at least half-clear of PII concerns, the check passes. Raise the threshold for stricter privacy requirements.
Prompt templates (LLM Judge)
An LLM Judge criterion's Prompt Template is a Markdown text area. Because criteria are reusable across pipelines, you don't map specific field targets at template creation time. Instead, you write general judging instructions and optionally use {{…}} placeholder variables in the template — these are mapped to actual fields later when the criterion is added as a field in a pipeline.
Creating a criterion
- Go to Evaluations → Criteria.
- Click New Criterion.
- Fill in Name, Display Label, and Type. Description is optional but recommended.
- Configure the type-specific settings in the Configuration panel.
- Click Save.
The criterion immediately becomes available in the library at version 1.
Editing and versioning
Open a criterion from the library and click Edit.
A new version is created when any of these change on save:
- Display label
- Config (prompt template, model, temperature, programmatic subtype settings, etc.)
- Output schema (output type, range, categories, etc.)
These do not bump the version when edited on their own:
- Name
- Description
The current version number is shown on the criterion detail page. Past versions are listed in the Versions card, where you can inspect what each looked like at the time.
When a criterion version bumps, the platform automatically:
- Bumps the version of every Evaluation that includes this criterion and repoints the evaluation's internal reference to the new criterion version.
- Does not automatically bump pipelines that use the criterion. They continue running against the version they were pinned to when the pipeline was last edited; re-open and re-save the pipeline to pick up the newer version.
Existing evaluation results remain attached to the criterion version that produced them.
Deleting and archiving
- Clicking Delete on a criterion that is not used anywhere removes it permanently.
- Clicking Delete on a criterion that is in use (referenced by any evaluation or pipeline) archives it instead. A confirmation dialog shows exactly how many evaluations and pipelines reference it before you confirm.
- Archived criteria do not appear in the default library list. Click the archive icon in the sidebar header to show or hide archived items.
- Archiving does not change or invalidate existing results.
Usage tracking
The criterion detail page shows where the criterion is used, broken down by version: which evaluations include it and which pipelines have it as an evaluator.