What is Pipelines?

Pipelines is an AI research and development platform for building datasets and running reproducible evaluations — with the end goal of optimizing your model or agent performance.

Think of each pipeline as an evaluation harness: a configurable test suite where you define the inputs, prompts, tools, scoring criteria, and human review steps — then run experiments across any combination of variables. Every result is stored, versioned, and comparable, so you can see exactly what changed and why performance improved.

Depending on your workflow, a pipeline might produce annotated training data, synthetic datasets, fine-tuning examples, structured human feedback, evidence of model failures, or benchmark results that track quality over time. The platform supports everything from pure data annotation to full model evaluation — and any combination in between.

Who is Pipelines for?

AI research teams — building test suites for models, running structured evaluations across prompt and parameter variations, and iterating on quality with human and AI scoring.
AI application teams — setting up evaluation harnesses for agents and LLM features, curating ground-truth datasets, and maintaining a system of record for every experiment across the development lifecycle.
Data teams — curating, comparing, and versioning datasets from multiple sources to build benchmarks that track how model performance evolves over time.

What can you build?

Build evaluation harnesses

Design configurable pipelines that act as test suites for your models or agents. Each node represents a step — data preparation, generation, human review, scoring — and you control the variables at every stage. Add review steps for quality control and logic gates to route work automatically based on results.

Define and reuse test criteria

Define evaluation criteria — human ratings, LLM-as-judge scoring, or programmatic checks (regex, exact match, PII detection, and more) — and reuse them as the scoring rubric across experiments. Swap models, prompts, or tools and re-run the same criteria to compare results. Every score is versioned and tied to the criteria that produced it, so experiments are reproducible.

Combine human and AI work

Any step in a pipeline can blend human and AI contributions. LLM-generated fields pre-fill responses for contributors to review and refine. Human reviewers act as part of the evaluation loop, providing ground-truth scores that calibrate automated criteria. Connect external tools via MCP servers or HTTP APIs for tool-calling during generation, and manage prompts in a versioned library so your team stays aligned.

Track what works

Every experiment produces a versioned dataset, so you can compare runs, identify regressions, and track improvement over time. Import additional data from CSV/JSON files, HuggingFace Hub, or exports from other AI/ML platforms. Use the Analytics Studio to chart, filter, and aggregate across datasets.

Platform architecture

Pipelines is organized into a hierarchy:

Organization
  └── Project
        └── Pipeline
              ├── Nodes (Subtask, Review, Logic Gate)
              └── Tasks (units of work flowing through the pipeline)

Each organization contains projects, which contain pipelines. Pipelines are visual graphs of nodes that define the steps in your data workflow. Tasks are individual units of work that move through the pipeline from start to finish.

Key capabilities

Capability	Description
Pipeline Builder	Visual drag-and-drop editor for designing multi-step pipelines
Task Management	Claim-based work queues with automatic assignment and routing
Evaluations	Human, LLM, and programmatic quality scoring
Datasets	Import, export, and analyze data from any source
LLM Integration	AI-generated fields with tool-calling, prompt versioning, and BYOK models
Analytics	Real-time dashboards, interactive charting, and time tracking
RBAC	Role-based access control with Org Admin, Project Admin, and Contributor roles
API	Full REST API for programmatic access to platform features

Who is Pipelines for?

What can you build?

Build evaluation harnesses

Define and reuse test criteria

Combine human and AI work

Track what works

Platform architecture

Key capabilities

Next steps

Core Concepts

Quickstart Guide

On this page