What is Pipelines?
An overview of the Pipelines platform and what you can build with it.
Pipelines is an AI research and development platform for building datasets and running reproducible evaluations — with the end goal of optimizing your model or agent performance.
Think of each pipeline as an evaluation harness: a configurable test suite where you define the inputs, prompts, tools, scoring criteria, and human review steps — then run experiments across any combination of variables. Every result is stored, versioned, and comparable, so you can see exactly what changed and why performance improved.
Depending on your workflow, a pipeline might produce annotated training data, synthetic datasets, fine-tuning examples, structured human feedback, evidence of model failures, or benchmark results that track quality over time. The platform supports everything from pure data annotation to full model evaluation — and any combination in between.
Who is Pipelines for?
- AI research teams — building test suites for models, running structured evaluations across prompt and parameter variations, and iterating on quality with human and AI scoring.
- AI application teams — setting up evaluation harnesses for agents and LLM features, curating ground-truth datasets, and maintaining a system of record for every experiment across the development lifecycle.
- Data teams — curating, comparing, and versioning datasets from multiple sources to build benchmarks that track how model performance evolves over time.
What can you build?
Build evaluation harnesses
Design configurable pipelines that act as test suites for your models or agents. Each node represents a step — data preparation, generation, human review, scoring — and you control the variables at every stage. Add review steps for quality control and logic gates to route work automatically based on results.
Define and reuse test criteria
Define evaluation criteria — human ratings, LLM-as-judge scoring, or programmatic checks (regex, exact match, PII detection, and more) — and reuse them as the scoring rubric across experiments. Swap models, prompts, or tools and re-run the same criteria to compare results. Every score is versioned and tied to the criteria that produced it, so experiments are reproducible.
Combine human and AI work
Any step in a pipeline can blend human and AI contributions. LLM-generated fields pre-fill responses for contributors to review and refine. Human reviewers act as part of the evaluation loop, providing ground-truth scores that calibrate automated criteria. Connect external tools via MCP servers or HTTP APIs for tool-calling during generation, and manage prompts in a versioned library so your team stays aligned.
Track what works
Every experiment produces a versioned dataset, so you can compare runs, identify regressions, and track improvement over time. Import additional data from CSV/JSON files, HuggingFace Hub, or exports from other AI/ML platforms. Use the Analytics Studio to chart, filter, and aggregate across datasets.
Platform architecture
Pipelines is organized into a hierarchy:
Organization
└── Project
└── Pipeline
├── Nodes (Subtask, Review, Logic Gate)
└── Tasks (units of work flowing through the pipeline)Each organization contains projects, which contain pipelines. Pipelines are visual graphs of nodes that define the steps in your data workflow. Tasks are individual units of work that move through the pipeline from start to finish.
Key capabilities
| Capability | Description |
|---|---|
| Pipeline Builder | Visual drag-and-drop editor for designing multi-step pipelines |
| Task Management | Claim-based work queues with automatic assignment and routing |
| Evaluations | Human, LLM, and programmatic quality scoring |
| Datasets | Import, export, and analyze data from any source |
| LLM Integration | AI-generated fields with tool-calling, prompt versioning, and BYOK models |
| Analytics | Real-time dashboards, interactive charting, and time tracking |
| RBAC | Role-based access control with Org Admin, Project Admin, and Contributor roles |
| API | Full REST API for programmatic access to platform features |