Pipelines Docs is in beta — content is actively being added.
Integrations

HuggingFace Hub

Import datasets directly from the HuggingFace Hub into your Data Vault.

Pipelines can import datasets directly from the HuggingFace Hub, giving you access to thousands of public datasets and your private datasets.

Importing a dataset

  1. Navigate to Data Vault and click Import Dataset.
  2. Select the HuggingFace tab.
  3. Search for a dataset or enter the dataset identifier directly:
    • Public datasets: squad, imdb, gsm8k
    • User/org datasets: username/my-dataset
  4. Select the config (if the dataset has multiple configurations).
  5. Select the split to import (train, test, validation, or a custom split).
  6. Set the row limit — the maximum number of rows to import (1–100,000, default: 10,000).
  7. Choose a sample mode:
    • First N — imports the first N rows in order.
    • Random — randomly samples N rows from the dataset.
  8. Review the preview table showing a sample of rows. You can override column types using the dropdown in each column header.
  9. Click Import.

The import runs as a background job. You can cancel the import while it is in progress. A notification appears when the import completes or fails.

Authentication

Public datasets

No authentication is required for public datasets.

Private and gated datasets

For private or gated datasets, you need a HuggingFace access token:

  1. Generate a token at huggingface.co/settings/tokens.
  2. In Pipelines, go to Settings > Service Credentials and add your HuggingFace token.

The token is used for all HuggingFace imports across your organization.

Data type mapping

HuggingFace data types are automatically mapped to Pipelines column types:

HuggingFace TypePipelines Type
string, large_stringString
int8int64, float16float64Number
boolBoolean
timestamp, date32, date64Date
Image, Audio, Video featuresFile (media)
ClassLabelString
Sequence (of media types)File (media)
Sequence (non-media), dictJSON
binaryFile

You can override column types in the preview step using the dropdown in each column header. Available types: String, Number, Boolean, Date, URL, JSON, File (Media).

Media columns (images, audio, video) are fully imported and viewable directly in Pipelines.

Limitations

  • Maximum import size is 100,000 rows per import.
  • Imports with media columns may take significantly longer due to file downloads.