HuggingFace Hub
Import datasets directly from the HuggingFace Hub into your Data Vault.
Pipelines can import datasets directly from the HuggingFace Hub, giving you access to thousands of public datasets and your private datasets.
Importing a dataset
- Navigate to Data Vault and click Import Dataset.
- Select the HuggingFace tab.
- Search for a dataset or enter the dataset identifier directly:
- Public datasets:
squad,imdb,gsm8k - User/org datasets:
username/my-dataset
- Public datasets:
- Select the config (if the dataset has multiple configurations).
- Select the split to import (train, test, validation, or a custom split).
- Set the row limit — the maximum number of rows to import (1–100,000, default: 10,000).
- Choose a sample mode:
- First N — imports the first N rows in order.
- Random — randomly samples N rows from the dataset.
- Review the preview table showing a sample of rows. You can override column types using the dropdown in each column header.
- Click Import.
The import runs as a background job. You can cancel the import while it is in progress. A notification appears when the import completes or fails.
Authentication
Public datasets
No authentication is required for public datasets.
Private and gated datasets
For private or gated datasets, you need a HuggingFace access token:
- Generate a token at huggingface.co/settings/tokens.
- In Pipelines, go to Settings > Service Credentials and add your HuggingFace token.
The token is used for all HuggingFace imports across your organization.
Data type mapping
HuggingFace data types are automatically mapped to Pipelines column types:
| HuggingFace Type | Pipelines Type |
|---|---|
string, large_string | String |
int8–int64, float16–float64 | Number |
bool | Boolean |
timestamp, date32, date64 | Date |
Image, Audio, Video features | File (media) |
ClassLabel | String |
Sequence (of media types) | File (media) |
Sequence (non-media), dict | JSON |
binary | File |
You can override column types in the preview step using the dropdown in each column header. Available types: String, Number, Boolean, Date, URL, JSON, File (Media).
Media columns (images, audio, video) are fully imported and viewable directly in Pipelines.
Limitations
- Maximum import size is 100,000 rows per import.
- Imports with media columns may take significantly longer due to file downloads.