Get startedGet started for free

Hugging Face Datasets

1. Hugging Face Datasets

Welcome back! Now that we've covered the models available through the Hugging Face Hub, let’s become acquainted with the datasets.

2. Datasets in Hugging Face

The Hugging Face Hub provides a collection of community-curated datasets across a variety of tasks and domains. Similar to models, all datasets can be found in the Hub under "datasets".

3. Searching for datasets

Likewise, there are multiple filtering options to help find the appropriate dataset. This could be for a given task, language, size requirement, or within a certain license, such as the MIT license for permissive free software use.

4. Dataset cards

Each dataset has a dataset card which provides more metadata and information about it.

5. Dataset cards

Specifically, there is a dataset path, which uniquely identifies it, a description, information about the dataset structure, an example, and field metadata.

6. Dataset cards

The viewer section shows the first twenty or so rows of the dataset. There may also be different subsets of the data, like a subset for English-only rows.

7. Installing Datasets Package

As with transformers, Hugging Face developed a Python package specifically for interacting with datasets. It is conveniently called, datasets. The datasets library allows us to access, download, mutate, use, and share datasets with minimal lines of code.

8. Inspecting a dataset

Most datasets are quite large - often gigabytes in size - so it's useful to check their metadata first before downloading them. The load_dataset_builder() function allows us to inspect key dataset metadata programmatically, similar to what's shown on the dataset card. For instance, the .info.dataset_size attribute provides the dataset size in bytes, which we can convert to megabytes for clarity. This quick check ensures the dataset meets our needs before downloading it.

9. Downloading a dataset

Once we're happy that the dataset suits our needs, we use the load_dataset() function and provide the dataset path. We can use additional parameters like split to specify partitions to download such as train, test, or validate, which are crucial for developing and evaluating ML models. Check the dataset card to see which partitions are available.

10. Apache Arrow dataset formats

It's important to note that most datasets in Hugging Face leverage Apache Arrow, which is a data format that leverages columnar-based storage instead of more traditional row-based data storage.

11. Data manipulation

With the Arrow dataset format, mutating a dataset is slightly different from other data structures, like pandas DataFrames. To filter, use the .filter() method with a lambda function that applies the defined criteria to each row. For example, filtering for rows with a label of zero returns a dataset containing only those rows.

12. Data manipulation

To select rows based on indices, use the .select() method. For instance, select(range(2)) retrieves the first two rows, leaving a dataset with two entries. To review a column in the sliced dataset, pass the index, in this case zero, and the column, in this case text.

13. Benefits of datasets

Hugging Face datasets offer several key benefits. They are highly accessible and shareable within the ecosystem, making them easy to use across projects. With community curation, they are tailored to everyday ML tasks. Additionally, the Arrow format ensures efficient processing and faster querying, even for large datasets.

14. Let's practice!

We've covered a lot about datasets - now it's time to put this knowledge into practice!