Skip to content

What if my dataset isn't on the Hub?

Loading a custom dataset

  • We will explore how the datasets library can be used to load datasets that are not available on the huggingface hub.
  • HuggingFace Datasets provides loading scripts to handle the loading of local and remote datasets. It supports several common data formats, such as:
    • load_dataset("csv", data_files="my_file.csv")
    • load_dataset("text", data_files="my_file.txt")
    • load_dataset("json", data_files="my_file.jsonl")
    • load_dataset("pandas", data_files="my_dataframe.pkl")
  • To load a dataset in one of above formats, you just need to provide the name of the format to the load_dataset() function alongwith the data_files argument that points to 1 or more file paths or urls.

Loading from local

  • Here's how we can load a local csv dataset.
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv

from datasets import load_dataset
local_csv_dataset = load_dataset("csv", data_files="winequality-white.csv", sep=";")
local_csv_dataset["train"]
  • Here dataset is loaded automatically as a DatasetDict object with each column in the csv file represented as a feature.

Loading from remote

  • Remote datasets be loaded by passing URLs to the data_files argument

Loading csv files

# Load the dataset from the URL directly
dataset_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv"
remote_csv_dataset = load_dataset("csv", data_files=dataset_url, sep=";") #sep like we pass in pandas dataframe
remote_csv_dataset
  • Here, the data_files argument points to a url inside of a local file path

Loading raw text files

  • Raw text files are read line by line to build the dataset
dataset_url = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"
text_dataset = load_dataset("text", data_files=dataset_url)
text_dataset["train"][:5]

Loading json files

  • Json files can loaded in 2 main ways-
    • either line by line: called json lines where every row in the file is a separate json object
      • for these files you can load the dataset by selecting the json loading script and pointing the data_files argument to the file/url
dataset_url = "https://raw.githubusercontent.com/hirupert/sede/main/data/sede/train.jsonl"
json_lines_dataset = load_dataset("json", data_files=dataset_url)
json_lines_dataset["train"][:2]
- the other json format is by specifying a field in nested JSON
    - these files basically look like one huge dictionary, so the load_dataset() allows you to specify which specific key to load
dataset_url = "https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json"
json_dataset = load_dataset("json", data_files=dataset_url, field="data")
json_dataset
  • You can also specify which splits to return with the data_files argument
  • If you have more than one split, you can load them by treating data files as a dictionary that maps each split name to its corresponding file.
  • Everythingelse stays completely unchanged
url = "https://rajpurkar.github.io/SQuAD-explorer/dataset/"
data_files = {"train": f"{url}train-v2.0.json", "validation": f"{url}dev-v2.0.json"}
json_dataset = load_dataset("json", data_files=data_files, field="data")
json_dataset