Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot use cached dataset without Internet connection (or when servers are down) #6837

Open
DionisMuzenitov opened this issue Apr 25, 2024 · 6 comments

Comments

@DionisMuzenitov
Copy link

Describe the bug

I want to be able to use cached dataset from HuggingFace even when I have no Internet connection (or when HuggingFace servers are down, or my company has network issues).
The problem why I can't use it:
data_files argument from datasets.load_dataset() function get it updates from the server before calculating hash for caching. As a result, when I run the same code with and without Internet I get different dataset configuration directory name.

Steps to reproduce the bug

import datasets

c4_dataset = datasets.load_dataset(
    path="allenai/c4",
    data_files={"train": "en/c4-train.00000-of-01024.json.gz"},
    split="train",
    cache_dir="/datesets/cache",
    download_mode="reuse_cache_if_exists",
    token=False,
)
  1. Run this code with the Internet.
  2. Run the same code without the Internet.

Expected behavior

When running without the Internet connection, the loader should be able to get dataset from cache

Environment info

  • datasets version: 2.19.0
  • Platform: Windows-10-10.0.19044-SP0
  • Python version: 3.10.13
  • huggingface_hub version: 0.22.2
  • PyArrow version: 16.0.0
  • Pandas version: 1.5.3
  • fsspec version: 2023.12.2
@DionisMuzenitov
Copy link
Author

There are 2 workarounds, tho:

  1. Download datasets from web and just load them locally
  2. Use metadata directly (temporal solution, since metadata can change)
import datasets
from datasets.data_files import DataFilesDict, DataFilesList

data_files_list = DataFilesList(
    [
        "hf://datasets/allenai/c4@1588ec454efa1a09f29cd18ddd04fe05fc8653a2/en/c4-train.00000-of-01024.json.gz"
    ],
    [("allenai/c4", "1588ec454efa1a09f29cd18ddd04fe05fc8653a2")],
)
data_files = DataFilesDict({"train": data_files_list})
c4_dataset = datasets.load_dataset(
    path="allenai/c4",
    data_files=data_files,
    split="train",
    cache_dir="/datesets/cache",
    download_mode="reuse_cache_if_exists",
    token=False,
)

Second solution also shows where to find the bug. I suggest that the hashing functions should always use only original parameter data_files, and not the one they get after connecting to the server and creating DataFilesDict

@mariosasko
Copy link
Collaborator

Hi! You need to set the HF_DATASETS_OFFLINE env variable to 1 to load cached datasets offline, as explained in the docs here.

@DionisMuzenitov
Copy link
Author

DionisMuzenitov commented Apr 26, 2024

Just tested. It doesn't work, because of the exact problem I described above: hash of dataset config is different.
The only error difference is the reason why it cannot connect to HuggingFace (now it's 'offline mode is enabled')
image

@ErikaaWang
Copy link

Met a pretty similar issue here, as I manually load the dataset into ~/.cache and try to let load_dataset detect it automatically, but it will always try reach hub even I set HF_DATASETS_OFFLINE to 1. Have you solved it?

@zjwu0522
Copy link

same here!

@ZichengDuan
Copy link

Same issue here, my case is that I need to download the dataset from the login node and run the jobs on the compute node in which the internet is inaccessible, however, the load_dataset() function would always lead to sending requests and tries to connect, although I have downloaded the dataset using the same load_dataset() function previously. While I believe the model.from_pretrained() function is designed to be quite effective as it could always force the reuse of the pre-downloaded weights by using local_files_only = True, however, there is no such entry for us to set local_files_only = True for load_dataset() function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants