Skip to content
This repository has been archived by the owner on Jul 4, 2023. It is now read-only.

wmt_dataset download failed #119

Open
chloejiwon opened this issue Oct 24, 2021 · 2 comments
Open

wmt_dataset download failed #119

chloejiwon opened this issue Oct 24, 2021 · 2 comments

Comments

@chloejiwon
Copy link

Expected Behavior

Actual Behavior

  • wmt_dataset [DOWNLOAD_FAILED] occurs.

Steps to Reproduce the Problem

  1. install pytorch-nlp 0.5.0
  2. from torchnlp.datasets import wmt_dataset
  3. train=wmt_dataset(train=True)
>>> train = wmt_dataset(train=True)
tar: Error opening archive: Unrecognized archive format
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.9/site-packages/torchnlp/datasets/wmt.py", line 63, in wmt_dataset
    download_file_maybe_extract(
  File "/usr/local/lib/python3.9/site-packages/torchnlp/download.py", line 170, in download_file_maybe_extract
    raise ValueError('[DOWNLOAD FAILED] `*check_files` not found')
ValueError: [DOWNLOAD FAILED] `*check_files` not found
@ro-ko
Copy link

ro-ko commented Dec 6, 2021

In torchnlp/download.py

def _download_file_from_drive(filename, url):  # pragma: no cover
    """ Download filename from google drive unless it's already in directory.

    Args:
        filename (str): Name of the file to download to (do nothing if it already exists).
        url (str): URL to download from.
    """
    confirm_token = None

    # Since the file is big, drive will scan it for virus and take it to a
    # warning page. We find the confirm token on this page and append it to the
    # URL to start the download process.
    confirm_token = None
    session = requests.Session()
    response = session.get(url, stream=True)
    for k, v in response.cookies.items():
        if k.startswith("download_warning"):
            confirm_token = v

    if confirm_token:
        url = url + "&confirm=" + confirm_token

    logger.info("Downloading %s to %s" % (url, filename))

    response = session.get(url, stream=True)
    # Now begin the download.
    chunk_size = 16 * 1024
    with open(filename, "wb") as f:
        for chunk in response.iter_content(chunk_size):
            if chunk:
                f.write(chunk)

    # Print newline to clear the carriage return from the download progress
    statinfo = os.stat(filename)
    logger.info("Successfully downloaded %s, %s bytes." % (filename, statinfo.st_size))

I checked the not found *check_files

Result

data/wmt16_en_de/train.tok.clean.bpe.32000.en Extracting data/wmt16_en_de/wmt16_en_de.tar.gz tar: Error opening archive: Unrecognized archive format data/wmt16_en_de/train.tok.clean.bpe.32000.en
'data/wmt16_en_de/wmt16_en_de.tar.gz' file forms HTML document text, ASCII text

open file url 'https://drive.google.com/uc?export=download&id=0B_bZck-ksdkpM25jRUN2X2UxMm8' in documentation with wet dataset.
it was 404 found page.

this bug is occurred by documentation wmt data url.

@maximus12793
Copy link

Any update on this?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants