Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved experience when linking non-downloadable content #93

Open
seanmacavaney opened this issue Jul 15, 2021 · 6 comments
Open

Improved experience when linking non-downloadable content #93

seanmacavaney opened this issue Jul 15, 2021 · 6 comments
Labels
enhancement New feature or request

Comments

@seanmacavaney
Copy link
Collaborator

Describe the solution you'd like
Have a separate file structure for non-downloadable files. Improve linking experience by providing a command line utility to link, or by giving the command to link to the user directly.

Will require a migration of existing files and (potentially) a plan for backward compatibility.

Additional context
As suggested here: #89 (comment)

@seanmacavaney seanmacavaney added the enhancement New feature or request label Jul 15, 2021
@seanmacavaney
Copy link
Collaborator Author

Partially addressing this in #103. Will give a message like this one:

[INFO] If you have a local copy of https://ai2-s2-research-public.s3-us-west-2.amazonaws.com/ir-datasets/c4/en.noclean.checkpoints.tar.gz, you can symlink it to avoid downloading it again, e.g.:
ln -s /path/to/en.noclean.checkpoints.tar.gz /home/sean/.ir_datasets/downloads/eab00c3b5202564da998466198a01298

@yuenherny
Copy link

yuenherny commented Sep 3, 2022

Hi @seanmacavaney , may I know if how does this work in Windows OS? There's no such folder as downloads in \.ir_datasets.

How do I symlink the dataset I downloaded myself? I keep having issues with PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\\Users\\USER\\.ir_datasets\\msmarco-passage\\dev\\ms.run.tmp0' - the downloaded file got deleted and I need to spend another hour to download it again 😅

@seanmacavaney
Copy link
Collaborator Author

I'm not very experienced in Windows. But I think you can make the missing directory by:

mkdir C:\Users\USER\.ir_datasets\downloads

And the download, you can use CURL, I think:

curl.exe --output C:\Users\USER\.ir_datasets\downloads\XXX --url URL

(where XXX is the hash provided in the message and URL is the target URL)

Hope this helps!

@yuenherny
Copy link

yuenherny commented Sep 5, 2022

Hi @seanmacavaney , thanks for the prompt response.

By using curl, it seems like I need to download the file again, which is what I am trying to avoid, since I already have a local copy of the file.

I tried creating the symbolic link by following the instructions here.

  1. In CMD (opened with admin rights):
mklink C:\Users\<username>\.ir_datasets\downloads\8c140662bdf123a98fbfe3bb174c5831 C:\Users\<username>\.ir_datasets\msmarco-passage\top1000.dev.tar.gz
  1. I get this as response:
symbolic link created for C:\Users\USER\.ir_datasets\downloads\8c140662bdf123a98fbfe3bb174c5831 <<===>> C:\Users\USER\.ir_datasets\msmarco-passage\top1000.dev.tar.gz

Then I rerun the .scoreddocs_iter() cell again, but it seems that it is downloading it again? (This time without the symlink instructions tho)
image

Right now I am letting the process to finish and see what errors I will encounter after trying out this symlink method.

@yuenherny
Copy link

Apparently the software downloads the dataset (again), and this time it kinda hits itself in the foot:

  • ms.run.tmp5 was created by the software when the process is running
  • but it says that PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\\Users\\USER\\.ir_datasets\\msmarco-passage\\dev\\ms.run.tmp5' when it is about to wrap up the process

Full error message:

[INFO] [starting] https://msmarco.blob.core.windows.net/msmarcoranking/top1000.dev.tar.gz
[INFO] download error: HTTPSConnectionPool(host='msmarco.blob.core.windows.net', port=443): Read timed out.. Retrying range "121044992-" [2 attempts left]
[INFO] download error: HTTPSConnectionPool(host='msmarco.blob.core.windows.net', port=443): Read timed out.. Retrying range "245940224-" [2 attempts left]
[INFO] [finished] https://msmarco.blob.core.windows.net/msmarcoranking/top1000.dev.tar.gz: [44:01] [687MB] [260kB/s]
[WARNING] Download failed: [WinError 5] Access is denied: 'C:\\Users\\USER\\AppData\\Local\\Temp\\ir_datasets\\tmppzhcp6nu.tmp' -> 'C:\\Users\\USER\\AppData\\Local\\Temp\\ir_datasets\\tmppzhcp6nu'
Output exceeds the [size limit](command:workbench.action.openSettings?[). Open the full output data [in a text editor](command:workbench.action.openLargeOutput?d9307bf7-6f2e-4bcf-8468-b807df104661)
---------------------------------------------------------------------------
PermissionError                           Traceback (most recent call last)
File d:\Repos\XpressAI\vecto-reranking\venv\lib\site-packages\ir_datasets\util\fileio.py:69, in Cache.verify(self)
     68 with self._streamer.stream() as stream:
---> 69     shutil.copyfileobj(stream, f)
     70 f.close() # close file before move... Needed because of Windows

File ~\AppData\Local\Programs\Python\Python310\lib\shutil.py:195, in copyfileobj(fsrc, fdst, length)
    194 while True:
--> 195     buf = fsrc_read(length)
    196     if not buf:

File d:\Repos\XpressAI\vecto-reranking\venv\lib\site-packages\ir_datasets\util\fileio.py:35, in IterStream.readinto(self, b)
     34 l = len(b) - pos  # We're supposed to return at most this much
---> 35 chunk = self.leftover or next(self.it)
     36 output, self.leftover = chunk[:l], chunk[l:]

File d:\Repos\XpressAI\vecto-reranking\venv\lib\site-packages\ir_datasets\datasets\msmarco_passage.py:52, in ExtractQidPid.__iter__(self)
     51 def __iter__(self):
---> 52     with self._streamer.stream() as stream:
     53         for line in _logger.pbar(stream, desc='extracting QID/PID pairs', unit='pair'):

File ~\AppData\Local\Programs\Python\Python310\lib\contextlib.py:135, in _GeneratorContextManager.__enter__(self)
    134 try:
--> 135     return next(self.gen)
...
-> 1206     self._accessor.unlink(self)
   1207 except FileNotFoundError:
   1208     if not missing_ok:

PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\\Users\\USER\\.ir_datasets\\msmarco-passage\\dev\\ms.run.tmp5'

I am guessing that this is a legitimate bug - which is what you mentioned here

Right now I am using soft link in Windows. Trying to see if things are better if I use hard link.

Screenshot:
image

@yuenherny
Copy link

Tried hard link in Windows via mklink /H C:\Users\USER\.ir_datasets\downloads\8c140662bdf123a98fbfe3bb174c5831 C:\Users\USER\.ir_datasets\msmarco-passage\top1000.dev.tar.gz and got Hardlink created for C:\Users\USER\.ir_datasets\downloads\8c140662bdf123a98fbfe3bb174c5831 <<===>> C:\Users\USER\.ir_datasets\msmarco-passage\top1000.dev.tar.gz

Restarted kernel and rerun the ipynb from top, but it seems that it still tries to download from the URL again 😅 :

INFO] [starting] https://msmarco.blob.core.windows.net/msmarcoranking/top1000.dev.tar.gz
[INFO] [error] https://msmarco.blob.core.windows.net/msmarcoranking/top1000.dev.tar.gz: [00:26] [7.32MB] [280kB/s] 

Screenshot:
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants