Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mtdata downloader #10

Merged
merged 18 commits into from
Oct 13, 2022
Merged

mtdata downloader #10

merged 18 commits into from
Oct 13, 2022

Conversation

XapaJIaMnu
Copy link
Collaborator

@XapaJIaMnu XapaJIaMnu commented Jul 14, 2022

This is just a draft as I am not exactly sure how to hook it up to the GUI, but I have coded the backend bits necessary for

  1. Deduplicating datasets (only taking the latest version of a dataset)
  2. Downloading datasets (in parallel) using the mtdata cli.

I imagine it should be some "tab" such as "discover" datasets, where we get a list of them and we can manually exclude some/label them clean/medium/dirty. The downloader automatically splits train/test/dev based on the dataset id provided by mtdata.

This should fix #6 eventually.

mtdata-stuff.py Outdated
@@ -53,3 +94,10 @@ def read_dataset(did: str):
@app.get("/datasets/{did}/sample")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is duplicated with the function above. I assume that this is part of the javascript interface. @jelmervdl is that how it's supposed to work?

@XapaJIaMnu XapaJIaMnu marked this pull request as ready for review September 12, 2022 12:08
@XapaJIaMnu
Copy link
Collaborator Author

I propose we merge this for now so that it's in the tree and it can be picked up when we have a downloader interface.

@jelmervdl
Copy link
Collaborator

Stand-alone this doesn't add much actual functionality. I'll take this pull request over, and add some sort of minimal interface for it at least before merging it.

@jelmervdl jelmervdl marked this pull request as draft October 4, 2022 16:19
@jelmervdl
Copy link
Collaborator

Current route:
image

I'm really considering adding Bootstrap to the project for a bit more UI. Also a filter toggle for only the latest version of each dataset.

@jelmervdl jelmervdl marked this pull request as ready for review October 5, 2022 14:46
@XapaJIaMnu
Copy link
Collaborator Author

XapaJIaMnu commented Oct 5, 2022

Looks good to me, pending on the answer of thammegowda/mtdata#129 . Ideally i'd like to see how much i am downloading before starting to download (and also we had the issue with the mozilla pipeline that downloads would fail because we would be throttled, so we should be able to limit the number of parallel downloads.)

@XapaJIaMnu
Copy link
Collaborator Author

Also, crash on exit:

^CINFO:     Shutting down
2022-10-05 17:30:27 server.shutdown:252 INFO:: Shutting down
INFO:     Finished server process [14370]
2022-10-05 17:30:27 server.serve:85 INFO:: Finished server process [14370]
ERROR:    Traceback (most recent call last):
  File "/home/dheart/uni_stuff/postdoc/empty-train/.env/lib/python3.10/site-packages/starlette/routing.py", line 638, in lifespan
    await receive()
  File "/home/dheart/uni_stuff/postdoc/empty-train/.env/lib/python3.10/site-packages/uvicorn/lifespan/on.py", line 135, in receive
    return await self.receive_queue.get()
  File "/usr/lib/python3.10/asyncio/queues.py", line 159, in get
    await getter
asyncio.exceptions.CancelledError

2022-10-05 17:30:27 on.send:132 ERROR:: Traceback (most recent call last):
  File "/home/dheart/uni_stuff/postdoc/empty-train/.env/lib/python3.10/site-packages/starlette/routing.py", line 638, in lifespan
    await receive()
  File "/home/dheart/uni_stuff/postdoc/empty-train/.env/lib/python3.10/site-packages/uvicorn/lifespan/on.py", line 135, in receive
    return await self.receive_queue.get()
  File "/usr/lib/python3.10/asyncio/queues.py", line 159, in get
    await getter
asyncio.exceptions.CancelledError

@jelmervdl
Copy link
Collaborator

Downloads are currently limited to two concurrent downloads.

Maybe I can get it do a HEAD request to get the size (or Content-Length really) of the download. It would be infeasible to do this for all datasets that are listed, but should be doable for the ones in your "shopping list" at least.

@jelmervdl jelmervdl merged commit c9b0a48 into main Oct 13, 2022
@jelmervdl jelmervdl deleted the importers branch October 13, 2022 11:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Corpus finder and downloader
2 participants