mtdata downloader #10

XapaJIaMnu · 2022-07-14T15:44:06Z

This is just a draft as I am not exactly sure how to hook it up to the GUI, but I have coded the backend bits necessary for

Deduplicating datasets (only taking the latest version of a dataset)
Downloading datasets (in parallel) using the mtdata cli.

I imagine it should be some "tab" such as "discover" datasets, where we get a list of them and we can manually exclude some/label them clean/medium/dirty. The downloader automatically splits train/test/dev based on the dataset id provided by mtdata.

This should fix #6 eventually.

This reverts commit 2be4bc4. Turns out it was working beforehand.

XapaJIaMnu · 2022-09-12T12:08:23Z

mtdata-stuff.py

@@ -53,3 +94,10 @@ def read_dataset(did: str):
 @app.get("/datasets/{did}/sample")


This is duplicated with the function above. I assume that this is part of the javascript interface. @jelmervdl is that how it's supposed to work?

XapaJIaMnu · 2022-09-12T12:09:12Z

I propose we merge this for now so that it's in the tree and it can be picked up when we have a downloader interface.

jelmervdl · 2022-09-13T20:14:37Z

Stand-alone this doesn't add much actual functionality. I'll take this pull request over, and add some sort of minimal interface for it at least before merging it.

jelmervdl · 2022-10-04T16:32:00Z

Current route:

I'm really considering adding Bootstrap to the project for a bit more UI. Also a filter toggle for only the latest version of each dataset.

XapaJIaMnu · 2022-10-05T16:27:48Z

Looks good to me, pending on the answer of thammegowda/mtdata#129 . Ideally i'd like to see how much i am downloading before starting to download (and also we had the issue with the mozilla pipeline that downloads would fail because we would be throttled, so we should be able to limit the number of parallel downloads.)

XapaJIaMnu · 2022-10-05T16:29:49Z

Also, crash on exit:

^CINFO:     Shutting down
2022-10-05 17:30:27 server.shutdown:252 INFO:: Shutting down
INFO:     Finished server process [14370]
2022-10-05 17:30:27 server.serve:85 INFO:: Finished server process [14370]
ERROR:    Traceback (most recent call last):
  File "/home/dheart/uni_stuff/postdoc/empty-train/.env/lib/python3.10/site-packages/starlette/routing.py", line 638, in lifespan
    await receive()
  File "/home/dheart/uni_stuff/postdoc/empty-train/.env/lib/python3.10/site-packages/uvicorn/lifespan/on.py", line 135, in receive
    return await self.receive_queue.get()
  File "/usr/lib/python3.10/asyncio/queues.py", line 159, in get
    await getter
asyncio.exceptions.CancelledError

2022-10-05 17:30:27 on.send:132 ERROR:: Traceback (most recent call last):
  File "/home/dheart/uni_stuff/postdoc/empty-train/.env/lib/python3.10/site-packages/starlette/routing.py", line 638, in lifespan
    await receive()
  File "/home/dheart/uni_stuff/postdoc/empty-train/.env/lib/python3.10/site-packages/uvicorn/lifespan/on.py", line 135, in receive
    return await self.receive_queue.get()
  File "/usr/lib/python3.10/asyncio/queues.py", line 159, in get
    await getter
asyncio.exceptions.CancelledError

jelmervdl · 2022-10-05T18:12:39Z

Downloads are currently limited to two concurrent downloads.

Maybe I can get it do a HEAD request to get the size (or Content-Length really) of the download. It would be infeasible to do this for all datasets that are listed, but should be doable for the ones in your "shopping list" at least.

XapaJIaMnu added 8 commits July 14, 2022 15:04

mtdata downloader

b0219c2

typing

1ca500c

Threadpool -> processPool

566be3f

Use a map for the executor

2b948ec

try again with submit

db9925a

Try again

038c3ed

Get concurrency to work later

2be4bc4

Revert "Get concurrency to work later"

ee7502d

This reverts commit 2be4bc4. Turns out it was working beforehand.

XapaJIaMnu requested a review from jelmervdl July 14, 2022 15:44

Merge with main

950bf5f

XapaJIaMnu commented Sep 12, 2022

View reviewed changes

XapaJIaMnu marked this pull request as ready for review September 12, 2022 12:08

jelmervdl added 3 commits September 26, 2022 11:24

In progress integration of the mtdata-stuff.py (WIP commit)

be63efb

Merge branch 'main' into importers

4b20d2f

Something that lists datasets filtered by language

c5bce5a

jelmervdl marked this pull request as draft October 4, 2022 16:19

Shopping cart view

34b8869

jelmervdl added 3 commits October 5, 2022 14:14

Missing overflow:hidden?

58c84a6

Add API endpoints for downloading

34317ab

Adding download functionality to frontend

af90471

jelmervdl marked this pull request as ready for review October 5, 2022 14:46

jelmervdl added 2 commits October 12, 2022 18:45

Replace download code with something that's cancelable

b35ae56

Mark downloads that have files on disk already as downloaded

176d9de

jelmervdl merged commit c9b0a48 into main Oct 13, 2022

jelmervdl deleted the importers branch October 13, 2022 11:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mtdata downloader #10

mtdata downloader #10

XapaJIaMnu commented Jul 14, 2022 •

edited

Loading

XapaJIaMnu Sep 12, 2022

XapaJIaMnu commented Sep 12, 2022

jelmervdl commented Sep 13, 2022

jelmervdl commented Oct 4, 2022

XapaJIaMnu commented Oct 5, 2022 •

edited

Loading

XapaJIaMnu commented Oct 5, 2022

jelmervdl commented Oct 5, 2022

		@@ -53,3 +94,10 @@ def read_dataset(did: str):
		@app.get("/datasets/{did}/sample")

mtdata downloader #10

mtdata downloader #10

Conversation

XapaJIaMnu commented Jul 14, 2022 • edited Loading

XapaJIaMnu Sep 12, 2022

Choose a reason for hiding this comment

XapaJIaMnu commented Sep 12, 2022

jelmervdl commented Sep 13, 2022

jelmervdl commented Oct 4, 2022

XapaJIaMnu commented Oct 5, 2022 • edited Loading

XapaJIaMnu commented Oct 5, 2022

jelmervdl commented Oct 5, 2022

XapaJIaMnu commented Jul 14, 2022 •

edited

Loading

XapaJIaMnu commented Oct 5, 2022 •

edited

Loading