Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] injest fiftyone datasets #1957

Open
1 of 2 tasks
nmichlo opened this issue Oct 23, 2022 · 7 comments
Open
1 of 2 tasks

[FEATURE] injest fiftyone datasets #1957

nmichlo opened this issue Oct 23, 2022 · 7 comments
Assignees
Labels
enhancement New feature or request

Comments

@nmichlo
Copy link

nmichlo commented Oct 23, 2022

🚨🚨 Feature Request

  • Related to an existing Issue
  • A new implementation (Improvement, Extension)

Is your feature request related to a problem?

My problem is being able to ingest fiftyone datasets into deeplake

  • exporting would also be an interesting addition

If your feature will improve HUB

Fiftyone is a common dataset import and export tool, integration with deeplake would make such operations easy, and would mean that we do not have to implement such operations from scratch.

Description of the possible solution

import deeplake
import fiftyone

# ideally this would be able to detect the various different types and labels and be able to import these accordingly.
dataset = fiftyone.load_dataset('my_dataset')
deeplake.ingest_51('deeplake_data/my_dataset', dataset)

An alternative solution to the problem can look like

Ingest steps could be written manually. (Fiftyone doesn't enforce much structure on the datasets so I am not sure if the original ingest function even has a distinct solution, maybe some basic structure would be required).

Teachability, Documentation, Adoption, Migration Strategy
Needs discussion first

@nmichlo nmichlo added the enhancement New feature or request label Oct 23, 2022
@mikayelh
Copy link
Collaborator

hey @nmichlo, thanks a lot for the feature request! I'm tagging @istranic for visibility and follow-up here. If you are feeling like it, we would welcome a contribution to this enhancement!

@davidbuniat
Copy link
Member

davidbuniat commented Oct 23, 2022

@nmichlo thanks a lot for opening the issue. Curious can you give us more context on the use case why would you like to import FiftyOne datasets? (what you like and don't like in FiftyOne?)

@nmichlo
Copy link
Author

nmichlo commented Oct 23, 2022

Use Case:

As part of my day to work I often need to find, download, import and pre-process many different existing datasets which are all usually in common formats like COCO or YOLOv5. Occasionally I will need to write a script to import a custom format, but I generally try and avoid that. These datasets are then often merged together or added to existing datasets that are then used for re-training. Improving models by iterating on the data, ultimately version control here would be great, which is why deep-lake is so appealing.

  • Eventually I would love to transition from exporting to custom formats, to using the deeplake dataloaders themselves, however, often for experimentation it might still be necessary to export to these various common formats (eg. YOLOv5, COCO) to avoid code changes to external libs.

Fiftyone, the good and bad:

Disclaimer: my overall experience with fiftyone is still fairly limited, my main use case however is the import/export functionality, combined with the local preview of datasets, occasional dataset filtering and renaming/removing labels. Ultimately I would love to replace fiftyone entirely with deeplake, and store datasets in our own cloud buckets.

What is good about fiftyone:

  • the built in import/export functionality of datasets, from/to many different common dataset formats.
    • Often enables converting datasets for use existing STOA projects without modification to their source code.
  • local dataset previews in the browser without jupyter notebook and external connections
  • CVAT/LabelStudio integration for re-labelling of data
    • extremely useful for iterative refinement (This would be an amazing feature for deeplake, if done correctly, this could really set it apart)
  • tagging dataset items directly from the UI for use further down in scripts, eg. removal of problematic images.

What is bad about fiftyone:

  • extremely slow start times, making it painful to use in scripts, this is due to the MongoDB backend which is heavily integrated and cannot be removed.
  • type hints are not great across the project, IDE support is thus also not great making the project difficult to work with.
  • structuring of items in datasets is much less intuitive bordering on unstructured.
  • no versioning
  • not intended for use as a dataloader, export of datasets is required.
  • Some of the import/export formats are brittle and don't support the dataset standards entirely.

EDIT: overall, deeplake has been extremely refreshing to work with. Really good work on the project so far!

EDIT2: might be worth adding fiftyone to the README section on "Comparisons to Familiar Tools"?

EDIT3: I can provide examples of my own import fiftyone -> deeplake script, but it is definitely not general in any sense. It was tailored to a specific format, purely as a test.

@nmichlo
Copy link
Author

nmichlo commented Oct 23, 2022

Based on my clarified use case, I might even argue with my own issue, in that fiftyone injest would be a nice-to-have, and ultimately a better solution might be built in support for ingesting and exporting common dataset formats.

EDIT: this could also serve as a good way of documenting / providing examples of real-world use cases, that can be adapted.

@istranic
Copy link
Contributor

Hey @nmichlo Thank you for the feedback. This is extremely useful for our product development.

As I was reading your comments, I had the same thought as your last note:

  • "Based on my clarified use case, I might even argue with my own issue, in that fiftyone injest would be a nice-to-have, and ultimately a better solution might be built in support for ingesting and exporting common dataset formats."

Just want to clarify that I understand it correctly, because it appears aligned with our roadmap. Would you rather have a function to ingest from 51, or a set of function to ingest from dataset formats such as YOLO, COCO, CVAT, LabelStudio, and others?

@nmichlo
Copy link
Author

nmichlo commented Oct 24, 2022

@istranic no problem, glad to help!

Ideally in the long run I personally would prefer not to use fiftyone, and ingest/export datasets directly.

However, I think there might be merit for both?

  • ingesting directly from fiftyone might keep additional information that would otherwise be discarded if there is first an export and then import step.
  • this would also allow easier migration to deeplake

@istranic
Copy link
Contributor

Got it. Thanks @nmichlo!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants