Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WebDataset Reader #230

Open
wants to merge 109 commits into
base: develop
Choose a base branch
from
Open

Conversation

swetha097
Copy link
Contributor

Contains support for

  • WebDataset Meta-Data reader
  • WebDataset source reader
  • Utility functions for parsing tar files

LakshmiKumar23
LakshmiKumar23 previously approved these changes Jan 6, 2025
Copy link
Contributor

@rrawther rrawther left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added few more comments. please update as soon as possible for us to merge

rocAL/include/pipeline/master_graph.h Outdated Show resolved Hide resolved
@kiritigowda
Copy link
Collaborator

@swetha097 - please resolve conflicts

Copy link
Contributor

@rrawther rrawther left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some minor comments

Copy link
Contributor

@LakshmiKumar23 LakshmiKumar23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The breaks the current flow of build if libtar is not present. We should not have it as a hard dependency. If it's missing libtar, we should disable web dataset reader. Azure CI is not failing due to this

Copy link
Contributor

@rrawther rrawther left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added few more comments
This PR is huge. We need to make it into smaller PRs in future.

AsciiValues samples_ascii;
AsciiComponent ascii_component;
for (auto &elem : _map_content) {
std::cerr << "Name :\t " << elem.first;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this function it is better to accumulate all prints in an std::string and issue one print statement at the end. Probably we can do it as another PR in all such instances

if (_meta_data_reader)
THROW("A metadata reader has already been created")
if (_augmented_meta_data)
THROW("Metadata can only have a single output")

bool generate_index = (index_path[0] == '\0') ? true : false;
if (generate_index)
std::cerr << "Index file is not provided, it may take some time to infer it from the tar file";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you want to continue the rest of the code after this error? why are we not using ERRLOG macro

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This statement is printed if the user doesn't pass the idx file for the webdataset tar file. We have added support to infer the file contents from the tar file if its not present.
We cannot use ERR macro here as this is not an error but a warning that it will take some time for inferring file contents. There are INFO and WRN macros which can be used here but we are not enabling them in CMakelists. Only ERR and THROW macros are available for use

int img_size = rocalGetImageNameLen(handle, image_name_length);
std::vector<char> img_name(img_size);
rocalGetImageName(handle, img_name.data());
std::cerr << "\n Image name: " << img_name.data() << "\n \n";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we using std::cerr here and cout in other places

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the classification and detection pipelines, the image names and labels are printed using cerr while other metadata were printed using cout so we used cerr for the image name here. We'll change cout for all the prints for the webdataset pipeline

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci:precheckin enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants