Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revamp the datasets/dataset class #267

Open
chrisiacovella opened this issue Oct 8, 2024 · 1 comment
Open

Revamp the datasets/dataset class #267

chrisiacovella opened this issue Oct 8, 2024 · 1 comment
Labels
enhancement New feature or request

Comments

@chrisiacovella
Copy link
Member

chrisiacovella commented Oct 8, 2024

During our revamping of the post-processing, we classified properties (and operations) as per_atom and per_molecule. At the time it was discussed that we should probably adopt the same terminology within the hdf5 datasets, rather than prior scheme we developed (series_mol, series_atom, single_mol, single_atom, single_rec).

The original scheme was meant to provide a lot of flexibility and reduce file size for properties which do not depend on the conformation (e.g., atomic number, total_charge, etc.). The labels we defined basically tell us what the dimensions in the various arrays mean and how to handle them. I think at this point we can decide on basically 3 categories "per_atom", "per_molecule" and then everything else (which doesn't really need a label). We can just enforce that for all data, dim=0 is length n_configs, thus only require to define if a per_molecule or per_atom property. This of course is slightly less data efficient for something like total_charge, but at this point, I can't think of too many other properties where there would not be a conformer dependence (and changing from a single value to an array of values in the spice datasets resulted in a negligible change in file size).

By defining per_atom and per_molecule, we could define our own types that would potentially make it clearer what dimensions a given tensor in the code should have (at least in the first 2 dimensions).

This should simplify the hdf5 loader and make it easier to validate the inputs as well.

@chrisiacovella chrisiacovella added the enhancement New feature or request label Oct 8, 2024
@wiederm
Copy link
Member

wiederm commented Oct 8, 2024

I agree with the suffix per_atom, but I think instead of per_molecule it should be per_system, at least for things like energy and total charge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants