Revamp the datasets/dataset class #267

chrisiacovella · 2024-10-08T06:49:17Z

During our revamping of the post-processing, we classified properties (and operations) as per_atom and per_molecule. At the time it was discussed that we should probably adopt the same terminology within the hdf5 datasets, rather than prior scheme we developed (series_mol, series_atom, single_mol, single_atom, single_rec).

The original scheme was meant to provide a lot of flexibility and reduce file size for properties which do not depend on the conformation (e.g., atomic number, total_charge, etc.). The labels we defined basically tell us what the dimensions in the various arrays mean and how to handle them. I think at this point we can decide on basically 3 categories "per_atom", "per_molecule" and then everything else (which doesn't really need a label). We can just enforce that for all data, dim=0 is length n_configs, thus only require to define if a per_molecule or per_atom property. This of course is slightly less data efficient for something like total_charge, but at this point, I can't think of too many other properties where there would not be a conformer dependence (and changing from a single value to an array of values in the spice datasets resulted in a negligible change in file size).

By defining per_atom and per_molecule, we could define our own types that would potentially make it clearer what dimensions a given tensor in the code should have (at least in the first 2 dimensions).

This should simplify the hdf5 loader and make it easier to validate the inputs as well.

wiederm · 2024-10-08T17:57:12Z

I agree with the suffix per_atom, but I think instead of per_molecule it should be per_system, at least for things like energy and total charge.

chrisiacovella added the enhancement New feature or request label Oct 8, 2024

chrisiacovella mentioned this issue Oct 25, 2024

Force unit in datasets #298

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revamp the datasets/dataset class #267

Revamp the datasets/dataset class #267

chrisiacovella commented Oct 8, 2024 •

edited

Loading

wiederm commented Oct 8, 2024

Revamp the datasets/dataset class #267

Revamp the datasets/dataset class #267

Comments

chrisiacovella commented Oct 8, 2024 • edited Loading

wiederm commented Oct 8, 2024

chrisiacovella commented Oct 8, 2024 •

edited

Loading