Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deriving new particles #16

Closed
gordonwatts opened this issue Jan 28, 2021 · 17 comments
Closed

Deriving new particles #16

gordonwatts opened this issue Jan 28, 2021 · 17 comments

Comments

@gordonwatts
Copy link
Member

from an email from @ingomueller-net:

I just realized that a potentially useful feature is not covered by the benchmark: deriving new particles (in order to store them/reuse them later). All of the queries currently produce a plot, so even if they derive some particles as part of the analysis, these particles are consumed by the histogramming and not materialized. As I understand, this is something that can be done with the ROOT framework and part of a typical workflow.

From the perspective of SQL, this is actually interesting because some systems are not able to create arrays. We can work around that limitation for the queries of the benchmark, but if the task were to materialize the derived particles, these systems would not be usable. Note that adding new columns to an existing data set or defining views with additional columns based on a given data set are typical capabilities of database systems, so they generally seem like a suitable technology for the task.

@gordonwatts
Copy link
Member Author

If the new particle is a calculation made from the original input data - nothing extra is added - is there any difference between this and a "computed" column?

@ingomueller-net
Copy link
Contributor

What exactly is a computed column?

I can see this use case making sense both for "writing the new data to storage" (such as in this use case described for ROOT, where a branch is added to an existing tree) and for "store code that can recreate the data on the fly as if it were a column" (which is probably your "computed column" and which is close to a database "view").

@gordonwatts
Copy link
Member Author

I mean it in the most generic sense. If the input data is x, then a computed column is y=f(x). y doesn't need to be a single column, it could be a collection of columns/numbers, it could be calculated against a nested quantity, but not be nested itself - the most important thing is that it comes only from the source data. You don't need anything external.

From a database or system point-of-view, it looks like any other data. However, it doesn't actually exist written to disk (or it doesn't have to - the underlying system may decide it needs a cache).

The database "view" and also the creation of a new column are exactly what I'm talking about here.

Most, but not all, computed columns are very cheap to calculate - much cheaper than the disk space they represent. The most important reason that people write out new ROOT datasets are (in my experience, @masonproffitt may want to add to this):

  • The data they want to look at is only a small fraction of the original data - so the skimmed files can be read much faster than reading the original large dataset and applying the filter on the fly. This is especially true when the filter operation requires lots of data columns which the calculated columns do not.
  • A team is trying to standardize the data view they are looking at to make sure when someone says they are looking at "MHT", they are all looking at exactly the same column.

Another thing is that no one I know of adds branches to existing trees. The only supported mechanism to do this that might work at scale is by adding a friend tree in separate files. Since how we store files isn't integrated into our system (as you note in your paper!), developing code to track two sets of files stored in multiple places was never developed. Thanks for that, btw, it is really nice to have a more "formal" way to point to a well known failing in how we managed data in particle physics.

Having the ability to write out new data is interesting. I'm just not sure how it fits into an actual physics analysis' workflow. If it is under the control of the user, where is the data written, how can the user make a decision on what data to write without knowing the global situation (e.g. available disk space, priorities, etc.). I'm a huge fan of caching - that can be deleted whenever without losing anything other than time - because it can be used to get around many of the user-level issues.

@gordonwatts
Copy link
Member Author

@ingomueller-net - did this make sense?

@ingomueller-net
Copy link
Contributor

Yes, it absolutly make sense -- sorry for not replying earlier. Also, to answer your initial question: yes I am thinking about "computed columns."

I do think that what you mention would be an interesting and potentially impactful direction to explore: a system that knows which columns are computed by which function and can then cache that data or recompute it. This can be completely transparent from the user, depends on the (current) access pattern, and allows to trade disk space for computation. I am not sure which production-ready data management systems do that, but this is something that DB people have done in the past. (Dan is working on something similar at the moment, actually.)

I am also not sure what is the best way to fit this (or a more explicit way of writing out data) into current workflows, but if that is a question worth exploring, listing it as a use case may be helpful.

One question: You say "it could be calculated against a nested quantity, but not be nested itself." Why is it not nested? I thought I had understood that sometimes, you may want to derive the presence of particles (or jets) from the presence of other particles. Isn't that something you may want to share with other people, i.e., isn't that something that an f could do?

@ingomueller-net
Copy link
Contributor

This is really out of my area of expertise but my understanding is that reco_higgs_to_4mu in the Higgs boson tutorial of ROOT computes (an array of) Higgs bosons from some existing particles in each event, no? Isn't that a potentially interesting "computed column"?

@masonproffitt
Copy link
Member

If the new particle is a calculation made from the original input data - nothing extra is added - is there any difference between this and a "computed" column?

It seems misleading to say nothing extra is added. There is new information encoded in the calculation itself that is not in the original dataset (for example, the weights of a neural network). I would say there is no extra data.

@masonproffitt
Copy link
Member

This is really out of my area of expertise but my understanding is that reco_higgs_to_4mu in the Higgs boson tutorial of ROOT computes (an array of) Higgs bosons from some existing particles in each event, no? Isn't that a potentially interesting "computed column"?

Yes, this is definitely a use case. I know of analyses that add branches to existing trees with information on candidate combinations (like top candidates from jets and leptons).

@ingomueller-net
Copy link
Contributor

@masonproffitt: I understand, the presence of a "computed column" does not need to imply that "data is added," but the computed column itself may be counted as "something added." Whether or not data is added seems indeed orthogonal: With DB views, no data is added; if we add branches to an existing tree, then it is (right?).

@masonproffitt
Copy link
Member

I guess I'm not sure exactly how a DB view is defined. What I'm saying is that a computed column (regardless of whether it's only transiently in memory or actually written to the DB) does add new information (in the form of a function) not present in the DB itself, but it does not add new data (in the experiment sense), since the computation does not add events or extra detector information.

@gordonwatts
Copy link
Member Author

One question: You say "it could be calculated against a nested quantity, but not be nested itself." Why is it not nested? I thought I had understood that sometimes, you may want to derive the presence of particles (or jets) from the presence of other particles. Isn't that something you may want to share with other people, i.e., isn't that something that an f could do?

I meant it might be the case - so it could be the case that you aggregate data from your list of jets to derive an event-level quantity. I was attempting to make it clear that the function didn't have to just operate on single jets, it could operate on multiple jets, or jets and an event level quantity... But my prose was not clear, sorry!

And you can certainly build new particles as you describe!

@ingomueller-net
Copy link
Contributor

@gordonwatts, @masonproffitt: Thanks for the clarifications, I think I understand now.

@masonproffitt
Copy link
Member

Just to point out: this is a special case of #6.

@gordonwatts
Copy link
Member Author

@masonproffitt - is this a special case of #6? Or something like #6 plus something - if you added a new particle, you might want to add several branches and have them somehow associated together? Or, perhaps that isn't part of this repo, actually.

Ok - should we close this, or turn it into something that needs to be done?

@masonproffitt
Copy link
Member

I think the part of this that is missing from the current benchmarks is just #6. The "plus something" that you're referring to seems to me like just doing operations that are already covered by the other benchmarks, but with the new columns. So it does require that the underlying system can handle added columns using exactly the same interface as the existing ones. I guess I would normally take that for granted, but maybe I shouldn't?

@gordonwatts
Copy link
Member Author

Yeah - I would normally take that for granted as well. Perhaps we should just mention that somewhere, and that will be good enough. Then if we discover that isn't working, we can point to that or flesh out a new example. But at least this way we'll have documented for ourselves. Go ahead and add a note, and then I propose you close this issue.

@masonproffitt
Copy link
Member

Closing as a duplicate of #6.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants