Deriving new particles #16

gordonwatts · 2021-01-28T19:52:37Z

I just realized that a potentially useful feature is not covered by the benchmark: deriving new particles (in order to store them/reuse them later). All of the queries currently produce a plot, so even if they derive some particles as part of the analysis, these particles are consumed by the histogramming and not materialized. As I understand, this is something that can be done with the ROOT framework and part of a typical workflow.

From the perspective of SQL, this is actually interesting because some systems are not able to create arrays. We can work around that limitation for the queries of the benchmark, but if the task were to materialize the derived particles, these systems would not be usable. Note that adding new columns to an existing data set or defining views with additional columns based on a given data set are typical capabilities of database systems, so they generally seem like a suitable technology for the task.

gordonwatts · 2021-01-28T19:53:17Z

If the new particle is a calculation made from the original input data - nothing extra is added - is there any difference between this and a "computed" column?

ingomueller-net · 2021-01-28T20:39:45Z

What exactly is a computed column?

I can see this use case making sense both for "writing the new data to storage" (such as in this use case described for ROOT, where a branch is added to an existing tree) and for "store code that can recreate the data on the fly as if it were a column" (which is probably your "computed column" and which is close to a database "view").

gordonwatts · 2021-01-29T15:13:59Z

I mean it in the most generic sense. If the input data is x, then a computed column is y=f(x). y doesn't need to be a single column, it could be a collection of columns/numbers, it could be calculated against a nested quantity, but not be nested itself - the most important thing is that it comes only from the source data. You don't need anything external.

From a database or system point-of-view, it looks like any other data. However, it doesn't actually exist written to disk (or it doesn't have to - the underlying system may decide it needs a cache).

The database "view" and also the creation of a new column are exactly what I'm talking about here.

Most, but not all, computed columns are very cheap to calculate - much cheaper than the disk space they represent. The most important reason that people write out new ROOT datasets are (in my experience, @masonproffitt may want to add to this):

The data they want to look at is only a small fraction of the original data - so the skimmed files can be read much faster than reading the original large dataset and applying the filter on the fly. This is especially true when the filter operation requires lots of data columns which the calculated columns do not.
A team is trying to standardize the data view they are looking at to make sure when someone says they are looking at "MHT", they are all looking at exactly the same column.

Another thing is that no one I know of adds branches to existing trees. The only supported mechanism to do this that might work at scale is by adding a friend tree in separate files. Since how we store files isn't integrated into our system (as you note in your paper!), developing code to track two sets of files stored in multiple places was never developed. Thanks for that, btw, it is really nice to have a more "formal" way to point to a well known failing in how we managed data in particle physics.

Having the ability to write out new data is interesting. I'm just not sure how it fits into an actual physics analysis' workflow. If it is under the control of the user, where is the data written, how can the user make a decision on what data to write without knowing the global situation (e.g. available disk space, priorities, etc.). I'm a huge fan of caching - that can be deleted whenever without losing anything other than time - because it can be used to get around many of the user-level issues.

gordonwatts · 2021-02-10T15:27:32Z

@ingomueller-net - did this make sense?

ingomueller-net · 2021-02-11T12:04:33Z

Yes, it absolutly make sense -- sorry for not replying earlier. Also, to answer your initial question: yes I am thinking about "computed columns."

I do think that what you mention would be an interesting and potentially impactful direction to explore: a system that knows which columns are computed by which function and can then cache that data or recompute it. This can be completely transparent from the user, depends on the (current) access pattern, and allows to trade disk space for computation. I am not sure which production-ready data management systems do that, but this is something that DB people have done in the past. (Dan is working on something similar at the moment, actually.)

I am also not sure what is the best way to fit this (or a more explicit way of writing out data) into current workflows, but if that is a question worth exploring, listing it as a use case may be helpful.

One question: You say "it could be calculated against a nested quantity, but not be nested itself." Why is it not nested? I thought I had understood that sometimes, you may want to derive the presence of particles (or jets) from the presence of other particles. Isn't that something you may want to share with other people, i.e., isn't that something that an f could do?

ingomueller-net · 2021-02-11T12:10:04Z

This is really out of my area of expertise but my understanding is that reco_higgs_to_4mu in the Higgs boson tutorial of ROOT computes (an array of) Higgs bosons from some existing particles in each event, no? Isn't that a potentially interesting "computed column"?

masonproffitt · 2021-02-11T12:15:57Z

If the new particle is a calculation made from the original input data - nothing extra is added - is there any difference between this and a "computed" column?

It seems misleading to say nothing extra is added. There is new information encoded in the calculation itself that is not in the original dataset (for example, the weights of a neural network). I would say there is no extra data.

masonproffitt · 2021-02-11T12:19:40Z

This is really out of my area of expertise but my understanding is that reco_higgs_to_4mu in the Higgs boson tutorial of ROOT computes (an array of) Higgs bosons from some existing particles in each event, no? Isn't that a potentially interesting "computed column"?

Yes, this is definitely a use case. I know of analyses that add branches to existing trees with information on candidate combinations (like top candidates from jets and leptons).

ingomueller-net · 2021-02-11T12:29:08Z

@masonproffitt: I understand, the presence of a "computed column" does not need to imply that "data is added," but the computed column itself may be counted as "something added." Whether or not data is added seems indeed orthogonal: With DB views, no data is added; if we add branches to an existing tree, then it is (right?).

masonproffitt · 2021-02-11T12:59:33Z

I guess I'm not sure exactly how a DB view is defined. What I'm saying is that a computed column (regardless of whether it's only transiently in memory or actually written to the DB) does add new information (in the form of a function) not present in the DB itself, but it does not add new data (in the experiment sense), since the computation does not add events or extra detector information.

gordonwatts · 2021-02-12T06:57:55Z

One question: You say "it could be calculated against a nested quantity, but not be nested itself." Why is it not nested? I thought I had understood that sometimes, you may want to derive the presence of particles (or jets) from the presence of other particles. Isn't that something you may want to share with other people, i.e., isn't that something that an f could do?

I meant it might be the case - so it could be the case that you aggregate data from your list of jets to derive an event-level quantity. I was attempting to make it clear that the function didn't have to just operate on single jets, it could operate on multiple jets, or jets and an event level quantity... But my prose was not clear, sorry!

And you can certainly build new particles as you describe!

ingomueller-net · 2021-02-13T14:45:37Z

@gordonwatts, @masonproffitt: Thanks for the clarifications, I think I understand now.

masonproffitt · 2021-03-17T17:31:11Z

Just to point out: this is a special case of #6.

gordonwatts · 2021-03-26T00:10:27Z

@masonproffitt - is this a special case of #6? Or something like #6 plus something - if you added a new particle, you might want to add several branches and have them somehow associated together? Or, perhaps that isn't part of this repo, actually.

Ok - should we close this, or turn it into something that needs to be done?

masonproffitt · 2021-03-26T21:15:49Z

I think the part of this that is missing from the current benchmarks is just #6. The "plus something" that you're referring to seems to me like just doing operations that are already covered by the other benchmarks, but with the new columns. So it does require that the underlying system can handle added columns using exactly the same interface as the existing ones. I guess I would normally take that for granted, but maybe I shouldn't?

gordonwatts · 2021-03-26T23:19:03Z

Yeah - I would normally take that for granted as well. Perhaps we should just mention that somewhere, and that will be good enough. Then if we discover that isn't working, we can point to that or flesh out a new example. But at least this way we'll have documented for ourselves. Go ahead and add a note, and then I propose you close this issue.

masonproffitt · 2021-03-29T11:05:27Z

Closing as a duplicate of #6.

masonproffitt mentioned this issue Mar 29, 2021

Example of adding additional branches #6

Open

masonproffitt closed this as completed Mar 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deriving new particles #16

Deriving new particles #16

gordonwatts commented Jan 28, 2021

gordonwatts commented Jan 28, 2021

ingomueller-net commented Jan 28, 2021

gordonwatts commented Jan 29, 2021

gordonwatts commented Feb 10, 2021

ingomueller-net commented Feb 11, 2021

ingomueller-net commented Feb 11, 2021

masonproffitt commented Feb 11, 2021

masonproffitt commented Feb 11, 2021

ingomueller-net commented Feb 11, 2021

masonproffitt commented Feb 11, 2021

gordonwatts commented Feb 12, 2021

ingomueller-net commented Feb 13, 2021

masonproffitt commented Mar 17, 2021

gordonwatts commented Mar 26, 2021

masonproffitt commented Mar 26, 2021

gordonwatts commented Mar 26, 2021

masonproffitt commented Mar 29, 2021

Deriving new particles #16

Deriving new particles #16

Comments

gordonwatts commented Jan 28, 2021

gordonwatts commented Jan 28, 2021

ingomueller-net commented Jan 28, 2021

gordonwatts commented Jan 29, 2021

gordonwatts commented Feb 10, 2021

ingomueller-net commented Feb 11, 2021

ingomueller-net commented Feb 11, 2021

masonproffitt commented Feb 11, 2021

masonproffitt commented Feb 11, 2021

ingomueller-net commented Feb 11, 2021

masonproffitt commented Feb 11, 2021

gordonwatts commented Feb 12, 2021

ingomueller-net commented Feb 13, 2021

masonproffitt commented Mar 17, 2021

gordonwatts commented Mar 26, 2021

masonproffitt commented Mar 26, 2021

gordonwatts commented Mar 26, 2021

masonproffitt commented Mar 29, 2021