-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deriving new particles #16
Comments
If the new particle is a calculation made from the original input data - nothing extra is added - is there any difference between this and a "computed" column? |
What exactly is a computed column? I can see this use case making sense both for "writing the new data to storage" (such as in this use case described for ROOT, where a branch is added to an existing tree) and for "store code that can recreate the data on the fly as if it were a column" (which is probably your "computed column" and which is close to a database "view"). |
I mean it in the most generic sense. If the input data is From a database or system point-of-view, it looks like any other data. However, it doesn't actually exist written to disk (or it doesn't have to - the underlying system may decide it needs a cache). The database "view" and also the creation of a new column are exactly what I'm talking about here. Most, but not all, computed columns are very cheap to calculate - much cheaper than the disk space they represent. The most important reason that people write out new ROOT datasets are (in my experience, @masonproffitt may want to add to this):
Another thing is that no one I know of adds branches to existing trees. The only supported mechanism to do this that might work at scale is by adding a friend tree in separate files. Since how we store files isn't integrated into our system (as you note in your paper!), developing code to track two sets of files stored in multiple places was never developed. Thanks for that, btw, it is really nice to have a more "formal" way to point to a well known failing in how we managed data in particle physics. Having the ability to write out new data is interesting. I'm just not sure how it fits into an actual physics analysis' workflow. If it is under the control of the user, where is the data written, how can the user make a decision on what data to write without knowing the global situation (e.g. available disk space, priorities, etc.). I'm a huge fan of caching - that can be deleted whenever without losing anything other than time - because it can be used to get around many of the user-level issues. |
@ingomueller-net - did this make sense? |
Yes, it absolutly make sense -- sorry for not replying earlier. Also, to answer your initial question: yes I am thinking about "computed columns." I do think that what you mention would be an interesting and potentially impactful direction to explore: a system that knows which columns are computed by which function and can then cache that data or recompute it. This can be completely transparent from the user, depends on the (current) access pattern, and allows to trade disk space for computation. I am not sure which production-ready data management systems do that, but this is something that DB people have done in the past. (Dan is working on something similar at the moment, actually.) I am also not sure what is the best way to fit this (or a more explicit way of writing out data) into current workflows, but if that is a question worth exploring, listing it as a use case may be helpful. One question: You say "it could be calculated against a nested quantity, but not be nested itself." Why is it not nested? I thought I had understood that sometimes, you may want to derive the presence of particles (or jets) from the presence of other particles. Isn't that something you may want to share with other people, i.e., isn't that something that an |
This is really out of my area of expertise but my understanding is that |
It seems misleading to say nothing extra is added. There is new information encoded in the calculation itself that is not in the original dataset (for example, the weights of a neural network). I would say there is no extra data. |
Yes, this is definitely a use case. I know of analyses that add branches to existing trees with information on candidate combinations (like top candidates from jets and leptons). |
@masonproffitt: I understand, the presence of a "computed column" does not need to imply that "data is added," but the computed column itself may be counted as "something added." Whether or not data is added seems indeed orthogonal: With DB views, no data is added; if we add branches to an existing tree, then it is (right?). |
I guess I'm not sure exactly how a DB view is defined. What I'm saying is that a computed column (regardless of whether it's only transiently in memory or actually written to the DB) does add new information (in the form of a function) not present in the DB itself, but it does not add new data (in the experiment sense), since the computation does not add events or extra detector information. |
I meant it might be the case - so it could be the case that you aggregate data from your list of jets to derive an event-level quantity. I was attempting to make it clear that the function didn't have to just operate on single jets, it could operate on multiple jets, or jets and an event level quantity... But my prose was not clear, sorry! And you can certainly build new particles as you describe! |
@gordonwatts, @masonproffitt: Thanks for the clarifications, I think I understand now. |
Just to point out: this is a special case of #6. |
@masonproffitt - is this a special case of #6? Or something like #6 plus something - if you added a new particle, you might want to add several branches and have them somehow associated together? Or, perhaps that isn't part of this repo, actually. Ok - should we close this, or turn it into something that needs to be done? |
I think the part of this that is missing from the current benchmarks is just #6. The "plus something" that you're referring to seems to me like just doing operations that are already covered by the other benchmarks, but with the new columns. So it does require that the underlying system can handle added columns using exactly the same interface as the existing ones. I guess I would normally take that for granted, but maybe I shouldn't? |
Yeah - I would normally take that for granted as well. Perhaps we should just mention that somewhere, and that will be good enough. Then if we discover that isn't working, we can point to that or flesh out a new example. But at least this way we'll have documented for ourselves. Go ahead and add a note, and then I propose you close this issue. |
Closing as a duplicate of #6. |
from an email from @ingomueller-net:
The text was updated successfully, but these errors were encountered: