-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
remove coupling between fields with the same cardinality #128
Comments
First of all, I'd like to be sure we share some same vocabulary :) The cardinality of a field refers to the number of distinct values that it can have.. I guess that having "cardinality" defining two different things in the context of the corpus generator is what lead to your confusion when making the example of two Is it more clear what to expect given the clarification above? The implicit coupling of fields with the same value for their Another step is to identify the specific behaviour of "coupling" between fields. In more details, "coupling" fields through Is the "coupling" behaviour you have in mind any different from the above? In the case the behaviour is the same, is it correct that the issue is more about expressing this behaviour in a different, more explicit way? Thanks :) |
let me add some more context :)
we are asking for multiple things at once:
From my point of view only 2. is "not negotiable": meaning that no matter how much we have to stretch the cover, in case all three can not be achieved at the same time, 2. must remain. I'm quite confident we could totally give up on 3.: meaning that it's something that we could just not support. I'm not quite sure on 1.: at the beginning the question was slightly different and more dynamic, something like "the percentage of events with a single different value for namespace and service". I have the impression that this initial solution suits better what we want to achieve, especially in terms on how to express it. simply because we reduce 1. and 2. to the same concept: a ratio. 1. being a ratio between total number of events and a single different values for a field, and 2 a ratio between single different values of multiple fields. what are your thought on that? cc @ruflin , @gizas : I'd like to have your opinion as well. especially @gizas's one, since you have a very concrete case with the k8s container and pod dataset. if we all agree with the above, I'd say the best way is to remove the |
last note on the above: this implies that we have a known total number of events. that's not always the case, see for example the |
Thanks @aspacca for the details. It explain why the current behaviour exists / works the way it is.
That was for me the surprising part as I had (falsely) assume that with a cardinality of 5, each time a value is picked randomly. For 1. / 2. / 3. above, there is always the option to add more config options to define some coupling and ordering of fields. |
Until now the only need we have was to generate fix amount resources like 10 namespaces and 50 services. But I see your point for the rest examples. I guess that what @aspacca says : Some ideas when reading the above: Additionally the object can be a solution to group fields with specific ratio, like in the example below. What do you think?: - name: aws.dimensions.*
object_keys:
- TableName
- Operation
- name: aws.dimensions.TableName
enum: ["table1", "table2"]
- name: aws.dimensions.Operation
cardinality: 2 |
you can use again, we are asking for multiple things at once:
both for this scenario and for the one mentioned before, each 3. are basically just a consequence of the current implementation. for both scenarios, the number of different values actually generated (1.) is what "coupling" must be expressed in a different, independent way, without need for still, @tommyers-elastic , @gizas , please give me your definition of "coupling", or better: what do you need exactly? :) @gizas from what I know, in your case is being able to define how many different namespace must be generated, and for each namespace how many different services and/or pods must be generated, and that the same service/pod must not be generated in different namespaces. correct? does it boil down to the same for you, @tommyers-elastic ? |
not sure I get what you mean :) |
i think we should not overload the meaning of 'cardinality' from what an elastic developer would assume it means, i.e. the number of unique elements in a set. if we are having to make distinctions about 'cardinality of a field' vs 'cardinality configuration' to the people in this thread, let alone the rest of the company, then we need to simplify. i think implementing random selection to remove repetition from fields with the same cardinality is a good idea. as far as I can tell, the only other concrete use-case mentioned here relating to 'coupling' is: "generate events with a total of 10 different k8s namespaces and 50 different services, so that every single different namespace has always the same 5 different services, keeping an even distribution" in order to solve this, how about an additional configuration option, which specifies a 'parent' field. in this case, the cardinality config would apply to each 'instance' (i.e. value) of the parent field. so here, each time a
WDYT? |
the distinction is required not because the meaning of cardinality is overloaded, but because the sets we refer to with "the number of unique elements in a set" are different :)
oki on that
please, see https://github.com/elastic/integrations/blob/main/packages/aws/_dev/benchmark/rally/ec2metrics-benchmark/config.yml#L4-L12 for a different concrete use-case:
do you think that a different use-case should have a different configuration? so for the use-case you mentioned, we go for
I would say the meaning of "cardinality" stays the same, as well the sets it refers to. it's just that assuring that there's no overlap across all sets is an hidden property that's not evident from the configuration itself. still it is often required to have this property, how do you suggest to express it in order to not cause confusion? |
in this second use case, do the two fields always require the same cardinality? |
To illustrate this problem, consider generating 15 events with the folllowing configuration:
a
is a number between 0-50 and in the generated events there are 5 unique values ofa
.b
is a number between 0-100 and in the generated events there are 5 unique values ofb
.In this configuration there is no explicit coupling between fields
a
andb
. However when this is run, the output is as follows:Notice how there are 5 unique documents here, repeated 3 times. The fact that the fields have the same cardinality causes them to be coupled.
This behaviour is confusing, and can cause unwanted repetition in the generated data.
The correct behaviour can be observed with
enum
types, which also have well-defined cardinality (the number of enum values).Note that in this configuration both fields have a cardinality of 3. In the generated data there is no coupling. Here are 9 generated data points:
Another strange behaviour is that if I explicitly write the cardinality values in for these fields (3 and 3 respectively), one would expect it to have no effect, since the cardinality is already 3, but doing this causes only 3 unique values of
sales-team
in the output, repeated over and over.->
This implicit coupling of values with the same cardinality should be removed, and replaced with a more explicit way to enable coupling between values (which is often required).
The text was updated successfully, but these errors were encountered: