-
-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix filtering and uneven parameter distribution #10
base: master
Are you sure you want to change the base?
Fix filtering and uneven parameter distribution #10
Conversation
20: ['Brand X', '2000', 'Internal', 'Salaried', 6] | ||
21: ['Brand Y', 'NT', 'Modem', 'Part-Time', 60] | ||
22: ['Brand X', '98', 'Internal', 'Hourly', 60] | ||
23: ['Brand Y', 'XP', 'Modem', 'Salaried', 30] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Taking into account filtering function, this output were lacking the following pairs:
98 - Hourly
98 - Contr.
98 - 10
98 - 30
98 - 60
NT - 15
2000 - 6
XP - 10
Salaried - 30
Hourly - 30
Hourly - 60
Part-Time - 60
Now they are all present in the resulting set.
20: ['Brand X', '98', 'Internal', 'Hourly', 10] | ||
21: ['Brand Y', 'XP', 'Modem', 'Contr.', 6] | ||
22: ['Brand Y', 'XP', 'Modem', 'Salaried', 60] | ||
23: ['Brand X', 'NT', 'Internal', 'Salaried', 15] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Non-filtered generations now also have different outputs due to fix of uneven distribution of parameters in the resulting set.
In this example, 'Modem'
were used more than 'Internal'
. Now they are distributed evenly.
7: Pairs(brand='Brand X', os='NT', minute=30) | ||
8: Pairs(brand='Brand X', os='NT', minute=60) | ||
9: Pairs(brand='Brand Y', os='98', minute=30) | ||
10: Pairs(brand='Brand Y', os='XP', minute=15) | ||
11: Pairs(brand='Brand X', os='2000', minute=15) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resulting set changed due to uneven parameter distribution fix.
weights = [-len(new_combs[-1])] | ||
|
||
# less used outbound connections most likely to produce more new | ||
# pairs while search continues | ||
weights.extend( | ||
[len(data_node.out)] | ||
+ [len(x) for x in reversed(new_combs[:-1])] | ||
+ [-data_node.counter] # less used node is better | ||
+ [data_node.counter] # less used node is better |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This fixes uneven distribution of parameters.
If you want less used nodes to be first after resorting, then you shouldn't negate node's usage number. I believe it was just a typo here.
--
You can also consider swapping this weight (data_node.counter
) with the last one (-len(data_node.in_)
). Sometimes I was getting better results after doing this. Especially for such inputs. Let me know what you think, I can update it in the next commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you mentioned -len(data_node.in_)
would be get better results.
Could you update to using -len(data_node.in_)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, but I'll be able to do it in a week from now.
if len(self.__pairs) == previous_unique_pairs_count: | ||
# could not find new unique pairs - go back to scanning | ||
direction = -1 | ||
i += direction |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This fixes insufficient combinations when filtering is applied.
If chosen_item_list
doesn't present new pairs, we should get back to scanning until we tried all items. So we will either get new pairs, or brute force all items and raise StopIteration()
when we got back to i == 0
.
["Brand Y", "XP", "Internal", "Hourly", 30], | ||
["Brand X", "2000", "Internal", "Part-Time", 15], | ||
["Brand X", "2000", "Internal", "Hourly", 30], | ||
["Brand X", "2000", "Internal", "Contr.", 10], | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Major increase in number of combinations when n > 2.
Currently, this test have the following input and expected result:
parameters = [
["Brand X", "Brand Y"],
["98", "NT", "2000", "XP"],
["Internal", "Modem"],
["Salaried", "Hourly", "Part-Time", "Contr."],
[6, 10, 15, 30, 60],
]
assert list(AllPairs(parameters, n=3)) == [
["Brand X", "98", "Internal", "Salaried", 6],
["Brand Y", "NT", "Modem", "Hourly", 6],
["Brand Y", "2000", "Modem", "Part-Time", 10],
["Brand X", "XP", "Internal", "Contr.", 10],
["Brand X", "XP", "Modem", "Part-Time", 6],
["Brand Y", "2000", "Internal", "Hourly", 15],
["Brand Y", "NT", "Internal", "Salaried", 10],
["Brand X", "98", "Modem", "Contr.", 15],
["Brand X", "98", "Modem", "Hourly", 10],
["Brand Y", "NT", "Modem", "Contr.", 30],
["Brand X", "XP", "Internal", "Hourly", 30],
["Brand X", "2000", "Modem", "Salaried", 30],
["Brand Y", "2000", "Internal", "Contr.", 6],
["Brand Y", "NT", "Internal", "Part-Time", 60],
["Brand Y", "XP", "Modem", "Salaried", 15],
["Brand X", "98", "Modem", "Part-Time", 60],
["Brand X", "XP", "Modem", "Salaried", 60],
["Brand X", "2000", "Internal", "Part-Time", 15],
["Brand X", "2000", "Modem", "Contr.", 60],
["Brand X", "98", "Modem", "Salaried", 10],
["Brand X", "98", "Modem", "Part-Time", 30],
["Brand X", "NT", "Modem", "Part-Time", 10],
["Brand Y", "NT", "Modem", "Salaried", 60],
["Brand Y", "NT", "Modem", "Hourly", 15],
["Brand Y", "NT", "Modem", "Hourly", 30],
["Brand Y", "NT", "Modem", "Hourly", 60],
["Brand Y", "NT", "Modem", "Hourly", 10],
]
Correct me if I'm wrong, but if I'm getting it right, when we say n = 3
, we want all triple-wise combinations to be tested. So in the result we want to see, for example, following Brand X - NT - ...
combinations:
Brand X - NT - Internal
Brand X - NT - Modem [+]
Brand X - NT - Salaried
Brand X - NT - Hourly
Brand X - NT - Part-Time [+]
Brand X - NT - Contr.
Brand X - NT - 6
Brand X - NT - 10 [+]
Brand X - NT - 15
Brand X - NT - 30
Brand X - NT - 60
But currently there are only 3 of them (marked with [+]
).
Filtering fix also fixed this issue,
weights = [-len(new_combs[-1])] | ||
|
||
# less used outbound connections most likely to produce more new | ||
# pairs while search continues | ||
weights.extend( | ||
[len(data_node.out)] | ||
+ [len(x) for x in reversed(new_combs[:-1])] | ||
+ [-data_node.counter] # less used node is better | ||
+ [data_node.counter] # less used node is better |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you mentioned -len(data_node.in_)
would be get better results.
Could you update to using -len(data_node.in_)
?
Quick update: I found one more issue. |
So I swapped last 2 weights for better results. -- Unfortunately, I wasn't able to find a way to fix an issue I found after applying all my changes. Let me try to explain the issue. As I said in this comment, in order for the algorithm to correctly cover all pairs when filtering is applied, we need to go back to scanning if currently chosen combination doesn't represent new pairs. This way we will either get new pairs, or brute force all items and raise StopIteration() when we got back to The problem comes when there are too many parameters. In my testings, starting from 8 parameters with 8 values each (and more) AND a filter which restricts 1 pair, the brute force would take minutes to finish. I didn't even wait once until it finish, but it is for sure more than 10 minutes which I don't think is good. For example: parameters = [
("1", ["1-1", "1-2", "1-3", "1-4", "1-5", "1-6", "1-7", "1-8"]),
("2", ["2-1", "2-2", "2-3", "2-4", "2-5", "2-6", "2-7", "2-8"]),
("3", ["3-1", "3-2", "3-3", "3-4", "3-5", "3-6", "3-7", "3-8"]),
("4", ["4-1", "4-2", "4-3", "4-4", "4-5", "4-6", "4-7", "4-8"]),
("5", ["5-1", "5-2", "5-3", "5-4", "5-5", "5-6", "5-7", "5-8"]),
("6", ["6-1", "6-2", "6-3", "6-4", "6-5", "6-6", "6-7", "6-8"]),
("7", ["7-1", "7-2", "7-3", "7-4", "7-5", "7-6", "7-7", "7-8"]),
("8", ["8-1", "8-2", "8-3", "8-4", "8-5", "8-6", "8-7", "8-8"]),
]
rules = [
lambda d: "1-1" == d["1"] and "2-1" == d["2"],
] Run this with only 6 parameters and it will finish in less than a second. -- I still believe the fix to the current AllPairs algorithm is needed since now filtering works completely incorrect (removing valid pairs from the resulting set). This PR fixes it, but introduces the issue of the algorithm working for too long on large input data. So you either ship it with this flaw, or wait until someone else finds a way to make the described case faster. If anything is unclear, feel free to ask! I'm still very open to finishing this PR, especially if there are any ideas on how to fix the issue. |
Hi @thombashi, @pavelicii, I recently stubled across the same issue. For my matrix (5x26x4) it still shows overrepresentation of some values but at least all the pairs are now covered. Thank you very much for providing the fix. Any idea by when you are able to update PyPI? Cheers, |
Using AllPairs 2.5.0 some combinations are missing from the result once filtering is used. This refers to thombashi/allpairspy#2 thombashi/allpairspy#10
@pavelicii One possible approach is to add an option to |
@thombashi However, I also came up with another solution which could potentially be better, but it would require major version change since it would introduce breaking changes. Let me explain. I will refer to the problematic input which I provided earlier:
It has 1792 unique pairs. This number is stored in Here's what we could do to tell the algorithm no to do so. Before generating cases, we can create a list of dictionaries of all possible pairs. It can be easily done with Python, check this SO answer. And then we could run our Why it would be a breaking change? To be able to run Let me know what do you think. |
@pavelicii In my opinion, breaking changes would be allowed if it is needed |
It is a good fix and works well. When can it be merged in? |
@pavelicii |
@thombashi Yes, I stumbled upon this once again and I plan to implement what I suggested in this comment, hopefully in the following 1-2 weeks. I implemented this solution in my Java port more than a year ago and it is working fine with few remarks (will try to cover everything in the readme). I will also try to avoid changing major version, but will see, I don't remember why I assumed it will be needed. About your comment on '1 billion possible combinations'. Here by 'combinations' we mean UPD: Sorry, I didn't have much time recently, but it is still in my TODOs. |
@pavelicii @JonatanAntoni @Huang-Jack |
Current problems:
I think I was able to fix them by tuning the scanning algorithm a bit. I totally see that probably it could be improved even more, so feel free to ask any questions or demand any additional changes.
I will write detailed comments in the Changes dialog.