Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix filtering and uneven parameter distribution #10

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

pavelicii
Copy link

Current problems:

  1. Filtering removes valid pairs. More info in this issue.
  2. Uneven distribution of parameters. More info in this issue.
  3. When n > 2, resulting combinations are not complete.

I think I was able to fix them by tuning the scanning algorithm a bit. I totally see that probably it could be improved even more, so feel free to ask any questions or demand any additional changes.

I will write detailed comments in the Changes dialog.

20: ['Brand X', '2000', 'Internal', 'Salaried', 6]
21: ['Brand Y', 'NT', 'Modem', 'Part-Time', 60]
22: ['Brand X', '98', 'Internal', 'Hourly', 60]
23: ['Brand Y', 'XP', 'Modem', 'Salaried', 30]
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking into account filtering function, this output were lacking the following pairs:

98 - Hourly
98 - Contr.
98 - 10
98 - 30
98 - 60
NT - 15
2000 - 6
XP - 10
Salaried - 30
Hourly - 30
Hourly - 60
Part-Time - 60

Now they are all present in the resulting set.

20: ['Brand X', '98', 'Internal', 'Hourly', 10]
21: ['Brand Y', 'XP', 'Modem', 'Contr.', 6]
22: ['Brand Y', 'XP', 'Modem', 'Salaried', 60]
23: ['Brand X', 'NT', 'Internal', 'Salaried', 15]
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-filtered generations now also have different outputs due to fix of uneven distribution of parameters in the resulting set.

In this example, 'Modem' were used more than 'Internal'. Now they are distributed evenly.

7: Pairs(brand='Brand X', os='NT', minute=30)
8: Pairs(brand='Brand X', os='NT', minute=60)
9: Pairs(brand='Brand Y', os='98', minute=30)
10: Pairs(brand='Brand Y', os='XP', minute=15)
11: Pairs(brand='Brand X', os='2000', minute=15)
Copy link
Author

@pavelicii pavelicii Dec 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resulting set changed due to uneven parameter distribution fix.

weights = [-len(new_combs[-1])]

# less used outbound connections most likely to produce more new
# pairs while search continues
weights.extend(
[len(data_node.out)]
+ [len(x) for x in reversed(new_combs[:-1])]
+ [-data_node.counter] # less used node is better
+ [data_node.counter] # less used node is better
Copy link
Author

@pavelicii pavelicii Dec 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fixes uneven distribution of parameters.

If you want less used nodes to be first after resorting, then you shouldn't negate node's usage number. I believe it was just a typo here.

--

You can also consider swapping this weight (data_node.counter) with the last one (-len(data_node.in_)). Sometimes I was getting better results after doing this. Especially for such inputs. Let me know what you think, I can update it in the next commit.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you mentioned -len(data_node.in_) would be get better results.
Could you update to using -len(data_node.in_)?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but I'll be able to do it in a week from now.

if len(self.__pairs) == previous_unique_pairs_count:
# could not find new unique pairs - go back to scanning
direction = -1
i += direction
Copy link
Author

@pavelicii pavelicii Dec 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fixes insufficient combinations when filtering is applied.

If chosen_item_list doesn't present new pairs, we should get back to scanning until we tried all items. So we will either get new pairs, or brute force all items and raise StopIteration() when we got back to i == 0.

["Brand Y", "XP", "Internal", "Hourly", 30],
["Brand X", "2000", "Internal", "Part-Time", 15],
["Brand X", "2000", "Internal", "Hourly", 30],
["Brand X", "2000", "Internal", "Contr.", 10],
]
Copy link
Author

@pavelicii pavelicii Dec 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Major increase in number of combinations when n > 2.

Currently, this test have the following input and expected result:

parameters = [
	["Brand X", "Brand Y"],
	["98", "NT", "2000", "XP"],
	["Internal", "Modem"],
	["Salaried", "Hourly", "Part-Time", "Contr."],
	[6, 10, 15, 30, 60],
]

assert list(AllPairs(parameters, n=3)) == [
	["Brand X", "98", "Internal", "Salaried", 6],
	["Brand Y", "NT", "Modem", "Hourly", 6],
	["Brand Y", "2000", "Modem", "Part-Time", 10],
	["Brand X", "XP", "Internal", "Contr.", 10],
	["Brand X", "XP", "Modem", "Part-Time", 6],
	["Brand Y", "2000", "Internal", "Hourly", 15],
	["Brand Y", "NT", "Internal", "Salaried", 10],
	["Brand X", "98", "Modem", "Contr.", 15],
	["Brand X", "98", "Modem", "Hourly", 10],
	["Brand Y", "NT", "Modem", "Contr.", 30],
	["Brand X", "XP", "Internal", "Hourly", 30],
	["Brand X", "2000", "Modem", "Salaried", 30],
	["Brand Y", "2000", "Internal", "Contr.", 6],
	["Brand Y", "NT", "Internal", "Part-Time", 60],
	["Brand Y", "XP", "Modem", "Salaried", 15],
	["Brand X", "98", "Modem", "Part-Time", 60],
	["Brand X", "XP", "Modem", "Salaried", 60],
	["Brand X", "2000", "Internal", "Part-Time", 15],
	["Brand X", "2000", "Modem", "Contr.", 60],
	["Brand X", "98", "Modem", "Salaried", 10],
	["Brand X", "98", "Modem", "Part-Time", 30],
	["Brand X", "NT", "Modem", "Part-Time", 10],
	["Brand Y", "NT", "Modem", "Salaried", 60],
	["Brand Y", "NT", "Modem", "Hourly", 15],
	["Brand Y", "NT", "Modem", "Hourly", 30],
	["Brand Y", "NT", "Modem", "Hourly", 60],
	["Brand Y", "NT", "Modem", "Hourly", 10],
]

Correct me if I'm wrong, but if I'm getting it right, when we say n = 3, we want all triple-wise combinations to be tested. So in the result we want to see, for example, following Brand X - NT - ... combinations:

Brand X - NT - Internal
Brand X - NT - Modem [+]
Brand X - NT - Salaried 
Brand X - NT - Hourly
Brand X - NT - Part-Time [+]
Brand X - NT - Contr.
Brand X - NT - 6
Brand X - NT - 10 [+]
Brand X - NT - 15
Brand X - NT - 30
Brand X - NT - 60

But currently there are only 3 of them (marked with [+]).

Filtering fix also fixed this issue,

weights = [-len(new_combs[-1])]

# less used outbound connections most likely to produce more new
# pairs while search continues
weights.extend(
[len(data_node.out)]
+ [len(x) for x in reversed(new_combs[:-1])]
+ [-data_node.counter] # less used node is better
+ [data_node.counter] # less used node is better
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you mentioned -len(data_node.in_) would be get better results.
Could you update to using -len(data_node.in_)?

@pavelicii
Copy link
Author

Quick update: I found one more issue.
I'm fixing it in my free time and plan to commit it in the next 1-2 weeks together with requested changes.

@pavelicii
Copy link
Author

pavelicii commented Feb 15, 2022

So I swapped last 2 weights for better results.

--

Unfortunately, I wasn't able to find a way to fix an issue I found after applying all my changes. Let me try to explain the issue.

As I said in this comment, in order for the algorithm to correctly cover all pairs when filtering is applied, we need to go back to scanning if currently chosen combination doesn't represent new pairs. This way we will either get new pairs, or brute force all items and raise StopIteration() when we got back to i == 0.

The problem comes when there are too many parameters. In my testings, starting from 8 parameters with 8 values each (and more) AND a filter which restricts 1 pair, the brute force would take minutes to finish. I didn't even wait once until it finish, but it is for sure more than 10 minutes which I don't think is good. For example:

parameters = [
    ("1", ["1-1", "1-2", "1-3", "1-4", "1-5", "1-6", "1-7", "1-8"]),
    ("2", ["2-1", "2-2", "2-3", "2-4", "2-5", "2-6", "2-7", "2-8"]),
    ("3", ["3-1", "3-2", "3-3", "3-4", "3-5", "3-6", "3-7", "3-8"]),
    ("4", ["4-1", "4-2", "4-3", "4-4", "4-5", "4-6", "4-7", "4-8"]),
    ("5", ["5-1", "5-2", "5-3", "5-4", "5-5", "5-6", "5-7", "5-8"]),
    ("6", ["6-1", "6-2", "6-3", "6-4", "6-5", "6-6", "6-7", "6-8"]),
    ("7", ["7-1", "7-2", "7-3", "7-4", "7-5", "7-6", "7-7", "7-8"]),
    ("8", ["8-1", "8-2", "8-3", "8-4", "8-5", "8-6", "8-7", "8-8"]),
]

rules = [
    lambda d: "1-1" == d["1"] and "2-1" == d["2"],
]

Run this with only 6 parameters and it will finish in less than a second.

--

I still believe the fix to the current AllPairs algorithm is needed since now filtering works completely incorrect (removing valid pairs from the resulting set). This PR fixes it, but introduces the issue of the algorithm working for too long on large input data. So you either ship it with this flaw, or wait until someone else finds a way to make the described case faster.

If anything is unclear, feel free to ask! I'm still very open to finishing this PR, especially if there are any ideas on how to fix the issue.

@JonatanAntoni
Copy link

Hi @thombashi, @pavelicii,

I recently stubled across the same issue.
I can confirm this patch fixes it.

For my matrix (5x26x4) it still shows overrepresentation of some values but at least all the pairs are now covered.

Thank you very much for providing the fix. Any idea by when you are able to update PyPI?

Cheers,
Jonatan

energy6 added a commit to energy6/python-matrix-runner that referenced this pull request Feb 22, 2022
Using AllPairs 2.5.0 some combinations are missing from
the result once filtering is used. This refers to

thombashi/allpairspy#2
thombashi/allpairspy#10
@thombashi
Copy link
Owner

@pavelicii
Thank you for your investigation and explanations.

One possible approach is to add an option to AllPairs class that can choose an algorithm.
In this way, users can select suitable algorithms for their needs like fast (but uneven parameter distribution) or precise (but may slow for large inputs).

@pavelicii
Copy link
Author

@thombashi
Solution with fast and precise is OK, I could implement it.

However, I also came up with another solution which could potentially be better, but it would require major version change since it would introduce breaking changes. Let me explain.

I will refer to the problematic input which I provided earlier:

parameters = [
    ("1", ["1-1", "1-2", "1-3", "1-4", "1-5", "1-6", "1-7", "1-8"]),
    ("2", ["2-1", "2-2", "2-3", "2-4", "2-5", "2-6", "2-7", "2-8"]),
    ("3", ["3-1", "3-2", "3-3", "3-4", "3-5", "3-6", "3-7", "3-8"]),
    ("4", ["4-1", "4-2", "4-3", "4-4", "4-5", "4-6", "4-7", "4-8"]),
    ("5", ["5-1", "5-2", "5-3", "5-4", "5-5", "5-6", "5-7", "5-8"]),
    ("6", ["6-1", "6-2", "6-3", "6-4", "6-5", "6-6", "6-7", "6-8"]),
    ("7", ["7-1", "7-2", "7-3", "7-4", "7-5", "7-6", "7-7", "7-8"]),
    ("8", ["8-1", "8-2", "8-3", "8-4", "8-5", "8-6", "8-7", "8-8"]),
]

rules = [
    lambda d: "1-1" == d["1"] and "2-1" == d["2"],
]

It has 1792 unique pairs. This number is stored in max_unique_pairs_expected variable and the algorithm constantly checks if we reached this number. We filter out 1 pair using rules, so when the algorithm finds 1791 unique pairs, it hangs on brute forcing for the last pair. But in fact we know that there us no need to brute force for the pair which we excluded.

Here's what we could do to tell the algorithm no to do so. Before generating cases, we can create a list of dictionaries of all possible pairs. It can be easily done with Python, check this SO answer. And then we could run our rules against this list to filter out unwanted pairs. After it is filtered, we can update max_unique_pairs_expected. In this case it will be 1791 so the algorithm won't hang anymore.

Why it would be a breaking change? To be able to run rules against all possible pairs list, parameters should be named, so they should be created as dictionaries. Currently there is an opportunity to provide parameters without names. So if we apply what I suggest, this wouldn't be possible anymore (to be honest, I don't think this should be possible at all, because it is very inconvenient to describe rules for unnamed parameters).

Let me know what do you think.

@thombashi
Copy link
Owner

@pavelicii
Thank you for your ideas.

In my opinion, breaking changes would be allowed if it is needed
My concern is that applying a filter for all possible combinations might require a long processing time, especially for large inputs (like 1 billion possible combinations).

@Huang-Jack
Copy link

It is a good fix and works well. When can it be merged in?

@thombashi
Copy link
Owner

@pavelicii
Are there any update on this?

@coveralls
Copy link

Coverage Status

coverage: 93.814% (+2.6%) from 91.192%
when pulling 4a2bdfd on pavelicii:fix/filtering-and-distribution
into 415a2d2 on thombashi:master.

@pavelicii
Copy link
Author

pavelicii commented Nov 3, 2023

@thombashi Yes, I stumbled upon this once again and I plan to implement what I suggested in this comment, hopefully in the following 1-2 weeks. I implemented this solution in my Java port more than a year ago and it is working fine with few remarks (will try to cover everything in the readme). I will also try to avoid changing major version, but will see, I don't remember why I assumed it will be needed.

About your comment on '1 billion possible combinations'. Here by 'combinations' we mean n-wise possible combinations, for example possible pairs (not to confuse it with Cartesian product of all parameters). Such amount is very hard to hit, and I don't think the algorithm was ready for it before anyway, so I believe it shouldn't be a blocker.

UPD: Sorry, I didn't have much time recently, but it is still in my TODOs.

@thombashi
Copy link
Owner

@pavelicii
Thank you for your answer.
I'm looking forward to your update.

@JonatanAntoni @Huang-Jack
Please wait for a little while longer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants