Fix filtering and uneven parameter distribution #10

pavelicii · 2021-12-10T19:26:33Z

Current problems:

Filtering removes valid pairs. More info in this issue.
Uneven distribution of parameters. More info in this issue.
When n > 2, resulting combinations are not complete.

I think I was able to fix them by tuning the scanning algorithm a bit. I totally see that probably it could be improved even more, so feel free to ask any questions or demand any additional changes.

I will write detailed comments in the Changes dialog.

pavelicii · 2021-12-10T19:30:59Z

README.rst

+        20: ['Brand X', '2000', 'Internal', 'Salaried', 6]
+        21: ['Brand Y', 'NT', 'Modem', 'Part-Time', 60]
+        22: ['Brand X', '98', 'Internal', 'Hourly', 60]
+        23: ['Brand Y', 'XP', 'Modem', 'Salaried', 30]


Taking into account filtering function, this output were lacking the following pairs:

98 - Hourly 98 - Contr. 98 - 10 98 - 30 98 - 60 NT - 15 2000 - 6 XP - 10 Salaried - 30 Hourly - 30 Hourly - 60 Part-Time - 60

Now they are all present in the resulting set.

pavelicii · 2021-12-10T19:33:32Z

README.rst

+        20: ['Brand X', '98', 'Internal', 'Hourly', 10]
+        21: ['Brand Y', 'XP', 'Modem', 'Contr.', 6]
+        22: ['Brand Y', 'XP', 'Modem', 'Salaried', 60]
+        23: ['Brand X', 'NT', 'Internal', 'Salaried', 15]


Non-filtered generations now also have different outputs due to fix of uneven distribution of parameters in the resulting set.

In this example, 'Modem' were used more than 'Internal'. Now they are distributed evenly.

pavelicii · 2021-12-10T19:34:24Z

README.rst

+         7: Pairs(brand='Brand X', os='NT', minute=30)
+         8: Pairs(brand='Brand X', os='NT', minute=60)
+         9: Pairs(brand='Brand Y', os='98', minute=30)
+        10: Pairs(brand='Brand Y', os='XP', minute=15)
        11: Pairs(brand='Brand X', os='2000', minute=15)


Resulting set changed due to uneven parameter distribution fix.

pavelicii · 2021-12-10T19:38:41Z

allpairspy/allpairs.py

            weights = [-len(new_combs[-1])]

            # less used outbound connections most likely to produce more new
            # pairs while search continues
            weights.extend(
                [len(data_node.out)]
                + [len(x) for x in reversed(new_combs[:-1])]
-                + [-data_node.counter]  # less used node is better
+                + [data_node.counter]  # less used node is better


This fixes uneven distribution of parameters.

If you want less used nodes to be first after resorting, then you shouldn't negate node's usage number. I believe it was just a typo here.

--

You can also consider swapping this weight (data_node.counter) with the last one (-len(data_node.in_)). Sometimes I was getting better results after doing this. Especially for such inputs. Let me know what you think, I can update it in the next commit.

As you mentioned -len(data_node.in_) would be get better results.
Could you update to using -len(data_node.in_)?

Sure, but I'll be able to do it in a week from now.

pavelicii · 2021-12-10T19:44:03Z

allpairspy/allpairs.py

+                if len(self.__pairs) == previous_unique_pairs_count:
+                    # could not find new unique pairs - go back to scanning
+                    direction = -1
+                    i += direction


This fixes insufficient combinations when filtering is applied.

If chosen_item_list doesn't present new pairs, we should get back to scanning until we tried all items. So we will either get new pairs, or brute force all items and raise StopIteration() when we got back to i == 0.

pavelicii · 2021-12-10T20:03:07Z

tests/test_allpairs.py

+            ["Brand Y", "XP", "Internal", "Hourly", 30],
+            ["Brand X", "2000", "Internal", "Part-Time", 15],
+            ["Brand X", "2000", "Internal", "Hourly", 30],
+            ["Brand X", "2000", "Internal", "Contr.", 10],
        ]


Major increase in number of combinations when n > 2.

Currently, this test have the following input and expected result:

parameters = [ ["Brand X", "Brand Y"], ["98", "NT", "2000", "XP"], ["Internal", "Modem"], ["Salaried", "Hourly", "Part-Time", "Contr."], [6, 10, 15, 30, 60], ] assert list(AllPairs(parameters, n=3)) == [ ["Brand X", "98", "Internal", "Salaried", 6], ["Brand Y", "NT", "Modem", "Hourly", 6], ["Brand Y", "2000", "Modem", "Part-Time", 10], ["Brand X", "XP", "Internal", "Contr.", 10], ["Brand X", "XP", "Modem", "Part-Time", 6], ["Brand Y", "2000", "Internal", "Hourly", 15], ["Brand Y", "NT", "Internal", "Salaried", 10], ["Brand X", "98", "Modem", "Contr.", 15], ["Brand X", "98", "Modem", "Hourly", 10], ["Brand Y", "NT", "Modem", "Contr.", 30], ["Brand X", "XP", "Internal", "Hourly", 30], ["Brand X", "2000", "Modem", "Salaried", 30], ["Brand Y", "2000", "Internal", "Contr.", 6], ["Brand Y", "NT", "Internal", "Part-Time", 60], ["Brand Y", "XP", "Modem", "Salaried", 15], ["Brand X", "98", "Modem", "Part-Time", 60], ["Brand X", "XP", "Modem", "Salaried", 60], ["Brand X", "2000", "Internal", "Part-Time", 15], ["Brand X", "2000", "Modem", "Contr.", 60], ["Brand X", "98", "Modem", "Salaried", 10], ["Brand X", "98", "Modem", "Part-Time", 30], ["Brand X", "NT", "Modem", "Part-Time", 10], ["Brand Y", "NT", "Modem", "Salaried", 60], ["Brand Y", "NT", "Modem", "Hourly", 15], ["Brand Y", "NT", "Modem", "Hourly", 30], ["Brand Y", "NT", "Modem", "Hourly", 60], ["Brand Y", "NT", "Modem", "Hourly", 10], ]

Correct me if I'm wrong, but if I'm getting it right, when we say n = 3, we want all triple-wise combinations to be tested. So in the result we want to see, for example, following Brand X - NT - ... combinations:

Brand X - NT - Internal Brand X - NT - Modem [+] Brand X - NT - Salaried Brand X - NT - Hourly Brand X - NT - Part-Time [+] Brand X - NT - Contr. Brand X - NT - 6 Brand X - NT - 10 [+] Brand X - NT - 15 Brand X - NT - 30 Brand X - NT - 60

But currently there are only 3 of them (marked with [+]).

Filtering fix also fixed this issue,

thombashi · 2022-01-04T12:42:05Z

allpairspy/allpairs.py

            weights = [-len(new_combs[-1])]

            # less used outbound connections most likely to produce more new
            # pairs while search continues
            weights.extend(
                [len(data_node.out)]
                + [len(x) for x in reversed(new_combs[:-1])]
-                + [-data_node.counter]  # less used node is better
+                + [data_node.counter]  # less used node is better


As you mentioned -len(data_node.in_) would be get better results.
Could you update to using -len(data_node.in_)?

pavelicii · 2022-01-21T09:16:01Z

Quick update: I found one more issue.
I'm fixing it in my free time and plan to commit it in the next 1-2 weeks together with requested changes.

pavelicii · 2022-02-15T15:39:47Z

So I swapped last 2 weights for better results.

--

Unfortunately, I wasn't able to find a way to fix an issue I found after applying all my changes. Let me try to explain the issue.

As I said in this comment, in order for the algorithm to correctly cover all pairs when filtering is applied, we need to go back to scanning if currently chosen combination doesn't represent new pairs. This way we will either get new pairs, or brute force all items and raise StopIteration() when we got back to i == 0.

The problem comes when there are too many parameters. In my testings, starting from 8 parameters with 8 values each (and more) AND a filter which restricts 1 pair, the brute force would take minutes to finish. I didn't even wait once until it finish, but it is for sure more than 10 minutes which I don't think is good. For example:

parameters = [
    ("1", ["1-1", "1-2", "1-3", "1-4", "1-5", "1-6", "1-7", "1-8"]),
    ("2", ["2-1", "2-2", "2-3", "2-4", "2-5", "2-6", "2-7", "2-8"]),
    ("3", ["3-1", "3-2", "3-3", "3-4", "3-5", "3-6", "3-7", "3-8"]),
    ("4", ["4-1", "4-2", "4-3", "4-4", "4-5", "4-6", "4-7", "4-8"]),
    ("5", ["5-1", "5-2", "5-3", "5-4", "5-5", "5-6", "5-7", "5-8"]),
    ("6", ["6-1", "6-2", "6-3", "6-4", "6-5", "6-6", "6-7", "6-8"]),
    ("7", ["7-1", "7-2", "7-3", "7-4", "7-5", "7-6", "7-7", "7-8"]),
    ("8", ["8-1", "8-2", "8-3", "8-4", "8-5", "8-6", "8-7", "8-8"]),
]

rules = [
    lambda d: "1-1" == d["1"] and "2-1" == d["2"],
]

Run this with only 6 parameters and it will finish in less than a second.

--

I still believe the fix to the current AllPairs algorithm is needed since now filtering works completely incorrect (removing valid pairs from the resulting set). This PR fixes it, but introduces the issue of the algorithm working for too long on large input data. So you either ship it with this flaw, or wait until someone else finds a way to make the described case faster.

If anything is unclear, feel free to ask! I'm still very open to finishing this PR, especially if there are any ideas on how to fix the issue.

JonatanAntoni · 2022-02-22T16:09:45Z

Hi @thombashi, @pavelicii,

I recently stubled across the same issue.
I can confirm this patch fixes it.

For my matrix (5x26x4) it still shows overrepresentation of some values but at least all the pairs are now covered.

Thank you very much for providing the fix. Any idea by when you are able to update PyPI?

Cheers,
Jonatan

Using AllPairs 2.5.0 some combinations are missing from the result once filtering is used. This refers to thombashi/allpairspy#2 thombashi/allpairspy#10

thombashi · 2022-02-27T04:49:44Z

@pavelicii
Thank you for your investigation and explanations.

One possible approach is to add an option to AllPairs class that can choose an algorithm.
In this way, users can select suitable algorithms for their needs like fast (but uneven parameter distribution) or precise (but may slow for large inputs).

pavelicii · 2022-03-12T17:51:37Z

@thombashi
Solution with fast and precise is OK, I could implement it.

However, I also came up with another solution which could potentially be better, but it would require major version change since it would introduce breaking changes. Let me explain.

I will refer to the problematic input which I provided earlier:

parameters = [
    ("1", ["1-1", "1-2", "1-3", "1-4", "1-5", "1-6", "1-7", "1-8"]),
    ("2", ["2-1", "2-2", "2-3", "2-4", "2-5", "2-6", "2-7", "2-8"]),
    ("3", ["3-1", "3-2", "3-3", "3-4", "3-5", "3-6", "3-7", "3-8"]),
    ("4", ["4-1", "4-2", "4-3", "4-4", "4-5", "4-6", "4-7", "4-8"]),
    ("5", ["5-1", "5-2", "5-3", "5-4", "5-5", "5-6", "5-7", "5-8"]),
    ("6", ["6-1", "6-2", "6-3", "6-4", "6-5", "6-6", "6-7", "6-8"]),
    ("7", ["7-1", "7-2", "7-3", "7-4", "7-5", "7-6", "7-7", "7-8"]),
    ("8", ["8-1", "8-2", "8-3", "8-4", "8-5", "8-6", "8-7", "8-8"]),
]

rules = [
    lambda d: "1-1" == d["1"] and "2-1" == d["2"],
]

It has 1792 unique pairs. This number is stored in max_unique_pairs_expected variable and the algorithm constantly checks if we reached this number. We filter out 1 pair using rules, so when the algorithm finds 1791 unique pairs, it hangs on brute forcing for the last pair. But in fact we know that there us no need to brute force for the pair which we excluded.

Here's what we could do to tell the algorithm no to do so. Before generating cases, we can create a list of dictionaries of all possible pairs. It can be easily done with Python, check this SO answer. And then we could run our rules against this list to filter out unwanted pairs. After it is filtered, we can update max_unique_pairs_expected. In this case it will be 1791 so the algorithm won't hang anymore.

Why it would be a breaking change? To be able to run rules against all possible pairs list, parameters should be named, so they should be created as dictionaries. Currently there is an opportunity to provide parameters without names. So if we apply what I suggest, this wouldn't be possible anymore (to be honest, I don't think this should be possible at all, because it is very inconvenient to describe rules for unnamed parameters).

Let me know what do you think.

thombashi · 2022-03-20T10:21:43Z

@pavelicii
Thank you for your ideas.

In my opinion, breaking changes would be allowed if it is needed
My concern is that applying a filter for all possible combinations might require a long processing time, especially for large inputs (like 1 billion possible combinations).

Huang-Jack · 2023-11-01T08:22:35Z

It is a good fix and works well. When can it be merged in?

thombashi · 2023-11-03T02:34:18Z

@pavelicii
Are there any update on this?

coveralls · 2023-11-03T02:35:17Z

coverage: 93.814% (+2.6%) from 91.192%
when pulling 4a2bdfd on pavelicii:fix/filtering-and-distribution
into 415a2d2 on thombashi:master.

pavelicii · 2023-11-03T09:27:31Z

@thombashi Yes, I stumbled upon this once again and I plan to implement what I suggested in this comment, hopefully in the following 1-2 weeks. I implemented this solution in my Java port more than a year ago and it is working fine with few remarks (will try to cover everything in the readme). I will also try to avoid changing major version, but will see, I don't remember why I assumed it will be needed.

About your comment on '1 billion possible combinations'. Here by 'combinations' we mean n-wise possible combinations, for example possible pairs (not to confuse it with Cartesian product of all parameters). Such amount is very hard to hit, and I don't think the algorithm was ready for it before anyway, so I believe it shouldn't be a blocker.

UPD: Sorry, I didn't have much time recently, but it is still in my TODOs.

thombashi · 2023-11-05T11:35:45Z

@pavelicii
Thank you for your answer.
I'm looking forward to your update.

@JonatanAntoni @Huang-Jack
Please wait for a little while longer.

Fix filtering and uneven parameter distribution

a4bff15

pavelicii commented Dec 10, 2021

View reviewed changes

This was referenced Dec 10, 2021

incorrect distribution for one parameter #5

Open

Insufficient combination when filter function is specified? #2

Open

thombashi requested changes Jan 4, 2022

View reviewed changes

Swap last 2 weights for better results

3982e23

thombashi mentioned this pull request Sep 2, 2023

Got wrong results(miss many combinations) when using 3-wise and higher #12

Open

Merge branch 'thombashi:master' into fix/filtering-and-distribution

4a2bdfd

SimeonEhrig mentioned this pull request Feb 7, 2024

Add filters and setup pairwise generator alpaka-group/bashi#8

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix filtering and uneven parameter distribution #10

Fix filtering and uneven parameter distribution #10

pavelicii commented Dec 10, 2021

pavelicii Dec 10, 2021

pavelicii Dec 10, 2021

pavelicii Dec 10, 2021 •

edited

Loading

pavelicii Dec 10, 2021 •

edited

Loading

thombashi Jan 4, 2022

pavelicii Jan 5, 2022

pavelicii Dec 10, 2021 •

edited

Loading

pavelicii Dec 10, 2021 •

edited

Loading

thombashi Jan 4, 2022

pavelicii commented Jan 21, 2022

pavelicii commented Feb 15, 2022 •

edited

Loading

JonatanAntoni commented Feb 22, 2022

thombashi commented Feb 27, 2022

pavelicii commented Mar 12, 2022

thombashi commented Mar 20, 2022

Huang-Jack commented Nov 1, 2023

thombashi commented Nov 3, 2023

coveralls commented Nov 3, 2023

pavelicii commented Nov 3, 2023 •

edited

Loading

thombashi commented Nov 5, 2023

Fix filtering and uneven parameter distribution #10

Are you sure you want to change the base?

Fix filtering and uneven parameter distribution #10

Conversation

pavelicii commented Dec 10, 2021

pavelicii Dec 10, 2021

Choose a reason for hiding this comment

pavelicii Dec 10, 2021

Choose a reason for hiding this comment

pavelicii Dec 10, 2021 • edited Loading

Choose a reason for hiding this comment

pavelicii Dec 10, 2021 • edited Loading

Choose a reason for hiding this comment

thombashi Jan 4, 2022

Choose a reason for hiding this comment

pavelicii Jan 5, 2022

Choose a reason for hiding this comment

pavelicii Dec 10, 2021 • edited Loading

Choose a reason for hiding this comment

pavelicii Dec 10, 2021 • edited Loading

Choose a reason for hiding this comment

thombashi Jan 4, 2022

Choose a reason for hiding this comment

pavelicii commented Jan 21, 2022

pavelicii commented Feb 15, 2022 • edited Loading

JonatanAntoni commented Feb 22, 2022

thombashi commented Feb 27, 2022

pavelicii commented Mar 12, 2022

thombashi commented Mar 20, 2022

Huang-Jack commented Nov 1, 2023

thombashi commented Nov 3, 2023

coveralls commented Nov 3, 2023

pavelicii commented Nov 3, 2023 • edited Loading

thombashi commented Nov 5, 2023

pavelicii Dec 10, 2021 •

edited

Loading

pavelicii Dec 10, 2021 •

edited

Loading

pavelicii Dec 10, 2021 •

edited

Loading

pavelicii Dec 10, 2021 •

edited

Loading

pavelicii commented Feb 15, 2022 •

edited

Loading

pavelicii commented Nov 3, 2023 •

edited

Loading