Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error_set not populated if fail_ok true and assertion fails #97

Open
ExalDraen opened this issue Dec 6, 2019 · 8 comments
Open

error_set not populated if fail_ok true and assertion fails #97

ExalDraen opened this issue Dec 6, 2019 · 8 comments

Comments

@ExalDraen
Copy link

Summary

Observation:

When fail_ok is set and an assertion is also set, the resultant choria task has an empty error_set.

Expected

Per the tips on error handling the documentation I expect the error_set to be populated with the tasks whose assertions failed even though fail_ok is set.

Details

Given the following plan with an assertion that will never succeed

plan orchestration::assert_err_test (
) {
  $servers = ['kdhstage-ems03.grass.corp']
  $t1_res = choria::task(
    'action'    => 'puppet.status',
    'nodes'     => $servers,
    'assert'    => 'applying=nevermatch',
    'fail_ok'   => true,
    'silent'    => true,
  )
  notice($t1_res)
  notice("------ Error set ----------")
  notice($t1_res.error_set)
  notice("------ OK set ----------")
  notice($t1_res.ok_set)
}

the $t1_res results set will have an empty error_set - all the results will be in the ok_set.
The execution log shows this:

alexander.hermes orchestration $ mco playbook run orchestration::assert_err_test --modulepath $HOME/code/puppet/modules
Notice: Scope(<module>/orchestration/plans/assert_err_test.pp, 4): About to run task: mcollective task
Notice: Scope(<module>/orchestration/plans/assert_err_test.pp, 4): Starting request for puppet#status against 1 nodes
Warning: Scope(<module>/orchestration/plans/assert_err_test.pp, 4): Assertion 'applying=nevermatch' failed on 1/1 nodes
Warning: Scope(<module>/orchestration/plans/assert_err_test.pp, 4): Task failed but fail_ok is true, treating as success
Notice: Scope(<module>/orchestration/plans/assert_err_test.pp, 4): Failed request 841cc322e989540d86c7d74bd16953d8 for puppet#status assertion failed on 1 node(s)
Notice: Scope(<module>/orchestration/plans/assert_err_test.pp, 11): Choria::TaskResults({'results' => [Choria::TaskResult({'host' => 'kdhstage-ems03.grass.corp', 'result' => {'value' => {'agent' => 'puppet', 'action' => 'status', 'sender' => 'kdhstage-ems03.grass.corp', 'statuscode' => 0, 'statusmsg' => 'OK', 'data' => {'applying' => false, 'idling' => false, 'enabled' => true, 'daemon_present' => false, 'lastrun' => 1575606931, 'since_lastrun' => 1316, 'status' => 'stopped', 'disable_message' => '', 'message' => 'Currently stopped; last completed run 21 minutes 56 seconds ago'}, 'requestid' => '841cc322e989540d86c7d74bd16953d8'}, 'type' => 'mcollective', 'fail_ok' => true}})]})
Notice: Scope(<module>/orchestration/plans/assert_err_test.pp, 12): ------ Error set ----------
Notice: Scope(<module>/orchestration/plans/assert_err_test.pp, 13): Choria::TaskResults({'results' => []})
Notice: Scope(<module>/orchestration/plans/assert_err_test.pp, 14): ------ OK set ----------
Notice: Scope(<module>/orchestration/plans/assert_err_test.pp, 15): Choria::TaskResults({'results' => [Choria::TaskResult({'host' => 'kdhstage-ems03.grass.corp', 'result' => {'value' => {'agent' => 'puppet', 'action' => 'status', 'sender' => 'kdhstage-ems03.grass.corp', 'statuscode' => 0, 'statusmsg' => 'OK', 'data' => {'applying' => false, 'idling' => false, 'enabled' => true, 'daemon_present' => false, 'lastrun' => 1575606931, 'since_lastrun' => 1316, 'status' => 'stopped', 'disable_message' => '', 'message' => 'Currently stopped; last completed run 21 minutes 56 seconds ago'}, 'requestid' => '841cc322e989540d86c7d74bd16953d8'}, 'type' => 'mcollective', 'fail_ok' => true}})]})

NOTE that if I remove the assertion and instead make the task fail in another way the error_set is populated as expected:

alexander.hermes orchestration $ mco playbook run orchestration::assert_err_test --modulepath $HOME/code/puppet/modules
Notice: Scope(<module>/orchestration/plans/assert_err_test.pp, 4): About to run task: mcollective task
Notice: Scope(<module>/orchestration/plans/assert_err_test.pp, 4): Starting request for puppet#enable against 1 nodes
Warning: Scope(<module>/orchestration/plans/assert_err_test.pp, 4): kdhstage-ems03.grass.corp puppet#enable failure: {:status=>"Could not enable Puppet: Already enabled", :enabled=>true}
Warning: Scope(<module>/orchestration/plans/assert_err_test.pp, 4): Failed request 8b4a8e41cf8c5d4687cee7c46643869f for puppet#enable in 1.12s
Warning: Scope(<module>/orchestration/plans/assert_err_test.pp, 4): Failed result from kdhstage-ems03.grass.corp: Could not enable Puppet: Already enabled
Warning: Scope(<module>/orchestration/plans/assert_err_test.pp, 4): Task failed but fail_ok is true, treating as success
Notice: Scope(<module>/orchestration/plans/assert_err_test.pp, 4): Failed request 8b4a8e41cf8c5d4687cee7c46643869f for puppet#enable on 1 failed node(s)
Notice: Scope(<module>/orchestration/plans/assert_err_test.pp, 13): Choria::TaskResults({'results' => [Choria::TaskResult({'host' => 'kdhstage-ems03.grass.corp', 'result' => {'value' => {'agent' => 'puppet', 'action' => 'enable', 'sender' => 'kdhstage-ems03.grass.corp', 'statuscode' => 1, 'statusmsg' => 'Could not enable Puppet: Already enabled', 'data' => {'status' => 'Could not enable Puppet: Already enabled', 'enabled' => true}, 'requestid' => '8b4a8e41cf8c5d4687cee7c46643869f'}, 'type' => 'mcollective', 'fail_ok' => true, 'error' => {'msg' => 'Could not enable Puppet: Already enabled', 'kind' => 'choria.playbook/taskerror', 'details' => {'agent' => 'puppet', 'action' => 'enable', 'issue_code' => 1}}}})]})
Notice: Scope(<module>/orchestration/plans/assert_err_test.pp, 14): ------ Error set ----------
Notice: Scope(<module>/orchestration/plans/assert_err_test.pp, 15): Choria::TaskResults({'results' => [Choria::TaskResult({'host' => 'kdhstage-ems03.grass.corp', 'result' => {'value' => {'agent' => 'puppet', 'action' => 'enable', 'sender' => 'kdhstage-ems03.grass.corp', 'statuscode' => 1, 'statusmsg' => 'Could not enable Puppet: Already enabled', 'data' => {'status' => 'Could not enable Puppet: Already enabled', 'enabled' => true}, 'requestid' => '8b4a8e41cf8c5d4687cee7c46643869f'}, 'type' => 'mcollective', 'fail_ok' => true, 'error' => {'msg' => 'Could not enable Puppet: Already enabled', 'kind' => 'choria.playbook/taskerror', 'details' => {'agent' => 'puppet', 'action' => 'enable', 'issue_code' => 1}}}})]})
Notice: Scope(<module>/orchestration/plans/assert_err_test.pp, 16): ------ OK set ----------
Notice: Scope(<module>/orchestration/plans/assert_err_test.pp, 17): Choria::TaskResults({'results' => []})

Versions

Puppet AIO: 6.3.0
mcollective_choria module: 0.17.1

@ripienaar
Copy link
Member

The code treats a task that failed, while fail_ok is true and that only had 1 try as success.

https://github.com/choria-io/mcollective-choria/blob/44ce8237ddd2447770a141c449d8a1c2c79b83d0/lib/mcollective/util/playbook/task_result.rb#L35

Now, debatable if this is right - seems to me its not - but I am also I am not sure I can change this behaviour as it will break things for people.

Difficult to know whats right here

@ExalDraen
Copy link
Author

@ripienaar the line you pointed out is a problem (for me), but I think the issue I reported is about something slightly different:

Namely, that when an assertion fails with fail_ok the error_set is not populated, but if a task fails for another reason, error_set is populated (see the last log trace in the original post).

In other words, failure by assertion is treated differently than other kinds of failures which is problematic.


Regarding the problem you pointed out: I would suggest that fail_ok should apply regardless of the number of tries. To maintain backwards compatibility, I have a couple ideas:

  • introduce a fail_ok_count parameter that says how many times a task can fail and still be considered a pass. This could be set to 1 if fail_ok is set to maintain behaviour
  • introduce an any_fail_ok type parameter that would let a task succeed if it is set, no matter the number of tries

I'm happy to create a separate issue if you want me to.

@ripienaar
Copy link
Member

ripienaar commented Dec 7, 2019

yes, I understand the problem, error set is only populated for failed requests - the request didnt fail, it merely didnt match. Where as it is error set is only for actual failed requests where the RPC request failed.

The task system detect this more or less correctly though and correctly marks the overall task run as failed despite the rpc requests passing because it understands the assert intention, however this line that I highlighted then takes the failed request - because assert set it to fail - and force it back to success because it was fail_ok, 1 try and marked as failed.

Ultimately the problem is the line I highlighted, right before this block the result from your scenario is:

[false,
 "Failed request bfd9f34a20c858ca9ae85332221e1cfd for puppet#status assertion failed on 2 node(s)", <data>]

this then sets it to successful and so it never makes it into error set. task behavior is right, exposing that result set to puppet is not - its this exposing that gets done right here and munges the data.

I appreciate the desire though but as things stand I don't think the basic intent with the design is to treat individual PRC replies that does not pass assertion as error, it operates at a higher level - it makes the whole task fail or not based on assertion over replies and this seems right to me. The fail_ok then being a task level thing also treats the resulting task as success which is what the line here does.

@ExalDraen
Copy link
Author

Hi @ripienaar ,

Thanks for the detail - I see what you mean now.

I see the subtle distinction you are making between assertion and RPC failure, but I consider this to be problematic on the following counts:

  1. It introduces a subtle difference in failure modes. In particular, it means that users have to have separate code to discover the nodes that have failed an RPC request vs. an assertion.
    The interaction between assertions, result set and fail_ok are not that obvious.

So you would need something like the following code to extract the set of nodes that failed RPC or assertion:

  $t4_res = choria::task(
    'action'    => 'puppet.last_run_summary',
    'nodes'     => $nodes,
    'fail_ok'   => true,
    'assert'    => 'summary.events.failure=0',
    'tries'     => 1,
    'silent'    => true,
  )
  # Work around https://github.com/choria-io/mcollective-choria/issues/617
  notice($t4_res)
  $fail_res = $t4_res.ok_set.filter |Choria::TaskResult $r| {$r['data']['summary']['events']['failure'] != 0 + $t4_res.error_set }
  $good_res = $t4_res.results.filter |Choria::TaskResult $r| {$r['data']['summary']['events']['failure'] == 0 }
  1. This difference in concept / design is not documented very clearly as far as I can see. We should at least document the subtle distinctions here - e.g. clarify that errors_set only ever contains nodes that fail the RPC request, not assertions, etc.

Since we are supporting assertions as a first class concept, maybe the solution is to return another result set that is the set of nodes for which an assertion failed.

Or alternatively, it could be a property of nodes in the ok_set?

@ripienaar
Copy link
Member

yeah the docs took me a long time to figure out too and I had to spelunk into the code to figure out the answer and what the intended behaviour is, we definitely have a huge gap in the docs here.

We can create something like a method on the Results set, something like:

$matched = $res.jgrep("summary.events.failure = 0")

Would that help a bit? At least that would be the same assertion language etc and feels a bit more natural?

@ExalDraen
Copy link
Author

I think that would help and be a decent addition.

It still leaves the potential for confusion around the concept of "failure", though.
A task can fail in at least two ways now: RPC failure and assertion failure and they're not distinguished in the docs or in log output but are distinct such as in the case discussed above.

To keep code changes minimal, how about this for now:

  1. Update the docs to clarify a) that assertion failures != rpc failures. Maybe clarify that a task can result in one of three states: OK, RPC fail, Assertion fail ? I think the main area in the docs is the bit that talks about the error_set. @ripienaar I don't know if it affects other things, if yes we should document those too.
  2. Add a way to gather results with failed assertions. Either the jgrep approach you outlined and/or a way to just retrieve results with failed assertions (like $res.assert_fail_set)
  3. Add an example in the tips and tricks section of the docs that gives an example of dealing with different kinds of failure

I can do the first task fairly easily, but I'm less sure about 2+3.

@ExalDraen
Copy link
Author

@ripienaar Now that the holidays are over I'm keen to get a resolution on this.

What do you think about the options I outlined? I can probably help with the implementation if you give me some pointers.

@ripienaar
Copy link
Member

I havnt had a chance to go through this again and think through all the implications. Gathering results on actual fails isnt possible without major plumbum, assert fails we can probably do and I like the jgrep helper in general

@ripienaar ripienaar transferred this issue from choria-legacy/mcollective-choria Jan 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants