Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OPA: improve error messages returned to the end user #796

Closed
wants to merge 4 commits into from

Conversation

aliculPix4D
Copy link

@aliculPix4D aliculPix4D commented Sep 25, 2024

Draft PR just to fish for some initial feedback and to show what changes are required here and in our ci-for-concourse repo..

Related to: https://github.com/Pix4D/ci-for-concourse/pull/18

@@ -10,7 +10,7 @@ import (
"testing"
)

func TestAccessor(t *testing.T) {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems as copy pasta...

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes. We contributing upstream, we can even split this PR in 2. The first one with this fix and the equivalent you did below. At least such PR should be accepted without problems.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My plan:

  • merge it to our fork as a single PR to our master (and release branch)... I will now read our instructions on release procedure
  • after we release and deploy I plan to prepare the PRs to upstream
  • then cherry pick specific commits and open 2 PRs upstream if your think its necessary to split this changes

@aliculPix4D
Copy link
Author

@marco-m-pix4d fishing for feedback

@aliculPix4D
Copy link
Author

aliculPix4D commented Sep 25, 2024

Two things I forgot to mention:

  1. in theory the same confusing error can also arise from (both for json and yaml):
    err = json.Unmarshal(body, &input.Data)

    Expect(checkErr.Error()).To(Equal(`invalid character 'h' looking for beginning of value`))

    I choose not to change anything there because in my understanding this can happen only if the request is crafted wrongly by one of the Concourse internal components or fly... which would mean there is an existing issue in the code.
    Ok, sure it can also happen if user uses curl to craft request and targets the internal API request directly (without going through fly). I mean we do that in reaper and and skyguide but not e.g. to set the pipeline etc..

In any case, its easy to do the same change here so, opinions?

  1. the PR as it is may or may not be enough... the end goal is to improve the error msg to the end user...

The question is: is this error reported back to the use if e.g. is using fly?

I will test this part now... I already know for example that fly expose-pipeline does not report any meaningful error back to the user, just prints forbidden with no reason (on the othehand you see the response in the reason in the WebUI)... This was one of the problems when I started with OPA actually because I was testing first with ExposePipeline rule...

For other commands, like fly set-pipeline I need to ensure that this msg reaches the end user in full... I don't remember seeing anything else than invalid character .... but looking already at existing code parsing opa results: invalid character... Anyway, I just wanted to make it clear that there is more to check and test here...

@@ -205,7 +205,7 @@ var _ = Describe("OPA Policy Checker", func() {
It("should return error", func() {
result, err := agent.Check(policy.PolicyCheckInput{})
Expect(err).To(HaveOccurred())
Expect(err.Error()).To(ContainSubstring("invalid character 'h' looking for beginning of value"))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is so ironic I thought you invented it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I know it's even worse... The code from current master would craft this error msg:

parsing opa results: invalid character 'h' looking for beginning of value

and this is not the error msg end-user was getting... but instead end user saw only:

invalid character 'p' looking for beginning of value

because another JSON parser got the body starting with parsing...
:)

Unbelievable...

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because another JSON parser got the body starting with parsing...

Ah. Can we fix also this properly ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We did... 4th commit...

marco-m-pix4d
marco-m-pix4d previously approved these changes Sep 25, 2024
Copy link

@marco-m-pix4d marco-m-pix4d left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. It was easier than I though, better like this.

@aliculPix4D
Copy link
Author

Looks good. It was easier than I though, better like this.

@marco-m-pix4d please read also: #796 (comment)

@marco-m-pix4d
Copy link

Looks good. It was easier than I though, better like this.

@marco-m-pix4d please read also: #796 (comment)

Ah OK. The root cause is that the Go json library sometimes doesn't prefix its error messages with "json", and sometimes it does, so we even risk stuttering ("parsing JSON: json: xxx") :-(

So we can do the following:

Grep for all "invalid character 'h' looking for beginning of value`" in the tests for this area of the code and see if we can prefix the error with "parsing JSON" as we are doing here.

@aliculPix4D
Copy link
Author

aliculPix4D commented Sep 26, 2024

Update: I still didn't manage to fully fix this...

I don't know why I thought set-pipeline command was working...
Screenshot from 2024-09-26 09-26-27

it certainly does not work. If I add the --verbose flag I see the full error msg...

policy check error: parsing opa response: JSON parsing: error: json: cannot unmarshal array into Go value of type vars.StaticVariables. Hint: response from opa server is not a valid JSON
error: invalid character 'p' looking for beginning of value

So yes, to reach the end goal (improve the error msg to the end user) I must sort this with fly also...

Irony: fly expose-pipeline does print it properly. On the other hand, fly execute just doesn't report the reason why it was rejected when the parsing is correct.

@aliculPix4D
Copy link
Author

aliculPix4D commented Sep 26, 2024

So I managed to solve the issue with both (maybe all commands)... As I expected the difference of behaviour between commands arrived from usage of different clients:

  • some commands like set-pipeline use new httpClient

    response, err := team.httpAgent.Send(internal.Request{

    (FYI its even crazier than that: for fetching pipeline config (GET) you use deprecated connection client while for creating/updating (PUT) of the config an HTTP client is used)

  • other commands still use deprecated connection client

    err = team.connection.Send(internal.Request{

  1. for the connection client the cause and solution is quite straightforward and exactly what we discussed during standup:

currently if the response is forbidden by the e.g. OPA just the status code is returned without any reason.

  if response.StatusCode == http.StatusForbidden {
    return ErrForbidden
  }

what we actually need is:

  if response.StatusCode == http.StatusForbidden {
    body, _ := io.ReadAll(response.Body)
    return ForbiddenError{
      Reason: string(body),
    }
  }

to print also the reason from the response body.

Note that this simple fix actually affect many many commands. I would even say majority of the commands suffered from the same issue.

Note also that these commands using connection client don't have a problem when there is an BadRequest error:

return UnexpectedResponseError{

because they would return the proper error msg to the end user.

  1. for the set pipeline command things are a bit different than we expected

Even though we fixed the server side already, I was still saying: invalid character 'p' looking for beginning of value. OK, it was clear its another JSON parser, but which one?

The cause of the error arrives from:

err = json.Unmarshal(body, &validationErr)

So my explanation is this: originally you enter the case http.StatusBadRequest: when you fail to set/update the pipeline due to some configuration errors in the pipeline itself, so the Concourse server rejects it and returns a StatusBadRequest.
But they wanted to give a proper error msg to the end user so they decided to extract it from the body and convert it to JSON so that they can do something like:

var validationErr atc.SaveConfigResponse
err = json.Unmarshal(body, &validationErr)
return InvalidConfigError{Errors: validationErr.Errors}

The problem is that now with OPA (but maybe also due to other cases) one can not expect that body is in JSON format, and especially you should not expect that it can be converted to atc.SaveConfigResponse type.

so when server was returning a body with an error msg like:

Body: policy check error: parsing opa response: JSON parsing: error: invalid character 'h' looking for beginning of value. Hint: response from opa server is not a valid JSON

we were just returning the error msg from the JSON parser, which is this time:

invalid character 'p' looking for beginning of value

even though OPA policy checker returned:
policy check error: parsing opa response: JSON parsing: error: invalid character 'h' looking for beginning of value
because OPA server return a body hello we have: invalid character 'h' looking for beginning of value, but because the second JSON parser started to parse a body:

policy check error: parsing opa response: JSON parsing: error: invalid character 'h' looking for beginning of value. Hint: response from opa server is not a valid JSON

we have:

invalid character 'p' looking for beginning of value

seen by the end user...

My first attempt to fix this is:

    var validationErr atc.SaveConfigResponse
    err = json.Unmarshal(body, &validationErr)
    if err != nil {
      return false, false, []ConfigWarning{}, internal.UnexpectedResponseError{
        StatusCode: response.StatusCode,
        Status:     response.Status,
        Body:       string(body),
      }
    }

This prints what we want to the end user, but we lost the actual error msg from the json.Unmarshal just above...
Personally, I am fine with that... I am just not sure yet this is the best solution.

@marco-m-pix4d FYI

@aliculPix4D aliculPix4D force-pushed the pci-3852-improve-opa-error-msg branch 5 times, most recently from fe42ea1 to 8cb0c3c Compare September 26, 2024 13:04
@aliculPix4D
Copy link
Author

aliculPix4D commented Sep 26, 2024

Summary of changes:

  1. example: fly set-pipeline when OPA responded with 200 but with non-json body

What user saw before:

$ fly -t main set-pipeline -c pipeline.yml -p test

jobs:
  job job-1 has been added:
+ name: job-1
+ plan: null
  
pipeline name: test

apply configuration? [yN]: y

error: invalid character 'p' looking for beginning of value

What it sees now?

$ fly -t main set-pipeline -c pipeline.yml -p test

jobs:
  job job-1 has been added:
+ name: job-1
+ plan: null
  
pipeline name: test

apply configuration? [yN]: y

error: Unexpected Response
Status: 400 Bad Request
Body: policy check error: parsing OPA results: parsing JSON: invalid character 'h' looking for beginning of value
  1. example: fly expose-pipeline (and many other command) when OPA returned 200 and denied the action (correct json body).

What user saw before:

$ fly -t main expose-pipeline -p test
error: forbidden

What it sees now?

$ fly -t main expose-pipeline -p test
error: forbidden: policy check failed: 
 * because we said so

@@ -30,7 +30,7 @@ func ParseOpaResult(bytesResult []byte, opaConfig OpaConfig) (opaResult, error)
var results vars.StaticVariables
err := json.Unmarshal(bytesResult, &results)
if err != nil {
return opaResult{}, err
return opaResult{}, fmt.Errorf("parsing JSON: %w", err)
Copy link
Author

@aliculPix4D aliculPix4D Sep 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I am still not 100% happy with the error msg user sees:

policy check error: parsing OPA results: parsing JSON: invalid character 'h' looking for beginning of value

for sure its better than before:

invalid character 'p' looking for beginning of value

because I am somehow lacking the information that this is because the response body from the OPA server is not a valid JSON.. or in other words that the problem is because OPA server being misconfigured and sending something unexpected...

Can we append something like: Hint: targeted OPA server seems to be misconfigured

Just as an example, I will give the error msg from the fly when Concourse is not reachable...

could not reach the Concourse server called developers:

    Get "https://concourse-web/api/v1/info": dial tcp XX.XX.XX.XX:443: i/o timeout

is the targeted Concourse running? better go catch it lol

e.g. this seems quite clear. Can we do something like that?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you think about it, from the point of view of the end user (the dev setting the pipeline), nothing changes:

  • policy check error: parsing OPA results: parsing JSON: invalid character 'h' looking for beginning of value
  • policy check error: parsing OPA results: parsing JSON: invalid character 'h' looking for beginning of value. Hint: targeted OPA server seems to be misconfigured

In both cases, the user will ask for help on the channel: he cannot do anything about it and he is still confused.

I really resist this "Hint" because it is adding a patch to something that is broken already.

If we really want to give a better message to the user, I think we can, because we control the system (Concourse). This is different from cogito: the system is GitHub that we do not control.

So we should be able to rework some more logic how Concourse interacts with the OPA server and detect the real problem. Once we do that, then we could report something like

  • policy check error: OPA server misconfigured

Now, I don't even understand what we mean by "misconfigured".:

  • Do we mean that the OPA server configuration is wrong? (so: independently from Concourse).
  • Do we mean that the Concourse settings about the OPA server is wrong?

To summarize, I think that this PR already made the situation way better and we should merge and deploy.

I am OK if we want to create a ticket, with some of the discussion here, to go more at the root of the problem (assuming I am not misunderstanding something).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Said in another way: this PR is too big to follow in details, at least for me. All the discussions here are useful, but we can still do the secondary improvements in a separate ticket, I think.

@aliculPix4D aliculPix4D dismissed marco-m-pix4d’s stale review September 26, 2024 13:47

The PR change quite a bit since the approval..

}
defer resp.Body.Close()

statusCode := resp.StatusCode
if statusCode != http.StatusOK {
return nil, fmt.Errorf("opa returned status: %d", statusCode)
return nil, fmt.Errorf("OPA server returned status: %d", statusCode)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to do something with the response body here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be useful e.g. when you configure wrong value for:
CONCOURSE_OPA_URL: http://opa:8181/v1/data/concourse/decision
and opa is there but the path is wrong or something?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine if we do as in other places and add also the body to the error.

- the existing names seem to be a copy paste mistake

Signed-off-by: aliculPix4D <[email protected]>
Majority of fly commands still use the deprecated "connection" client and when and HTTP request is forbidden by the ATC server e.g. when OPA action is not allowed (or any other internal Concourse reason) the error message is not printed to the end user, leaving them in questioning why. This is especially important when OPA block the action in question, because in that case its imperative the end-user see the reason why the action was blocked by the OPA.

Signed-off-by: aliculPix4D <[email protected]>
…o misconfigured OPA server) the response body is not a parsablble JSON, and hence instead of returning the JSON parser error, we need to return an actual response body containing the reason for the error. e.g. in case of OPA error the ATC server response body will never be a parsable JSON and instead its string (byte) starting with:

> policy check error: reason
This will always be unparsable by the JSON parser used in the HTTP client and instead of returning JSON parser error from the client itself (`invalid character 'p' looking for beginning of value`) we want to return the respons body error msg returned by the ATC server. This is especially important e.g. when a OPA server is misconfigured.

Instead of:
> error: invalid character 'p' looking for beginning of value
end user will see:

> error: Unexpected Response
> Status: 400 Bad Request
> Body: policy check error: reason

Signed-off-by: aliculPix4D <[email protected]>
@aliculPix4D aliculPix4D marked this pull request as ready for review September 26, 2024 14:24
@aliculPix4D aliculPix4D requested a review from a team as a code owner September 26, 2024 14:24
@aliculPix4D
Copy link
Author

@Pix4D/integration ready for review...

I split the work into 4 commits. The two very important ones (third and fourth) have longer commit msg explaining the reasoning why they are required (shorter version of my PR comments)

@@ -353,7 +353,7 @@ var _ = Describe("ATC Connection", func() {
})
})

Describe("403 response", func() {
Describe("HTTPAgent: 403 response", func() {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they had a test for 403 response for what they call HTTPagent client, but not for what they call Connection client

@aliculPix4D
Copy link
Author

@Pix4D/integration one more note... I think reading carefully:
#796 (comment)
#796 (comment)

before reviewing will help you a lot to understand what is going on...

Copy link

@marco-m-pix4d marco-m-pix4d left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left my detailed reasoning in my latest inline comment.
I think we can proceed (unless I am missing something).
Thanks!

@@ -96,7 +96,11 @@ func (team *team) CreateOrUpdatePipelineConfig(pipelineRef atc.PipelineRef, conf
var validationErr atc.SaveConfigResponse
err = json.Unmarshal(body, &validationErr)
if err != nil {
return false, false, []ConfigWarning{}, err
return false, false, []ConfigWarning{}, internal.UnexpectedResponseError{
Copy link
Author

@aliculPix4D aliculPix4D Sep 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marco-m-pix4d I am curious to hear/read your opinion on this part...

if you follow closely:

  • before we were returning the err which is a useless error from the JSON parse on L97...
  • now we return internal.UnexpectedResponseError which I craft using response body and status code, but I never actually use the useless err anywhere... I think it is not needed to append this err.Error() because this error msg would always look the same invalid character.

Personally, I was uncomfortable just ignoring the err like that... We usually don't have this practice... but I don't think it would bring any value appending it.

What do you think?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the reasoning. As usual, the answer is that it depends.

The problem is that json.Unmarshal can, at least in theory, return any type of error, not only parse errors.
If the json package had an error sentinel, say ErrParse, then one could say:
I check for json.ErrParse. If it matches, then I do not pass it to internal.UnexpectedResponseError{}.
If on the other hand it doesn't match, then it can be helpful to know what has happened, so in this case I pass it to internal.UnexpectedResponseError{}.

Also if json.ErrParse existed, doing the reasoning just above would still be risky, because it is possible that a new version of the json package added another error sentinel, and so on...

So my approach is to almost always include errors. If I know like in this case that it can be confusing, I add it in parenthesis "(%s)" or similar...

@aliculPix4D
Copy link
Author

As discussed in video chat, I am closing this. I will split it in smaller separate PRs as suggested.

@aliculPix4D aliculPix4D deleted the pci-3852-improve-opa-error-msg branch October 7, 2024 14:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants