-
Notifications
You must be signed in to change notification settings - Fork 787
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: prevent concurrent build to edit the same namespace #6953
Conversation
@@ -189,6 +189,12 @@ func (o *StepHelmApplyOptions) Run() error { | |||
defer os.RemoveAll(rootTmpDir) | |||
} | |||
|
|||
if release, err := kube.AcquireBuildLock(kubeClient, devNs, ns); err != nil { | |||
return errors.Wrapf(err, "fail to acquire the lock") | |||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if block ends with a return statement, so drop this else and outdent its block (move short variable declaration to its own line if necessary)
pkg/kube/build_lock.go
Outdated
"jenkins-x.io/kind": "build-lock", | ||
} | ||
|
||
// Acquires a build lock, to avoid other builds to edit the same namespace |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment on exported function AcquireBuildLock should be of the form "AcquireBuildLock ..."
Woo! Happy to see this being worked on. I’ll get you the failed build logs in a bit. The labels you’re looking at are consistent for all pipelines. I’ll look into the env cars exposing pod info, but think that will probably need to be dealt with in Tekton. |
So a quick bit on the failed tests - when booting for the bdd tests, |
/test bdd |
|
Ok,
The env vars are annoyingly different on static masters, and, frankly, we don't care a ton about static masters since we're removing them next month. I'm trying to think of an env var we could look for that would signify "this is running in a static master" that we could just use to ignore the locking behavior when running in a Jenkins pipeline via a static master. |
Ah-ha - better than an env var, we could just check Line 129 in a306c29
AcquireLock and return a no-op function if it's false.
|
I am wondering whether it would make sense to write some integration tests for |
6eb52d7
to
b477808
Compare
I updated the PR (I forgot to fix the commit message, will fix next update). I hopefully fixed the integration test.
I totally agree ! but writing tests take time, I wanted to check if people agree with my fix first. |
b477808
to
645758c
Compare
pkg/kube/build_lock_test.go
Outdated
} | ||
|
||
// build_lock_check checks if the lock configmap is matching the given pod | ||
func build_lock_check(t *testing.T, client kubernetes.Interface, namespace string, pod *v1.Pod) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't use underscores in Go names; func build_lock_check should be buildLockCheck
pkg/kube/build_lock_test.go
Outdated
// returns a chan that is filled once AcquireBuildLock returns | ||
// its item will perform some check and call the callback | ||
// its item is nil on timeout | ||
func build_lock_acquire(t *testing.T, client kubernetes.Interface, namespace string, pod *v1.Pod, fails bool) (func(), chan func()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't use underscores in Go names; func build_lock_acquire should be buildLockAcquire
pkg/kube/build_lock_test.go
Outdated
|
||
// build_lock_env prepares the environment for calling AcquireBuildLock | ||
// returns a defer function to restore the environment | ||
func build_lock_env(t *testing.T, owner, repository, branch, build string) func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't use underscores in Go names; func build_lock_env should be buildLockEnv
pkg/kube/build_lock_test.go
Outdated
} | ||
|
||
// build_lock_timestamp create the timestamp for a lock, now plus or minus some minutes | ||
func build_lock_timestamp(minutes int) string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't use underscores in Go names; func build_lock_timestamp should be buildLockTimestamp
pkg/kube/build_lock_test.go
Outdated
} | ||
|
||
// build_lock_lock creates a lock that matches a pod | ||
func build_lock_lock(t *testing.T, client kubernetes.Interface, namespace string, pod *v1.Pod, minutes int) *v1.ConfigMap { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't use underscores in Go names; func build_lock_lock should be buildLockLock
pkg/kube/build_lock_test.go
Outdated
|
||
var build_lock_uid int = 1 << 20 // the pid of out fake pods | ||
// build_lock_pod creates a running pod, looking close enough to a pipeline pod | ||
func build_lock_pod(t *testing.T, client kubernetes.Interface, owner, repository, branch, build string) *v1.Pod { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't use underscores in Go names; func build_lock_pod should be buildLockPod
pkg/kube/build_lock_test.go
Outdated
return c | ||
} | ||
|
||
var build_lock_uid int = 1 << 20 // the pid of out fake pods |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't use underscores in Go names; var build_lock_uid should be buildLockUID
pkg/kube/build_lock_test.go
Outdated
} | ||
|
||
// build_lock_count_watch count watchers for synchronization reasons | ||
func build_lock_count_watch(client *fake.Clientset) chan int { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't use underscores in Go names; func build_lock_count_watch should be buildLockCountWatch
pkg/kube/build_lock_test.go
Outdated
} | ||
|
||
// build_lock_client creates a fake client with a fake tekton deployment | ||
func build_lock_client(t *testing.T) *fake.Clientset { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't use underscores in Go names; func build_lock_client should be buildLockClient
645758c
to
c1047b4
Compare
c1047b4
to
6bc87f4
Compare
I've created tests to try most situation, and consequently fixed a bug.
I have no idea why lint is failing, it seems to work fine in local (or at least as fine as it used to). |
/test boot-vault |
FYI, lint is failing due to:
|
And |
6bc87f4
to
754df94
Compare
Ok so this should fix |
go.mod
Outdated
github.com/go-openapi/jsonreference v0.19.2 | ||
github.com/go-openapi/spec v0.19.3 | ||
github.com/go-stack/stack v1.8.0 // indirect | ||
github.com/go-openapi/jsonreference v0.19.3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there anything in your code which required an update of the dependencies? I don't think so, right? It would be nice to take this version changes out of the PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those dependencies were updated by make lint
for a reason I ignore (or gofmt
, not sure anymore). I'll try to clean most of them then.
pkg/kube/build_lock.go
Outdated
return nil, err | ||
} | ||
// Find our pod | ||
podList, err := kubeClient.CoreV1().Pods(devNamespace).List(metav1.ListOptions{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am struggling a bit why we need to tie the locking via a ConfigMap
to a running pod? Even though uncommon, I think about the case where the step is executed locally. jx boot
comes to mind.
Could we just not work with the ConfigMap? I think that's also what @ccojocar suggested in a comment on the issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea to use a ConfigMap is because we need an atomic operation at the Kubernetes level. Only create-update-delete are atomic as far as I know. Looking at the present pods is not a strong guarantee. Plus the ConfigMap is a playground for other job to negotiate which one will be next.
The ConfigMap is linked to a pod such that we can know if the pipeline is still running, in case of the pipeline did not have the opportunity to clean the ConfigMap (for instance with jx stop pipeline
). If not, the lock could lock every future build.
I agree that this mechanism is not adapted to local commands like jx boot
. But first those are manual commands, there is no reason for it to be launched 2 times. Second, I think blocking jx boot
because another one may be running (and we don't have any way to be sure) could be horribly wrong when somebody is trying to fix a failed jx boot
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea to use a ConfigMap is because we need an atomic operation at the Kubernetes level. Only create-update-delete are atomic as far as I know.
I am onboard with the ConfigMap. I think that's the right thing to do.
The ConfigMap is linked to a pod such that we can know if the pipeline is still running,
That's the part I am not so sure about. The linking to a pod.
If not, the lock could lock every future build.
Well, right. There is a potential for a lock not getting cleaned up if something crashes, but I was more thinking about working with timeouts. If there is a lock which does not get cleaned up and another build exceeds its timeout it will fail with an appropriate error message. And yes, one needs to then either manually delete the lock (which might be the first approach) or then eventually build something into jx
. I think that would make things also a bit simpler and solve also the running locally case.
Second, I think blocking jx boot because another one may be running ...
I am not so sure. There is actually an open issue for this as well. Given that the outcome of two concurrent boot runs would be unclear at best, I'd argue a jx boot
needs to be locked as well. If you try to "fix" a failed install and there are locking issues, I would expect that you sort them out first.
@jenkins-x/core thoughts?
// Returns a function to release the lock (to be called in a defer) | ||
// Returns an error if a newer build is already running, or if an error happened | ||
func AcquireBuildLock(kubeClient kubernetes.Interface, devNamespace, namespace string) (func() error, error) { | ||
// Only lock if running in Tekton |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a heck of a function, I am wondering whether it could not be split up a bit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure it can be, I'll work on that.
lock.Labels[k] = v | ||
} | ||
// this loop continuously tries to create the lock | ||
Create: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmm, do we really need to use labels?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We never have to use labels (same as you never have to use GOTO). But in that case, we will instead have to signify which loop should break and which one should continue. I don't thin kit will make the algorithm more clear. The algorithm is basically a state machine, and those are well described with GOTO ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh sorry did you mean labels on the lock ? Well most of them let you know who is locking currently. They are similar to the ones in the pod. I also added a label to be sure that build are not log by another configmap that shouldn't be here (or maybe later another version of the same lock), that would cause a bug due to a different format. Nothing mandatory, just felt like it was cleaner.
754df94
to
61fff74
Compare
log.Logger().Infof("locking pod %s finished", owner.Name) | ||
return nil, nil | ||
// an error while getting the pod | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if block ends with a return statement, so drop this else and outdent its block
OK I've restored I'm actually not clear how Then I'll implement 2 mechanisms: if there is a pod (ie no |
61fff74
to
ca8672a
Compare
/retest |
So as said before, there are now 2 version of the lock:
This should make |
func watchBuildLock(kubeClient kubernetes.Interface, lock *v1.ConfigMap, pod *v1.Pod, build map[string]string) (*v1.ConfigMap, error) { | ||
log.Logger().Infof("waiting for updates on the lock configmap %s", lock.Name) | ||
// watch a timer for expiration | ||
var expChan <-chan time.Time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor nitpicks. I like this much better now. Personally, I would only use time-based locks, but I can see the benefits to tieing this to the locks.
"github.com/stretchr/testify/require" | ||
) | ||
|
||
func Test_compareBuildLocks(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:+1
pkg/kube/build_lock_test.go
Outdated
return clean, c | ||
} | ||
|
||
func builLockNoLock(t *testing.T, client kubernetes.Interface, namespace string) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess you are missing a 'd' here - buildLockNoLock
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, instead of buildLockNoLock, is this not more of an assertNoLock or expectNoLock?
pkg/kube/build_lock_test.go
Outdated
} | ||
|
||
// builLockCheckLock checks if the lock configmap is correct | ||
func builLockCheckLock(t *testing.T, client kubernetes.Interface, namespace, owner, repository, branch, build string) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, instead of builLockCheckLock, is this not more of an assertLock
, expectLock
or verifyLock
?
pkg/kube/build_lock_test.go
Outdated
} | ||
|
||
// builLockCheckLockFromPod checks if the lock configmap is matching the given pod | ||
func builLockCheckLockFromPod(t *testing.T, client kubernetes.Interface, namespace string, pod *v1.Pod) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
This PR prevents builds to edit the same namespace at the same time. Such a behavior could lead to a namespace to be accidentally wiped out. Only `jx step helm apply` is locked. A ConfigMap `jx-lock-{namespace}` is used as a lock. No other build can run while this configmap exists. Waiting builds can edit the data of the ConfigMap in order to be the next one to run. If a build sees that the locking or waiting build is "higher", it will fail. When the build finished, the ConfigMap is removed. A waiting build can also remove the ConfigMap if the lokcing pod has finished. The algorithm is approximately: ``` Label: CREATE try to create the ConfigMap if it succeeds: return Label: READ get the ConfigMap and the locking Pod if the locking Pod has finished remove the ConfigMap goto CREATE if the ConfigMap references a "higher" build fail if the ConfigMap references a "lower" build update the ConfigMap wait for the ConfigMap or the Pod to be updated if the ConfigMap is delete goto CREATE if the ConfigMap references a different build goto READ if the Pod has finished goto CREATE ``` fixes jenkins-x#6167 Signed-off-by: Aurélien Lambert <[email protected]>
ca8672a
to
4a1ab6e
Compare
} | ||
|
||
// buildLock_AssertLockFromPod checks if the lock configmap is matching the given pod | ||
func buildLock_AssertLockFromPod(t *testing.T, client kubernetes.Interface, namespace string, pod *v1.Pod) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't use underscores in Go names; func buildLock_AssertLockFromPod should be buildLockAssertLockFromPod
} | ||
|
||
// buildLock_AssertLock checks if the lock configmap is correct | ||
func buildLock_AssertLock(t *testing.T, client kubernetes.Interface, namespace, owner, repository, branch, build string) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't use underscores in Go names; func buildLock_AssertLock should be buildLockAssertLock
return clean, c | ||
} | ||
|
||
func buildLock_AssertNoLock(t *testing.T, client kubernetes.Interface, namespace string) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't use underscores in Go names; func buildLock_AssertNoLock should be buildLockAssertNoLock
// returns a chan that is filled once AcquireBuildLock returns | ||
// its item will perform some check and call the callback | ||
// its item is nil on timeout | ||
func buildLock_AcquireFromPod(t *testing.T, client kubernetes.Interface, namespace string, pod *v1.Pod, fails bool) (func(), chan func()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't use underscores in Go names; func buildLock_AcquireFromPod should be buildLockAcquireFromPod
// returns a chan that is filled once AcquireBuildLock returns | ||
// its item will perform some check and call the callback | ||
// its item is nil on timeout | ||
func buildLock_Acquire(t *testing.T, client kubernetes.Interface, namespace, owner, repository, branch, build string, fails bool) (func(), chan func()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't use underscores in Go names; func buildLock_Acquire should be buildLockAcquire
} | ||
|
||
// buildLock_Lock creates a lock | ||
func buildLock_Lock(t *testing.T, client kubernetes.Interface, namespace, owner, repository, branch, build string, minutes int, expires time.Duration) *v1.ConfigMap { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't use underscores in Go names; func buildLock_Lock should be buildLockLock
|
||
var buildLock_UID int = 1 << 20 // the pid of out fake pods | ||
// buildLock_Pod creates a running pod, looking close enough to a pipeline pod | ||
func buildLock_Pod(t *testing.T, client kubernetes.Interface, owner, repository, branch, build string) *v1.Pod { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't use underscores in Go names; func buildLock_Pod should be buildLockPod
return c | ||
} | ||
|
||
var buildLock_UID int = 1 << 20 // the pid of out fake pods |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't use underscores in Go names; var buildLock_UID should be buildLockUID
} | ||
|
||
// buildLock_CountWatch count watchers for synchronization reasons | ||
func buildLock_CountWatch(client *fake.Clientset) chan int { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't use underscores in Go names; func buildLock_CountWatch should be buildLockCountWatch
} | ||
|
||
// buildLock_Client creates a fake client with a fake tekton deployment | ||
func buildLock_Client(t *testing.T) *fake.Clientset { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't use underscores in Go names; func buildLock_Client should be buildLockClient
I had to write those "hard to read" name because hound seem to care quite much about function names in test. I reverted it to a more readable form. The prefix is to avoid collisions with other tests in the package |
/lgtm |
@aure-olli thanks a lot for your pull request and your patience. Much appreciated. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: hferentschik The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Submitter checklist
Description
This PR prevents builds to edit the same namespace at the same time. Such a behavior could lead to a namespace to be accidentally wiped out.
Only
jx step helm apply
is locked.Special notes for the reviewer(s)
A ConfigMap
jx-lock-{namespace}
is used as a lock. No other build can run while this configmap exists. Waiting builds can edit the data of the ConfigMap in order to be the next one to run. If a build sees that the locking or waiting build is "higher", it will fail. When the build finished, the ConfigMap is removed. A waiting build can also remove the ConfigMap if the lokcing pod has finished.The algorithm is approximately:
Which issue this PR fixes
fixes #6167
Issues
jx
is running in. The pipeline could be changed to give such an info: https://kubernetes.io/docs/tasks/inject-data-application/environment-variable-expose-pod-information/owner={REPO_OWNER},repository={REPO_NAME}s,branch={BRANCH_NAME},build={BUILD_NUMBER},jenkins.io/pipelineType=build
, using the headers. I'm afraid such a filter is specific to specific kind of pipelines.jx
commands. But I'm not sure which ones need.make lint
made a huge update ofgo.mod
. No idea why.