Skip to content

Commit

Permalink
Check archived jobs for last known job number before creating cluster
Browse files Browse the repository at this point in the history
We ran into a bug where we deleted a job cluster and then recreated the
job cluster with the same name.  The old job cluster had 4 stages and
the new one had two.  When a job was completed, it would write to the
archived tables.  However, there already existed a job cluster there
with the same ID.  The KV provider only overwrote the rows for stages 1
and 2. It did not delete the values for stages 3 and 4.

When Mantis tried to load the archived job, it would see job metadata
indicating 2 stages, but then would receive 4 stages (two from the new
job and 4 from the old job).  This would lead to the Mantis not loading
the job.

We could probably consider this a bug in the Dynamo KV Provider, _but_
it felt like we don't want to overwrite archived jobs in any scenario
since we'd like to maintain a record of those jobs.  Instead, the
problem is further upstream.  When we create a job, we should be
reasonably confident that the Job ID is globally unique.  However, when
creating a job cluster, the `lastJobCount` value is always set to 0.  We
should instead check if there are any archived jobs with the same
cluster name.  If so, we should grab the last value and set that as the
last known job number.

We desire the following scenario

1. Create a job cluster "MyJob"
2. Create a job "MyJob-1"
3. Delete the job and job cluster
4. Create another job cluster MyJob
5. Create a job "MyJob-2" instead of "MyJob-1"

Previously, we would have an archived job "MyJob-1" and an active job
"MyJob-1" that are distinct.  Stopping the active one would overwrite
the archived one.
  • Loading branch information
timmartin-stripe committed Jan 3, 2025
1 parent a18a200 commit e8dadba
Show file tree
Hide file tree
Showing 4 changed files with 87 additions and 52 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -447,7 +447,18 @@ public void onJobClusterCreate(final CreateJobClusterRequest request) {
try {
Optional<JobClusterInfo> jobClusterInfoO = jobClusterInfoManager.createClusterActorAndRegister(request.getJobClusterDefinition());
if (jobClusterInfoO.isPresent()) {
jobClusterInfoManager.initializeClusterAsync(jobClusterInfoO.get(), new JobClusterProto.InitializeJobClusterRequest(request.getJobClusterDefinition(), request.getUser(), getSender()));
// Check if the job cluster name used to exist but was deleted
// If it was, use the last known job number from that cluster instead of starting from 0
// This ensures that there are no conflicts between archived jobs and old jobs
// e.g. If a job cluster with the name "MyJobCluster" was deleted, but had run a job, then
// we'd have a Job ID of `MyJobCluster-1` in the archived jobs table. When the new job cluster
// came online it may partially overwrite the old archived job if we started a new job with ID MyJobCluster-1.
List<CompletedJob> completedJobs = jobStore.loadCompletedJobsForCluster(request.getJobClusterDefinition().getName(), 1, null);
long lastJobNum = completedJobs.stream()
.map((job) -> JobId.fromId(job.getJobId()).map(JobId::getJobNum).orElse(0L))
.max(Long::compareTo)
.orElse(0L);
jobClusterInfoManager.initializeClusterAsync(jobClusterInfoO.get(), new JobClusterProto.InitializeJobClusterRequest(request.getJobClusterDefinition(), lastJobNum, request.getUser(), getSender()));
} else {
getSender().tell(new CreateJobClusterResponse(
request.requestId, CLIENT_ERROR,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -74,11 +74,12 @@ public InitializeJobClusterRequest(final JobClusterDefinitionImpl jobClusterDefi
/**
* Invoked during Job Cluster Creation
* @param jobClusterDefinition
* @param lastJobNumber
* @param user
* @param requestor
*/
public InitializeJobClusterRequest(final JobClusterDefinitionImpl jobClusterDefinition, String user, ActorRef requestor) {
this(jobClusterDefinition, false, 0, Lists.newArrayList(), user, requestor, true);
public InitializeJobClusterRequest(final JobClusterDefinitionImpl jobClusterDefinition, long lastJobNumber, String user, ActorRef requestor) {
this(jobClusterDefinition, false, lastJobNumber, Lists.newArrayList(), user, requestor, true);

}

Expand Down
Loading

0 comments on commit e8dadba

Please sign in to comment.