Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hangfire.ProRedis parse error seems to cause infinite retry #860

Closed
markalanevans opened this issue Apr 7, 2017 · 13 comments
Closed

Hangfire.ProRedis parse error seems to cause infinite retry #860

markalanevans opened this issue Apr 7, 2017 · 13 comments

Comments

@markalanevans
Copy link

Silently on one of our integration servers, we started getting the below error. Its not easy to figure out what was causing this so we had to wipe the REDIS DB.

Unfortunately it caused thousands of errors to recur and since we use the RayGun error logging service it triggered our plan to the next tier. :(

What sort of things could possibly cause this?

Message: [ArgumentNullException: String reference not set to an instance of a String.
Parameter name: s]
 System.DateTimeParse.Parse(String s, DateTimeFormatInfo dtfi, DateTimeStyles styles):34
 System.DateTime.Parse(String s, IFormatProvider provider, DateTimeStyles styles):11
 Hangfire.Common.JobHelper.DeserializeDateTime(String value):0
 Hangfire.Pro.Redis.RedisConnection.GetJobData(String jobId):267
 Hangfire.States.BackgroundJobStateChanger.GetJobData(StateChangeContext context):0
 Hangfire.States.BackgroundJobStateChanger.ChangeState(StateChangeContext context):23
 Hangfire.Server.Worker.Execute(BackgroundProcessContext context):413
 Hangfire.Server.ServerProcessExtensions.Execute(IServerProcess process, BackgroundProcessContext context):51
 Hangfire.Server.AutomaticRetryProcess.Execute(BackgroundProcessContext context):26
@odinserj
Copy link
Member

odinserj commented Apr 8, 2017

Oh dear, this can happen, when job:{job-id} hash key was removed or expired, during a state change process. So a call to flushdb, when background processing is active, or a problem with NTP server, which moves a system time forward (not a DST transition), can cause this, because Redis doesn't use a monotonic clock for expirations. I'll make the following changes next week to prevent this in future:

  1. Increase a default expiration time for non-initialised jobs, to be more robust on problems with NTP.
  2. Start catching serialisation exceptions in a state change process, to prevent infinite retries.
  3. Add a default value for StartedAt in Hangfire.Pro.Redis, to prevent masquerading the actual problem (no type/method/arguments data).

Duplicates #790.

@odinserj odinserj added this to the 1.6.13 milestone Apr 8, 2017
@markalanevans
Copy link
Author

That makes sense.

I noticed that the Redis instance did fill up, there were "workers" and hundreds of thousands of jobs queued up, but no jobs were being processed.

Thank you Sergey.

@aidanmorgan
Copy link

I am also encountering this issue, with same stack trace.

@odinserj
Copy link
Member

Guys, please tell me what Hangfire packages are you using and their versions. This behavior could also be caused by a bug related to expiring jobs in a batch continuation fixed in Hangfire.Pro 1.4.4. The problem is background job is expired or removed prior the correct time. I've added a workaround for this to upcoming Hangfire.Pro.Redis 2.1.0, but we need to know the exact problem.

@aidanmorgan
Copy link

Hangfire: 1.6.8
Hangfire.Core: 1.6.8
Hangfire.Pro.Redis: 2.0.6

@aidanmorgan
Copy link

I just checked my production server, I have 300k jobs backed up and they will not dequeue - is there a change that I can make to get these to dequeue?

@aidanmorgan
Copy link

Interestingly if I restart the server then it will dequeue 10-15 jobs and execute them, then it will no longer dequeue any more.

@aidanmorgan
Copy link

I manually connected to the redis server and removed the jobs that were showing as having a problem and it appears that my processing has resumed (although will take several hours to catch back up).

I think any fix needs to also handle the exception and allow the workers to keep on running, perhaps transitioning the job into some "fault" state that can be further resumed. At the moment it has blocked up over 350k jobs and has caused downstream impact on my users.

@odinserj
Copy link
Member

I've just released Hangfire.Pro.Redis 2.1.0. Missing type and method information will not lead to an exception anymore, and job will be moved to the Failed state. I've increased the initial expiration timeouts to prevent job expiration during clock changes. So, the consequences of such problem will not cause the whole processing to stop. I'll also add a protection layer to the Hangfire.Core package.

Thank you for reporting such a non-trivial problem!

@markalanevans
Copy link
Author

markalanevans commented Apr 17, 2017

Sorry for the delayed response.

Our versions are:
2.0.6 Redis Pro
1.6.8 Hangfire Core.

Looks like we have a little updating to do.

@odinserj i'll see about upgrading all of our Hangfire packages on Wed.

Thank you so much for your work on this.

-Mark

@aidanmorgan
Copy link

I've deployed the new Redis library into our test environment, it appears to skip the failed job for me.

@aidanmorgan
Copy link

We have been running new redis in production for three weeks now - only issue seems to be that the "failed" jobs tab on the dashboard now returns a 404, but I will raise a different issue for that.

@markalanevans
Copy link
Author

Same here. So far so good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

3 participants