Hangfire.ProRedis parse error seems to cause infinite retry #860

markalanevans · 2017-04-07T22:46:26Z

Silently on one of our integration servers, we started getting the below error. Its not easy to figure out what was causing this so we had to wipe the REDIS DB.

Unfortunately it caused thousands of errors to recur and since we use the RayGun error logging service it triggered our plan to the next tier. :(

What sort of things could possibly cause this?

Message: [ArgumentNullException: String reference not set to an instance of a String.
Parameter name: s]
 System.DateTimeParse.Parse(String s, DateTimeFormatInfo dtfi, DateTimeStyles styles):34
 System.DateTime.Parse(String s, IFormatProvider provider, DateTimeStyles styles):11
 Hangfire.Common.JobHelper.DeserializeDateTime(String value):0
 Hangfire.Pro.Redis.RedisConnection.GetJobData(String jobId):267
 Hangfire.States.BackgroundJobStateChanger.GetJobData(StateChangeContext context):0
 Hangfire.States.BackgroundJobStateChanger.ChangeState(StateChangeContext context):23
 Hangfire.Server.Worker.Execute(BackgroundProcessContext context):413
 Hangfire.Server.ServerProcessExtensions.Execute(IServerProcess process, BackgroundProcessContext context):51
 Hangfire.Server.AutomaticRetryProcess.Execute(BackgroundProcessContext context):26

odinserj · 2017-04-08T10:15:18Z

Oh dear, this can happen, when job:{job-id} hash key was removed or expired, during a state change process. So a call to flushdb, when background processing is active, or a problem with NTP server, which moves a system time forward (not a DST transition), can cause this, because Redis doesn't use a monotonic clock for expirations. I'll make the following changes next week to prevent this in future:

Increase a default expiration time for non-initialised jobs, to be more robust on problems with NTP.
Start catching serialisation exceptions in a state change process, to prevent infinite retries.
Add a default value for StartedAt in Hangfire.Pro.Redis, to prevent masquerading the actual problem (no type/method/arguments data).

Duplicates #790.

markalanevans · 2017-04-10T17:29:57Z

That makes sense.

I noticed that the Redis instance did fill up, there were "workers" and hundreds of thousands of jobs queued up, but no jobs were being processed.

Thank you Sergey.

aidanmorgan · 2017-04-14T03:08:48Z

I am also encountering this issue, with same stack trace.

odinserj · 2017-04-14T10:06:28Z

Guys, please tell me what Hangfire packages are you using and their versions. This behavior could also be caused by a bug related to expiring jobs in a batch continuation fixed in Hangfire.Pro 1.4.4. The problem is background job is expired or removed prior the correct time. I've added a workaround for this to upcoming Hangfire.Pro.Redis 2.1.0, but we need to know the exact problem.

aidanmorgan · 2017-04-15T02:02:20Z

Hangfire: 1.6.8
Hangfire.Core: 1.6.8
Hangfire.Pro.Redis: 2.0.6

aidanmorgan · 2017-04-17T04:42:38Z

I just checked my production server, I have 300k jobs backed up and they will not dequeue - is there a change that I can make to get these to dequeue?

aidanmorgan · 2017-04-17T04:45:47Z

Interestingly if I restart the server then it will dequeue 10-15 jobs and execute them, then it will no longer dequeue any more.

aidanmorgan · 2017-04-17T07:14:17Z

I manually connected to the redis server and removed the jobs that were showing as having a problem and it appears that my processing has resumed (although will take several hours to catch back up).

I think any fix needs to also handle the exception and allow the workers to keep on running, perhaps transitioning the job into some "fault" state that can be further resumed. At the moment it has blocked up over 350k jobs and has caused downstream impact on my users.

odinserj · 2017-04-17T16:40:57Z

I've just released Hangfire.Pro.Redis 2.1.0. Missing type and method information will not lead to an exception anymore, and job will be moved to the Failed state. I've increased the initial expiration timeouts to prevent job expiration during clock changes. So, the consequences of such problem will not cause the whole processing to stop. I'll also add a protection layer to the Hangfire.Core package.

Thank you for reporting such a non-trivial problem!

markalanevans · 2017-04-17T17:24:38Z

Sorry for the delayed response.

Our versions are:
2.0.6 Redis Pro
1.6.8 Hangfire Core.

Looks like we have a little updating to do.

@odinserj i'll see about upgrading all of our Hangfire packages on Wed.

Thank you so much for your work on this.

-Mark

aidanmorgan · 2017-04-18T05:56:01Z

I've deployed the new Redis library into our test environment, it appears to skip the failed job for me.

aidanmorgan · 2017-05-09T01:44:10Z

We have been running new redis in production for three weeks now - only issue seems to be that the "failed" jobs tab on the dashboard now returns a 404, but I will raise a different issue for that.

markalanevans · 2017-05-09T20:19:55Z

Same here. So far so good.

odinserj added a: core t: bug labels Apr 8, 2017

odinserj added this to the 1.6.13 milestone Apr 8, 2017

odinserj added the a: redis label Apr 17, 2017

odinserj closed this as completed in 9e90614 Jun 7, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hangfire.ProRedis parse error seems to cause infinite retry #860

Hangfire.ProRedis parse error seems to cause infinite retry #860

markalanevans commented Apr 7, 2017

odinserj commented Apr 8, 2017

markalanevans commented Apr 10, 2017

aidanmorgan commented Apr 14, 2017

odinserj commented Apr 14, 2017

aidanmorgan commented Apr 15, 2017

aidanmorgan commented Apr 17, 2017

aidanmorgan commented Apr 17, 2017

aidanmorgan commented Apr 17, 2017

odinserj commented Apr 17, 2017

markalanevans commented Apr 17, 2017 •

edited

Loading

aidanmorgan commented Apr 18, 2017

aidanmorgan commented May 9, 2017

markalanevans commented May 9, 2017

Hangfire.ProRedis parse error seems to cause infinite retry #860

Hangfire.ProRedis parse error seems to cause infinite retry #860

Comments

markalanevans commented Apr 7, 2017

odinserj commented Apr 8, 2017

markalanevans commented Apr 10, 2017

aidanmorgan commented Apr 14, 2017

odinserj commented Apr 14, 2017

aidanmorgan commented Apr 15, 2017

aidanmorgan commented Apr 17, 2017

aidanmorgan commented Apr 17, 2017

aidanmorgan commented Apr 17, 2017

odinserj commented Apr 17, 2017

markalanevans commented Apr 17, 2017 • edited Loading

aidanmorgan commented Apr 18, 2017

aidanmorgan commented May 9, 2017

markalanevans commented May 9, 2017

markalanevans commented Apr 17, 2017 •

edited

Loading