Skip to content
Fabiano V. Santos edited this page Jun 5, 2018 · 4 revisions

FAQ

Failed to start a job with legacy Nightfall: v1.x

If your job is failing with this exception: Exception in thread "main" java.lang.IncompatibleClassChangeError: class com.netflix.governator.lifecycle.AnnotationFinder has interface org.objectweb.asm.ClassVisitor as super class, means that you need to set the configuration spark.driver.userClassPathFirst=true

Failed to start a job v2.x

If your job is failing with this exception: Exception in thread "main" java.lang.IncompatibleClassChangeError: class com.netflix.governator.lifecycle.AnnotationFinder has interface org.objectweb.asm.ClassVisitor as super class, means that you should use the shaded jar of nightfall, see add Nightfall to your project for more information.

Adjusting Stream Batch Interval and Consume Max Rate

We spent a lot of time to set a next setting the optimal balancing the runtime of micro-batches and consumption of kafka messages. The setting that shows up better in most scenarios is a 30-second time interval with a maximum consumption of 1,000 messages per second that is 30,000 messages per batch interval.

S3 Timeouts

If you see warnings as the one below you can try to increase the timeout configuration with S3 (spark.streaming.driver.writeAheadLog.batchingTimeout), but change it with caution. Also see Reliability issues with Checkpointing/WAL in Spark Streaming 1.6.0:

2016-04-27 11:26:14,974 [WARN ] [JobGenerator] aming.scheduler.ReceivedBlockTracker Exception thrown while writing record: BatchCleanupEvent(ArrayBuffer()) to the WriteAheadLog. java.util.concurrent.TimeoutException: Futures timed out after [5000 milliseconds]
	at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
	at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
	at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
	at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
	at scala.concurrent.Await$.result(package.scala:107)
	at org.apache.spark.streaming.util.BatchedWriteAheadLog.write(BatchedWriteAheadLog.scala:81)
	at org.apache.spark.streaming.scheduler.ReceivedBlockTracker.writeToLog(ReceivedBlockTracker.scala:232)
	at org.apache.spark.streaming.scheduler.ReceivedBlockTracker.cleanupOldBatches(ReceivedBlockTracker.scala:169)
	at org.apache.spark.streaming.scheduler.ReceiverTracker.cleanupOldBlocksAndBatches(ReceiverTracker.scala:223)
	at org.apache.spark.streaming.scheduler.JobGenerator.clearCheckpointData(JobGenerator.scala:285)
	at org.apache.spark.streaming.scheduler.JobGenerator.org$apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:185)
	at org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:87)
	at org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:86)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

Also is not recommended to use S3 as checkpoints.

Executor stopped by resource manager

When your job is consuming more memory than expected the resource manager will stop it. You will see the 2016-04-27 01:37:49,032 [ERROR] [dispatcher-event-loop-4] he.spark.scheduler.TaskSchedulerImpl Lost executor 4784a4aa-0af9-4585-9c73-94abf0fa3086-S5 on <HOSTNAME>: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages. message in your logs.

Memory leak with Spark

We have seen Spark behave like was having a memory leak, but in fact, its was a malloc arena configuration from glibc. We solve it by setting the environment variable MALLOC_ARENA_MAX=4. More details in Malloc per-thread arenas in glibc.

Deleting checkpoint configuration

Unfortunately it is necessary to delete the checkpoint files when the configuration or code changes. In order to avoid the need to erase every new release of our projects we create the default checkpoint directory has the version number.