You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
The core module is using a significant memory heap to process the eventlong successfuly.
After performing several iterations to enumerate bottlenecks, it was found that around 60% of the heap utilization is occupied by the Task accumulable updates.
The flow is as follows:
taskEnd event is scanned
for each accumulable in the taskEnd; a new entry (long, long) is added to the AccumInfo.taskUpdateMap.
This map is kept in memory
The map is accessed to aggregate the statistics of the accummulables across the tasks of a specific stage.
In the above flow, it is noticed that the storage of the taskAccumulables is needed only to build the stats. Knowing, that the total is stored as part of the StageAccumulable. Then, it is clear that this taskUpdateMap represents a huge overhead.
Describe the solution you'd like
A quick around is to skip storing individual task-to-accumulable update. Instead the core aggregates the statistics per stage on the fly.
This requires the following:
approximately get the moving average of the task-updates.
with every task-end we update the statistics metrics of the stage. In other words, this means we are creating extra statiscs object but hopefully this is still less memory than taskUpdates would be using.
Describe alternatives you've considered
We can free memory if we can offload the taskUpdateMap to the disk (using binary files like rocksDB or parquet). Then load the updates by stageID as necessary.
This is the ultimate fix for that problem; however this will be a significant code-rewriting and synchronizations. Since the task-accumulable data is not used so far in the core, then the effort won't be justified.
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem? Please describe.
The core module is using a significant memory heap to process the eventlong successfuly.
After performing several iterations to enumerate bottlenecks, it was found that around 60% of the heap utilization is occupied by the Task accumulable updates.
The flow is as follows:
AccumInfo.taskUpdateMap
.In the above flow, it is noticed that the storage of the taskAccumulables is needed only to build the stats. Knowing, that the total is stored as part of the StageAccumulable. Then, it is clear that this
taskUpdateMap
represents a huge overhead.Describe the solution you'd like
A quick around is to skip storing individual task-to-accumulable update. Instead the core aggregates the statistics per stage on the fly.
This requires the following:
Describe alternatives you've considered
We can free memory if we can offload the
taskUpdateMap
to the disk (using binary files like rocksDB or parquet). Then load the updates by stageID as necessary.This is the ultimate fix for that problem; however this will be a significant code-rewriting and synchronizations. Since the task-accumulable data is not used so far in the core, then the effort won't be justified.
The text was updated successfully, but these errors were encountered: