Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TASK] Improve core's memory utilization by reducing storage of Accum task updates #1545

Open
Tracked by #815
amahussein opened this issue Feb 13, 2025 · 0 comments · May be fixed by #1543
Open
Tracked by #815

[TASK] Improve core's memory utilization by reducing storage of Accum task updates #1545

amahussein opened this issue Feb 13, 2025 · 0 comments · May be fixed by #1543
Assignees
Labels
core_tools Scope the core module (scala) performance performance and scalability of tools

Comments

@amahussein
Copy link
Collaborator

Is your feature request related to a problem? Please describe.

The core module is using a significant memory heap to process the eventlong successfuly.
After performing several iterations to enumerate bottlenecks, it was found that around 60% of the heap utilization is occupied by the Task accumulable updates.

The flow is as follows:

  • taskEnd event is scanned
  • for each accumulable in the taskEnd; a new entry (long, long) is added to the AccumInfo.taskUpdateMap.
  • This map is kept in memory
  • The map is accessed to aggregate the statistics of the accummulables across the tasks of a specific stage.

In the above flow, it is noticed that the storage of the taskAccumulables is needed only to build the stats. Knowing, that the total is stored as part of the StageAccumulable. Then, it is clear that this taskUpdateMap represents a huge overhead.

Describe the solution you'd like

A quick around is to skip storing individual task-to-accumulable update. Instead the core aggregates the statistics per stage on the fly.
This requires the following:

  • approximately get the moving average of the task-updates.
  • with every task-end we update the statistics metrics of the stage. In other words, this means we are creating extra statiscs object but hopefully this is still less memory than taskUpdates would be using.

Describe alternatives you've considered

We can free memory if we can offload the taskUpdateMap to the disk (using binary files like rocksDB or parquet). Then load the updates by stageID as necessary.
This is the ultimate fix for that problem; however this will be a significant code-rewriting and synchronizations. Since the task-accumulable data is not used so far in the core, then the effort won't be justified.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core_tools Scope the core module (scala) performance performance and scalability of tools
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants