[FEA] Generate ONLY required metrics from process speed perspective #1541

wjxiz1992 · 2025-02-11T08:49:52Z

Is your feature request related to a problem? Please describe.

Now we receive >100 eventlogs per day, and each eventlogs is at GB level size, the large one can reach 7GB, small ones are around 1~2 GB.

The profiling tool now procude full size of the metrics extracted from the eventlogs, but it usually take amount of time.

Is it possible that we pick required parts that Profiling tool can skip lots of work so we can get the results in short time? For example, we only need the "failed_jobs.csv", all other metrics are not needed.

Describe the solution you'd like
N/A

Describe alternatives you've considered
N/A

Additional context
If it's possible, Profiling Tool can provide an extra argument pointing to a json/yaml whatever config file, I can say "I only need failed_job.csv" as output.

The text was updated successfully, but these errors were encountered:

amahussein · 2025-02-13T18:50:28Z

The issue description does not show what is the context?

Is this the python CLI rapids+tools or the java cmd? There are ways to boost the runtime of the tools but that depends on how it is triggered (number of eventlogs processed in parallel and the memory allocated to the process).
Is it taking long time writing to disk or processing eventlogs?
- If it is taking long time writing to disk, then it is doable to allow only a subset of files to be written.
- If it is taking long time processing, then we go back to the first question to see why it takes time.

Regarding the feature request: It is kind of tough to implement such request.

There is a huge dependency between data extracted.
In some cases, the output cannot be explained unless it is full. For example, users might need to look at the SQL failures or stages in order to understand what that job was doing. This implies, they will have to rerun the tools gain to generated that file.
If we try to refactor the code to do each feature independently, the runtime will end up way longer because of isolating variables.

amahussein · 2025-02-13T20:41:42Z

CC: @mattahrens

wjxiz1992 added ? - Needs Triage feature request New feature or request labels Feb 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Generate ONLY required metrics from process speed perspective #1541

[FEA] Generate ONLY required metrics from process speed perspective #1541

wjxiz1992 commented Feb 11, 2025

amahussein commented Feb 13, 2025

amahussein commented Feb 13, 2025

[FEA] Generate ONLY required metrics from process speed perspective #1541

[FEA] Generate ONLY required metrics from process speed perspective #1541

Comments

wjxiz1992 commented Feb 11, 2025

amahussein commented Feb 13, 2025

amahussein commented Feb 13, 2025