Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Generate ONLY required metrics from process speed perspective #1541

Open
wjxiz1992 opened this issue Feb 11, 2025 · 2 comments
Open

[FEA] Generate ONLY required metrics from process speed perspective #1541

wjxiz1992 opened this issue Feb 11, 2025 · 2 comments
Labels

Comments

@wjxiz1992
Copy link
Collaborator

Is your feature request related to a problem? Please describe.

Now we receive >100 eventlogs per day, and each eventlogs is at GB level size, the large one can reach 7GB, small ones are around 1~2 GB.

The profiling tool now procude full size of the metrics extracted from the eventlogs, but it usually take amount of time.

Is it possible that we pick required parts that Profiling tool can skip lots of work so we can get the results in short time? For example, we only need the "failed_jobs.csv", all other metrics are not needed.

Describe the solution you'd like
N/A

Describe alternatives you've considered
N/A

Additional context
If it's possible, Profiling Tool can provide an extra argument pointing to a json/yaml whatever config file, I can say "I only need failed_job.csv" as output.

@amahussein
Copy link
Collaborator

The issue description does not show what is the context?

  • Is this the python CLI rapids+tools or the java cmd? There are ways to boost the runtime of the tools but that depends on how it is triggered (number of eventlogs processed in parallel and the memory allocated to the process).
  • Is it taking long time writing to disk or processing eventlogs?
    • If it is taking long time writing to disk, then it is doable to allow only a subset of files to be written.
    • If it is taking long time processing, then we go back to the first question to see why it takes time.

Regarding the feature request: It is kind of tough to implement such request.

  • There is a huge dependency between data extracted.
  • In some cases, the output cannot be explained unless it is full. For example, users might need to look at the SQL failures or stages in order to understand what that job was doing. This implies, they will have to rerun the tools gain to generated that file.
  • If we try to refactor the code to do each feature independently, the runtime will end up way longer because of isolating variables.

@amahussein
Copy link
Collaborator

CC: @mattahrens

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants