Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate nightly benchmarks 0 events/s issue #13738

Closed
carsonip opened this issue Jul 22, 2024 · 20 comments
Closed

Investigate nightly benchmarks 0 events/s issue #13738

carsonip opened this issue Jul 22, 2024 · 20 comments
Assignees
Labels

Comments

@carsonip
Copy link
Member

carsonip commented Jul 22, 2024

Nightly benchmarks occasionally report 0 events/s. Investigate the root cause of it.

@carsonip carsonip added the bug label Jul 22, 2024
@lahsivjar lahsivjar self-assigned this Jul 22, 2024
@lahsivjar
Copy link
Contributor

lahsivjar commented Jul 29, 2024

Status update

The first thing I looked at was what was getting reported by the benchmark failures. Here are 2 links to the benchmark run:

  1. Run with events/sec metric populated - Link to APM-Server logs - Link to deployment
  2. Run without events/sec metric populated - Link to APM-Server logs - Link to deployment

Both of these show 500 internal error, however, the logs for 0 events/sec additionally show data validation errors due to unexpected EOF. These errors seemed to be logged from here. This could be an issue with our sender, however, the most intriguing thing is why only a subset of delta metrics are reported as 0. For example: in the above link, the txn/sec and metrics/sec are reported correctly whereas other delta metrics are reported as zero.

I have tried reproducing the errors locally but haven't succeeded (note that the expvar metrics collection is designed for benchtimes in minutes so if testing locally make sure that you have a good enough benchtime to give expvar metrics to work correctly). I did see some special handling in the expvar metric collection but nothing explains this bug.

I have also created a PR to log errors in expvar endpoint which was not done before. I am not sure how helpful it will be though.

@lahsivjar lahsivjar removed their assignment Jul 29, 2024
@1pkg 1pkg removed their assignment Aug 13, 2024
@simitt
Copy link
Contributor

simitt commented Oct 16, 2024

Is this still happening?

@rubvs
Copy link
Contributor

rubvs commented Oct 23, 2024

@simitt I had this happen to me in a run on GH Actions last week, see Slack Thread: https://elastic.slack.com/archives/C95SB62AG/p1729263104854879

@raultorrecilla
Copy link

moving this out of iteration, to backlog. If it happens again more frequently we can reprioritize.
Out of curiosity, how many times this happened between end of October and now? cc @rubvs

@rubvs
Copy link
Contributor

rubvs commented Jan 11, 2025

@raultorrecilla it happens very infrequently. For reference, I've probably ran 300+ benchmarks since Oct, and have only observed this 3 or 4 times.

@carsonip
Copy link
Member Author

carsonip commented Jan 22, 2025

It happened today to an on-demand benchmark run. Also updated the issue description.

edit: there seems to be a real issue with main

@1pkg
Copy link
Member

1pkg commented Jan 23, 2025

resolved in #15338

@1pkg 1pkg closed this as completed Jan 23, 2025
@carsonip
Copy link
Member Author

resolved in #15338

@1pkg #15338 resolved a real issue that causes reproducible 0 events/s.

This GH issue was created for some occasional 0 events/s that auto resolves itself. However, I'm happy to keep it closed and reopen it when needed.

@carsonip
Copy link
Member Author

Reopening as it is happening quite frequently lately.

@carsonip carsonip reopened this Jan 27, 2025
@carsonip carsonip assigned 1pkg and unassigned inge4pres Jan 27, 2025
@inge4pres inge4pres self-assigned this Jan 27, 2025
@raultorrecilla
Copy link

moving it as a candidate for next iteration (it-107)

@simitt
Copy link
Contributor

simitt commented Jan 28, 2025

IMO we should treat this with urgency, ideally one of the on-support duty engineers could look into it. I believe @1pkg started investigation on it as part of support duty last week and could hand over.

@simitt
Copy link
Contributor

simitt commented Jan 28, 2025

Let's wait for #15360 (comment) to be resolved and see if this still happens afterwards.

@kruskall
Copy link
Member

See #15360 (comment)

There might be more to this. The expvar endpoint is working fine and when running the benchmarks locally the numbers are reported correctly (events is not 0).

@kruskall
Copy link
Member

kruskall commented Jan 28, 2025

started a benchmark targeting main without the last 3 commits and it seems to work: https://github.com/elastic/apm-server/actions/runs/13012315722

tip at 1e021db

@kruskall
Copy link
Member

started another benchmark targeting main without the last commit and it works too: https://github.com/elastic/apm-server/actions/runs/13012675235/job/36293952533

tip at 8aac9e3

@kruskall
Copy link
Member

ok we're running the main branch with and without moxy.

with moxy: everything works and events are not 0
without moxy: 0 events/s

the commit/tip is exactly the same

@carsonip
Copy link
Member Author

carsonip commented Jan 28, 2025

with moxy: everything works and events are not 0
without moxy: 0 events/s

I can reproduce this. Specifically, in the case of 0 events/s (without moxy), the ESS ES has all apm documents indexed. There is no error. It is the libbeat.output.* that are missing from /debug/vars:

(truncated output)

"apm-server.processor.stream.accepted": 5257954,
"apm-server.processor.stream.errors.invalid": 0,
"apm-server.processor.stream.errors.toolarge": 0,
"apm-server.processor.span.transformations": 3173610,
"apm-server.processor.error.transformations": 105356,
"apm-server.processor.transaction.transformations": 1214548,
"libbeat.config.scans": 0,
"libbeat.config.reloads": 0,
"libbeat.config.module.starts": 0,
"libbeat.config.module.stops": 0,
"libbeat.config.module.running": 0,
"badger_v2_blocked_puts_total": 0,
"badger_v2_disk_reads_total": 0,
"badger_v2_disk_writes_total": 0,

@carsonip
Copy link
Member Author

carsonip commented Jan 28, 2025

@kruskall
Copy link
Member

should be fixed by #15439

benchmarks using the pr branch are working fine

@kruskall
Copy link
Member

PR merged. Closing this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

8 participants