Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random and frequent usePermission timeouts causing 404s #603

Open
4 tasks done
marsea24 opened this issue Oct 11, 2024 · 1 comment
Open
4 tasks done

Random and frequent usePermission timeouts causing 404s #603

marsea24 opened this issue Oct 11, 2024 · 1 comment
Labels
status/triage/completed Automatic triage completed status/triage/manual Manual triage in progress type/bug Something isn't working

Comments

@marsea24
Copy link

marsea24 commented Oct 11, 2024

Issue submitter TODO list

  • I've looked up my issue in FAQ
  • I've searched for an already existing issues here
  • I've tried running main-labeled docker image and the issue still persists there
  • I'm running a supported version of the application which is listed here

Describe the bug (actual behavior)

Users are encountering pages refusing to load (spinning) and giving a timeout 404 on waiting. This is during normal browsing through pages like Topics and Messages for a particular cluster. Additionally, some users are seeing timeouts for random frontend assets such as other js or css items.

The usual culprit when this happens is usePermission.js which seems to timeout after some time. The full request url for this specific asset from what we've encountered is http://kafka-ui.example.io/assets/usePermission-D1xwE96v.js and if we curl the asset continuously from cli, we can usually reproduce a hanging timeout somewhere after the 3rd or 4th try.

In our kafka-ui logs, we see the following printed that correlates with the failure:

2024-10-09 19:34:18,224 INFO  [kafka-admin-client-thread | kafbat-ui-admin-1728494355-1] o.a.k.c.NetworkClient: [AdminClient clientId=kafbat-ui-admin-1728494355-1] Cancelled in-flight METADATA request with correlation id 351 due to node 5 being disconnected (elapsed time since creation: 1ms, elapsed time since send: 1ms, request timeout: 30000ms)
2024-10-10 15:29:40,715 INFO [kafka-admin-client-thread | kafbat-ui-admin-1728494355-1] o.a.k.c.NetworkClient: [AdminClient clientId=kafbat-ui-admin-1728494355-1] Cancelled in-flight METADATA request with correlation id 2479 due to node 14 being disconnected (elapsed time since creation: 1ms, elapsed time since send: 1ms, request timeout: 29797ms)

However, we've also seen errors like these:

org.apache.kafka.common.errors.InterruptException: java.lang.InterruptedException
	at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.maybeThrowInterruptException(ConsumerNetworkClient.java:535)
	at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:296)
	at org.apache.kafka.clients.consumer.internals.AbstractFetch.maybeCloseFetchSessions(AbstractFetch.java:756)
	at org.apache.kafka.clients.consumer.internals.AbstractFetch.close(AbstractFetch.java:777)
	at org.apache.kafka.clients.consumer.internals.Fetcher.close(Fetcher.java:110)
	at org.apache.kafka.clients.consumer.KafkaConsumer.lambda$close$3(KafkaConsumer.java:2472)
	at org.apache.kafka.common.utils.Utils.swallow(Utils.java:1025)
	at org.apache.kafka.clients.consumer.KafkaConsumer.close(KafkaConsumer.java:2472)
	at org.apache.kafka.clients.consumer.KafkaConsumer.close(KafkaConsumer.java:2415)
	at io.kafbat.ui.emitter.EnhancedConsumer.close(EnhancedConsumer.java:75)
	at org.apache.kafka.clients.consumer.KafkaConsumer.close(KafkaConsumer.java:2388)
	at io.kafbat.ui.emitter.RangePollingEmitter.accept(RangePollingEmitter.java:51)
	at io.kafbat.ui.emitter.ForwardEmitter.accept(ForwardEmitter.java:14)
	at io.kafbat.ui.emitter.RangePollingEmitter.accept(RangePollingEmitter.java:18)
	at reactor.core.publisher.FluxCreate.subscribe(FluxCreate.java:95)
	at reactor.core.publisher.Flux.subscribe(Flux.java:8773)
	at reactor.core.publisher.FluxFlatMap$FlatMapMain.onNext(FluxFlatMap.java:427)
	at reactor.core.publisher.FluxPublishOn$PublishOnSubscriber.runAsync(FluxPublishOn.java:446)
	at reactor.core.publisher.FluxPublishOn$PublishOnSubscriber.run(FluxPublishOn.java:533)
	at reactor.core.scheduler.WorkerTask.call(WorkerTask.java:84)
	at reactor.core.scheduler.WorkerTask.call(WorkerTask.java:37)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: java.lang.InterruptedException: null
	... 26 common frames omitted
kafka-ui-9d87748df-v7gxr 2024-10-10 15:51:56,826 ERROR [parallel-1] o.s.b.a.w.r.e.AbstractErrorWebExceptionHandler: [91c18eb1-6690] 500 Server Error for HTTP GET "/api/clusters/herp-derpnet/topics/assets.herpderp.evm_1.dlq.v1/messages/v2?limit=100&mode="
java.lang.NullPointerException: null
at com.google.common.base.Preconditions.checkNotNull(Preconditions.java:903)
Suppressed: reactor.core.publisher.FluxOnAssembly$OnAssemblyException:
Error has been observed at the following site(s):
*__checkpoint ⇢ io.kafbat.ui.config.CorsGlobalConfiguration$$Lambda$1013/0x00007ff51a62a728 [DefaultWebFilterChain]
*__checkpoint ⇢ io.kafbat.ui.config.CustomWebFilter [DefaultWebFilterChain]
*__checkpoint ⇢ io.kafbat.ui.config.ReadOnlyModeFilter [DefaultWebFilterChain]
*__checkpoint ⇢ AuthorizationWebFilter [DefaultWebFilterChain]
*__checkpoint ⇢ ExceptionTranslationWebFilter [DefaultWebFilterChain]
*__checkpoint ⇢ LogoutWebFilter [DefaultWebFilterChain]
*__checkpoint ⇢ ServerRequestCacheWebFilter [DefaultWebFilterChain]
*__checkpoint ⇢ SecurityContextServerWebExchangeWebFilter [DefaultWebFilterChain]
*__checkpoint ⇢ ReactorContextWebFilter [DefaultWebFilterChain]
*__checkpoint ⇢ HttpHeaderWriterWebFilter [DefaultWebFilterChain]
*__checkpoint ⇢ ServerWebExchangeReactorContextWebFilter [DefaultWebFilterChain]
*__checkpoint ⇢ org.springframework.security.web.server.WebFilterChainProxy [DefaultWebFilterChain]
*__checkpoint ⇢ org.springframework.web.filter.reactive.ServerHttpObservationFilter [DefaultWebFilterChain]
*__checkpoint ⇢ HTTP GET "/api/clusters/herp-derpnet/topics/assets.herpderptopic.evm_1.dlq.v1/messages/v2?limit=100&mode=" [ExceptionHandlingWebHandler]

Expected behavior

Expected pages to load on refresh with full content with a few seconds as usually happens after one or more refreshes from encountering the problem. Note the same page may have an issue one time but not another.

Your installation details

  1. We've seen this issue with 91ed167 and 273e64c.
  2. We've seen this issue with both Helm Chart versions 1.4.2 and 1.4.6
  3. Here is the helm and application env configs https://gist.github.com/marsea24/01fe5002eb4b363e35b8c5166da95797
  4. See above gist for what should be all relevant kafka-ui configuration, happy to provide any other specifics needed.

Steps to reproduce

Clicking around in UI under any cluster, diving into Topics pages and even Messages usually is where we see the issue most likely happen (although this is biased, other pages might actually exhibit it more often). Note there is no authentication configured in our kafka-ui instance.

Screenshots

No response

Logs

No response

Additional context

  1. We've tried other users trying to reproduce the problem, different helm chart and image tags of kafka-ui, granting our Confluent Cloud service account full permissions, and running the kafka-ui locally. None of these have made any difference, the issue still persists.
  2. Yes, we have one particular user who gets random timeouts of kafka-ui frontend assets which cause him to get more frequent page load spinning. In many cases when this happens, it takes multiple refreshes or reloading the site in another tab before he can get anything to load.
  3. All logs provided in above sections, happy to provide more upon request
  4. Impact on the end-user here is a really difficult and drawn out process of debugging and developing our platform. With all the issues, sometimes they just have to give up and move onto other things. Watching users deal with the issue directly sees them spamming the refresh button, opening multiple tabs, trying to refresh VPN connections, and continually coming to the Infra team looking for help (which we can't provide).

Finally, we've encountered other UI weirdness like ERR_EMPTY_RESPONSE with assets like http://kafka-ui.example.io/assets/Indicator-BUTjfyDu.js or http://kafka-ui.example.io/assets/Input-BPtTPA5k.js and well as timeouts with other random css or js files which cause the UI to spin. It's about a 50/50 chance on if a single refresh solves the problem or multiple/fullreload is required to even get the UI to load successfully.

@marsea24 marsea24 added status/triage Issues pending maintainers triage type/bug Something isn't working labels Oct 11, 2024
@kapybro kapybro bot added status/triage/manual Manual triage in progress status/triage/completed Automatic triage completed and removed status/triage Issues pending maintainers triage labels Oct 11, 2024
Copy link

Hi marsea24! 👋

Welcome, and thank you for opening your first issue in the repo!

Please wait for triaging by our maintainers.

As development is carried out in our spare time, you can support us by sponsoring our activities or even funding the development of specific issues.
Sponsorship link

If you plan to raise a PR for this issue, please take a look at our contributing guide.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status/triage/completed Automatic triage completed status/triage/manual Manual triage in progress type/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant