Random and frequent usePermission timeouts causing 404s #603

marsea24 · 2024-10-11T19:02:39Z

Issue submitter TODO list

I've looked up my issue in FAQ
I've searched for an already existing issues here
I've tried running main-labeled docker image and the issue still persists there
I'm running a supported version of the application which is listed here

Describe the bug (actual behavior)

Users are encountering pages refusing to load (spinning) and giving a timeout 404 on waiting. This is during normal browsing through pages like Topics and Messages for a particular cluster. Additionally, some users are seeing timeouts for random frontend assets such as other js or css items.

The usual culprit when this happens is usePermission.js which seems to timeout after some time. The full request url for this specific asset from what we've encountered is http://kafka-ui.example.io/assets/usePermission-D1xwE96v.js and if we curl the asset continuously from cli, we can usually reproduce a hanging timeout somewhere after the 3rd or 4th try.

In our kafka-ui logs, we see the following printed that correlates with the failure:

2024-10-09 19:34:18,224 INFO  [kafka-admin-client-thread | kafbat-ui-admin-1728494355-1] o.a.k.c.NetworkClient: [AdminClient clientId=kafbat-ui-admin-1728494355-1] Cancelled in-flight METADATA request with correlation id 351 due to node 5 being disconnected (elapsed time since creation: 1ms, elapsed time since send: 1ms, request timeout: 30000ms)
2024-10-10 15:29:40,715 INFO [kafka-admin-client-thread | kafbat-ui-admin-1728494355-1] o.a.k.c.NetworkClient: [AdminClient clientId=kafbat-ui-admin-1728494355-1] Cancelled in-flight METADATA request with correlation id 2479 due to node 14 being disconnected (elapsed time since creation: 1ms, elapsed time since send: 1ms, request timeout: 29797ms)

However, we've also seen errors like these:

org.apache.kafka.common.errors.InterruptException: java.lang.InterruptedException
	at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.maybeThrowInterruptException(ConsumerNetworkClient.java:535)
	at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:296)
	at org.apache.kafka.clients.consumer.internals.AbstractFetch.maybeCloseFetchSessions(AbstractFetch.java:756)
	at org.apache.kafka.clients.consumer.internals.AbstractFetch.close(AbstractFetch.java:777)
	at org.apache.kafka.clients.consumer.internals.Fetcher.close(Fetcher.java:110)
	at org.apache.kafka.clients.consumer.KafkaConsumer.lambda$close$3(KafkaConsumer.java:2472)
	at org.apache.kafka.common.utils.Utils.swallow(Utils.java:1025)
	at org.apache.kafka.clients.consumer.KafkaConsumer.close(KafkaConsumer.java:2472)
	at org.apache.kafka.clients.consumer.KafkaConsumer.close(KafkaConsumer.java:2415)
	at io.kafbat.ui.emitter.EnhancedConsumer.close(EnhancedConsumer.java:75)
	at org.apache.kafka.clients.consumer.KafkaConsumer.close(KafkaConsumer.java:2388)
	at io.kafbat.ui.emitter.RangePollingEmitter.accept(RangePollingEmitter.java:51)
	at io.kafbat.ui.emitter.ForwardEmitter.accept(ForwardEmitter.java:14)
	at io.kafbat.ui.emitter.RangePollingEmitter.accept(RangePollingEmitter.java:18)
	at reactor.core.publisher.FluxCreate.subscribe(FluxCreate.java:95)
	at reactor.core.publisher.Flux.subscribe(Flux.java:8773)
	at reactor.core.publisher.FluxFlatMap$FlatMapMain.onNext(FluxFlatMap.java:427)
	at reactor.core.publisher.FluxPublishOn$PublishOnSubscriber.runAsync(FluxPublishOn.java:446)
	at reactor.core.publisher.FluxPublishOn$PublishOnSubscriber.run(FluxPublishOn.java:533)
	at reactor.core.scheduler.WorkerTask.call(WorkerTask.java:84)
	at reactor.core.scheduler.WorkerTask.call(WorkerTask.java:37)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: java.lang.InterruptedException: null
	... 26 common frames omitted

kafka-ui-9d87748df-v7gxr 2024-10-10 15:51:56,826 ERROR [parallel-1] o.s.b.a.w.r.e.AbstractErrorWebExceptionHandler: [91c18eb1-6690] 500 Server Error for HTTP GET "/api/clusters/herp-derpnet/topics/assets.herpderp.evm_1.dlq.v1/messages/v2?limit=100&mode="
java.lang.NullPointerException: null
at com.google.common.base.Preconditions.checkNotNull(Preconditions.java:903)
Suppressed: reactor.core.publisher.FluxOnAssembly$OnAssemblyException:
Error has been observed at the following site(s):
*__checkpoint ⇢ io.kafbat.ui.config.CorsGlobalConfiguration$$Lambda$1013/0x00007ff51a62a728 [DefaultWebFilterChain]
*__checkpoint ⇢ io.kafbat.ui.config.CustomWebFilter [DefaultWebFilterChain]
*__checkpoint ⇢ io.kafbat.ui.config.ReadOnlyModeFilter [DefaultWebFilterChain]
*__checkpoint ⇢ AuthorizationWebFilter [DefaultWebFilterChain]
*__checkpoint ⇢ ExceptionTranslationWebFilter [DefaultWebFilterChain]
*__checkpoint ⇢ LogoutWebFilter [DefaultWebFilterChain]
*__checkpoint ⇢ ServerRequestCacheWebFilter [DefaultWebFilterChain]
*__checkpoint ⇢ SecurityContextServerWebExchangeWebFilter [DefaultWebFilterChain]
*__checkpoint ⇢ ReactorContextWebFilter [DefaultWebFilterChain]
*__checkpoint ⇢ HttpHeaderWriterWebFilter [DefaultWebFilterChain]
*__checkpoint ⇢ ServerWebExchangeReactorContextWebFilter [DefaultWebFilterChain]
*__checkpoint ⇢ org.springframework.security.web.server.WebFilterChainProxy [DefaultWebFilterChain]
*__checkpoint ⇢ org.springframework.web.filter.reactive.ServerHttpObservationFilter [DefaultWebFilterChain]
*__checkpoint ⇢ HTTP GET "/api/clusters/herp-derpnet/topics/assets.herpderptopic.evm_1.dlq.v1/messages/v2?limit=100&mode=" [ExceptionHandlingWebHandler]

Expected behavior

Expected pages to load on refresh with full content with a few seconds as usually happens after one or more refreshes from encountering the problem. Note the same page may have an issue one time but not another.

Your installation details

We've seen this issue with 91ed167 and 273e64c.
We've seen this issue with both Helm Chart versions 1.4.2 and 1.4.6
Here is the helm and application env configs https://gist.github.com/marsea24/01fe5002eb4b363e35b8c5166da95797
See above gist for what should be all relevant kafka-ui configuration, happy to provide any other specifics needed.

Steps to reproduce

Clicking around in UI under any cluster, diving into Topics pages and even Messages usually is where we see the issue most likely happen (although this is biased, other pages might actually exhibit it more often). Note there is no authentication configured in our kafka-ui instance.

Screenshots

No response

Logs

No response

Additional context

We've tried other users trying to reproduce the problem, different helm chart and image tags of kafka-ui, granting our Confluent Cloud service account full permissions, and running the kafka-ui locally. None of these have made any difference, the issue still persists.
Yes, we have one particular user who gets random timeouts of kafka-ui frontend assets which cause him to get more frequent page load spinning. In many cases when this happens, it takes multiple refreshes or reloading the site in another tab before he can get anything to load.
All logs provided in above sections, happy to provide more upon request
Impact on the end-user here is a really difficult and drawn out process of debugging and developing our platform. With all the issues, sometimes they just have to give up and move onto other things. Watching users deal with the issue directly sees them spamming the refresh button, opening multiple tabs, trying to refresh VPN connections, and continually coming to the Infra team looking for help (which we can't provide).

Finally, we've encountered other UI weirdness like ERR_EMPTY_RESPONSE with assets like http://kafka-ui.example.io/assets/Indicator-BUTjfyDu.js or http://kafka-ui.example.io/assets/Input-BPtTPA5k.js and well as timeouts with other random css or js files which cause the UI to spin. It's about a 50/50 chance on if a single refresh solves the problem or multiple/fullreload is required to even get the UI to load successfully.

The text was updated successfully, but these errors were encountered:

github-actions · 2024-10-11T19:03:04Z

Hi marsea24! 👋

Welcome, and thank you for opening your first issue in the repo!

Please wait for triaging by our maintainers.

As development is carried out in our spare time, you can support us by sponsoring our activities or even funding the development of specific issues.
Sponsorship link

If you plan to raise a PR for this issue, please take a look at our contributing guide.

marsea24 added status/triage Issues pending maintainers triage type/bug Something isn't working labels Oct 11, 2024

kapybro bot added status/triage/manual Manual triage in progress status/triage/completed Automatic triage completed and removed status/triage Issues pending maintainers triage labels Oct 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Random and frequent usePermission timeouts causing 404s #603

Random and frequent usePermission timeouts causing 404s #603

marsea24 commented Oct 11, 2024 •

edited

Loading

github-actions bot commented Oct 11, 2024

Random and frequent usePermission timeouts causing 404s #603

Random and frequent usePermission timeouts causing 404s #603

Comments

marsea24 commented Oct 11, 2024 • edited Loading

Issue submitter TODO list

Describe the bug (actual behavior)

Expected behavior

Your installation details

Steps to reproduce

Screenshots

Logs

Additional context

github-actions bot commented Oct 11, 2024

marsea24 commented Oct 11, 2024 •

edited

Loading