Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory Leak after updating Akka from 1.5.27.1 to 1.5.37 #516

Open
lucavice opened this issue Feb 3, 2025 · 15 comments · May be fixed by #518
Open

Memory Leak after updating Akka from 1.5.27.1 to 1.5.37 #516

lucavice opened this issue Feb 3, 2025 · 15 comments · May be fixed by #518
Labels
bug Something isn't working

Comments

@lucavice
Copy link
Contributor

lucavice commented Feb 3, 2025

Version Information
Version of Akka.NET?
1.5.37
Which Akka.NET Modules?
Akka.NET + Persistence, Remoting, Clustering. Journal Plugin is Akka.Persistence.Sql

Describe the bug
Last week I updated our staging environment from Akka 1.5.27.1 to 1.5.37.
Spefically, I updated the following packages:

  • Akka: 1.5.27.1 -> 1.5.37
  • Akka.Cluster.Sharding: 1.5.27.1 -> 1.5.37
  • Akka.Persistence.Sql: 1.5.25 -> 1.5.37
  • Akka.Serialization.Hyperion: 1.5.27.1 -> 1.5.37

As part of this update, I also retargeted from .NET 7 to .NET 8.

After a few days, I noticed our staging environment running low on memory:

Image

Inspecting the VMs reveal most of processes hosting Akka.NET were over 1GB in size.
The only exceptions were some processes that do not instantiate any journal reads, which were not affected. This may mean that the memory leak is coming from Akka.Persistence or the Akka.Persistence.Sql plugin.

Additionally, I took a memory dump of one of the affected processes:

Image Image Image

Task<QueryStartGranted> seems to be the culprit, which further indicates that the memory leak is coming from Akka.Persistence module or its plugin.

Environment
Windows on .NET 8

Additional context
I am currently reverting our environment to 1.5.27.1 to verify that the memory leak disappears (just to confirm that this is not due to some other changes). Potentially, it could also be due to the upgrade to .NET 8, but probably less likely. I will monitor RAM usage and check back tomorrow to see if just reverting the packages version rollbacks the issue.

Let me know if there is anything else I can help with to narrow this down.

@lucavice lucavice changed the title Memory Leak after updating Akka from 1.5.34 to 1.5.37 Memory Leak after updating Akka from 1.5.27.1 to 1.5.37 Feb 3, 2025
@lucavice
Copy link
Contributor Author

lucavice commented Feb 3, 2025

I did some further testing on local environment and did some profiling with dotMemory.
I can replicate immediately with 1.5.37 but I cannot with 1.5.34. So the source of the issue must have been introduced between these two versions.

As soon as I start profiling, the number of instances of these objects start to grow quickly:

Image

@lucavice
Copy link
Contributor Author

lucavice commented Feb 3, 2025

Ok I narrowed it down even further. The call stack that created the new objects is:

System.Threading.Tasks.TaskCompletionSource<TResult>..ctor(Object, TaskCreationOptions)
Akka.Util.Internal.TaskEx.NonBlockingTaskCompletionSource<T>()
Akka.Actor.Futures.Ask<T>(ICanTell, Func<T, TResult>, Nullable<T>, CancellationToken)
Akka.Persistence.Sql.Extensions.ConnectionFactoryExtensions+<ExecuteQueryWithTransactionAsync>d__3<TState, T>.MoveNext()
System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start<TStateMachine>(TStateMachine)
Akka.Persistence.Sql.Query.Dao.BaseByteReadArrayJournalDao+<>c+<<EventsByTag>b__5_0>d.MoveNext()
System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start<TStateMachine>(TStateMachine)
Akka.Streams.Implementation.Fusing.SelectAsync+Logic<TIn, TOut>.OnPush()
Akka.Streams.Implementation.Fusing.GraphInterpreter.Execute(Int32)
Akka.Streams.Implementation.Fusing.GraphInterpreterShell.RunBatch(Int32)
Akka.Streams.Implementation.Fusing.ActorGraphInterpreter.TryInit(GraphInterpreterShell)
Akka.Streams.Implementation.Fusing.ActorGraphInterpreter.PreStart()
Akka.Actor.ActorCell.UseThreadContext(Action)
Akka.Actor.ActorCell.Create(Exception)
Akka.Actor.ActorCell.SysMsgInvokeAll(EarliestFirstSystemMessageList, Int32)
Akka.Actor.ActorCell.SystemInvoke(ISystemMessage)
Akka.Dispatch.Mailbox.ProcessAllSystemMessages()
Akka.Dispatch.Mailbox.Run()
System.Threading.ThreadPoolWorkQueue.Dispatch()
System.Threading.PortableThreadPool+WorkerThread.WorkerThreadStart()
[AllThreadsRoot]

Due to the mention of Akka.Persistence.Sql.Query.Dao.BaseByteReadArrayJournalDao and Akka.Persistence.Sql.Extensions.ConnectionFactoryExtensions, I tried leaving all packages to 1.5.37, but only downgrade Akka.Persistence.Sql to 1.5.30. The issue disappears in that version.

Investigating the potential source differences between these two versions, it appears that the Query Throttling PR touched on several of these. I suspect that is a good candidate for the source of the memory leak.

@Arkatufus
Copy link
Contributor

Thank you for the very detailed report @lucavice, we will look into this as soon as possible

@Arkatufus Arkatufus transferred this issue from akkadotnet/akka.net Feb 3, 2025
@Arkatufus
Copy link
Contributor

Moving the issue to this repo because it is not related to Akka.NET

@Arkatufus
Copy link
Contributor

@lucavice, do you see any timeout exceptions thrown from your queries inside your logs? Either AskTimeoutException or OperationCancelledException?

@lucavice
Copy link
Contributor Author

lucavice commented Feb 4, 2025

hey @Arkatufus. No, I don't see any exceptions inside my logs.

For example, I am now running the system locally and I see about 1000 new objects created and not collected of type System.Threading.Tasks.Task<QueryStartGranted> and Akka.Actor.FutureActorRef<QueryStartGranted> every minute (with dotMemory snapshots compares). No exceptions and nothing unusual with my actors using EventsByTag events streaming function. As far as I can see, they are working fine.

In case it helps, this is both happening locally using SQL Server and on Azure using Azure SQL Databases for querying the events.

@lucavice
Copy link
Contributor Author

lucavice commented Feb 6, 2025

Hi @Arkatufus, I am able to replicate it on a super simple program that I prepared here: https://github.com/lucavice/AkkaNetBugRepro/tree/memory-leak/AkkaNetBugRepro

That hopefully can help tracking it down.

@Arkatufus
Copy link
Contributor

That's great! Thank you, I'll use a profiler on your reproduction to try and catch the problem.

@Aaronontheweb
Copy link
Member

Thanks Luca - I took your code and built a self-contained reproduction here https://github.com/Aaronontheweb/AkkaPersistenceSqlMemoryLeak/

I can reproduce the issue! Going to see about trying to fix it next

@Aaronontheweb
Copy link
Member

I am 99% sure the problem is how we're closing over the FutureActorRef used in Ask<T> here

internal static async Task<T> ExecuteQueryWithTransactionAsync<T>(
            this AkkaPersistenceDataConnectionFactory factory,
            DbStateHolder state,
            Func<AkkaDataConnection, CancellationToken, Task<T>> handler)
        {
            using var cts = CancellationTokenSource.CreateLinkedTokenSource(state.ShutdownToken);
            {
                cts.CancelAfter(state.QueryThrottleTimeout);
                await state.QueryPermitter.Ask<QueryStartGranted>(RequestQueryStart.Instance, cts.Token);
            }

            try
            {
                return await factory.ExecuteWithTransactionAsync(state.IsolationLevel, state.ShutdownToken, handler);
            }
            finally
            {
                state.QueryPermitter.Tell(ReturnQueryStart.Instance);
            }
        }

@Aaronontheweb
Copy link
Member

However, I have tried several permutations of this:

{
    // using var requestTimeout = new CancellationTokenSource(state.QueryThrottleTimeout);
    // using var cts = CancellationTokenSource.CreateLinkedTokenSource(state.ShutdownToken, requestTimeout.Token);
    await factory.QueryPermitter.Ask<QueryStartGranted>(RequestQueryStart.Instance, factory.QueryThrottleTimeout);
}

Leak is still reproducible even with those - so the issue might be a higher level async chaining thing

@Aaronontheweb
Copy link
Member

I think the issue here is AsyncSource is doing some weird stuff and retaining Task instances - something weird is going on there.

I added a private build with some details here: Aaronontheweb/AkkaPersistenceSqlMemoryLeak#3

But basically, that the Task<QueryStartGranted> is getting rooted is the problem - I think it's generally due to retention around the task state machines in-use.

@Aaronontheweb
Copy link
Member

@lucavice so @Arkatufus found the source of the leak and it's from the PR you linked to - as it turns out, we've been calling Context.Watch on the temporary IActorRefs created by Ask<T> - and since those actorrefs never send a Terminated message back, that Context.Watch internal collection maintained by the QueryThrottler never decreases.

I might need to open an issue on the general Akka.NET repo for this because that problem could occur anywhere

@Aaronontheweb Aaronontheweb added the bug Something isn't working label Feb 8, 2025
@lucavice
Copy link
Contributor Author

Thanks for the update @Aaronontheweb.

As I was trying to find the memory leak myself I was looking as well at that Context.Watch in the QueryThrottler, but I assumed that a Terminated message was guaranteed to be received when the temporary Ask actor stopped.

Just for my own understanding, is the reason why the Terminated message is not sent back known or is it part of a bug that will be investigated in the main Akka.NET repo?

@Aaronontheweb
Copy link
Member

I think it does merit an investigation in the main Akka.NET repo

@Arkatufus Arkatufus linked a pull request Feb 10, 2025 that will close this issue
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants