Fix server stuck issue when test timeout exception #61

kaibocai · 2024-01-16T22:40:18Z

Resolve #58.

Thanks @ItalyPaleAle for the fix!

…stream ends

backend/executor.go

cgillum · 2024-01-16T23:51:27Z

backend/executor.go

 	executor.pendingActivities.Store(key, result)
-	defer executor.pendingActivities.Delete(key)



Why are we removing the calls to delete these keys? Won't that result in a memory leak?

They will be deleted by CompleteActivityTask. Additionally, if the GetWorkItems strema is closed while the operation is in progress, the cleanup logic added to GetWorkItems will delete any "leftover" item in this map

Is there any reason for moving the delete operation from here to other places? If I understand corretly, the delete then needs to be added in CompleteActivityTask, in the defer block in GetWorkItems and potentially Shutdown as well. Instead if we just keep it here, it would handle all scenarios. Also looks more intuitive to me(We would still need to close the complete channel in getWorkItems and Shutdown)

Good question. I wrote this a week ago and now I cannot remember why, even after reading the code :( It may work

Then this may help to avoid closing a closed channel, as we delete it in time.

It's impossible currently for close() to be called on the same channel twice

yes, I mean currently implementation will help avoid closing a closed channel, but with defer executor.pendingActivities.Delete(key), we can potentially close a closed channel if the stream is closed and then the shutdown is called.

backend/executor.go

cgillum · 2024-01-16T23:57:16Z

backend/executor.go

 	// closing the work item queue is a signal for shutdown
 	close(g.workItemQueue)
+
+	// Iterate through all pending items and close them to unblock the goroutines waiting on this


Are these changes to the Shutdown routine related to the original issue or is this unrelated?

This was my first attempt at fixing the issue. It turned out it didn't fix it, but it seemed like something that would be helpful to have nevertheless, to ensure we don't leave goroutines waiting

Should we not delete the items as well from pendingActivities/pendingOrchestrators as well in case of shutdown?

The idea is that shutdown is called before the object is abandoned. The memory is released by the GC regardless. The important thing is to close the channels in case there are goroutines waiting on them.

ItalyPaleAle · 2024-01-17T00:19:10Z

backend/executor.go

+				if ok {
+					p.(*ExecutionResults).pending = pendingOrchestratorCh
+				}
+				g.logger.Debugf("pending orchestrators after add %s: %#v\n", key, pendingOrchestrators)


I think these debug logs could be removed @kaibocai
I had left the logs in the code I shared with you just to show the flow of data! These logs may be even too verbose for a debug-level logging

I removed logs here and just left a few logs on tracking the newly added logic.

shivamkm07 · 2024-01-17T09:04:46Z

backend/executor.go

+	pendingOrchestratorCh := make(chan string, 1)
+	defer func() {
+		// If there's any pending activity left, remove them
+		for key := range pendingActivities {


What is the use of pendingActivities or pendingOrchestrators here? Since we just need to delete all pending keys that are there in g.pendingActivities or g.pendingOrchestrators, can we not simply iterate in the map and delete? This would avoid sending the items to pendingChannel as well for every item being processed..

No, we do not delete all keys in g.pendingActivities.

There could be multiple connections to GetWorkItems, so each connection needs its own map to track what pending items are related to that connection.

Using a local variable, rather than depending on the global g.pendingActivities/g.pendingOrchestrators, means we can avoid using locks (because of how gRPC works, there's no issue with concurrent access there) and simplify the cleanup process.

Oh interesting. We support multiple connections to GetWorkItems? I didn't know and was actually going to open an issue for it!

A question though: In case we have say 2 connections to the server, when the workItem gets added to g.workItemQueue, how does it properly gets allocated to the correct stream? It seems like two concurrent streams trying to read a common channel.. won't that have any issue?

Oh I had assumed it was supported to have multiple GetWOrkItems streams. But this patch doesn't prevent it.

In case we have say 2 connections to the server, when the workItem gets added to g.workItemQueue, how does it properly gets allocated to the correct stream?

With multiple GetWorkItems, there's multiple goroutines blocked here:

https://github.com/microsoft/durabletask-go/pull/61/files#diff-fe6ca2dcacabca215bef6921c53b3c197d9c9ea4d1febcf15392a1e441dddc07R166

In this case, one of those listening, at "random", will get the message

Got it. In that case, it's required to maintain the stream local map. Thanks for explaining.

shivamkm07 · 2024-01-17T09:08:39Z

backend/executor.go

 	executor.pendingActivities.Store(key, result)
-	defer executor.pendingActivities.Delete(key)



Is there any reason for moving the delete operation from here to other places? If I understand corretly, the delete then needs to be added in CompleteActivityTask, in the defer block in GetWorkItems and potentially Shutdown as well. Instead if we just keep it here, it would handle all scenarios. Also looks more intuitive to me(We would still need to close the complete channel in getWorkItems and Shutdown)

shivamkm07 · 2024-01-17T09:10:37Z

backend/executor.go

+				}
+				g.logger.Debugf("pending activities after add %s: %#v\n", key, pendingActivities)
+			}
+
 			if err := stream.Send(wi); err != nil {
 				return err


can we add a debug log here as well work item stream closed?

I add g.logger.Errorf("encountered an error while sending work item: %v", err), I think error here does not always mean stream close right.

tests/grpc/grpc_test.go

kaibocai · 2024-01-17T17:03:58Z

backend/executor.go

+	g.pendingActivities.Range(func(_, value any) bool {
+		p, ok := value.(*activityExecutionResult)
+		if ok {
+			close(p.complete)


Shouldn't we check if the complete channel is already closed as it's closed at multiple places.

If the channel is in the map, it's not been closed. Before closing the channel, the calls use LoadAndDelete which means that the channel is "removed from the map" before close is called on it

fix style

ItalyPaleAle · 2024-03-26T15:03:45Z

@cgillum are you good with this PR? I think it fixes some bugs that could make dtf-go hang

cgillum · 2024-03-26T16:25:45Z

@ItalyPaleAle I'm quite uncomfortable with all the changes related to shutdown, which aren't even relevant for the original issue. I need more time to understand all the implications. If we can reduce the scope of this PR to just fixing the hangs, then I'd be okay with merging it.

ItalyPaleAle · 2024-04-04T00:18:14Z

I had to go back and look at the PR again since it's been a while. I don't remember the reason for changes behind the shutdown, but looking at it, it seems there's code that closes channels that would otherwise be hanging, so it seems to prevent leaking goroutines

yaron2 · 2024-06-20T18:07:35Z

We've tested this code internally and can confirm it improves reliability greatly.

yaron2 · 2024-06-20T18:07:48Z

cc @cgillum @ItalyPaleAle

…to debug/server-stuck

famarting · 2024-06-28T12:24:25Z

looks good to me!

Making the calls ExecuteOrchestrator and ExecuteActivity fail when the GetWorkItems function exits increases reliability greatly. Because it allows to signal the backend of an error so it can call AbandonActivityWorkItem, and that way depending on the backend implementation the activity will be retried as soon as a the GetWorkItems stream is recreated.

kaibocai and others added 4 commits January 9, 2024 15:20

reproduce tests

28789e1

ale's change

efa7d15

Clean up pending activities and orchestrations when the GetWorkItems …

4c155b5

…stream ends

update log messages

6522137

kaibocai requested a review from cgillum January 16, 2024 22:40

cgillum requested changes Jan 16, 2024

View reviewed changes

ItalyPaleAle reviewed Jan 17, 2024

View reviewed changes

shivamkm07 reviewed Jan 17, 2024

View reviewed changes

kaibocai commented Jan 17, 2024

View reviewed changes

kaibocai and others added 3 commits January 17, 2024 19:05

remove logs - add test

d4a89c0

fix style

add log for sending workitem error

6c2ce81

Merge branch 'main' into debug/server-stuck

726b7ca

famarting mentioned this pull request Jun 20, 2024

frequent unknown instance ID error when running multiple backend instances #69

Open

ItalyPaleAle added 2 commits June 20, 2024 21:25

Merge branch 'main' into debug/server-stuck

1aa7f02

Merge branch 'main' of https://github.com/microsoft/durabletask-go in…

0ed57e4

…to debug/server-stuck

cgillum approved these changes Jun 28, 2024

View reviewed changes

kaibocai merged commit 0948712 into main Jun 28, 2024
4 checks passed

kaibocai mentioned this pull request Jun 28, 2024

Update CHANGELOG.md #75

Merged

cgillum deleted the debug/server-stuck branch June 28, 2024 22:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix server stuck issue when test timeout exception #61

Fix server stuck issue when test timeout exception #61

kaibocai commented Jan 16, 2024

cgillum Jan 16, 2024

ItalyPaleAle Jan 17, 2024

shivamkm07 Jan 17, 2024

ItalyPaleAle Jan 17, 2024

kaibocai Jan 17, 2024

ItalyPaleAle Jan 17, 2024

kaibocai Jan 17, 2024

cgillum Jan 16, 2024

ItalyPaleAle Jan 17, 2024

shivamkm07 Jan 17, 2024

ItalyPaleAle Jan 17, 2024

ItalyPaleAle Jan 17, 2024

kaibocai Jan 18, 2024

shivamkm07 Jan 17, 2024

ItalyPaleAle Jan 17, 2024

shivamkm07 Jan 17, 2024

ItalyPaleAle Jan 17, 2024

shivamkm07 Jan 17, 2024

shivamkm07 Jan 17, 2024

shivamkm07 Jan 17, 2024

kaibocai Jan 18, 2024

kaibocai Jan 17, 2024

ItalyPaleAle Jan 17, 2024

ItalyPaleAle commented Mar 26, 2024

cgillum commented Mar 26, 2024

ItalyPaleAle commented Apr 4, 2024

yaron2 commented Jun 20, 2024

yaron2 commented Jun 20, 2024

famarting commented Jun 28, 2024

		executor.pendingActivities.Store(key, result)
		defer executor.pendingActivities.Delete(key)

Fix server stuck issue when test timeout exception #61

Fix server stuck issue when test timeout exception #61

Conversation

kaibocai commented Jan 16, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ItalyPaleAle commented Mar 26, 2024

cgillum commented Mar 26, 2024

ItalyPaleAle commented Apr 4, 2024

yaron2 commented Jun 20, 2024

yaron2 commented Jun 20, 2024

famarting commented Jun 28, 2024