Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BatchGet doesn't wait for BasePolicy.TotalTimeout #354

Open
xqzhang2015 opened this issue Jun 8, 2021 · 11 comments
Open

BatchGet doesn't wait for BasePolicy.TotalTimeout #354

xqzhang2015 opened this issue Jun 8, 2021 · 11 comments

Comments

@xqzhang2015
Copy link

For waiting for new connection,

  • Get op could continuously wait for BasePolicy.TotalTimeout
  • BatchGet op only wait for BasePolicy.MaxRetries * BasePolicy.SleepBetweenRetries, and the SleepMultiplier doesn't work.
    • replica=2 for the namespace
    • no connection available in such case

if retry, err := bc.retryBatch(bc, cmd.node.cluster, deadline, iterations, commandSentCounter); retry {

For BatchGet, why doesn't it wait until BasePolicy.TotalTimeout?

@xqzhang2015 xqzhang2015 changed the title BatchGet doesn't wait until total timeout exceeded BatchGet doesn't wait for BasePolicy.TotalTimeout Jun 8, 2021
@khaf
Copy link
Collaborator

khaf commented Jun 8, 2021

retryBatch makes a new command with the same policy, and executes it, which will take care of the timeout logic.

@xqzhang2015
Copy link
Author

xqzhang2015 commented Jun 9, 2021

@khaf thanks for your quick reply.

For batchCommand.Execute(), the iterations arg of executeAt() is -1.
But for the 1st call of retryBatch(), the iterations will be 0. And for the following retryBatch(), the iterations wil increase too. It seems this snippet doesn't work for batch retry.

cmd.conn, err = ifc.getConnection(policy)
		if err != nil {
			isClientTimeout = true

			if err == ErrConnectionPoolEmpty {
				// if the connection pool is empty, we still haven't tried
				// the transaction to increase the iteration count.
				iterations--
			}
			Logger.Debug("Node " + cmd.node.String() + ": " + err.Error())
			continue
		}

@khaf
Copy link
Collaborator

khaf commented Jun 9, 2021

It does. In the new commands, a new node will be selected (you never know why connection pool is empty, but it should mostly be due to a faulty node). The new command will be retried on a new node.
This logic of course does not retry on the same node.

@xqzhang2015
Copy link
Author

xqzhang2015 commented Jun 10, 2021

@khaf The case for empty connection pool here is that we don't maintain min idle connection pool, because of historical limitation. So the new BatchGet after long time always needs waiting for a connection. Besides that, burst traffic always needs waiting for new connections.

The question is new BatchGet doesn't wait for the BasePolicy.TotalTimeout. Here is the debug log below. So is there a way to wait until TotalTimeout for BatchGet?

  • BasePolicy.MaxRetries = 2
  • BasePolicy.SleepBetweenRetries = 1ms
  • Aerospike cluster replica = 2
  • only one key in the BatchGet request
I0607 07:36:39.613429   14051 command.go:1872] [aerospike] Node BB955E61AF14F0A 10.x.y.1:3000: Connection pool is empty. This happens when either all connection are in-use already, or no connections were available

I0607 07:36:39.614553   14051 command.go:1872] [aerospike] Node BB9D3C58669050A 10.x.y.2:3000: Connection pool is empty. This happens when either all connection are in-use already, or no connections were available

I0607 07:36:39.615713 14051 command.go:1872] [aerospike] Node BB955E61AF14F0A 10.x.y.1:3000: Connection pool is empty. This happens when either all connection are in-use already, or no connections were available

I0607 07:36:39.616863 14051 ... Get response: records=[nil]

For a new node will be selected in this case, the replica node will be selected and to retry, right?

aerospike client version: 3.1.1, and it seems latest code in master branch has the same logic.

@xqzhang2015
Copy link
Author

@khaf any comment?

@khaf
Copy link
Collaborator

khaf commented Jun 14, 2021

Sorry I missed your comment. While a new replica should be selected, the problem is that if that replica is also connection starved, you'll end up with errors. I'll have to think about how to deal with this situation. Will get back to you later this week.

ps. Why are you so worried about using MinIdleConns?

@xqzhang2015
Copy link
Author

@khaf thanks for your comment.

MinIdleConns is great feature.
Because of historical reasons, there are multiple clients in one app for the same cluster and there are thousands of app instances. If maintaing a proper MinIdleConns, the cluster can't support so many connections. BTW, the refine for my app can't be done in short term because of some limitations.

Most import, even enabling MinIdleConns, the conns will NOT be enough for traffic peak(10X compared to normal), because of so many app instances.

@xqzhang2015
Copy link
Author

@khaf any update from your side?
I plan to increase the BasePolicy.MaxRetries and SleepBetweenRetries to mitigate the waiting time issue.

@khaf
Copy link
Collaborator

khaf commented Jun 21, 2021

I'm still working on this, trying multiple different approaches. The current workaround would be increasing SleepBetweenRetries, but I hope I can get to a better solution.

@khaf
Copy link
Collaborator

khaf commented Jun 23, 2021

After working on this issue for a few days, I have come to the conclusion that the best scalable way to handle this use case for now would be tweaking the value of the following policy attributes:

	.MaxRetries
	.SleepBetweenRetries
	.TotalTimeout
	.SocketTimeout

Any other solution adds a bottleneck for a different use case. The retry mechanism should handle these cases well, given enough time and retries.
I may introduce a policy selector in the future that would allow to wait for inline (and synchronous) creation of a new connection to handle these use cases.

@xqzhang2015
Copy link
Author

@khaf sounds good. thanks for the work

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants