Balanced channel measures RTT, hoping to stay within AZ (and save $) #794

iamdanfox · 2020-05-27T00:11:09Z

Before this PR

There's an outstanding FR to make dialogue smart enough to reduce our AWS spend (#686).

After this PR

There are a variety of ways to achieve this, calling the AWS API directly, baking in a hostname convention, but this PR implements a version which just measures the RTT for a HTTP OPTIONS call.

==COMMIT_MSG==
Balanced channel should now bias towards whichever node has the lowest latency, which should reduce AWS spend by routing requests within AZ.
==COMMIT_MSG==

TODO:

validate whether we can actually identify different AZs by RTT (validated using snapshots)
~~maybe just go back to HEAD, OPTIONS doesn't buy us much~~
don't send a RTT-sampling request for every RPC
do we need to expire the RTTs eventually?
tests

Possible downsides?

if dialogue is used to talk to a server that doesn't support OPTIONS requests, their logs might look weird
~~after merging this, the graphs will probably show 'balanced' sending lots of traffic to one node (within AZ) and very little to another. this might confuse people.~~ EDIT it's now opt-in
simulations don't exercise this as they all return OPTIONS instantly.

changelog-app · 2020-05-27T00:11:13Z

Generate changelog in `changelog/@unreleased`

Type

Description

Balanced channel now biases towards whichever node has the lowest latency, which should reduce AWS spend by routing requests within AZ.

Check the box to generate changelog(s)

Generate changelog entry

carterkozak · 2020-05-27T13:40:55Z

...ogue-core/src/main/java/com/palantir/dialogue/core/BalancedNodeSelectionStrategyChannel.java

        @Override
        public String toString() {
-            return "SortableChannel{score=" + score + ", delegate=" + delegate + '}';
+            return "SortableChannel{" + "score=" + score + ", rtt=" + rtt + ", delegate=" + delegate + '}';


Suggested change

return "SortableChannel{" + "score=" + score + ", rtt=" + rtt + ", delegate=" + delegate + '}';

return "SortableChannel{score=" + score + ", rtt=" + rtt + ", delegate=" + delegate + '}';

...ogue-core/src/main/java/com/palantir/dialogue/core/BalancedNodeSelectionStrategyChannel.java

iamdanfox · 2020-05-27T22:43:48Z

~~Ok so I've stumbled across an edge case that is probably worth solving:~~

what if we're talking to 10 upstreams, 5 of which are local to my AZ and 5 of which are in another AZ. We probably don't want to firehose all our traffic at whichever of the local 5 happens to be the fastest.

EDIT solved.

...ogue-core/src/main/java/com/palantir/dialogue/core/BalancedNodeSelectionStrategyChannel.java

ferozco · 2020-05-28T17:03:27Z

...ogue-core/src/main/java/com/palantir/dialogue/core/BalancedNodeSelectionStrategyChannel.java

@@ -56,13 +68,24 @@
 final class BalancedNodeSelectionStrategyChannel implements LimitedChannel {


I think we should update the description so that it includes the additional parameters to our cost function

...ogue-core/src/main/java/com/palantir/dialogue/core/BalancedNodeSelectionStrategyChannel.java

carterkozak

What would it take to support a flag to opt in/out of this feature for initial rollout now that we have excavated out BALANCED?

Added a few comments from an initial read through, still need to do a higher level pass.

carterkozak · 2020-05-28T16:53:25Z

...-core/src/test/java/com/palantir/dialogue/core/BalancedNodeSelectionStrategyChannelTest.java

                .describedAs("%s: Constant 4xxs did move the needle %s", Duration.ofNanos(clock.read()), channel)
                .containsExactly(1, 0);

        incrementClockBy(Duration.ofSeconds(5));

-        assertThat(channel.getScores())
+        assertThat(channel.getScoresForTesting().map(c -> c.getScore()))


peanut gallery: I want an error-prone rule that can refactor this automatically

Suggested change

assertThat(channel.getScoresForTesting().map(c -> c.getScore()))

assertThat(channel.getScoresForTesting().map(BalancedNodeSelectionStrategyChannel::getScore))

That would be pretty cool! Think we just need to improve https://github.com/palantir/gradle-baseline/blob/15661f333a76997d0a5b79a52af8767e01a86ca1/baseline-error-prone/src/main/java/com/palantir/baseline/errorprone/LambdaMethodReference.java#L45 to handle more than suppliers

there ya go: palantir/gradle-baseline#1359

dialogue-target/src/main/java/com/palantir/dialogue/HttpMethod.java

...ogue-core/src/main/java/com/palantir/dialogue/core/BalancedNodeSelectionStrategyChannel.java

carterkozak · 2020-05-28T17:06:11Z

...ogue-core/src/main/java/com/palantir/dialogue/core/BalancedNodeSelectionStrategyChannel.java

+        // can return in ~1 ms but others return in ~5ms, the 1ms nodes will all have a similar rttScore (near zero).
+        // Note, this can only be computed when we have all the snapshots in front of us.
+        long rttRange = worstRttNanos - bestRttNanos;
+        if (bestRttNanos != Long.MAX_VALUE && worstRttNanos != 0 && rttRange > 0) {


perhaps a negative sentinel value, extracted to a static field to make the code easier to read? bestRttNanos != UNKNOWN_RTT

So this is actually the result of doing the min and max things above, so it's not actually an arbitrary sentinel. I tried using streams to compute these:

Arrays.stream(snapshotArray) .filter(snapshot -> snapshot.rttNanos.isPresent()) .mapToLong(snapshot -> snapshot.rttNanos.getAsLong()) .summaryStatistics();

but the summaryStatistics also gives you a Long.MIN_VALUE/MAX_VALUE if there is no inputs.

...ogue-core/src/main/java/com/palantir/dialogue/core/BalancedNodeSelectionStrategyChannel.java

…ox/rtt

ferozco · 2020-05-30T00:27:15Z

dialogue-core/src/main/java/com/palantir/dialogue/core/RttSampler.java

+                Arrays.fill(samples, newMeasurement);
+                bestRttNanos = newMeasurement;
+            } else {
+                System.arraycopy(samples, 1, samples, 0, NUM_MEASUREMENTS - 1);


instead of copying the array could we treat the array as a circular buffer keeping track of our current location

Yeah I considered this, but figured it was actually gonna be more readable to try and keep it simple here - especially when this codepath only gets invoked at most once per second, and in a non-blocking way :)

ferozco

overall looks great! excited to try this out

ferozco · 2020-06-01T12:29:27Z

👍

iamdanfox added 4 commits May 27, 2020 00:47

Balanced measures RTT using OPTIONS requests

5c39335

Compile

bc5b31d

Handl OPTIONS

109b05d

Update simulations

4661549

iamdanfox requested review from carterkozak and ferozco May 27, 2020 00:11

probot-autolabeler bot added the autorelease label May 27, 2020

iamdanfox changed the title ~~Balanced channel tiebreaks using RTT, hoping to stay within AZ (and save $)~~ [proof of concept] Balanced channel tiebreaks using RTT, hoping to stay within AZ (and save $) May 27, 2020

Updatee tests for rtt sampling

17c248e

carterkozak reviewed May 27, 2020

View reviewed changes

...ogue-core/src/main/java/com/palantir/dialogue/core/BalancedNodeSelectionStrategyChannel.java Outdated Show resolved Hide resolved

iamdanfox added 4 commits May 27, 2020 16:26

Accumulate RTT average

a5e7531

Sample all channels, with two sequential calls

e398af0

Also log accumulated ones

0701a64

Just store the min

bef11a4

iamdanfox added 7 commits May 28, 2020 00:49

Compute score based on the range of observed rtts

2284f62

Delete misleading method

961b240

Fix unit tests

3c2dc94

Update simulations

9514f91

streams are fine here

cb3b440

test for rate limiter

9398d9a

Simulations

b17e311

iamdanfox changed the title ~~[proof of concept] Balanced channel tiebreaks using RTT, hoping to stay within AZ (and save $)~~ [proof of concept] Balanced channel measures RTT, hoping to stay within AZ (and save $) May 28, 2020

iamdanfox changed the title ~~[proof of concept] Balanced channel measures RTT, hoping to stay within AZ (and save $)~~ Balanced channel measures RTT, hoping to stay within AZ (and save $) May 28, 2020

iamdanfox added 4 commits May 28, 2020 02:08

Smaller diff

3beac0f

Add generated changelog entries

c40950f

Don't allow multiple samples to run at the same time

c913334

Return the min of the last 5 measurements

7005529

ferozco reviewed May 28, 2020

View reviewed changes

...ogue-core/src/main/java/com/palantir/dialogue/core/BalancedNodeSelectionStrategyChannel.java Outdated Show resolved Hide resolved

ferozco reviewed May 28, 2020

View reviewed changes

...ogue-core/src/main/java/com/palantir/dialogue/core/BalancedNodeSelectionStrategyChannel.java Outdated Show resolved Hide resolved

carterkozak reviewed May 28, 2020

View reviewed changes

Pull everything out to a dedicated 'RttSampler' class

64a955b

ferozco reviewed May 28, 2020

View reviewed changes

...ogue-core/src/main/java/com/palantir/dialogue/core/BalancedNodeSelectionStrategyChannel.java Outdated Show resolved Hide resolved

ferozco reviewed May 28, 2020

View reviewed changes

...ogue-core/src/main/java/com/palantir/dialogue/core/BalancedNodeSelectionStrategyChannel.java Outdated Show resolved Hide resolved

iamdanfox added 6 commits May 28, 2020 18:18

Use OptionalLong instead of Long.MAX_VALUE as special value

2389a06

be more immutable, reduce diff

3adc16c

Feature flag it off

eb76abb

Allow servers to enable it with BALANCED_RTT

4f73a90

Re-run simulations

f364205

Fix tests

6f9804a

iamdanfox mentioned this pull request May 29, 2020

Support http OPTIONS #802

Merged

iamdanfox added 8 commits May 29, 2020 16:12

Merge remote-tracking branch 'origin/develop' into dfox/rtt

d81eb17

Merge remote-tracking branch 'origin/develop' into dfox/rtt

c0a90b4

Merge branch 'dfox/rtt' of ssh://github.com/palantir/dialogue into df…

384cf55

…ox/rtt

Ensure we send a good user agent with these OPTIONS requests

0128c4e

More CR

1af6a22

Move logic to RttSampler

f156637

Appease errorprone

f62429d

Minimise diff

d873cd1

iamdanfox added the merge when ready label May 29, 2020

ferozco reviewed May 30, 2020

View reviewed changes

Remember to close the response!

f6f6917

bulldozer-bot bot merged commit df5329a into develop Jun 1, 2020

bulldozer-bot bot deleted the dfox/rtt branch June 1, 2020 12:29

iamdanfox mentioned this pull request Jun 12, 2020

Promote BALANCED_RTT2 #843

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Balanced channel measures RTT, hoping to stay within AZ (and save $) #794

Balanced channel measures RTT, hoping to stay within AZ (and save $) #794

iamdanfox commented May 27, 2020 •

edited

Loading

changelog-app bot commented May 27, 2020 •

edited by iamdanfox

Loading

carterkozak May 27, 2020

iamdanfox commented May 27, 2020 •

edited

Loading

ferozco May 28, 2020

carterkozak left a comment

carterkozak May 28, 2020

ferozco May 28, 2020

ferozco May 28, 2020

carterkozak May 28, 2020

iamdanfox May 28, 2020 •

edited

Loading

ferozco May 30, 2020

iamdanfox Jun 1, 2020

ferozco left a comment

ferozco commented Jun 1, 2020

	return "SortableChannel{" + "score=" + score + ", rtt=" + rtt + ", delegate=" + delegate + '}';
	return "SortableChannel{score=" + score + ", rtt=" + rtt + ", delegate=" + delegate + '}';

		@@ -56,13 +68,24 @@
		final class BalancedNodeSelectionStrategyChannel implements LimitedChannel {

	assertThat(channel.getScoresForTesting().map(c -> c.getScore()))
	assertThat(channel.getScoresForTesting().map(BalancedNodeSelectionStrategyChannel::getScore))

Balanced channel measures RTT, hoping to stay within AZ (and save $) #794

Balanced channel measures RTT, hoping to stay within AZ (and save $) #794

Conversation

iamdanfox commented May 27, 2020 • edited Loading

Before this PR

After this PR

Possible downsides?

changelog-app bot commented May 27, 2020 • edited by iamdanfox Loading

Generate changelog in changelog/@unreleased

Choose a reason for hiding this comment

iamdanfox commented May 27, 2020 • edited Loading

Choose a reason for hiding this comment

carterkozak left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

iamdanfox May 28, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ferozco left a comment

Choose a reason for hiding this comment

ferozco commented Jun 1, 2020

iamdanfox commented May 27, 2020 •

edited

Loading

changelog-app bot commented May 27, 2020 •

edited by iamdanfox

Loading

Generate changelog in `changelog/@unreleased`

iamdanfox commented May 27, 2020 •

edited

Loading

iamdanfox May 28, 2020 •

edited

Loading