Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Balanced channel measures RTT, hoping to stay within AZ (and save $) #794

Merged
merged 38 commits into from
Jun 1, 2020

Conversation

iamdanfox
Copy link
Contributor

@iamdanfox iamdanfox commented May 27, 2020

Before this PR

There's an outstanding FR to make dialogue smart enough to reduce our AWS spend (#686).

After this PR

There are a variety of ways to achieve this, calling the AWS API directly, baking in a hostname convention, but this PR implements a version which just measures the RTT for a HTTP OPTIONS call.

==COMMIT_MSG==
Balanced channel should now bias towards whichever node has the lowest latency, which should reduce AWS spend by routing requests within AZ.
==COMMIT_MSG==

TODO:

  • validate whether we can actually identify different AZs by RTT (validated using snapshots)
  • maybe just go back to HEAD, OPTIONS doesn't buy us much
  • don't send a RTT-sampling request for every RPC
  • do we need to expire the RTTs eventually?
  • tests

Possible downsides?

  • if dialogue is used to talk to a server that doesn't support OPTIONS requests, their logs might look weird
  • after merging this, the graphs will probably show 'balanced' sending lots of traffic to one node (within AZ) and very little to another. this might confuse people. EDIT it's now opt-in
  • simulations don't exercise this as they all return OPTIONS instantly.

@changelog-app
Copy link

changelog-app bot commented May 27, 2020

Generate changelog in changelog/@unreleased

Type

  • Feature
  • Improvement
  • Fix
  • Break
  • Deprecation
  • Manual task
  • Migration

Description

Balanced channel now biases towards whichever node has the lowest latency, which should reduce AWS spend by routing requests within AZ.

Check the box to generate changelog(s)

  • Generate changelog entry

@iamdanfox iamdanfox changed the title Balanced channel tiebreaks using RTT, hoping to stay within AZ (and save $) [proof of concept] Balanced channel tiebreaks using RTT, hoping to stay within AZ (and save $) May 27, 2020
@Override
public String toString() {
return "SortableChannel{score=" + score + ", delegate=" + delegate + '}';
return "SortableChannel{" + "score=" + score + ", rtt=" + rtt + ", delegate=" + delegate + '}';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return "SortableChannel{" + "score=" + score + ", rtt=" + rtt + ", delegate=" + delegate + '}';
return "SortableChannel{score=" + score + ", rtt=" + rtt + ", delegate=" + delegate + '}';

@iamdanfox
Copy link
Contributor Author

iamdanfox commented May 27, 2020

Ok so I've stumbled across an edge case that is probably worth solving:

what if we're talking to 10 upstreams, 5 of which are local to my AZ and 5 of which are in another AZ. We probably don't want to firehose all our traffic at whichever of the local 5 happens to be the fastest.

EDIT solved.

@iamdanfox iamdanfox changed the title [proof of concept] Balanced channel tiebreaks using RTT, hoping to stay within AZ (and save $) [proof of concept] Balanced channel measures RTT, hoping to stay within AZ (and save $) May 28, 2020
@iamdanfox iamdanfox changed the title [proof of concept] Balanced channel measures RTT, hoping to stay within AZ (and save $) Balanced channel measures RTT, hoping to stay within AZ (and save $) May 28, 2020
@@ -56,13 +68,24 @@
final class BalancedNodeSelectionStrategyChannel implements LimitedChannel {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should update the description so that it includes the additional parameters to our cost function

Copy link
Contributor

@carterkozak carterkozak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would it take to support a flag to opt in/out of this feature for initial rollout now that we have excavated out BALANCED?

Added a few comments from an initial read through, still need to do a higher level pass.

.describedAs("%s: Constant 4xxs did move the needle %s", Duration.ofNanos(clock.read()), channel)
.containsExactly(1, 0);

incrementClockBy(Duration.ofSeconds(5));

assertThat(channel.getScores())
assertThat(channel.getScoresForTesting().map(c -> c.getScore()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

peanut gallery: I want an error-prone rule that can refactor this automatically

Suggested change
assertThat(channel.getScoresForTesting().map(c -> c.getScore()))
assertThat(channel.getScoresForTesting().map(BalancedNodeSelectionStrategyChannel::getScore))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// can return in ~1 ms but others return in ~5ms, the 1ms nodes will all have a similar rttScore (near zero).
// Note, this can only be computed when we have all the snapshots in front of us.
long rttRange = worstRttNanos - bestRttNanos;
if (bestRttNanos != Long.MAX_VALUE && worstRttNanos != 0 && rttRange > 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps a negative sentinel value, extracted to a static field to make the code easier to read? bestRttNanos != UNKNOWN_RTT

Copy link
Contributor Author

@iamdanfox iamdanfox May 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is actually the result of doing the min and max things above, so it's not actually an arbitrary sentinel. I tried using streams to compute these:

Arrays.stream(snapshotArray)
                .filter(snapshot -> snapshot.rttNanos.isPresent())
                .mapToLong(snapshot -> snapshot.rttNanos.getAsLong())
                .summaryStatistics();

but the summaryStatistics also gives you a Long.MIN_VALUE/MAX_VALUE if there is no inputs.

@iamdanfox iamdanfox mentioned this pull request May 29, 2020
Arrays.fill(samples, newMeasurement);
bestRttNanos = newMeasurement;
} else {
System.arraycopy(samples, 1, samples, 0, NUM_MEASUREMENTS - 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of copying the array could we treat the array as a circular buffer keeping track of our current location

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I considered this, but figured it was actually gonna be more readable to try and keep it simple here - especially when this codepath only gets invoked at most once per second, and in a non-blocking way :)

Copy link
Contributor

@ferozco ferozco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall looks great! excited to try this out

@ferozco
Copy link
Contributor

ferozco commented Jun 1, 2020

👍

@bulldozer-bot bulldozer-bot bot merged commit df5329a into develop Jun 1, 2020
@bulldozer-bot bulldozer-bot bot deleted the dfox/rtt branch June 1, 2020 12:29
@iamdanfox iamdanfox mentioned this pull request Jun 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants