-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
metrics: report pod communication latency percentiles in pdf #212
base: master
Are you sure you want to change the base?
Conversation
Request for comments:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this looks good. There are a couple of issue to address. One is without a legend, the chart is impossible to interpret without reading the R code. The second is rendering more than one test run (data directory) for comparison. The code loops over more than one directory if present, but only the last result will be rendered.
I've done one R metric latency graph before, over in the kata fio metrics. I was only plotting two percentiles, but you can see the example graphs over on the issue at kata-containers/tests#885 Note I used colours/opacity to try and group and separate the tests and the values. Yeah, I do think we need a legend on the graph to show which colour/lines mean what :-) |
b3c1aee
to
53b2412
Compare
metrics/scaling/k8s_scale_nc.sh
Outdated
@@ -22,6 +22,7 @@ nc_req_msg_len=${nc_req_msg_len:-1000} | |||
nc_port=33101 | |||
# request message | |||
nc_req_msg=$(head -c $nc_req_msg_len /dev/zero | tr '\0' 'x') | |||
nc_percentiles=(5 25 50 75 95) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm testing this PR (struggling a little to get good data to check the views - that is more a 'local' issue for me though - working on it).... a high level observation though. I think it would be a good idea if we stored the raw data in the JSON results, as well as the cooked percentiles. That way, we can:
- potentially draw some other useful graphs in the R, to show boxplots for instance covering all the data/results for better fine grained insights
- later potentially modify the R and/or any ELK stack views or post processing we have, and still be able to apply that to historic data
I suspect we do also want to hold some 'cooked' data in the JSON as well, like at present, as ELK is not necessarily going to be easy to do that cooking inside, and generating and presenting pre-cooked data into ELK is for now the easiest UI visualisation route.
Probably don't just go adding this until we've reviewed all the code - may as well do a PR single pass update in one go ;-)
OOI @askervin - I see you are taking the time measures for the
What is interesting for me right now is we don't really get to see that spread across the 4-5ms range, and we don't get to see those tail end outliers that ramp from 6ms up to 11ms. This is the sort of reason we should probably capture the raw data in the JSON and find some way to show that data in the graphs - I think it can give us more insight than the current rather step-function result we are seeing. |
median_index = match("p50", names(ltpt)) | ||
if (!is.na(median_index)) { | ||
# Draw median line | ||
latp = latp + geom_line(data=latencydata[latencydata$testname==plot_test,],aes_string(x="n_pods",y=names(ltpt)[median_index],color="testname")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@askervin - I was wondering if there might be some built in R functionality for generating a similar chart with 'fading percentile bands' from the raw data - looks like it is very similar to a fanchart/fanplot - not sure if we can utilise that with our data under R though (did not dig too far): https://journal.r-project.org/archive/2015-1/abel.pdf
@grahamwhaley , I agree that millisecond granularity is rather coarse for reporting network latency. I get 12 ms on my VM cluster, but 3-4 ms on real hardware. However, I think we should make sure that we'd really benefit from it instead of just adding noise. For the current measurement loop, I compared
On my high-performance desktop measuring the time of "do nothing":
Another thing is the latency server implementation. Trying out different server+client combos on localhost:
I would conclude that if we want to get to microsecond level, we should not do the time measurement with |
Good data and valid point on the noise/overheads of the tools contributing significantly to the end result ;-) That fits/correlates with what I was inferring - that we are trimming off too many significant digits right now I think. In which case, I agree, we probably want to be using better tools to measure the latency. First, we should probably define exactly what we are trying to measure (TCP or just network response time from a pod for instance) - and then we can consider other specific network tools like one of the problems with network tools is - there are just soooo many to choose from... |
@grahamwhaley , good point. I changed the time in the result json to microsecond accuracy. We can handle the measurement method as another topic. In any case, current round trip time is accuracy is enough for spotting a reproducible anomaly after scaling to 400+ pods. I also changed the number of percentiles from 5 to 9, so that we can see where the previously hidden 5 % of latency times land. (Added min, max, 1 % and 99 % percentiles.) Saving the original measurements with all necessary details (like pod IPs associated with each latency time and the timestamp) helps debugging anomalies. It also is a remedy to the concern on having only "cooked" data available after test run. |
Whilst I remember, something I spotted that is not actually modified in this PR.... "Pod_command": "${pod_command//\"/\\\"}",
"Request_length": "${nc_req_msg_len}",
"Requests_per_pod": "${nc_reqs_per_pod}",
"Sender": "serial",
"Percentiles": [$(IFS=, ; echo "${latency_percentiles[*]}")], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@askervin - some feedback, mostly just little things.
I think if/before we land this in the code base though we would want to clean up a couple of things like the graph legend, and if we still use them, the box plot a bit.
But, overall, looking useful and hopefully giving you the data you need.
metrics/scaling/k8s_scale_nc.sh
Outdated
@@ -338,6 +339,8 @@ run() { | |||
sleep ${settle_time} | |||
|
|||
if [[ ${nc_reqs_per_pod} -ge 1 ]]; then | |||
mkdir -p "$RESULT_DIR" 2>/dev/null || true | |||
local latency_raw_output="$RESULT_DIR/${TEST_NAME// /-}.tmaster_tworker_pods_r_ipord_ipaddr_lattot_latconn_latio_latdisconn_rx.raw" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
took me a moment to figure out what that lengthy name was... ah, you have encoded the column names into the filename? Is that a common idiom?
Given you are storing the raw data in a separate file, rather than into the JSON (which is where I thought it would go), maybe you can make that file a CSV, and then you can encode the column names in the first row, and just use .csv
on the filename?
metrics/scaling/k8s_scale_nc.sh
Outdated
if [[ $(echo ${nc_req_msg} | nc ${pod_ip} ${nc_port}) != "${nc_req_msg}" ]]; then | ||
pod_ip_ord=$(( pod_ip_ord + 1 )) | ||
local latency_failed=0 | ||
local latency_pod_start_time=${EPOCHREALTIME/./} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice find with EPOCHREALTIME
:-) - nitpick - the json still says Units of ms
I think - let's update that to be correct (even if we don't presently use that Units field).
library(tibble) # tibbles for tidy data | ||
|
||
testnames=c( | ||
"k8s-scaling-nc.*" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found in another PR - that the .*
is a bogus overhang from another R file, and messes up the regexp that is then constructed below. Just delete the .*
from the string I think.
#### Now extract latency time percentiles (ltp) ######## | ||
######################################################## | ||
ltp=br$latency_time$Percentiles | ||
# Percentile thresholds, for example [5, 25, 50, 75, 95] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this comment now out of date, since you added 0/1/99/100 as well iirc
} | ||
} | ||
|
||
if (length(latencydata[[1]]) <= 20 && perc_count == 5) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why perc_count == 5
? - ah, I guess as the boxplot takes 5 input values....
Hmm, so, two things then:
- given we now generate >5 items in the test run, I suspect this code will no longer run?
- There is no strict standard for what all the whiskers on a boxplot mean - I suspect the outer ones are expected maybe to be some stddev extraction - but I think we are using them as percentiles. We should probably document that in the graph (unless it is obvious from the axis - I don't have a graph to hand to check). There is an example at https://owi.usgs.gov/blog/boxplots/, but that might be a little OOT.
} | ||
} | ||
|
||
cat("\n\nLatency percentiles illustrated in the Figure below: ", paste0(ltp_perc, "\\%"), "\n\n") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, look at that - so, maybe disregared some of my above comments if we think the graph is clear enough already ;-)
suppressMessages(suppressWarnings(library(ggplot2))) # ability to plot nicely. | ||
suppressMessages(library(jsonlite)) # to load the data. | ||
suppressMessages(library(scales)) # For de-science notation of axis | ||
library(tibble) # tibbles for tidy data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you might need gridextra
here, like we have in the tidy_scaling.R
. right now it works as you inherit that from the tidy_scaling.R
that runs before this script, but when run on its own in debug mode I see:
Error in grid.arrange(latp, ncol = 1) :
could not find function "grid.arrange"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nudge.
upper_prev_name = names(ltpt)[perc_count-n+1] | ||
perc_dist = abs(n-perc_mid) | ||
alpha = 0.7 * ((n+1) / (perc_mid+1))**2 | ||
latp = latp + geom_ribbon(data=latencydata[latencydata$testname==plot_test,],aes_string(x="n_pods",ymin=lower_name,ymax=lower_next_name,fill="testname"),alpha=alpha) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before we land this, we should try and generate a proper legend for the plot. I suspect that will either involve some heavy aes()
usage, or generating a custom legend.
dfed4dc
to
2ff6fc1
Compare
Rebased the whole series. Now this is based on the rapid test. Quite a few changes, see last commit message. |
hi @askervin - thanks for the update. I tried to give it a quick spin - initial feedback...
(oh, I got a run by hacking out my collectd check bug and setting no_api btw) |
2ff6fc1
to
b7ea60c
Compare
- Add R routine to load and plot percentiles from JSON. - Support any number of percentiles 1..n. - Store percentile configuration in JSON. Signed-off-by: Antti Kervinen <[email protected]>
e67521b
to
866a20b
Compare
@grahamwhaley , Thanks for your feedback!
Issue:
But curl gets stuck if I just s/http:/https:. Need to figure how to make apiserver accept http requests. Do you have any tips for this? Todo:
|
I may know what the I'm slightly confused though, as I was sure I'd been running API based tests on the |
(oops, accidental mis-click closed.... re-opened now....) |
OK, take that kubeproxy item with a little pinch of salt - turned out on my test machine I still had a kube proxy running (probably to the clr-k8s-examples stackup) when I tried to run and bind to the kind cluster. I still got some 'nc TIMEOUT' errors, but wrt the use of
|
suppressMessages(suppressWarnings(library(ggplot2))) # ability to plot nicely. | ||
suppressMessages(library(jsonlite)) # to load the data. | ||
suppressMessages(library(scales)) # For de-science notation of axis | ||
library(tibble) # tibbles for tidy data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nudge.
|
||
### For developers: uncomment following variables to run this as is in R | ||
# resultdirs=c("PATH/TO/RES1/", ...) # keep the ending slash on result paths | ||
# inputdir="" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this comment? We don't have it in any of the other R files, which do the same thing. Generally I think this is covered by the use of the report debug method, and the source'ing of the Env.R
file..
} | ||
} | ||
|
||
# Table presentation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't comprehend how the code relates to the final table - can you add some commentary please?
For instance, in my 20 pod run, I see a table in the pdf with rows for 1,5,10 pods - how is that math done? I thought maybe it would be 1,10,20 for a 20 pod run for instance?
local latency_json="$(cat << EOF | ||
"latency_time": { | ||
"Percentiles": [$(IFS=, ; echo "${latency_percentiles[*]}")], | ||
"Result": ${latency_percentiles[$(( ${#latency_percentiles[@]} / 2 ))]}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick - we might want to rename Result
to something like Midpoint
to reflect the actual value it holds.
# Measure network latency | ||
if [[ ${nc_reqs} -ge 1 ]]; then | ||
mkdir -p "$RESULT_DIR" 2>/dev/null || true | ||
local latency_raw_output="$RESULT_DIR/${TEST_NAME// /-}.tmaster_tworker_pods_req_ipaddr_lattot_latconn_latio_latdisconn_rx.raw" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't like the mega huge raw name.... or that it's not a json or actual csv :-)
Signed-off-by: Antti Kervinen [email protected]