-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add latency timeseries output #8
base: main
Are you sure you want to change the base?
Conversation
Looks good so far. |
arcaflow_plugin_rtla/rtla_plugin.py
Outdated
# the relative times will be accurate. We'll accept this as good enough. | ||
uptime_offset = time.time() - time.monotonic() | ||
|
||
if all(False for _ in timeseries_lines): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This ought to "eat" the first element in your iterator so that the subsequent for
loop won't see it. Is that what you intended, or is this a bug?
Why not just
timeseries_lines = timeseries_output.splitlines()
if len(timeseries_lines) == 0:
print(...)
for line in timeseries_lines:
...
Or
timeseries_lines = timeseries_output.splitlines()
for line in timeseries_lines:
...
else:
print("No results ...")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that second option, but don't think it will actually do what we want. 🙃
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh ... you're probably right. The else
triggers only if there's no break
, which we don't want here. Sigh. Well, it seemed like a good idea at the time. 😦
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we're simply testing for the presence of output, I think we should be looking at timeseries_output
rather than the iterator that we created for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sigh. Well, it seemed like a good idea at the time. 😦
It did! (I really wanted it to be the answer!) In fact, I almost offered a similar suggestion later...and then realized that it was just a convolution of the same, non-working, approach. 😢
In the win column, on the other hand, I confirmed for myself that while
also offers an else
clause (but, it didn't turn out to help in the case I was looking at, either.) (I think I was "a man with a hammer" there for awhile. 😆)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good to me, I just have a few (pointed) questions...and a bunch of suggestions....
arcaflow_plugin_rtla/rtla_plugin.py
Outdated
import re | ||
import sys | ||
import typing | ||
import os.path | ||
import time | ||
from threading import Event | ||
from datetime import datetime | ||
from arcaflow_plugin_sdk import plugin, predefined_schemas | ||
from rtla_schema import ( | ||
TimerlatInputParams, | ||
latency_stats_schema, | ||
latency_timeseries_schema, | ||
TimerlatOutput, | ||
ErrorOutput, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since these are "early days", it would be nice to establish some practice around imports. PEP 8 has some guidance, of particular note is the suggestion of having three sections of imports with blank lines between them, to which I would add the suggestion of keeping the lists in alphabetical order within the sections.
There's a nice tool, isort
which will check and/or maintain this for you. (You can install it with pip
.)
arcaflow_plugin_rtla/rtla_plugin.py
Outdated
wait_time = 0 | ||
timeout_seconds = 5 | ||
sleep_time = 0.5 | ||
trace_file_exists = False | ||
while not trace_file_exists and wait_time < timeout_seconds: | ||
time.sleep(sleep_time) | ||
trace_file_exists = os.path.isfile(trace_path) | ||
wait_time += sleep_time | ||
|
||
if not trace_file_exists: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I would use an absolute time deadline instead of a counted number of intervals; and, I would avoid capturing the isfile()
result in a local variable, since only the last call really matters. This reduces the loop setup and makes the loop body much simpler.
wait_time = 0 | |
timeout_seconds = 5 | |
sleep_time = 0.5 | |
trace_file_exists = False | |
while not trace_file_exists and wait_time < timeout_seconds: | |
time.sleep(sleep_time) | |
trace_file_exists = os.path.isfile(trace_path) | |
wait_time += sleep_time | |
if not trace_file_exists: | |
timeout = datetime.now() + timedelta(seconds=5) | |
while not os.path.isfile(trace_path) and datetime.now() < timeout: | |
time.sleep(0.5) | |
if not os.path.isfile(trace_path): |
arcaflow_plugin_rtla/rtla_plugin.py
Outdated
if params.enable_time_series and trace_file_exists: | ||
# Interrupt the time series collection process and capture the output |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would base this test on whether the subprocess, timeseries_proc
exists -- that's what we want to signal, not the file, and that's the value that we're going to use to send the signal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a bad idea. timerlat_proc.send_signal(0)
should raise an OSError
if the pid is invalid or doesn't exist... (although this will somewhat depend on exactly how timerlat_proc.pid
is managed under the covers, and I haven't experimented).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was actually thinking more about the memory object, timeseries_proc
, as opposed to the process which it represents (especially since we would need a valid object in order to invoke send_signal(0)
, and, if we have one, it's reasonable to assume that the process was successfully created).
In any case, sending "signal zero" seems like a good idea except for the fact that the next thing we intend to do is to send signal 2
-- we might as well skip the preamble once we've decided to try sending a signal.
arcaflow_plugin_rtla/rtla_plugin.py
Outdated
|
||
if params.enable_time_series and trace_file_exists: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, we don't care about whether the trace file exists/ed, here; what we care about is whether we got anything in timeseries_output
, so we should test that, instead.
arcaflow_plugin_rtla/rtla_plugin.py
Outdated
# the relative times will be accurate. We'll accept this as good enough. | ||
uptime_offset = time.time() - time.monotonic() | ||
|
||
if all(False for _ in timeseries_lines): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we're simply testing for the presence of output, I think we should be looking at timeseries_output
rather than the iterator that we created for it.
arcaflow_plugin_rtla/rtla_plugin.py
Outdated
line_list = line.split() | ||
# Because the tracer format is dependent on the underlying OS and cannot | ||
# be controlled by the container, check the tracer output format and | ||
# break gracefully if we don't recognize it | ||
try: | ||
cpu = int(line_list[1][1:-1]) | ||
uptimestamp = float(line_list[3][:-1]) | ||
timestamp = str( | ||
datetime.fromtimestamp(uptimestamp + uptime_offset) | ||
.astimezone() | ||
.isoformat() | ||
) | ||
context = str(line_list[6]) | ||
latency = int(line_list[8]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider using a regex with capture groups, instead. I think it will ultimately be simpler (to read and maintain, at least), and the "unknown format" result will be much simpler and more obvious.
arcaflow_plugin_rtla/rtla_plugin.py
Outdated
timestamp = str( | ||
datetime.fromtimestamp(uptimestamp + uptime_offset) | ||
.astimezone() | ||
.isoformat() | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think astimezone()
is doing for you, here? If you're trying to produce a timezone-aware datetime
value, perhaps you should specify the tz
parameter on the fromtimestamp()
call?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The advantage of .astimezone()
is that it defaults to local time without having to supply a parameter, I suppose. Specifying an explicit zone in fromtimestamp
would be a bit better documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right but we have the whole provenience problem: what timezone was the uptimestamp
in?...what timezone are we trying to display it in?...how is using timezone-naive times (even briefly) going to relieve the confusion? Wouldn't it be better to be explicit through the whole chain?
arcaflow_plugin_rtla/rtla_plugin.py
Outdated
if cpu_context not in timeseries_dict: | ||
timeseries_dict[cpu_context] = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider using a defaultdict
instead of rolling your own.
arcaflow_plugin_rtla/rtla_plugin.py
Outdated
if params.enable_time_series and trace_file_exists: | ||
# Interrupt the time series collection process and capture the output |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a bad idea. timerlat_proc.send_signal(0)
should raise an OSError
if the pid is invalid or doesn't exist... (although this will somewhat depend on exactly how timerlat_proc.pid
is managed under the covers, and I haven't experimented).
arcaflow_plugin_rtla/rtla_plugin.py
Outdated
# The calculation of the offset from uptime to current time certainly means | ||
# that our time series is not 100% accurately aligned to the nanosecond, but | ||
# the relative times will be accurate. We'll accept this as good enough. | ||
uptime_offset = time.time() - time.monotonic() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In terms of the generic/standard definitions of "clock monotonic" this is certainly true; whether the actual relationships of time.time
and time.monotonic
on RHEL 8/9/10 provides additional guarantees that make this operation reasonable is a different matter...
arcaflow_plugin_rtla/rtla_plugin.py
Outdated
# Because the tracer format is dependent on the underlying OS and cannot | ||
# be controlled by the container, check the tracer output format and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you thought of using uname -r
, which should report the "host" kernel version as it uses the kernel service?
The biggest concern might be superficially compatible formats that don't break your limited parsing constraints but result in bad data ... if the format really changes significantly.
arcaflow_plugin_rtla/rtla_plugin.py
Outdated
timestamp = str( | ||
datetime.fromtimestamp(uptimestamp + uptime_offset) | ||
.astimezone() | ||
.isoformat() | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The advantage of .astimezone()
is that it defaults to local time without having to supply a parameter, I suppose. Specifying an explicit zone in fromtimestamp
would be a bit better documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found a small bug with the CPU number handling.
if all(False for _ in timeseries_output): | ||
print( | ||
"No results reading tracer output; " | ||
"Skipping time series collection\n" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As Dave mentioned earlier, this check is going to eat the first line of the timeseries output. You should put a comment here noting that, if it's actually intentional.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not following how you are seeing that the first line gets eaten.
>>> i = ["a", "b", "c", "d"]
>>> if all(False for _ in i):
... print("bad juju")
...
>>> for line in i:
... print(line)
...
a
b
c
d
>>> i = []
>>> if all(False for _ in i):
... print("bad juju")
...
bad juju
>>> for line in i:
... print(line)
...
>>>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Try your experiment with i
set to an iterator or a generator. With it set to a list
, of course you see the first item when you check after the all()
call, because the list state is unchanged. 😁
(Just to be blisteringly clear, reading a file line-by-line in a loop uses an iterator/generator.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some typos and a suggestion. (I'm "approving" to remove my previous "changes requested" mark; however, there are still a number of open conversations, above.)
# The first object in the line is a string, and it is possible for | ||
# it to have spaces in it. We'll try to cast the second object, | ||
# which should be the CPU number surrounded with [], to an int. If | ||
# that fails, we discard that object from the line_list. | ||
for i, _ in enumerate(line_list): | ||
if i == 1: | ||
try: | ||
cpu_val = re.compile(r"\[([0-9][0-9][0-9])\]") | ||
cpu = int(cpu_val.match(line_list[i]).group(1)) | ||
break | ||
except (ValueError, AttributeError): | ||
line_list.pop(i) | ||
except IndexError as error: | ||
print( | ||
"Unable to determine CPU number; Skipping time " | ||
f"series collection: {error}\n{line}" | ||
) | ||
timeseries_dict = {} | ||
break | ||
# it to have spaces in it, so we split first on the expected "[" | ||
# deliniator for the CPU number field. | ||
# FIXME: I can't be 100% sure that a "[" wont' appear in the string | ||
# that is the first object of the line. | ||
stripped_line = line.split("[") | ||
line_list = stripped_line[1].split() | ||
|
||
# Because the tracer format is dependent on the underlying OS and | ||
# cannot be controlled by the container, check the tracer output | ||
# format and break gracefully if we don't recognize it | ||
try: | ||
uptimestamp = float(line_list[3][:-1]) | ||
timestamp = str( | ||
cpu = int(line_list[0][:-1]) | ||
uptimestamp = float(line_list[2][:-1]) | ||
timestamp = ( | ||
datetime.fromtimestamp(uptimestamp + uptime_offset) | ||
.astimezone() | ||
.isoformat() | ||
) | ||
context = str(line_list[6]) | ||
latency = int(line_list[8]) | ||
context = str(line_list[5]) | ||
latency = int(line_list[7]) | ||
|
||
except (IndexError, ValueError) as error: | ||
print( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since what you're trying to do is recognition and parsing, why not do that with a regex? Admitted, the expression itself is not easy to read, but that's a single sunk cost, and the remainder of the code is very easy and expressive (and you don't need a try
block). (Note that the re.compile()
call should be moved above the loop, perhaps to module-scope, so that it is executed only once.)
# Match a line which starts with zero or more spaces, at
# least one character between a pair of angle-brackets, a
# hyphen, a sequence of digits, zero or more spaces, a
# sequence of three digits (which we capture) between a
# pair of square-brackets, a space, two characters, a space,
# a sequence of digits followed by a dot followed by six
# digits (which we capture), a colon, a space, an octathorp
# followed by a sequence of digits, a space, "context", one
# or more spaces, a sequence of non-spaces (which we
# capture), "timer_latency", a space, a sequence of digits
# (which we capture), a space, and "ns". E.g.,
# <idle>-0 [009] d. 123.769498: #1002 context irq timer_latency 458 ns
# or
# <...>-625890 [009] .. 123.769499: #1002 context thread timer_latency 666 ns
data_line = re.compile(r" *<[^>]+>-[0-9]+ *\[([0-9]{3})\] .. ([0-9]+\.[0-9]{6}): #[0-9]+ context +([^ ]+) timer_latency ([0-9]+) ns")
[...]
matches = data_line.match(line)
if matches:
(cpu, uptimestamp, context, latency) = matches.groups()
cpu = int(cpu)
uptimestamp = float(uptimestamp)
# `context` is already a string, so no conversion is required.
latency = int(latency)
else:
print("Unable to determine CPU number; Skipping time "
f"series collection:\n\t{line}")
timeseries_dict = {}
break
timestamp = (
datetime.fromtimestamp(uptimestamp + uptime_offset)
.astimezone()
.isoformat()
)
# deliniator for the CPU number field. | ||
# FIXME: I can't be 100% sure that a "[" wont' appear in the string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"deliniator" ==> "delineator"
"wont'" ==> "won't"
Changes introduced with this PR
Adding CPU latency collection from kernel debug tracing.
By contributing to this repository, I agree to the contribution guidelines.