Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add latency timeseries output #8

Draft
wants to merge 16 commits into
base: main
Choose a base branch
from
Draft

add latency timeseries output #8

wants to merge 16 commits into from

Conversation

dustinblack
Copy link
Member

Changes introduced with this PR

Adding CPU latency collection from kernel debug tracing.


By contributing to this repository, I agree to the contribution guidelines.

@jaredoconnell
Copy link

Looks good so far.

@dustinblack dustinblack marked this pull request as ready for review November 18, 2024 14:11
@dustinblack dustinblack requested a review from a team November 18, 2024 14:12
@dustinblack dustinblack marked this pull request as draft November 18, 2024 14:56
# the relative times will be accurate. We'll accept this as good enough.
uptime_offset = time.time() - time.monotonic()

if all(False for _ in timeseries_lines):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ought to "eat" the first element in your iterator so that the subsequent for loop won't see it. Is that what you intended, or is this a bug?

Why not just

timeseries_lines = timeseries_output.splitlines()
if len(timeseries_lines) == 0:
    print(...)
for line in timeseries_lines:
    ...

Or

timeseries_lines = timeseries_output.splitlines()
for line in timeseries_lines:
    ...
else:
    print("No results ...")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that second option, but don't think it will actually do what we want. 🙃

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh ... you're probably right. The else triggers only if there's no break, which we don't want here. Sigh. Well, it seemed like a good idea at the time. 😦

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're simply testing for the presence of output, I think we should be looking at timeseries_output rather than the iterator that we created for it.

Copy link
Contributor

@webbnh webbnh Nov 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sigh. Well, it seemed like a good idea at the time. 😦

It did! (I really wanted it to be the answer!) In fact, I almost offered a similar suggestion later...and then realized that it was just a convolution of the same, non-working, approach. 😢

In the win column, on the other hand, I confirmed for myself that while also offers an else clause (but, it didn't turn out to help in the case I was looking at, either.) (I think I was "a man with a hammer" there for awhile. 😆)

Copy link
Contributor

@webbnh webbnh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good to me, I just have a few (pointed) questions...and a bunch of suggestions....

Comment on lines 4 to 18
import re
import sys
import typing
import os.path
import time
from threading import Event
from datetime import datetime
from arcaflow_plugin_sdk import plugin, predefined_schemas
from rtla_schema import (
TimerlatInputParams,
latency_stats_schema,
latency_timeseries_schema,
TimerlatOutput,
ErrorOutput,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since these are "early days", it would be nice to establish some practice around imports. PEP 8 has some guidance, of particular note is the suggestion of having three sections of imports with blank lines between them, to which I would add the suggestion of keeping the lists in alphabetical order within the sections.

There's a nice tool, isort which will check and/or maintain this for you. (You can install it with pip.)

Comment on lines 80 to 89
wait_time = 0
timeout_seconds = 5
sleep_time = 0.5
trace_file_exists = False
while not trace_file_exists and wait_time < timeout_seconds:
time.sleep(sleep_time)
trace_file_exists = os.path.isfile(trace_path)
wait_time += sleep_time

if not trace_file_exists:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would use an absolute time deadline instead of a counted number of intervals; and, I would avoid capturing the isfile() result in a local variable, since only the last call really matters. This reduces the loop setup and makes the loop body much simpler.

Suggested change
wait_time = 0
timeout_seconds = 5
sleep_time = 0.5
trace_file_exists = False
while not trace_file_exists and wait_time < timeout_seconds:
time.sleep(sleep_time)
trace_file_exists = os.path.isfile(trace_path)
wait_time += sleep_time
if not trace_file_exists:
timeout = datetime.now() + timedelta(seconds=5)
while not os.path.isfile(trace_path) and datetime.now() < timeout:
time.sleep(0.5)
if not os.path.isfile(trace_path):

arcaflow_plugin_rtla/rtla_plugin.py Outdated Show resolved Hide resolved
Comment on lines 124 to 125
if params.enable_time_series and trace_file_exists:
# Interrupt the time series collection process and capture the output
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would base this test on whether the subprocess, timeseries_proc exists -- that's what we want to signal, not the file, and that's the value that we're going to use to send the signal.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a bad idea. timerlat_proc.send_signal(0) should raise an OSError if the pid is invalid or doesn't exist... (although this will somewhat depend on exactly how timerlat_proc.pid is managed under the covers, and I haven't experimented).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was actually thinking more about the memory object, timeseries_proc, as opposed to the process which it represents (especially since we would need a valid object in order to invoke send_signal(0), and, if we have one, it's reasonable to assume that the process was successfully created).

In any case, sending "signal zero" seems like a good idea except for the fact that the next thing we intend to do is to send signal 2 -- we might as well skip the preamble once we've decided to try sending a signal.

Comment on lines 225 to 226

if params.enable_time_series and trace_file_exists:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, we don't care about whether the trace file exists/ed, here; what we care about is whether we got anything in timeseries_output, so we should test that, instead.

# the relative times will be accurate. We'll accept this as good enough.
uptime_offset = time.time() - time.monotonic()

if all(False for _ in timeseries_lines):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're simply testing for the presence of output, I think we should be looking at timeseries_output rather than the iterator that we created for it.

Comment on lines 241 to 254
line_list = line.split()
# Because the tracer format is dependent on the underlying OS and cannot
# be controlled by the container, check the tracer output format and
# break gracefully if we don't recognize it
try:
cpu = int(line_list[1][1:-1])
uptimestamp = float(line_list[3][:-1])
timestamp = str(
datetime.fromtimestamp(uptimestamp + uptime_offset)
.astimezone()
.isoformat()
)
context = str(line_list[6])
latency = int(line_list[8])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using a regex with capture groups, instead. I think it will ultimately be simpler (to read and maintain, at least), and the "unknown format" result will be much simpler and more obvious.

Comment on lines 248 to 252
timestamp = str(
datetime.fromtimestamp(uptimestamp + uptime_offset)
.astimezone()
.isoformat()
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think astimezone() is doing for you, here? If you're trying to produce a timezone-aware datetime value, perhaps you should specify the tz parameter on the fromtimestamp() call?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The advantage of .astimezone() is that it defaults to local time without having to supply a parameter, I suppose. Specifying an explicit zone in fromtimestamp would be a bit better documentation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right but we have the whole provenience problem: what timezone was the uptimestamp in?...what timezone are we trying to display it in?...how is using timezone-naive times (even briefly) going to relieve the confusion? Wouldn't it be better to be explicit through the whole chain?

Comment on lines 267 to 268
if cpu_context not in timeseries_dict:
timeseries_dict[cpu_context] = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using a defaultdict instead of rolling your own.

arcaflow_plugin_rtla/rtla_plugin.py Outdated Show resolved Hide resolved
Comment on lines 124 to 125
if params.enable_time_series and trace_file_exists:
# Interrupt the time series collection process and capture the output

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a bad idea. timerlat_proc.send_signal(0) should raise an OSError if the pid is invalid or doesn't exist... (although this will somewhat depend on exactly how timerlat_proc.pid is managed under the covers, and I haven't experimented).

arcaflow_plugin_rtla/rtla_plugin.py Show resolved Hide resolved
Comment on lines 229 to 232
# The calculation of the offset from uptime to current time certainly means
# that our time series is not 100% accurately aligned to the nanosecond, but
# the relative times will be accurate. We'll accept this as good enough.
uptime_offset = time.time() - time.monotonic()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In terms of the generic/standard definitions of "clock monotonic" this is certainly true; whether the actual relationships of time.time and time.monotonic on RHEL 8/9/10 provides additional guarantees that make this operation reasonable is a different matter...

Comment on lines 242 to 243
# Because the tracer format is dependent on the underlying OS and cannot
# be controlled by the container, check the tracer output format and

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you thought of using uname -r, which should report the "host" kernel version as it uses the kernel service?

The biggest concern might be superficially compatible formats that don't break your limited parsing constraints but result in bad data ... if the format really changes significantly.

Comment on lines 248 to 252
timestamp = str(
datetime.fromtimestamp(uptimestamp + uptime_offset)
.astimezone()
.isoformat()
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The advantage of .astimezone() is that it defaults to local time without having to supply a parameter, I suppose. Specifying an explicit zone in fromtimestamp would be a bit better documentation.

Copy link
Contributor

@webbnh webbnh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found a small bug with the CPU number handling.

arcaflow_plugin_rtla/rtla_plugin.py Outdated Show resolved Hide resolved
arcaflow_plugin_rtla/rtla_plugin.py Outdated Show resolved Hide resolved
Comment on lines +147 to +151
if all(False for _ in timeseries_output):
print(
"No results reading tracer output; "
"Skipping time series collection\n"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As Dave mentioned earlier, this check is going to eat the first line of the timeseries output. You should put a comment here noting that, if it's actually intentional.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not following how you are seeing that the first line gets eaten.

>>> i = ["a", "b", "c", "d"]
>>> if all(False for _ in i):
...   print("bad juju")
... 
>>> for line in i:
...   print(line)
... 
a
b
c
d
>>> i = []
>>> if all(False for _ in i):
...   print("bad juju")
... 
bad juju
>>> for line in i:
...   print(line)
... 
>>> 

Copy link
Contributor

@webbnh webbnh Dec 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try your experiment with i set to an iterator or a generator. With it set to a list, of course you see the first item when you check after the all() call, because the list state is unchanged. 😁

(Just to be blisteringly clear, reading a file line-by-line in a loop uses an iterator/generator.)

arcaflow_plugin_rtla/rtla_plugin.py Outdated Show resolved Hide resolved
arcaflow_plugin_rtla/rtla_plugin.py Outdated Show resolved Hide resolved
arcaflow_plugin_rtla/rtla_plugin.py Show resolved Hide resolved
Copy link
Contributor

@webbnh webbnh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some typos and a suggestion. (I'm "approving" to remove my previous "changes requested" mark; however, there are still a number of open conversations, above.)

Comment on lines 155 to 178
# The first object in the line is a string, and it is possible for
# it to have spaces in it. We'll try to cast the second object,
# which should be the CPU number surrounded with [], to an int. If
# that fails, we discard that object from the line_list.
for i, _ in enumerate(line_list):
if i == 1:
try:
cpu_val = re.compile(r"\[([0-9][0-9][0-9])\]")
cpu = int(cpu_val.match(line_list[i]).group(1))
break
except (ValueError, AttributeError):
line_list.pop(i)
except IndexError as error:
print(
"Unable to determine CPU number; Skipping time "
f"series collection: {error}\n{line}"
)
timeseries_dict = {}
break
# it to have spaces in it, so we split first on the expected "["
# deliniator for the CPU number field.
# FIXME: I can't be 100% sure that a "[" wont' appear in the string
# that is the first object of the line.
stripped_line = line.split("[")
line_list = stripped_line[1].split()

# Because the tracer format is dependent on the underlying OS and
# cannot be controlled by the container, check the tracer output
# format and break gracefully if we don't recognize it
try:
uptimestamp = float(line_list[3][:-1])
timestamp = str(
cpu = int(line_list[0][:-1])
uptimestamp = float(line_list[2][:-1])
timestamp = (
datetime.fromtimestamp(uptimestamp + uptime_offset)
.astimezone()
.isoformat()
)
context = str(line_list[6])
latency = int(line_list[8])
context = str(line_list[5])
latency = int(line_list[7])

except (IndexError, ValueError) as error:
print(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since what you're trying to do is recognition and parsing, why not do that with a regex? Admitted, the expression itself is not easy to read, but that's a single sunk cost, and the remainder of the code is very easy and expressive (and you don't need a try block). (Note that the re.compile() call should be moved above the loop, perhaps to module-scope, so that it is executed only once.)

                     # Match a line which starts with zero or more spaces, at
                     # least one character between a pair of angle-brackets, a
                     # hyphen, a sequence of digits, zero or more spaces, a
                     # sequence of three digits (which we capture) between a
                     # pair of square-brackets, a space, two characters, a space,
                     # a sequence of digits followed by a dot followed by six
                     # digits (which we capture), a colon, a space, an octathorp
                     # followed by a sequence of digits, a space, "context", one
                     # or more spaces, a sequence of non-spaces (which we
                     # capture), "timer_latency", a space, a sequence of digits
                     # (which we capture), a space, and "ns".  E.g.,
                     # <idle>-0      [009] d. 123.769498: #1002 context    irq timer_latency  458 ns
                     # or
                     #  <...>-625890 [009] .. 123.769499: #1002 context thread timer_latency 666 ns
                     data_line = re.compile(r" *<[^>]+>-[0-9]+ *\[([0-9]{3})\] .. ([0-9]+\.[0-9]{6}): #[0-9]+ context +([^ ]+) timer_latency ([0-9]+) ns")

                      [...]

                     matches = data_line.match(line)
                     if matches:
                         (cpu, uptimestamp, context, latency) = matches.groups()
                         cpu = int(cpu)
                         uptimestamp = float(uptimestamp)
                         # `context` is already a string, so no conversion is required.
                         latency = int(latency)
                    else:
                        print("Unable to determine CPU number; Skipping time "
                            f"series collection:\n\t{line}")
                        timeseries_dict = {}
                        break

                    timestamp = (
                            datetime.fromtimestamp(uptimestamp + uptime_offset)
                            .astimezone()
                            .isoformat()
                        )

Comment on lines +157 to +158
# deliniator for the CPU number field.
# FIXME: I can't be 100% sure that a "[" wont' appear in the string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"deliniator" ==> "delineator"
"wont'" ==> "won't"

arcaflow_plugin_rtla/rtla_plugin.py Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants