add latency timeseries output #8

dustinblack · 2024-11-15T18:02:56Z

Changes introduced with this PR

Adding CPU latency collection from kernel debug tracing.

By contributing to this repository, I agree to the contribution guidelines.

jaredoconnell · 2024-11-16T00:21:10Z

Looks good so far.

arcaflow_plugin_rtla/rtla_plugin.py

dbutenhof · 2024-11-18T15:10:01Z

arcaflow_plugin_rtla/rtla_plugin.py

+            # the relative times will be accurate. We'll accept this as good enough.
+            uptime_offset = time.time() - time.monotonic()
+
+            if all(False for _ in timeseries_lines):


This ought to "eat" the first element in your iterator so that the subsequent for loop won't see it. Is that what you intended, or is this a bug?

Why not just

timeseries_lines = timeseries_output.splitlines() if len(timeseries_lines) == 0: print(...) for line in timeseries_lines: ...

Or

timeseries_lines = timeseries_output.splitlines() for line in timeseries_lines: ... else: print("No results ...")

I like that second option, but don't think it will actually do what we want. 🙃

Oh ... you're probably right. The else triggers only if there's no break, which we don't want here. Sigh. Well, it seemed like a good idea at the time. 😦

Since we're simply testing for the presence of output, I think we should be looking at timeseries_output rather than the iterator that we created for it.

Sigh. Well, it seemed like a good idea at the time. 😦

It did! (I really wanted it to be the answer!) In fact, I almost offered a similar suggestion later...and then realized that it was just a convolution of the same, non-working, approach. 😢

In the win column, on the other hand, I confirmed for myself that while also offers an else clause (but, it didn't turn out to help in the case I was looking at, either.) (I think I was "a man with a hammer" there for awhile. 😆)

…utput

webbnh

Looks pretty good to me, I just have a few (pointed) questions...and a bunch of suggestions....

webbnh · 2024-11-18T16:47:47Z

arcaflow_plugin_rtla/rtla_plugin.py

 import re
 import sys
 import typing
+import os.path
+import time
 from threading import Event
+from datetime import datetime
 from arcaflow_plugin_sdk import plugin, predefined_schemas
 from rtla_schema import (
    TimerlatInputParams,
    latency_stats_schema,
+    latency_timeseries_schema,
    TimerlatOutput,
    ErrorOutput,
 )


Since these are "early days", it would be nice to establish some practice around imports. PEP 8 has some guidance, of particular note is the suggestion of having three sections of imports with blank lines between them, to which I would add the suggestion of keeping the lists in alphabetical order within the sections.

There's a nice tool, isort which will check and/or maintain this for you. (You can install it with pip.)

webbnh · 2024-11-18T17:09:19Z

arcaflow_plugin_rtla/rtla_plugin.py

+            wait_time = 0
+            timeout_seconds = 5
+            sleep_time = 0.5
+            trace_file_exists = False
+            while not trace_file_exists and wait_time < timeout_seconds:
+                time.sleep(sleep_time)
+                trace_file_exists = os.path.isfile(trace_path)
+                wait_time += sleep_time
+
+            if not trace_file_exists:


I think I would use an absolute time deadline instead of a counted number of intervals; and, I would avoid capturing the isfile() result in a local variable, since only the last call really matters. This reduces the loop setup and makes the loop body much simpler.

Suggested change

wait_time = 0

timeout_seconds = 5

sleep_time = 0.5

trace_file_exists = False

while not trace_file_exists and wait_time < timeout_seconds:

time.sleep(sleep_time)

trace_file_exists = os.path.isfile(trace_path)

wait_time += sleep_time

if not trace_file_exists:

timeout = datetime.now() + timedelta(seconds=5)

while not os.path.isfile(trace_path) and datetime.now() < timeout:

time.sleep(0.5)

if not os.path.isfile(trace_path):

arcaflow_plugin_rtla/rtla_plugin.py

webbnh · 2024-11-18T17:19:59Z

arcaflow_plugin_rtla/rtla_plugin.py

+        if params.enable_time_series and trace_file_exists:
+            # Interrupt the time series collection process and capture the output


I would base this test on whether the subprocess, timeseries_proc exists -- that's what we want to signal, not the file, and that's the value that we're going to use to send the signal.

Not a bad idea. timerlat_proc.send_signal(0) should raise an OSError if the pid is invalid or doesn't exist... (although this will somewhat depend on exactly how timerlat_proc.pid is managed under the covers, and I haven't experimented).

I was actually thinking more about the memory object, timeseries_proc, as opposed to the process which it represents (especially since we would need a valid object in order to invoke send_signal(0), and, if we have one, it's reasonable to assume that the process was successfully created).

In any case, sending "signal zero" seems like a good idea except for the fact that the next thing we intend to do is to send signal 2 -- we might as well skip the preamble once we've decided to try sending a signal.

webbnh · 2024-11-18T17:23:49Z

arcaflow_plugin_rtla/rtla_plugin.py

+
+        if params.enable_time_series and trace_file_exists:


Again, we don't care about whether the trace file exists/ed, here; what we care about is whether we got anything in timeseries_output, so we should test that, instead.

webbnh · 2024-11-18T17:33:25Z

arcaflow_plugin_rtla/rtla_plugin.py

+            # the relative times will be accurate. We'll accept this as good enough.
+            uptime_offset = time.time() - time.monotonic()
+
+            if all(False for _ in timeseries_lines):


Since we're simply testing for the presence of output, I think we should be looking at timeseries_output rather than the iterator that we created for it.

webbnh · 2024-11-18T17:35:09Z

arcaflow_plugin_rtla/rtla_plugin.py

+                line_list = line.split()
+                # Because the tracer format is dependent on the underlying OS and cannot
+                # be controlled by the container, check the tracer output format and
+                # break gracefully if we don't recognize it
+                try:
+                    cpu = int(line_list[1][1:-1])
+                    uptimestamp = float(line_list[3][:-1])
+                    timestamp = str(
+                        datetime.fromtimestamp(uptimestamp + uptime_offset)
+                        .astimezone()
+                        .isoformat()
+                    )
+                    context = str(line_list[6])
+                    latency = int(line_list[8])


Consider using a regex with capture groups, instead. I think it will ultimately be simpler (to read and maintain, at least), and the "unknown format" result will be much simpler and more obvious.

webbnh · 2024-11-18T17:42:54Z

arcaflow_plugin_rtla/rtla_plugin.py

+                    timestamp = str(
+                        datetime.fromtimestamp(uptimestamp + uptime_offset)
+                        .astimezone()
+                        .isoformat()
+                    )


What do you think astimezone() is doing for you, here? If you're trying to produce a timezone-aware datetime value, perhaps you should specify the tz parameter on the fromtimestamp() call?

The advantage of .astimezone() is that it defaults to local time without having to supply a parameter, I suppose. Specifying an explicit zone in fromtimestamp would be a bit better documentation.

Right but we have the whole provenience problem: what timezone was the uptimestamp in?...what timezone are we trying to display it in?...how is using timezone-naive times (even briefly) going to relieve the confusion? Wouldn't it be better to be explicit through the whole chain?

webbnh · 2024-11-18T17:45:15Z

arcaflow_plugin_rtla/rtla_plugin.py

+                if cpu_context not in timeseries_dict:
+                    timeseries_dict[cpu_context] = []


Consider using a defaultdict instead of rolling your own.

arcaflow_plugin_rtla/rtla_plugin.py

dbutenhof · 2024-11-18T20:13:33Z

arcaflow_plugin_rtla/rtla_plugin.py

+        if params.enable_time_series and trace_file_exists:
+            # Interrupt the time series collection process and capture the output


Not a bad idea. timerlat_proc.send_signal(0) should raise an OSError if the pid is invalid or doesn't exist... (although this will somewhat depend on exactly how timerlat_proc.pid is managed under the covers, and I haven't experimented).

arcaflow_plugin_rtla/rtla_plugin.py

dbutenhof · 2024-11-18T21:10:57Z

arcaflow_plugin_rtla/rtla_plugin.py

+            # The calculation of the offset from uptime to current time certainly means
+            # that our time series is not 100% accurately aligned to the nanosecond, but
+            # the relative times will be accurate. We'll accept this as good enough.
+            uptime_offset = time.time() - time.monotonic()


In terms of the generic/standard definitions of "clock monotonic" this is certainly true; whether the actual relationships of time.time and time.monotonic on RHEL 8/9/10 provides additional guarantees that make this operation reasonable is a different matter...

dbutenhof · 2024-11-18T21:15:09Z

arcaflow_plugin_rtla/rtla_plugin.py

+                # Because the tracer format is dependent on the underlying OS and cannot
+                # be controlled by the container, check the tracer output format and


Have you thought of using uname -r, which should report the "host" kernel version as it uses the kernel service?

The biggest concern might be superficially compatible formats that don't break your limited parsing constraints but result in bad data ... if the format really changes significantly.

dbutenhof · 2024-11-18T21:19:26Z

arcaflow_plugin_rtla/rtla_plugin.py

+                    timestamp = str(
+                        datetime.fromtimestamp(uptimestamp + uptime_offset)
+                        .astimezone()
+                        .isoformat()
+                    )


The advantage of .astimezone() is that it defaults to local time without having to supply a parameter, I suppose. Specifying an explicit zone in fromtimestamp would be a bit better documentation.

webbnh

I found a small bug with the CPU number handling.

arcaflow_plugin_rtla/rtla_plugin.py

webbnh · 2024-11-26T18:22:38Z

arcaflow_plugin_rtla/rtla_plugin.py

+                if all(False for _ in timeseries_output):
+                    print(
+                        "No results reading tracer output; "
+                        "Skipping time series collection\n"
+                    )


As Dave mentioned earlier, this check is going to eat the first line of the timeseries output. You should put a comment here noting that, if it's actually intentional.

I'm not following how you are seeing that the first line gets eaten.

>>> i = ["a", "b", "c", "d"] >>> if all(False for _ in i): ... print("bad juju") ... >>> for line in i: ... print(line) ... a b c d

>>> i = [] >>> if all(False for _ in i): ... print("bad juju") ... bad juju >>> for line in i: ... print(line) ... >>>

Try your experiment with i set to an iterator or a generator. With it set to a list, of course you see the first item when you check after the all() call, because the list state is unchanged. 😁

(Just to be blisteringly clear, reading a file line-by-line in a loop uses an iterator/generator.)

arcaflow_plugin_rtla/rtla_plugin.py

webbnh

Some typos and a suggestion. (I'm "approving" to remove my previous "changes requested" mark; however, there are still a number of open conversations, above.)

webbnh · 2024-12-02T20:25:42Z

arcaflow_plugin_rtla/rtla_plugin.py

                    # The first object in the line is a string, and it is possible for
-                    # it to have spaces in it. We'll try to cast the second object,
-                    # which should be the CPU number surrounded with [], to an int. If
-                    # that fails, we discard that object from the line_list.
-                    for i, _ in enumerate(line_list):
-                        if i == 1:
-                            try:
-                                cpu_val = re.compile(r"\[([0-9][0-9][0-9])\]")
-                                cpu = int(cpu_val.match(line_list[i]).group(1))
-                                break
-                            except (ValueError, AttributeError):
-                                line_list.pop(i)
-                            except IndexError as error:
-                                print(
-                                    "Unable to determine CPU number; Skipping time "
-                                    f"series collection: {error}\n{line}"
-                                )
-                                timeseries_dict = {}
-                                break
+                    # it to have spaces in it, so we split first on the expected "["
+                    # deliniator for the CPU number field.
+                    # FIXME: I can't be 100% sure that a "[" wont' appear in the string
+                    # that is the first object of the line.
+                    stripped_line = line.split("[")
+                    line_list = stripped_line[1].split()

                    # Because the tracer format is dependent on the underlying OS and
                    # cannot be controlled by the container, check the tracer output
                    # format and break gracefully if we don't recognize it
                    try:
-                        uptimestamp = float(line_list[3][:-1])
-                        timestamp = str(
+                        cpu = int(line_list[0][:-1])
+                        uptimestamp = float(line_list[2][:-1])
+                        timestamp = (
                            datetime.fromtimestamp(uptimestamp + uptime_offset)
                            .astimezone()
                            .isoformat()
                        )
-                        context = str(line_list[6])
-                        latency = int(line_list[8])
+                        context = str(line_list[5])
+                        latency = int(line_list[7])

                    except (IndexError, ValueError) as error:
                        print(


Since what you're trying to do is recognition and parsing, why not do that with a regex? Admitted, the expression itself is not easy to read, but that's a single sunk cost, and the remainder of the code is very easy and expressive (and you don't need a try block). (Note that the re.compile() call should be moved above the loop, perhaps to module-scope, so that it is executed only once.)

# Match a line which starts with zero or more spaces, at # least one character between a pair of angle-brackets, a # hyphen, a sequence of digits, zero or more spaces, a # sequence of three digits (which we capture) between a # pair of square-brackets, a space, two characters, a space, # a sequence of digits followed by a dot followed by six # digits (which we capture), a colon, a space, an octathorp # followed by a sequence of digits, a space, "context", one # or more spaces, a sequence of non-spaces (which we # capture), "timer_latency", a space, a sequence of digits # (which we capture), a space, and "ns". E.g., # <idle>-0 [009] d. 123.769498: #1002 context irq timer_latency 458 ns # or # <...>-625890 [009] .. 123.769499: #1002 context thread timer_latency 666 ns data_line = re.compile(r" *<[^>]+>-[0-9]+ *\[([0-9]{3})\] .. ([0-9]+\.[0-9]{6}): #[0-9]+ context +([^ ]+) timer_latency ([0-9]+) ns") [...] matches = data_line.match(line) if matches: (cpu, uptimestamp, context, latency) = matches.groups() cpu = int(cpu) uptimestamp = float(uptimestamp) # `context` is already a string, so no conversion is required. latency = int(latency) else: print("Unable to determine CPU number; Skipping time " f"series collection:\n\t{line}") timeseries_dict = {} break timestamp = ( datetime.fromtimestamp(uptimestamp + uptime_offset) .astimezone() .isoformat() )

webbnh · 2024-12-02T20:27:16Z

arcaflow_plugin_rtla/rtla_plugin.py

+                    # deliniator for the CPU number field.
+                    # FIXME: I can't be 100% sure that a "[" wont' appear in the string


"deliniator" ==> "delineator"
"wont'" ==> "won't"

arcaflow_plugin_rtla/rtla_plugin.py

dustinblack and others added 2 commits November 15, 2024 19:02

add latency timeseries output

9610e29

Automatic update of README.md by arcaflow-docsgen arcabot

ab15ce4

dustinblack and others added 2 commits November 18, 2024 13:44

check tracer output and convert timestamp to datetime

15a4ab9

Automatic update of README.md by arcaflow-docsgen arcabot

66294be

dustinblack commented Nov 18, 2024

View reviewed changes

arcaflow_plugin_rtla/rtla_plugin.py Show resolved Hide resolved

dustinblack added 3 commits November 18, 2024 14:47

tweak for schema consistency

3e5b003

check for empty time series; tweak user-facing language

d899d65

fix oops

e2ee97c

dustinblack marked this pull request as ready for review November 18, 2024 14:11

Automatic update of README.md by arcaflow-docsgen arcabot

eeafca0

dustinblack requested a review from a team November 18, 2024 14:12

dustinblack marked this pull request as draft November 18, 2024 14:56

dbutenhof reviewed Nov 18, 2024

View reviewed changes

add check for trace file and a delay to ensure we can read the pipe o…

fe8e81a

…utput

webbnh reviewed Nov 18, 2024

View reviewed changes

dbutenhof reviewed Nov 18, 2024

View reviewed changes

dustinblack and others added 6 commits November 26, 2024 16:38

time series complete, and resolution user-controllable

8fd1974

Automatic update of README.md by arcaflow-docsgen arcabot

8b5ac0d

use regex to match cpu number object precisely

43ecded

simplify time series file checking and command

70bd08f

change how we read the timerlat output

d1761b6

restructure to merge separate if blocks

2e9fe43

webbnh requested changes Nov 26, 2024

View reviewed changes

updates from PR feedback

77d19de

webbnh approved these changes Dec 2, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add latency timeseries output #8

add latency timeseries output #8

dustinblack commented Nov 15, 2024

jaredoconnell commented Nov 16, 2024

dbutenhof Nov 18, 2024

webbnh Nov 18, 2024

dbutenhof Nov 18, 2024

webbnh Nov 18, 2024

webbnh Nov 18, 2024 •

edited

Loading

webbnh left a comment

webbnh Nov 18, 2024

webbnh Nov 18, 2024

webbnh Nov 18, 2024

dbutenhof Nov 18, 2024

webbnh Nov 19, 2024

webbnh Nov 18, 2024

webbnh Nov 18, 2024

webbnh Nov 18, 2024

webbnh Nov 18, 2024

dbutenhof Nov 18, 2024

webbnh Nov 19, 2024

webbnh Nov 18, 2024

dbutenhof Nov 18, 2024

dbutenhof Nov 18, 2024

dbutenhof Nov 18, 2024

dbutenhof Nov 18, 2024

webbnh left a comment

webbnh Nov 26, 2024

dustinblack Nov 27, 2024

webbnh Dec 2, 2024 •

edited

Loading

webbnh left a comment •

edited

Loading

webbnh Dec 2, 2024

webbnh Dec 2, 2024

		if params.enable_time_series and trace_file_exists:
		# Interrupt the time series collection process and capture the output

		if cpu_context not in timeseries_dict:
		timeseries_dict[cpu_context] = []

		# Because the tracer format is dependent on the underlying OS and cannot
		# be controlled by the container, check the tracer output format and

		# deliniator for the CPU number field.
		# FIXME: I can't be 100% sure that a "[" wont' appear in the string

add latency timeseries output #8

Are you sure you want to change the base?

add latency timeseries output #8

Conversation

dustinblack commented Nov 15, 2024

Changes introduced with this PR

jaredoconnell commented Nov 16, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

webbnh Nov 18, 2024 • edited Loading

Choose a reason for hiding this comment

webbnh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

webbnh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

webbnh Dec 2, 2024 • edited Loading

Choose a reason for hiding this comment

webbnh left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

webbnh Nov 18, 2024 •

edited

Loading

webbnh Dec 2, 2024 •

edited

Loading

webbnh left a comment •

edited

Loading