Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding containerd compatability to oom_logger - COMPINFRA-3947 #3904

Merged
merged 30 commits into from
Aug 29, 2024

Conversation

EmanElsaban
Copy link
Contributor

@EmanElsaban EmanElsaban commented Jun 27, 2024

Adds containerd support to oom_logger in paasta_tools.

[Testing]

Jul  4 14:03:23 10-40-11-245-uswest1adevc kernel: : [ 7195.442928] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=cri-containerd-e216d2f1e6c625d363c71edb6b3cbab5a9e1b447641b61028d0b94b077adf27c.scope,mems_allowed=0,oom_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod08768c36_163c_40e5_8e49_09cf42ff5046.slice,task_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod08768c36_163c_40e5_8e49_09cf42ff5046.slice/cri-containerd-e216d2f1e6c625d363c71edb6b3cbab5a9e1b447641b61028d0b94b077adf27c.scope,task=python3,pid=485850,uid=33

which is called oom_regex_kubernetes_containerd_systemd_cgroup and tested it matched:
https://fluffy.yelpcorp.com/i/zTVqjL76X4tQDRG4wbpMdcX83BcxP7R8.html

NOTE

we need to make sure we merge the paasta pr first and have it rolled out everywhere before merging the puppet pr otherwise syslog wont send the logs as we have seen in testing where we had to first install the paasta debian package before running the d-branch to have syslog send the logs to scribereader

@EmanElsaban EmanElsaban force-pushed the u/emanelsabban/COMPINFRA-3947 branch from cedce50 to a978a3b Compare June 27, 2024 21:32
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: we'll want to remove any references of this from puppet if we haven't already

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll also need to remove this from setup.py and debian/paasta-tools.links I believe.

That said, I don't think you'll need to do anything in puppet -- when I looked through for references to this script I found references to a script that had a similar name, but this specific paasta version was never actually referenced/used.

parser.add_argument(
"--containerd",
action="store_true",
help="Use containerd to inspect containers",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
help="Use containerd to inspect containers",
help="Use containerd to inspect containers, otherwise use docker",

@@ -136,11 +151,15 @@ def capture_oom_events_from_stdin():
break


def get_container_env_as_dict(docker_inspect):
def get_container_env_as_dict(is_cri_containerd: bool, container_inspect):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
def get_container_env_as_dict(is_cri_containerd: bool, container_inspect):
def get_container_env_as_dict(is_cri_containerd: bool, container_inspect: dict) -> Dict[str, str]:

comment: typing container_inspect is probably a PITA if the containerd/docker libraries don't have a good TypedDict/dataclass/etc that we can use - maybe once we're only using containerd we can have a non-total TypedDict with the bits that we do use tho :p

Comment on lines 236 to 239
if is_cri_containerd:
namespace = "k8s.io"
else:
namespace = "moby"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, could we use the containerd library for both cases? if so, we could remove the conditional in main() :)

otherwise, we can just assume that if this method is being called we want the k8s namespace

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, I was thinking of also just assuming if we call this function then we are using containerd but I wasn't sure if we would still need the moby ns

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll also need to remove this from setup.py and debian/paasta-tools.links I believe.

That said, I don't think you'll need to do anything in puppet -- when I looked through for references to this script I found references to a script that had a similar name, but this specific paasta version was never actually referenced/used.

Comment on lines 94 to 96
def capture_oom_events_from_stdin():
process_name_regex = re.compile(
r"^\d+\s[a-zA-Z0-9\-]+\s.*\]\s(.+)\sinvoked\soom-killer:"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'll need to do some experimentation w/ an actual oomkilled pod on nodes w/ containerd to verify what format the logs look like, and if we need to update any of these regexes.

Copy link
Contributor

@jfongatyelp jfongatyelp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

python dependencies are pain 🫨

Comment on lines 169 to 175
if is_cri_containerd:
config = container_inspect.get("process")
env = config.get("env", [])
else:
config = container_inspect.get("Config")
env = config.get("Env", [])
if config is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically this if config is not None should never be hit if we've got that far; if config was none we'd run into an exception when reaching into it while trying to get the env in each of the cases above.
Maybe this should be more like:

Suggested change
if is_cri_containerd:
config = container_inspect.get("process")
env = config.get("env", [])
else:
config = container_inspect.get("Config")
env = config.get("Env", [])
if config is not None:
if is_cri_containerd:
config = container_inspect.get("process")
env_key = "env"
else:
config = container_inspect.get("Config")
env_key = "Env"
if config is not None:
env = config.get(env_key, [])
...

Copy link
Member

@nemacysts nemacysts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should add some tests for the new functionality, but otherwise looks good :)

@@ -115,6 +150,8 @@ def capture_oom_events_from_stdin():
oom_regex_kubernetes,
oom_regex_kubernetes_structured,
oom_regex_kubernetes_systemd_cgroup,
oom_regex_kubernetes_containerd_systemd_cgroup,
oom_regex_kubernetes_containerd_systemd_cgroup_structured,
]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at some point we should probably audit all of these + remove unused ones + comment the remaining ones

from chatting in slack, it seems like we're not quite sure yet why sometimes we're only seeing the structured oomkill loglines and other times the unstructured one - but it does seem a bit weird that we can't use a single regex for containerd pods (or that in the past we needed 3 regexes for dockershim pods!)

@@ -136,11 +173,18 @@ def capture_oom_events_from_stdin():
break


def get_container_env_as_dict(docker_inspect):
def get_container_env_as_dict(
is_cri_containerd: bool, container_inspect: dict
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be nice to have some TypedDicts for the container_inspect dict - but that'll be a little annoying since you'd need a Union of two TypedDicts and you'd need to add some casts in the the code below :p

Comment on lines 348 to 350
@patch("paasta_tools.oom_logger.argparse", autospec=True)
def test_parse_args(mock_argparse):
assert parse_args() == mock_argparse.ArgumentParser.return_value.parse_args()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it's fine to leave this untested since we're not really checking for much here :)

tests/test_oom_logger.py Show resolved Hide resolved
@@ -12,6 +12,7 @@ passenv = SSH_AUTH_SOCK PAASTA_ENV DOCKER_HOST CI
setenv =
TZ = UTC
deps =
--only-binary=grpcio
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i wonder if there's a way for us to set this in the requirements file so that we only need to specify this once - i was going to suggest adding a comment as to why this is here would be nice (i.e., we have wheels readily available internally but the same is not true in public pypi), but we'd need to copy-paste that quite a bit :p

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh nice, it is possible: https://pip.pypa.io/en/stable/reference/requirements-file-format/#global-options

if it doesn't break requirements-tools (i.e., check-requirements) to do this, that'd make this diff smaller (we wouldn't have to add this --only-binary option to all the toxenvs nor to the dockerfiles!) and it would let us add a comment explaining why we're singling out grpcio

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(that said, also fine as a followup)

Comment on lines +68 to +69
setenv =
PIP_INDEX_URL = http://169.254.255.254:20641/simple/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should probably refactor our env handling at some point so that this isn't required here, but that's probably not something we wanna tackle in this PR

Copy link
Member

@nemacysts nemacysts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor nits, but looks good otherwise!

"PAASTA_SERVICE=fake_service",
"PAASTA_INSTANCE=fake_instance",
"PAASTA_RESOURCE_MEM=512",
"MESOS_CONTAINER_NAME=mesos-a04c14a6-83ea-4047-a802-92b850b1624e",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a more representative test would probably be to remove this env var

Suggested change
"MESOS_CONTAINER_NAME=mesos-a04c14a6-83ea-4047-a802-92b850b1624e",

since we no longer actually set this :)

service="fake_service",
instance="fake_instance",
process_name="python3",
mesos_container_id="mesos-a04c14a6-83ea-4047-a802-92b850b1624e",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, but

Suggested change
mesos_container_id="mesos-a04c14a6-83ea-4047-a802-92b850b1624e",
mesos_container_id="mesos-null",

instead of deleting this line

def log_line_containerd():
return LogLine(
timestamp=1720128512,
hostname="dev208-uswest1adevc",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just to make sure that there's nothing in this test that depends on where we run it - can we re-use the hostname from previous tests? i.e.,,

Suggested change
hostname="dev208-uswest1adevc",
hostname="dev37-devc",

@EmanElsaban EmanElsaban merged commit f2cc8fc into master Aug 29, 2024
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants