Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

E1014: Unknown error - weak object has gone away #817

Open
mattmiller87 opened this issue Oct 10, 2024 · 6 comments
Open

E1014: Unknown error - weak object has gone away #817

mattmiller87 opened this issue Oct 10, 2024 · 6 comments
Labels
bug Something isn't working

Comments

@mattmiller87
Copy link
Contributor

Environment

  • Python version: 3.11
  • Nautobot version: 2.3.4
  • nautobot-golden-config version: 2.1.2

Expected Behavior

Observed Behavior

E1014: Unknown error - weak object has gone away

Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/nornir/core/task.py", line 99, in start
    r = self.task(self, **self.params)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/nornir_jinja2/plugins/tasks/template_file.py", line 42, in template_file
    t = env.get_template(template)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/jinja2/environment.py", line 1013, in get_template
    return self._load_template(name, globals)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/jinja2/environment.py", line 961, in _load_template
    template = self.cache.get(cache_key)
               ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/jinja2/utils.py", line 466, in get
    return self[key]
           ~~~~^^^^^
  File "/usr/local/lib/python3.11/site-packages/jinja2/utils.py", line 504, in __getitem__
    rv = self._mapping[key]
         ~~~~~~~~~~~~~^^^^^
TypeError: weak object has gone away

Steps to Reproduce

Apologies for the lack of detail, I will attempt to update once I have more. I wanted to at least capture the error and get the issue submitted.

@mattmiller87 mattmiller87 added the bug Something isn't working label Oct 10, 2024
@jdrew82
Copy link
Contributor

jdrew82 commented Nov 14, 2024

We've determined this to be caused by an object being garbage collected before it's use. It's resolved by increasing resources on the worker or reducing the number of devices being processed.

@itdependsnetworks
Copy link
Contributor

@lampwins @glennmatthews any thoughts on this?

@gioccher
Copy link

I'm having the same issue with
Python version: 3.9
Nautobot version: 2.3.16
nautobot-golden-config version: 2.2.1

I'm running celery-worker in k8s on a host with plenty of memory and the default ~6GB limit for the celery worker containers and it doesn't seem to request more memory from the cluster.
Here's a memory graph of a freshly started celery-worker container that shows running the generate intended configs job 3 times: the memory goes up at the first run as expected, but stays well below the 6.5 GB limit
image

Every few runs, depending on luck, one or a handful of devices fail generating the running config with "TypeError: weak object has gone away". Most of the times the traceback is exactly like the one mattmiller87 posted, sometimes in includes references to our jinja template files

Traceback (most recent call last):
  File "/opt/nautobot/.local/lib/python3.9/site-packages/nornir/core/task.py", line 99, in start
    r = self.task(self, **self.params)
  File "/opt/nautobot/.local/lib/python3.9/site-packages/nornir_jinja2/plugins/tasks/template_file.py", line 43, in template_file
    text = t.render(host=task.host, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/jinja2/environment.py", line 1304, in render
    self.environment.handle_exception()
  File "/usr/local/lib/python3.9/site-packages/jinja2/environment.py", line 939, in handle_exception
    raise rewrite_traceback_stack(source=source)
  File "/opt/nautobot/git/golden-config-templates/arista_eos.j2", line 34, in top-level template code
    {% include './arista_eos/interfaces.j2' %}
  File "/opt/nautobot/git/golden-config-templates/arista_eos/interfaces.j2", line 21, in top-level template code
    {%       include './arista_eos/interfaces/ethernet.j2' %}
  File "/opt/nautobot/git/golden-config-templates/arista_eos/interfaces/ethernet.j2", line 10, in top-level template code
    {% include './arista_eos/interfaces/_switchport.j2' %}
  File "/usr/local/lib/python3.9/site-packages/jinja2/utils.py", line 466, in get
    return self[key]
  File "/usr/local/lib/python3.9/site-packages/jinja2/utils.py", line 504, in __getitem__
    rv = self._mapping[key]
TypeError: weak object has gone away

So when garbage collection kicks in (and it can trigger when enough objects in a generation are deemed ready for collection, not solely when the program is running out of memory) part of the job fails. Giving more resources to celery or reducing the number of devices doesn't seem to be a robust workaround

@gioccher
Copy link

gioccher commented Jan 14, 2025

I tried disabling garbage collection during this job by calling gc.disable() in IntendedJob.run() before

but that didn't solve the issue.

        gc.collect(generation=2)
        gc.disable()
        config_intended(self)
        gc.enable()

Perhaps nornir runs in its own process and so uses its own GC, or this is not strictly related to GC, or I'm disabling it in the wrong place?

@gioccher
Copy link

I'm running this patch now to https://github.com/pallets/jinja/blob/main/src/jinja2/utils.py as a workaround - doesn't seem to have serious side effects (other than making the jinja cache less effective, which is better than crashing)

--- jinja_utils_before.py	2025-01-14 13:06:18
+++ jinja_utils_after.py	2025-01-14 13:44:31
@@ -512,7 +512,10 @@
         Raise a `KeyError` if it does not exist.
         """
         with self._wlock:
-            rv = self._mapping[key]
+            try:
+                rv = self._mapping[key]
+            except TypeError:
+                raise KeyError
 
             if self._queue[-1] != key:
                 try:
@@ -532,13 +535,16 @@
         has the highest priority then.
         """
         with self._wlock:
-            if key in self._mapping:
-                self._remove(key)
-            elif len(self._mapping) == self.capacity:
-                del self._mapping[self._popleft()]
+            try:
+                if key in self._mapping:
+                    self._remove(key)
+                elif len(self._mapping) == self.capacity:
+                    del self._mapping[self._popleft()]
 
-            self._append(key)
-            self._mapping[key] = value
+                self._append(key)
+                self._mapping[key] = value
+            except TypeError:
+                pass
 
     def __delitem__(self, key: t.Any) -> None:
         """Remove an item from the cache dict.
@@ -549,7 +555,7 @@
 
             try:
                 self._remove(key)
-            except ValueError:
+            except (ValueError, TypeError):
                 pass
 
     def items(self) -> t.Iterable[t.Tuple[t.Any, t.Any]]:

I still can't tell if this is a bug inside Jinja or the way Jinja is used by the nornir task

@itdependsnetworks
Copy link
Contributor

@gioccher I don't expect we will have a quick answer, but @cmsirbu can I get this on your radar to look into?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants