E1014: Unknown error - weak object has gone away #817

mattmiller87 · 2024-10-10T15:19:38Z

Environment

Python version: 3.11
Nautobot version: 2.3.4
nautobot-golden-config version: 2.1.2

Expected Behavior

Observed Behavior

E1014: Unknown error - weak object has gone away

Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/nornir/core/task.py", line 99, in start
    r = self.task(self, **self.params)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/nornir_jinja2/plugins/tasks/template_file.py", line 42, in template_file
    t = env.get_template(template)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/jinja2/environment.py", line 1013, in get_template
    return self._load_template(name, globals)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/jinja2/environment.py", line 961, in _load_template
    template = self.cache.get(cache_key)
               ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/jinja2/utils.py", line 466, in get
    return self[key]
           ~~~~^^^^^
  File "/usr/local/lib/python3.11/site-packages/jinja2/utils.py", line 504, in __getitem__
    rv = self._mapping[key]
         ~~~~~~~~~~~~~^^^^^
TypeError: weak object has gone away

Steps to Reproduce

Apologies for the lack of detail, I will attempt to update once I have more. I wanted to at least capture the error and get the issue submitted.

The text was updated successfully, but these errors were encountered:

jdrew82 · 2024-11-14T18:32:45Z

We've determined this to be caused by an object being garbage collected before it's use. It's resolved by increasing resources on the worker or reducing the number of devices being processed.

itdependsnetworks · 2024-11-14T20:22:02Z

@lampwins @glennmatthews any thoughts on this?

gioccher · 2025-01-14T19:07:08Z

I'm having the same issue with
Python version: 3.9
Nautobot version: 2.3.16
nautobot-golden-config version: 2.2.1

I'm running celery-worker in k8s on a host with plenty of memory and the default ~6GB limit for the celery worker containers and it doesn't seem to request more memory from the cluster.
Here's a memory graph of a freshly started celery-worker container that shows running the generate intended configs job 3 times: the memory goes up at the first run as expected, but stays well below the 6.5 GB limit

Every few runs, depending on luck, one or a handful of devices fail generating the running config with "TypeError: weak object has gone away". Most of the times the traceback is exactly like the one mattmiller87 posted, sometimes in includes references to our jinja template files

Traceback (most recent call last):
  File "/opt/nautobot/.local/lib/python3.9/site-packages/nornir/core/task.py", line 99, in start
    r = self.task(self, **self.params)
  File "/opt/nautobot/.local/lib/python3.9/site-packages/nornir_jinja2/plugins/tasks/template_file.py", line 43, in template_file
    text = t.render(host=task.host, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/jinja2/environment.py", line 1304, in render
    self.environment.handle_exception()
  File "/usr/local/lib/python3.9/site-packages/jinja2/environment.py", line 939, in handle_exception
    raise rewrite_traceback_stack(source=source)
  File "/opt/nautobot/git/golden-config-templates/arista_eos.j2", line 34, in top-level template code
    {% include './arista_eos/interfaces.j2' %}
  File "/opt/nautobot/git/golden-config-templates/arista_eos/interfaces.j2", line 21, in top-level template code
    {%       include './arista_eos/interfaces/ethernet.j2' %}
  File "/opt/nautobot/git/golden-config-templates/arista_eos/interfaces/ethernet.j2", line 10, in top-level template code
    {% include './arista_eos/interfaces/_switchport.j2' %}
  File "/usr/local/lib/python3.9/site-packages/jinja2/utils.py", line 466, in get
    return self[key]
  File "/usr/local/lib/python3.9/site-packages/jinja2/utils.py", line 504, in __getitem__
    rv = self._mapping[key]
TypeError: weak object has gone away

So when garbage collection kicks in (and it can trigger when enough objects in a generation are deemed ready for collection, not solely when the program is running out of memory) part of the job fails. Giving more resources to celery or reducing the number of devices doesn't seem to be a robust workaround

gioccher · 2025-01-14T19:59:07Z

I tried disabling garbage collection during this job by calling gc.disable() in IntendedJob.run() before

nautobot-app-golden-config/nautobot_golden_config/jobs.py

Line 267 in 0b00113

config_intended(self)

but that didn't solve the issue.

        gc.collect(generation=2)
        gc.disable()
        config_intended(self)
        gc.enable()

Perhaps nornir runs in its own process and so uses its own GC, or this is not strictly related to GC, or I'm disabling it in the wrong place?

gioccher · 2025-01-14T22:14:11Z

I'm running this patch now to https://github.com/pallets/jinja/blob/main/src/jinja2/utils.py as a workaround - doesn't seem to have serious side effects (other than making the jinja cache less effective, which is better than crashing)

--- jinja_utils_before.py	2025-01-14 13:06:18
+++ jinja_utils_after.py	2025-01-14 13:44:31
@@ -512,7 +512,10 @@
         Raise a `KeyError` if it does not exist.
         """
         with self._wlock:
-            rv = self._mapping[key]
+            try:
+                rv = self._mapping[key]
+            except TypeError:
+                raise KeyError
 
             if self._queue[-1] != key:
                 try:
@@ -532,13 +535,16 @@
         has the highest priority then.
         """
         with self._wlock:
-            if key in self._mapping:
-                self._remove(key)
-            elif len(self._mapping) == self.capacity:
-                del self._mapping[self._popleft()]
+            try:
+                if key in self._mapping:
+                    self._remove(key)
+                elif len(self._mapping) == self.capacity:
+                    del self._mapping[self._popleft()]
 
-            self._append(key)
-            self._mapping[key] = value
+                self._append(key)
+                self._mapping[key] = value
+            except TypeError:
+                pass
 
     def __delitem__(self, key: t.Any) -> None:
         """Remove an item from the cache dict.
@@ -549,7 +555,7 @@
 
             try:
                 self._remove(key)
-            except ValueError:
+            except (ValueError, TypeError):
                 pass
 
     def items(self) -> t.Iterable[t.Tuple[t.Any, t.Any]]:

I still can't tell if this is a bug inside Jinja or the way Jinja is used by the nornir task

itdependsnetworks · 2025-01-15T17:35:10Z

@gioccher I don't expect we will have a quick answer, but @cmsirbu can I get this on your radar to look into?

mattmiller87 added the bug Something isn't working label Oct 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

E1014: Unknown error - weak object has gone away #817

E1014: Unknown error - weak object has gone away #817

mattmiller87 commented Oct 10, 2024

jdrew82 commented Nov 14, 2024

itdependsnetworks commented Nov 14, 2024

gioccher commented Jan 14, 2025

gioccher commented Jan 14, 2025 •

edited

Loading

gioccher commented Jan 14, 2025

itdependsnetworks commented Jan 15, 2025

E1014: Unknown error - weak object has gone away #817

E1014: Unknown error - weak object has gone away #817

Comments

mattmiller87 commented Oct 10, 2024

Environment

Expected Behavior

Observed Behavior

Steps to Reproduce

jdrew82 commented Nov 14, 2024

itdependsnetworks commented Nov 14, 2024

gioccher commented Jan 14, 2025

gioccher commented Jan 14, 2025 • edited Loading

gioccher commented Jan 14, 2025

itdependsnetworks commented Jan 15, 2025

gioccher commented Jan 14, 2025 •

edited

Loading