memcache write errors in the LMS ("object too large for cache") #877

timmc-edx · 2024-12-18T18:57:47Z

The vast majority of a certain class of memcache calls to set a key are failing with the error "object too large for cache".

These can be identified with @error.message:"b'object too large for cache'" "error.type:pymemcache.exceptions.MemcacheServerError in a Datadog query.

Notes

The failing spans are all operation_name:memcached.command resource_name:set. These come from the memcache library integration. These failing writes do not propagate their error upwards, which is for the best but does mean that querying is a little complicated; to get more information about what memcache operation was attempted, you'll need to look at their parent spans, which are operation_name:django.cache. You'll need to do an a => b trace search.
At the django.cache level I see that the resource names seem to all be django.core.cache.backends.memcached.OPERATION KEY_PREFIX (note the space). There are three key prefixes in effect: default, course_structure, and (uncommonly) general.
The vast majority of these errors are coming from set on course_structure. Here's a status breakdown for those resources. A few of the errors come from default.
Slicing a different way, almost all of course_structure sets are failing; almost all of default sets are succeeding. They are of roughly equal number.

The text was updated successfully, but these errors were encountered:

jristau1984 · 2025-01-06T15:11:20Z

@UsamaSadiq @iamsobanjaved please consider this a discovery ticket to try and find a root cause for this, instead of simply bumping up the max threshold. Thanks!

jristau1984 · 2025-02-07T19:49:36Z

@dianakhuang can you confirm that Arbi-BOM can fit this work into their current schedule? Thanks!

mumarkhan999 · 2025-02-18T11:59:59Z

During my work to replace the deprecated python-memcache library with pymemcache, I identified this issue related to cache size limitations. This issue was discussed in this Slack thread (link). I also created an SRE ticket to address the cache size limitation (DOS-3846). However, after consulting with Robert, it was determined that increasing the cache size would have a significant impact, as it would require restarting the memcache server, which would result in the loss of active user sessions. As a result, this approach was not pursued.

The root cause of the issue is straightforward: we are attempting to save data to the cache that exceeds the predefined size limit. In the production environment, this limit is set to 2MB. As demonstrated in the attached screenshots, data is successfully cached when it falls within this limit.

Previously, when using python-memcache, such errors were not encountered because the library silently handled this scenerio without raising exceptions. This behavior masked the issue, whereas pymemcache explicitly raises an error when the data exceeds the cache size limit, making the problem more apparent.
The following screenshot shows the silent behavior of python-memcache package

robrap · 2025-02-19T21:28:28Z

Additional thoughts:

Do you know if the client raises the error before attempting to store in memcached, or if it sends the large data to memcached in order to learn that it is too large? If it actually makes the memcached call, we might want to detect beforehand to avoid the unnecessary call to memcached with large data.
Can you add a call to function_trace (see examples in github) to allow us to see how long it takes to do all the work on cache misses for these large objects?
a. You may also want to add a call to set_custom_attribute('too_big_for_memcached', True) when we get this error, so we can easily filter traces for this case. But, maybe the error is enough? I'm not certain.

timmc-edx added this to Arch-BOM Dec 18, 2024

timmc-edx converted this from a draft issue Dec 18, 2024

jristau1984 added this to Arbi-BOM Jan 6, 2025

github-project-automation bot moved this to Todo in Arbi-BOM Jan 6, 2025

jristau1984 moved this to Backlog in Arch-BOM Jan 6, 2025

iamsobanjaved assigned mumarkhan999 Feb 17, 2025

iamsobanjaved moved this from Todo to In Progress in Arbi-BOM Feb 18, 2025

iamsobanjaved moved this from In Progress to Owner Review in Arbi-BOM Feb 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

memcache write errors in the LMS ("object too large for cache") #877

memcache write errors in the LMS ("object too large for cache") #877

timmc-edx commented Dec 18, 2024

jristau1984 commented Jan 6, 2025

jristau1984 commented Feb 7, 2025

mumarkhan999 commented Feb 18, 2025 •

edited

Loading

robrap commented Feb 19, 2025

memcache write errors in the LMS ("object too large for cache") #877

memcache write errors in the LMS ("object too large for cache") #877

Comments

timmc-edx commented Dec 18, 2024

Notes

jristau1984 commented Jan 6, 2025

jristau1984 commented Feb 7, 2025

mumarkhan999 commented Feb 18, 2025 • edited Loading

robrap commented Feb 19, 2025

mumarkhan999 commented Feb 18, 2025 •

edited

Loading