Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

memcache write errors in the LMS ("object too large for cache") #877

Open
timmc-edx opened this issue Dec 18, 2024 · 4 comments
Open

memcache write errors in the LMS ("object too large for cache") #877

timmc-edx opened this issue Dec 18, 2024 · 4 comments
Assignees

Comments

@timmc-edx
Copy link
Member

The vast majority of a certain class of memcache calls to set a key are failing with the error "object too large for cache".

These can be identified with @error.message:"b'object too large for cache'" "error.type:pymemcache.exceptions.MemcacheServerError in a Datadog query.

Notes

  • The failing spans are all operation_name:memcached.command resource_name:set. These come from the memcache library integration. These failing writes do not propagate their error upwards, which is for the best but does mean that querying is a little complicated; to get more information about what memcache operation was attempted, you'll need to look at their parent spans, which are operation_name:django.cache. You'll need to do an a => b trace search.
  • At the django.cache level I see that the resource names seem to all be django.core.cache.backends.memcached.OPERATION KEY_PREFIX (note the space). There are three key prefixes in effect: default, course_structure, and (uncommonly) general.
  • The vast majority of these errors are coming from set on course_structure. Here's a status breakdown for those resources. A few of the errors come from default.
  • Slicing a different way, almost all of course_structure sets are failing; almost all of default sets are succeeding. They are of roughly equal number.
@timmc-edx timmc-edx converted this from a draft issue Dec 18, 2024
@github-project-automation github-project-automation bot moved this to Todo in Arbi-BOM Jan 6, 2025
@jristau1984
Copy link

@UsamaSadiq @iamsobanjaved please consider this a discovery ticket to try and find a root cause for this, instead of simply bumping up the max threshold. Thanks!

@jristau1984 jristau1984 moved this to Backlog in Arch-BOM Jan 6, 2025
@jristau1984
Copy link

@dianakhuang can you confirm that Arbi-BOM can fit this work into their current schedule? Thanks!

@mumarkhan999
Copy link
Member

mumarkhan999 commented Feb 18, 2025

During my work to replace the deprecated python-memcache library with pymemcache, I identified this issue related to cache size limitations. This issue was discussed in this Slack thread (link). I also created an SRE ticket to address the cache size limitation (DOS-3846). However, after consulting with Robert, it was determined that increasing the cache size would have a significant impact, as it would require restarting the memcache server, which would result in the loss of active user sessions. As a result, this approach was not pursued.

The root cause of the issue is straightforward: we are attempting to save data to the cache that exceeds the predefined size limit. In the production environment, this limit is set to 2MB. As demonstrated in the attached screenshots, data is successfully cached when it falls within this limit.

Image

Image

Previously, when using python-memcache, such errors were not encountered because the library silently handled this scenerio without raising exceptions. This behavior masked the issue, whereas pymemcache explicitly raises an error when the data exceeds the cache size limit, making the problem more apparent.
The following screenshot shows the silent behavior of python-memcache package
Image

@iamsobanjaved iamsobanjaved moved this from Todo to In Progress in Arbi-BOM Feb 18, 2025
@iamsobanjaved iamsobanjaved moved this from In Progress to Owner Review in Arbi-BOM Feb 18, 2025
@robrap
Copy link
Contributor

robrap commented Feb 19, 2025

Additional thoughts:

  1. Do you know if the client raises the error before attempting to store in memcached, or if it sends the large data to memcached in order to learn that it is too large? If it actually makes the memcached call, we might want to detect beforehand to avoid the unnecessary call to memcached with large data.
  2. Can you add a call to function_trace (see examples in github) to allow us to see how long it takes to do all the work on cache misses for these large objects?
    a. You may also want to add a call to set_custom_attribute('too_big_for_memcached', True) when we get this error, so we can easily filter traces for this case. But, maybe the error is enough? I'm not certain.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Owner Review
Status: Backlog
Development

No branches or pull requests

4 participants