-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Version check makes potentially invalid assumptions about ELF layout #174
Comments
Hi @jhance and thanks for opening the issue. We will take a look soon. Meanwhile, could you tell us what version of Ubuntu are you using and how are you obtaining Python (deadsnakes, main repo, pyenv...). Also is this when analysing a live process or a core file? |
Also, just to ensure we use what you already debugged: what's making your rodata section different than a regular binary? (It's not immediate clear from your comment) |
I am compiling Python from source myself with a crosstool targetting towards a non-distribution provided version of glibc. As such, I don't expect many people to be following the same process. I am not sure what what is resulting in this difference though, maybe because I use gold instead of ld? I was analyzing a core file while pointing to the same python binary that the core file was extracted from. I just recently found a workaround is to essentially disable the |
My suspicion here is that what's going on is that you have a first PT_LOAD segment that doesn't map to the start of the file. When the linker loads the file in memory, the sections (such as .rodata) don't matter anymore and the only thing the linker sees its LOAD segments. For this reason, we just need to find where the first LOAD segment (the mount point) it's in the file and correct by that (which we aren't doing at the time). To corroborate this, do you mind sending the output of readelf -a over the binary? |
(I cut out some sections that contain literally all of the Python symbols). We pass |
Ah there you go:
I will try to make a patch this week |
Yeah that's also my guess but I have seen this on the wild as well so I don't think it's unique of this situation |
Thanks for the quick help, I am hardly an ELF expert so ended up reading a lot of docs today to figure out why this was not working... |
After playing with this for a while, I still cannot reproduce in a variety of situations so making a patch it's going to be very difficult. I still think we are handling the code correctly: The calculation works if you do:
(which is wrong, it only works for offset 0). But that works for the case in this issue so the resulting offsets is correct. The offset was calculated later:
But according to
If you just do what’s in main now:
Which should be correct. That’s also the first map I get:
So I don’t understand the problem as it's described above. I will close this issue until we have a better reproducer that we can investigate. |
I am sad to see this was closed as I didn't realize you were in need of more information. Would you be willing to investigate if I sent you the python binary itself? I think that should be legal per Pythons license without you being able to construct the binary, and I don't think you should need to actually run the Python binary in order to figure out is wrong. Alternatively, having a loophole to just pass the python version? |
We would need the Python binary and all transitive dependencies (shared objects) in a container or environment where we can execute it. Alternatively you would need to tell us how to build the Python binary that you are experiencing the problem with. Ideally you can provide us with a docker container or some similar thing that we can inspect. Unfortunately is impossible for us to debug something as tricky as this without access to a reproducer that is reliable. |
I think it should be possible if I give you the python binary and you could run it against system libc. Its statically linked to everything sans libc, and running it against your libc is a little sketchy but should be mostly fine if we use a jammy container. I will still need to get approval from someone to send you the binary, though. Otherwise you would need to run https://github.com/dropbox/dbx_build_tools/blob/master/build_tools/drte/tools/drte-build.sh yourself because I don't think its going to be possible for me to send you the libc artifact, which is way too much of a pain to ask. I guess I will poke around some more later and see if I can find something different on my end than what you said is happening. |
Ok, meanwhile could you do the following:
Hopefully I can spot what the hell is going on from how the linker is loading the segments in memory. |
Reopening as I have convinced myself we are doing it wrong |
@jhance gentle ping |
Sorry, will check |
It is still giving me an invalid error. I am trying to run the suggested gdb output stuff now. I attempted via cloning that branch, doing |
Readelf: https://gist.github.com/jhance/b6d9db093077cf7fd872f219581cc774 |
Wow that's surprising! Could you maybe try to add some print calls or use gdb in the getMemoryLocationFromElf function to know why is failing to locate the version in the segments? |
I will review your last comment in an hour to see if I can make sense on what's going on |
You are going to need to help me here because I think what the PR is doing should be correct. The first PT_LOAD affording to readelf is loaded at LOAD 0x0000000000000000 0x0000000000400000 0x0000000000400000 Which is also the load point of the binary according to gdb: Start Addr End Addr Size Offset Perms objfile That gives us a bias of 0 so the address of the symbol relocated to the elf would be: 0xbb8ea8 -(0x400000-0x400000)= 0xbb8ea8 That is the same as the address of the symbol in the readelf output. Then from there the computation is only with data from the file by substracting the virtual addr and adding the offset. Do you mind trying to step though that function or add some print debugging to know what are we getting wrong? |
Turning on
Is there something wrong with my
I realize this error is now being thrown from |
Its passing |
We still need to know if the function is doing what is doing. The fact that the address or the symbol is correct doesn't mean that the contents of the symbol are. The contents are fetched from the elf so that's why I am asking you to add some print debugging or using gdb to follow that we are reading the memory correctly. The fact that we are passing (0,0) means likely that we are reading it wrong somehow |
Figured out how to get debug logs to fire and added some more prints:
I will try to get some more debug logs in the internals of fetching the object from the process |
Basically we need to know that this is doing the correct thing:
That should put in Try your follow all the values to check that we are doing the calculation correctly. If we are maybe the problem is somewhere else but it's important to confirm that that is working correctly in the PR |
Yeah that points that we are copying the wrong thing so let's try to see what part of the calculation is wrong We are very close! |
I was able to get a print from |
Can you try to follow why we are not calling into the code I referred? I think that is likely what I am not understanding. If the function is not being called is because somehow it thinks the memory is in the core.... which should not be because somehow is failing. Try to give me as much print debugging of what's going on in that function as possible |
Yeah, working on it The way I am testing it doesn't seem to have very good partial compilation |
|
This is the log lines I added
|
Me not knowing anything about how this works thinks the unsigned long offset - start where offset < start it a little bit sus. Seems like this is underflowing? |
Oh wow according to this we have a map in the core with the symbol! Wow that's super strange. Can you give me the following
|
Also can you try to force it to go to the elf file for that address in particular instead of to the core to confirm that the function I am fixing works? Just hardcore a conditional for that specific address |
Guess the core I'm using in this example has a lot of numpy <_< The output of
|
I am also suspicious of this: DEBUG(process_core): map start = 4194304 That seems to imply that the map covers from 4194304 to 13377536 but only has a file size of 4k! |
It seems to work when I hack it to not use the core for that address like you suggested. One other thing that might be relevant is that there is no libpython.so or whatever it is all statically linked in |
Can you give me the output of readelf -a on the core file itself? |
This is my understanding of the situation: The core contains a mapping that has a big range but somehow it doesn't contain everything in the mapping. That's super weird as my understanding is that normally you just have a file size of 0 to say "the data is somewhere else in the original file" but I am correct it means we need to account for a mapping having some partial content of the full range. Once we identify how to know that we are in this situation we can properly redirect to copy from the elf file and update my PR and problem solved. We just need to be able to confirm this situation. |
|
Ok I think I got this. I will update my PR in one hour |
@jhance I updated the PR can you give it another go? |
Yeah it works. |
Ok, I will clean it up and merge it soon. Thanks for all the patience and all the help and for insisting for us to fix the issue :) |
Thank you for supporting our obscure python-built-with-bazel setup |
Is there an existing issue for this?
Current Behavior
I have been trying to debug why
pystack
thinks I am using python version12.41
and it turns out that my python binary has a different layout. This problem seems to only occur (or at least I've only noticed thus far) when it attempts to find hte value ofPy_Version
. The version of python supplied by ubuntu has elf section forrodata
that looks like this (obtained withreadelf -S
).Mine looks like this:
Expected Behavior
The proper way to look up the address would be something like
<addr of Py_Version> - 0x00000000008741c0 + 004741c0
Since these values are the same in most python binaries, the issue would go unnoticed. I am not sure if this is some guarantee that the normal python build process makes or not, so this could also bite regular python versions later.
I was able to validate that the calculation above would work for my binary, where
0x0000000000ba2c68
is the address that pystack is attempting to look up.Steps To Reproduce
I am not sure how you would easily reproduce this issue as you'd need to produce a python binary that has the rodata addresses like the one in my example. If you are able to do that the issue reproduces very easily and all functionality of
pystack
will fail.Pystack Version
1.3.0
Python Version
3.11
Linux distribution
Ubuntu
Anything else?
No response
The text was updated successfully, but these errors were encountered: