Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Take the interpreter holding references to the arguments into account #1032

Merged
merged 19 commits into from
Sep 4, 2024

Conversation

shino16
Copy link
Contributor

@shino16 shino16 commented Aug 23, 2024

Fixes #1029. The tests in test_examine_memory.py uses some magic numbers made with an incorrect assumption, and I am yet to update them.

@shino16
Copy link
Contributor Author

shino16 commented Aug 26, 2024

When rewriting the test, I faced #1043. We previously did not test the backward for split.

@shino16
Copy link
Contributor Author

shino16 commented Aug 27, 2024

I tried to eliminate the "golden" values for tests, but this seems impossible when nvFuser is involved. For example, on the following test case, nvFuser returns as result2 a tensor with stride == (0, 0, 1).

def bar(a, b):  # [4] [2,2]
    a_1 = torch.unsqueeze(a, 0)  # [1,4]
    a_2 = torch.unsqueeze(a_1, 1)  # [1,1,4]
    a_3 = a_2.expand(2, 3, 4)  # [2,3,4]

    b_1 = torch.reshape(b, (4,))  # [4]
    b_2 = torch.unsqueeze(b_1, 0)  # [1,4]
    b_3 = torch.unsqueeze(b_2, 1)  # [1,1,4]
    b_4 = b_3.expand(2, 3, 4)  # [2,3,4]

    result1 = a_2 + b_3
    result2 = b_4 + a_3
    return result1, result2
generated trace
def augmented_forward_fn(a, b):
  # a: "cuda:0 f32[4]"
  # b: "cuda:0 f32[2, 2]"
  [t14, t15] = nvFusion0(a, b)
    # t0 = prims.broadcast_in_dim(a, [1, 4], [1])  # t0: "cuda:0 f32[1, 4]"
    # t1 = prims.broadcast_in_dim(t0, [1, 1, 4], [0, 2])  # t1: "cuda:0 f32[1, 1, 4]"
    # t5 = prims.broadcast_in_dim(t1, (2, 3, 4), (0, 1, 2))  # t5: "cuda:0 f32[2, 3, 4]"
    # t7 = prims.reshape(b, (4,))  # t7: "cuda:0 f32[4]"
    # t8 = prims.broadcast_in_dim(t7, [1, 4], [1])  # t8: "cuda:0 f32[1, 4]"
    # t9 = prims.broadcast_in_dim(t8, [1, 1, 4], [0, 2])  # t9: "cuda:0 f32[1, 1, 4]"
    # t13 = prims.broadcast_in_dim(t9, (2, 3, 4), (0, 1, 2))  # t13: "cuda:0 f32[2, 3, 4]"
    # t14 = prims.add(t1, t9)  # t14: "cuda:0 f32[1, 1, 4]"
    # t15 = prims.add(t13, t5)  # t15: "cuda:0 f32[2, 3, 4]"
  return {'output': (t14, t15), 'flat_args': [a, b], 'flat_output': (t14, t15)}, ((), ())
(144, {'unpack_trivial a': 16, 'unpack_trivial b': 16, 'nvFusion0 t14, t15': 112})

Can we skip nvFuserExecutor for such tests as this PR does? It still serves the purpose of checking if get_alloc_memory scans the trace correctly, and it does not make much sense to test something that gives untrue values.

@shino16 shino16 marked this pull request as ready for review August 27, 2024 06:13
@IvanYashchuk
Copy link
Collaborator

Can we skip nvFuserExecutor for such tests as this PR does? It still serves the purpose of checking if get_alloc_memory scans the trace correctly, and it does not make much sense to test something that gives untrue values.

Yes, let's skip nvFuser with a comment explaining why. In the future, we need to update get_alloc_memory to account for broadcasts in fused regions correctly.

Copy link
Collaborator

@IvanYashchuk IvanYashchuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python has a really interesting behavior with respect to holding references to input variables for the duration of the function call. In order to make get_alloc_memory more accurately represent Python's memory deallocation we need to take into account the type of containers in which tensors are stored, they could be immutable and then Python would hold a reference to an immutable object and not let us modify its content so tensors will never be freed during the function call, but containers could be mutable and then it's possible to free the tensors if the container itself doesn't hold a reference to tensor anymore. We should take this into consideration in a follow-up.

thunder/examine/memory_caculation.py Outdated Show resolved Hide resolved
thunder/examine/memory_caculation.py Show resolved Hide resolved
thunder/examine/memory_caculation.py Show resolved Hide resolved
thunder/tests/test_examine_memory.py Outdated Show resolved Hide resolved
@jjsjann123
Copy link
Collaborator

nvFuser returns as result2 a tensor with stride == (0, 0, 1).

I'm disappointed that eager doesn't do so by default. 😢

@shino16
Copy link
Contributor Author

shino16 commented Aug 29, 2024

As Ivan pointed out, the previous code did not consider clear_mutable_collection that forces the interpreter to release references to its elements. Indeed, when I passed lists as arguments to bw_trace instead of tuples, it gave different numbers:

def f(x, y, z):
    return x * y * z

get_alloc_memory's output (commit 05c0a5d):

{'unpack_trivial ': 0, 'unpack_sequence ': 0, 'clear_mutable_collection ': 0, 'unpack_sequence t2': 16, 'unpack_sequence t0, x, y, z': 64, 'mul t8': 16, 'mul t9': 16, 'mul t10': 16, 'mul t11': 16, 'python_del t8': -16}
peak = 144, after = 128

Actual memory usage:

peak = 96, after = 64

I slightly changed thunder/core/transforms.py so that bw_trace.args matches what ThunderFunction.backward actually passes to it, which is not tuples but lists. get_alloc_memory follows this change and takes clear_mutable_collection on an argument into account.

Now (commit 7c5318c) get_alloc_memory gives correct numbers:

{'argument t0': 16, 'argument x': 16, 'argument y': 16, 'argument z': 16, 'argument t2': 16, 'mul t8': 16, 'python_del z': -16, 'mul t9': 16, 'python_del t0': -16, 'mul t10': 16, 'python_del y': -16, 'mul t11': 16, 'python_del x': -16, 'python_del t8': -16}
peak = 96, after = 64

@shino16
Copy link
Contributor Author

shino16 commented Sep 2, 2024

@t-vi This is ready for review. Thank you!

@t-vi
Copy link
Collaborator

t-vi commented Sep 2, 2024

Yes, let's skip nvFuser with a comment explaining why.

Unless I missed it, we would want to add to add the comment.

@shino16
Copy link
Contributor Author

shino16 commented Sep 2, 2024

Thank you for pointing that out! I added the comment.

The current test failure is apparently due to a breaking change in actions/upload-artifact@v3. I described the cause in Lightning-AI/utilities#302 and made a fix in Lightning-AI/utilities#303.

Copy link
Collaborator

@t-vi t-vi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@t-vi t-vi merged commit d05ebc6 into Lightning-AI:main Sep 4, 2024
37 checks passed
shino16 added a commit to shino16/lightning-thunder that referenced this pull request Sep 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

get_alloc_memory deallocates arguments before the function returns
5 participants