Take the interpreter holding references to the arguments into account #1032

shino16 · 2024-08-23T08:10:31Z

Fixes #1029. The tests in test_examine_memory.py uses some magic numbers made with an incorrect assumption, and I am yet to update them.

shino16 · 2024-08-26T04:11:57Z

When rewriting the test, I faced #1043. We previously did not test the backward for split.

shino16 · 2024-08-27T06:13:47Z

I tried to eliminate the "golden" values for tests, but this seems impossible when nvFuser is involved. For example, on the following test case, nvFuser returns as result2 a tensor with stride == (0, 0, 1).

def bar(a, b):  # [4] [2,2]
    a_1 = torch.unsqueeze(a, 0)  # [1,4]
    a_2 = torch.unsqueeze(a_1, 1)  # [1,1,4]
    a_3 = a_2.expand(2, 3, 4)  # [2,3,4]

    b_1 = torch.reshape(b, (4,))  # [4]
    b_2 = torch.unsqueeze(b_1, 0)  # [1,4]
    b_3 = torch.unsqueeze(b_2, 1)  # [1,1,4]
    b_4 = b_3.expand(2, 3, 4)  # [2,3,4]

    result1 = a_2 + b_3
    result2 = b_4 + a_3
    return result1, result2

generated trace

def augmented_forward_fn(a, b):
  # a: "cuda:0 f32[4]"
  # b: "cuda:0 f32[2, 2]"
  [t14, t15] = nvFusion0(a, b)
    # t0 = prims.broadcast_in_dim(a, [1, 4], [1])  # t0: "cuda:0 f32[1, 4]"
    # t1 = prims.broadcast_in_dim(t0, [1, 1, 4], [0, 2])  # t1: "cuda:0 f32[1, 1, 4]"
    # t5 = prims.broadcast_in_dim(t1, (2, 3, 4), (0, 1, 2))  # t5: "cuda:0 f32[2, 3, 4]"
    # t7 = prims.reshape(b, (4,))  # t7: "cuda:0 f32[4]"
    # t8 = prims.broadcast_in_dim(t7, [1, 4], [1])  # t8: "cuda:0 f32[1, 4]"
    # t9 = prims.broadcast_in_dim(t8, [1, 1, 4], [0, 2])  # t9: "cuda:0 f32[1, 1, 4]"
    # t13 = prims.broadcast_in_dim(t9, (2, 3, 4), (0, 1, 2))  # t13: "cuda:0 f32[2, 3, 4]"
    # t14 = prims.add(t1, t9)  # t14: "cuda:0 f32[1, 1, 4]"
    # t15 = prims.add(t13, t5)  # t15: "cuda:0 f32[2, 3, 4]"
  return {'output': (t14, t15), 'flat_args': [a, b], 'flat_output': (t14, t15)}, ((), ())
(144, {'unpack_trivial a': 16, 'unpack_trivial b': 16, 'nvFusion0 t14, t15': 112})

Can we skip nvFuserExecutor for such tests as this PR does? It still serves the purpose of checking if get_alloc_memory scans the trace correctly, and it does not make much sense to test something that gives untrue values.

IvanYashchuk · 2024-08-28T07:19:05Z

Can we skip nvFuserExecutor for such tests as this PR does? It still serves the purpose of checking if get_alloc_memory scans the trace correctly, and it does not make much sense to test something that gives untrue values.

Yes, let's skip nvFuser with a comment explaining why. In the future, we need to update get_alloc_memory to account for broadcasts in fused regions correctly.

IvanYashchuk

Python has a really interesting behavior with respect to holding references to input variables for the duration of the function call. In order to make get_alloc_memory more accurately represent Python's memory deallocation we need to take into account the type of containers in which tensors are stored, they could be immutable and then Python would hold a reference to an immutable object and not let us modify its content so tensors will never be freed during the function call, but containers could be mutable and then it's possible to free the tensors if the container itself doesn't hold a reference to tensor anymore. We should take this into consideration in a follow-up.

thunder/examine/memory_caculation.py

thunder/tests/test_examine_memory.py

jjsjann123 · 2024-08-28T18:23:48Z

nvFuser returns as result2 a tensor with stride == (0, 0, 1).

I'm disappointed that eager doesn't do so by default. 😢

shino16 · 2024-08-29T14:04:51Z

As Ivan pointed out, the previous code did not consider clear_mutable_collection that forces the interpreter to release references to its elements. Indeed, when I passed lists as arguments to bw_trace instead of tuples, it gave different numbers:

def f(x, y, z):
    return x * y * z

get_alloc_memory's output (commit 05c0a5d):

{'unpack_trivial ': 0, 'unpack_sequence ': 0, 'clear_mutable_collection ': 0, 'unpack_sequence t2': 16, 'unpack_sequence t0, x, y, z': 64, 'mul t8': 16, 'mul t9': 16, 'mul t10': 16, 'mul t11': 16, 'python_del t8': -16}
peak = 144, after = 128

Actual memory usage:

peak = 96, after = 64

I slightly changed thunder/core/transforms.py so that bw_trace.args matches what ThunderFunction.backward actually passes to it, which is not tuples but lists. get_alloc_memory follows this change and takes clear_mutable_collection on an argument into account.

Now (commit 7c5318c) get_alloc_memory gives correct numbers:

{'argument t0': 16, 'argument x': 16, 'argument y': 16, 'argument z': 16, 'argument t2': 16, 'mul t8': 16, 'python_del z': -16, 'mul t9': 16, 'python_del t0': -16, 'mul t10': 16, 'python_del y': -16, 'mul t11': 16, 'python_del x': -16, 'python_del t8': -16}
peak = 96, after = 64

thunder/examine/memory_caculation.py

…llection"

shino16 · 2024-09-02T09:11:19Z

@t-vi This is ready for review. Thank you!

t-vi · 2024-09-02T14:04:23Z

Yes, let's skip nvFuser with a comment explaining why.

Unless I missed it, we would want to add to add the comment.

shino16 · 2024-09-02T15:36:31Z

Thank you for pointing that out! I added the comment.

The current test failure is apparently due to a breaking change in actions/upload-artifact@v3. I described the cause in Lightning-AI/utilities#302 and made a fix in Lightning-AI/utilities#303.

t-vi

Thank you @shino16 @IvanYashchuk @jjsjann123 @crcrpar

…Lightning-AI#1032)

Interpreter has references to arguments

dd45864

shino16 added 2 commits August 26, 2024 22:12

Support split as an alias_op with multiple outputs

23df2c6

Update test

ec16a29

shino16 force-pushed the get_alloc_memory-args branch from 8c6306d to ec16a29 Compare August 27, 2024 05:14

shino16 added 2 commits August 26, 2024 22:41

Merge branch 'main' into get_alloc_memory-args

e2d3dea

cosmetic changes

446d6c7

shino16 marked this pull request as ready for review August 27, 2024 06:13

shino16 requested review from mruberry, lantiga and t-vi as code owners August 27, 2024 06:13

crcrpar requested a review from jjsjann123 August 27, 2024 07:13

t-vi assigned jjsjann123 and crcrpar Aug 27, 2024

Merge branch 'main' into get_alloc_memory-args

33729b8

crcrpar requested a review from IvanYashchuk August 28, 2024 06:53

IvanYashchuk approved these changes Aug 28, 2024

View reviewed changes

thunder/examine/memory_caculation.py Outdated Show resolved Hide resolved

thunder/examine/memory_caculation.py Show resolved Hide resolved

thunder/examine/memory_caculation.py Show resolved Hide resolved

thunder/tests/test_examine_memory.py Outdated Show resolved Hide resolved

cosmetic changes

05c0a5d

shino16 added 4 commits August 29, 2024 03:56

Merge branch 'main' into get_alloc_memory-args

54a974e

Cancel n+=1 when the collection passed as an argument is cleared

6c588fd

Merge branch 'main' into get_alloc_memory-args

87c8160

cosmetic change

7c5318c

shino16 commented Aug 29, 2024

View reviewed changes

thunder/examine/memory_caculation.py Show resolved Hide resolved

shino16 commented Aug 29, 2024

View reviewed changes

thunder/examine/memory_caculation.py Show resolved Hide resolved

shino16 added 2 commits August 30, 2024 12:18

Merge branch 'main' into get_alloc_memory-args

136c7d1

Use pythonex.clear_mutable_collection.id instead of "clear_mutable_co…

6ab6740

…llection"

Merge branch 'main' into get_alloc_memory-args

16ab0df

Merge branch 'main' into get_alloc_memory-args

667ea1e

Add a comment explaining why nvFuser is skipped now

0b150d6

shino16 and others added 3 commits September 2, 2024 20:46

empty commit

dbe726f

Merge branch 'main' into get_alloc_memory-args

57f1d9c

Merge branch 'main' into get_alloc_memory-args

274f9fa

t-vi approved these changes Sep 4, 2024

View reviewed changes

t-vi merged commit d05ebc6 into Lightning-AI:main Sep 4, 2024
37 checks passed

shino16 added a commit to shino16/lightning-thunder that referenced this pull request Sep 5, 2024

Take the interpreter holding references to the arguments into account (…

7ad11a4

…Lightning-AI#1032)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Take the interpreter holding references to the arguments into account #1032

Take the interpreter holding references to the arguments into account #1032

shino16 commented Aug 23, 2024

shino16 commented Aug 26, 2024

shino16 commented Aug 27, 2024

IvanYashchuk commented Aug 28, 2024

IvanYashchuk left a comment

jjsjann123 commented Aug 28, 2024

shino16 commented Aug 29, 2024 •

edited

Loading

shino16 commented Sep 2, 2024

t-vi commented Sep 2, 2024

shino16 commented Sep 2, 2024 •

edited

Loading

t-vi left a comment

Take the interpreter holding references to the arguments into account #1032

Take the interpreter holding references to the arguments into account #1032

Conversation

shino16 commented Aug 23, 2024

shino16 commented Aug 26, 2024

shino16 commented Aug 27, 2024

IvanYashchuk commented Aug 28, 2024

IvanYashchuk left a comment

Choose a reason for hiding this comment

jjsjann123 commented Aug 28, 2024

shino16 commented Aug 29, 2024 • edited Loading

shino16 commented Sep 2, 2024

t-vi commented Sep 2, 2024

shino16 commented Sep 2, 2024 • edited Loading

t-vi left a comment

Choose a reason for hiding this comment

shino16 commented Aug 29, 2024 •

edited

Loading

shino16 commented Sep 2, 2024 •

edited

Loading