-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proper dataset that reduces VRAM usage and provides higher performance. #37
base: main
Are you sure you want to change the base?
Conversation
@fire Hey, I think you can use this information. I don't think you have implemented this. |
Thanks! Happy holidays. I am still sleeping on how to do chunking as an autocomplete. Because I don’t have 10x the gpu ram |
I think it needs the concept of mesh to mesh and the idea you can localize the input |
Happy holidays 🎉 Do you mean chunking this?
Could you clarify? Are you talking about the augmentations/data cleaning? |
This comment was marked as outdated.
This comment was marked as outdated.
Here's a rewording of the problem. Problem Context The problem at hand involves processing a large triangle mesh with limited GPU RAM. This can be challenging as the size of the mesh may exceed the available memory, causing performance issues or even failure to process the mesh. The issue becomes more pronounced when dealing with 10x the input size. Current Approach and its Limitations Currently, I cache I am clarifying that the T5 embedding is not causing a problem. Alternative Approach Given the limitations of the current approach and potential solution, an alternative approach could be to divide the mesh into smaller chunks and process each chunk separately. This way, you can handle larger meshes without exceeding your GPU RAM capacity. This approach allows for the processing of only a portion of the triangle mesh at a time, effectively managing the use of GPU RAM. It should provide a scalable solution for handling larger inputs. However, we don't have a mesh to mesh workflow yet. |
Hmm, have you tested using the CPU only? So instead of the faces being on the GPU you can move it onto the CPU and since the computer might have more RAM then the GPU. Does this issue only occurs when you generate the face edges? |
Hi Marcus, You're correct in your understanding of the problem. Due to the size of the triangle mesh, we can only process about 10% of it at a time, such as the 7k portion of the subway car or the feminine character. I have indeed tested using only the CPU for processing. While this approach works because I have close to 200GB of CPU RAM, it significantly slows down the transformer stage. As you mentioned, if the computer RAM isn't enough, it will use the virtual RAM which is quite slow. The issue primarily occurs when training the mesh transformer. The autoencoder stage doesn't seem to require very large inputs, but rather a variety of inputs. Therefore, replacing the autoencoder with MeshDiscretizer might not be necessary in this case. Thank you for your suggestions and insights. They are greatly appreciated as we continue to work on optimizing this process. |
I see your problem, the derive_face_edges_from_faces is optimized for speed since it doesn't loop through the edges but process the whole dim at once.
Hmm, well since you can store the face edges on disk this is something you'll only need to do once, and only one per 3D model since the augmented versions still uses the same faces.
I'm 100% sure what happens but the transformer has a token max length, I'm not sure what happens when it exceeds 8192 tokens, e.g 1365 triangles (8192/6 tokens per triangle). Have you seen any difference when training with more then 1365 triangles vs meshes with less then 1300 triangles? |
Have you tried not using the cache function? I'm not familiar with lru_cache and not sure if they are copying or reference the data and how it deals if the data is on another device (e.g cpu/GPU). Best practice would be to generate the beforehand since if you using a 3D model and you augment it 100 times, since the faces edges wont change after augmentation you need only to store it in the VRAM once and reference it once instead of storing 100 copies of it in the cache. And also, if you are using a batch size of more then 1, are calling derive_face_edges_from_faces with not just one mesh, but it will process the entire batch at once, e.g 16 x 7000 x 3 = a lot of data. |
I am trying a different approach.
Later today I can try implementing the way you suggested. |
Hmm, that seems maybe unnecessary. I did some tests with a 6206 face model;
Good, I think it's much more efficient if you get the face edges and store 1 array per model, instead of creating and storing 100s of the same due to augmentation. |
Hi @lucidrains I think it's time for a dataset since people are not training properly. If you don't preprocess the dataset: This forces the model to generate face_edges/code and text embedding each step, this will increase VRAM usage and the time required to generate this data. This isn't a small amount but rather a big one, for example face_edges requires 4-96GB VRAM. I have multiple people not quite understanding this and they don't understand why they have a high VRAM usage. Since the VRAM usage is linear, it uses about 12 MB per face using batch size of 64, if you have a face count of 6000, it will use 75GB to generate the face_edges+tokenize the data which can be pre-generated without any significant VRAM usage. |
yup, will look into this Sunday! |
I've created a dataset class which hopefully help beginners.
Features:
Save & load compressed data in .npz format.
Generate text embeddings to reduce the VRAM bottleneck
Generate codes to reduce the VRAM bottleneck
Generate face_edges to reduce the VRAM bottleneck