Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proper dataset that reduces VRAM usage and provides higher performance. #37

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

MarcusLoppe
Copy link
Contributor

I've created a dataset class which hopefully help beginners.

Features:

  • Save & load compressed data in .npz format.

  • Generate text embeddings to reduce the VRAM bottleneck

  • Generate codes to reduce the VRAM bottleneck

  • Generate face_edges to reduce the VRAM bottleneck

    • Since autoencoder doesn't modify the dataset entry itself it needs to generate this data each step.
    • Only preparing and generating the face edges will have a very good performance boost, since it uses about 2-6 GB VRAM and takes 100-400 milliseconds to generate these for a large 3d model. If the dataset doesn't contain this information, it needs to do this every training step.

@MarcusLoppe
Copy link
Contributor Author

@fire Hey, I think you can use this information. I don't think you have implemented this.

@fire
Copy link

fire commented Dec 27, 2023

Thanks! Happy holidays. I am still sleeping on how to do chunking as an autocomplete. Because I don’t have 10x the gpu ram

@fire
Copy link

fire commented Dec 27, 2023

I think it needs the concept of mesh to mesh and the idea you can localize the input

@MarcusLoppe
Copy link
Contributor Author

Thanks! Happy holidays. I am still sleeping on how to do chunking as an autocomplete. Because I don’t have 10x the gpu ram

Happy holidays 🎉
But it should be a lot lower VRAM if you preprocess the face edges at at least, I see that you are calling derive_face_edges_from_faces in getitem but not storing the results, hence you need to get the face edges each step and might cost a couple of GBs of VRAM.

Do you mean chunking this?
I think it's the T5 encoder that are embedding these, shouldn't be to high VRAM usage.


    def embed_texts(self, transformer : MeshTransformer): 
        unique_texts = set(item['texts'] for item in self.data)
 
        text_embeddings = transformer.embed_texts(list(unique_texts))
        print(f"[MeshDataset] Generated {len(text_embeddings)} text_embeddings") 
        text_embedding_dict = dict(zip(unique_texts, text_embeddings))
 
        for item in self.data:
            text_value = item['texts']
            item['text_embeds'] = text_embedding_dict.get(text_value, None)
            del item['texts']

I think it needs the concept of mesh to mesh and the idea you can localize the input

Could you clarify? Are you talking about the augmentations/data cleaning?

@fire

This comment was marked as outdated.

@fire
Copy link

fire commented Dec 27, 2023

Here's a rewording of the problem.

Problem Context

The problem at hand involves processing a large triangle mesh with limited GPU RAM. This can be challenging as the size of the mesh may exceed the available memory, causing performance issues or even failure to process the mesh. The issue becomes more pronounced when dealing with 10x the input size.

Current Approach and its Limitations

Currently, I cache derive_face_edges_from_faces function in getitem. While this approach works, it's not efficient because it derives the face edges in each step, which can be computationally expensive and time-consuming. Moreover, the savings from caching might not be sufficient for larger inputs.

I am clarifying that the T5 embedding is not causing a problem.

Alternative Approach

Given the limitations of the current approach and potential solution, an alternative approach could be to divide the mesh into smaller chunks and process each chunk separately. This way, you can handle larger meshes without exceeding your GPU RAM capacity.

This approach allows for the processing of only a portion of the triangle mesh at a time, effectively managing the use of GPU RAM. It should provide a scalable solution for handling larger inputs. However, we don't have a mesh to mesh workflow yet.

@MarcusLoppe
Copy link
Contributor Author

The problem being solved is I have a 70,000 triangle mesh and I can't process all of it at once.

We can process only like 10% of that. Like the 7k portion of the triangle mesh of the named subway car or the named feminine character.

Hmm, have you tested using the CPU only? So instead of the faces being on the GPU you can move it onto the CPU and since the computer might have more RAM then the GPU.
If the computer RAM isn't enough it will use the virtual RAM which might be quite slow though.

Does this issue only occurs when you generate the face edges?
Or are you running out of memory when training the autoencoder too? If so; you might want to check out replacing the autoencoder with MeshDiscretizer since it only discretizes and no machine learning model.

@fire
Copy link

fire commented Dec 27, 2023

Hi Marcus,

You're correct in your understanding of the problem. Due to the size of the triangle mesh, we can only process about 10% of it at a time, such as the 7k portion of the subway car or the feminine character.

I have indeed tested using only the CPU for processing. While this approach works because I have close to 200GB of CPU RAM, it significantly slows down the transformer stage. As you mentioned, if the computer RAM isn't enough, it will use the virtual RAM which is quite slow.

The issue primarily occurs when training the mesh transformer. The autoencoder stage doesn't seem to require very large inputs, but rather a variety of inputs. Therefore, replacing the autoencoder with MeshDiscretizer might not be necessary in this case.

Thank you for your suggestions and insights. They are greatly appreciated as we continue to work on optimizing this process.

@MarcusLoppe
Copy link
Contributor Author

Hi Marcus,

You're correct in your understanding of the problem. Due to the size of the triangle mesh, we can only process about 10% of it at a time, such as the 7k portion of the subway car or the feminine character.

I see your problem, the derive_face_edges_from_faces is optimized for speed since it doesn't loop through the edges but process the whole dim at once.
The current method can't really chunk the process since it need to check all face's for matches and what if a connected face is another chunk.
It should be able to run with lower amount memory, I can try and give it a shot using dicts but it will be slower due to no GPU with parallel processing. Or maybe splitting up the mesh into Octotrees or traverse the mesh.

I have indeed tested using only the CPU for processing. While this approach works because I have close to 200GB of CPU RAM, it significantly slows down the transformer stage. As you mentioned, if the computer RAM isn't enough, it will use the virtual RAM which is quite slow.

Hmm, well since you can store the face edges on disk this is something you'll only need to do once, and only one per 3D model since the augmented versions still uses the same faces.

The issue primarily occurs when training the mesh transform. The autoencoder stage doesn't seem to require very large inputs, but rather a variety of inputs. Therefore, replacing the autoencoder with MeshDiscretizer might not be necessary in this case.

Thank you for your suggestions and insights. They are greatly appreciated as we continue to work on optimizing this process.

I'm 100% sure what happens but the transformer has a token max length, I'm not sure what happens when it exceeds 8192 tokens, e.g 1365 triangles (8192/6 tokens per triangle).

Have you seen any difference when training with more then 1365 triangles vs meshes with less then 1300 triangles?

@MarcusLoppe
Copy link
Contributor Author

MarcusLoppe commented Dec 27, 2023

Here's a rewording of the problem.

Problem Context

The problem at hand involves processing a large triangle mesh with limited GPU RAM. This can be challenging as the size of the mesh may exceed the available memory, causing performance issues or even failure to process the mesh. The issue becomes more pronounced when dealing with 10x the input size.

Current Approach and its Limitations

Currently, I cache derive_face_edges_from_faces function in getitem. While this approach works, it's not efficient because it derives the face edges in each step, which can be computationally expensive and time-consuming. Moreover, the savings from caching might not be sufficient for larger inputs.

Have you tried not using the cache function? I'm not familiar with lru_cache and not sure if they are copying or reference the data and how it deals if the data is on another device (e.g cpu/GPU).

Best practice would be to generate the beforehand since if you using a 3D model and you augment it 100 times, since the faces edges wont change after augmentation you need only to store it in the VRAM once and reference it once instead of storing 100 copies of it in the cache.

And also, if you are using a batch size of more then 1, are calling derive_face_edges_from_faces with not just one mesh, but it will process the entire batch at once, e.g 16 x 7000 x 3 = a lot of data.

@fire
Copy link

fire commented Dec 27, 2023

I am trying a different approach.

I added a commit that uses KDTree from the scipy.spatial module to improve the efficiency of nearest neighbor search in the MeshDataset class. The KDTree is used to extract a subset of faces based on their proximity to a randomly generated point within the bounding box of the mesh. This subset is then used to create new vertices and faces for augmentation. Additionally, the maximum number of faces allowed in a mesh has been set to 500.

I have now set it to 1365

Later today I can try implementing the way you suggested.

@MarcusLoppe
Copy link
Contributor Author

I am trying a different approach.

I added a commit that uses KDTree from the scipy.spatial module to improve the efficiency of nearest neighbor search in the MeshDataset class. The KDTree is used to extract a subset of faces based on their proximity to a randomly generated point within the bounding box of the mesh. This subset is then used to create new vertices and faces for augmentation. Additionally, the maximum number of faces allowed in a mesh has been set to 500.

I have now set it to 1365

Hmm, that seems maybe unnecessary.

I did some tests with a 6206 face model;
If i created it using generate_face_edges(), this consumed 2.2GB VRAM, I then saved the dataset and restarted the session, I loaded the dataset from disk and the VRAM usage was just 592 MB.
I ran torch.cuda.empty_cache() after generate_face_edges to clear some garbage but it was still at that usage.

Later today I can try implementing the way you suggested.

Good, I think it's much more efficient if you get the face edges and store 1 array per model, instead of creating and storing 100s of the same due to augmentation.

@MarcusLoppe
Copy link
Contributor Author

MarcusLoppe commented Jan 5, 2024

Hi @lucidrains

I think it's time for a dataset since people are not training properly.
Currently I see misconceptions of how it generates face edges/ codes/ text embedding when training.

If you don't preprocess the dataset:
Each time it generates the data and then moves onto the next batch; the data will be deleted since it's stored on a temporary object and will never be stored or used again. This is due to the dataloader don't expose the real data but a copy of it.

This forces the model to generate face_edges/code and text embedding each step, this will increase VRAM usage and the time required to generate this data. This isn't a small amount but rather a big one, for example face_edges requires 4-96GB VRAM.

I have multiple people not quite understanding this and they don't understand why they have a high VRAM usage.

Since the VRAM usage is linear, it uses about 12 MB per face using batch size of 64, if you have a face count of 6000, it will use 75GB to generate the face_edges+tokenize the data which can be pre-generated without any significant VRAM usage.

294633603-c7d0312f-3b75-45c1-bb9a-e37ae2dde876

@lucidrains
Copy link
Owner

yup, will look into this Sunday!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants