Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to prepare the train data? #1

Open
Bagfish opened this issue May 2, 2024 · 10 comments
Open

how to prepare the train data? #1

Bagfish opened this issue May 2, 2024 · 10 comments

Comments

@Bagfish
Copy link

Bagfish commented May 2, 2024

Dear sir
Thank you for your implementation based on PyTorch! I want to train the model, but i cant understand how to prepare the train data. In the paper, the speech and face image are paired, but in the first readme, i only see vox1,vox2and HQvox, which dataset is used to generate the face vector?

@Kacper-Pietkun
Copy link
Owner

Hello @Bagfish,
When it comes to training VoiceEncoder, you need to prepare a directory with a dataset, as described here - datasets (under the S2fDataset entry). In short, you must create a separate directory for each person, and inside each directory there must be two additional directories - one for calculated spectrograms (audios directory) and one for calculated face embeddings (images directory). Such directory can be used as a training set. If you want to prepare a validation or a test sets, just follow the same steps.

Note: To calculate spectrograms from audio files, you can use scripts located here: audio_spectrograms.py (for preprocessing like in Speech2Face: Learning the Face Behind a Voice paper) and ast_audio_preprocess.py (If you want to use AST voice encoder). On the other hand, face embeddings, which must be located in the images directories, must be calculated using FaceEncoder model - here is the script image_face_embeddings.py

@Bagfish
Copy link
Author

Bagfish commented May 3, 2024

thank you for your relpy, i will follow your guide!!! @Kacper-Pietkun

@Bagfish
Copy link
Author

Bagfish commented May 4, 2024

@Kacper-Pietkun I’m very sorry to bother you again. Can I ask you for the vgg model which is converted to pytorch? The reasons why i cant convert it by myself is: 1. I can’t find the model download link from the(https://github.com/serengil/deepface) . 2. My computer has been unable to install TensorFlow effectively.

@Kacper-Pietkun
Copy link
Owner

Here you will find PyTorch weights for the VGGFace_serengil model: https://drive.google.com/drive/u/2/folders/1DCqvpZYkd0chupA3mQeCVS7p69WAjnER

@Bagfish
Copy link
Author

Bagfish commented May 7, 2024

I get it!!! Thank you,very much!!!

@Bagfish
Copy link
Author

Bagfish commented May 9, 2024

@Kacper-Pietkun When i train the speechencoder, the loss appear Nan. I really cant find whats the problem.
loss

@Kacper-Pietkun
Copy link
Owner

  1. Have you trained FaceDecoder model beforehand? (When training VoiceEncoder, FaceDecoder model's weights should be frozen).
  2. What VoiceEncoder model are you training? I had similar problems with ve_conv model. Try training ast model instead.
  3. One approach that should help is playing with values of the coefficients of the loss function - coe_1, coe_2, coe_3, as well as the learning_rate hyperparameter.

@Bagfish
Copy link
Author

Bagfish commented May 10, 2024

@Kacper-Pietkun thank you for your reply
1.I have already trained FaceDecoder model,but I dont freeze the model weight. How can i freeze the facedecoder weight?
2.I will try the ast model instead.
3.I will try other hyperparameter.
Thank you very much!!!!

@Bagfish
Copy link
Author

Bagfish commented May 10, 2024

During the first part, whole model is frozen except the head, which is trained. During the second part whole model is unfrozen and the model is fine-tuned.
What's this means? In first step , which args in train/train_ast.py should I set? And how to fintune using train/train_ast.py,just use "python train/train_ast.py --fine-tune" is ok?

@Kacper-Pietkun
Copy link
Owner

@Kacper-Pietkun thank you for your reply 1.I have already trained FaceDecoder model,but I dont freeze the model weight. How can i freeze the facedecoder weight? 2.I will try the ast model instead. 3.I will try other hyperparameter. Thank you very much!!!!

Actually, I was wondering If you have trained FaceDecoder model beforehand, because it is necessary to calculate the loss function. You don't need to do anything extra to "freeze" FaceDecoder weight's, because optimizer was created only to optimize VoiceEncoder model's weights.

During the first part, whole model is frozen except the head, which is trained. During the second part whole model is unfrozen and the model is fine-tuned. What's this means? In first step , which args in train/train_ast.py should I set? And how to fintune using train/train_ast.py,just use "python train/train_ast.py --fine-tune" is ok?

Okay, so basically it looks like this. In the training script, AST VoiceEncoder model is downloaded from HuggingFace's transformer library, along with the pretrained weights. However, to adjust thee model to the problem of generating voice embedding vectors, it needs a new "head", so that the last layer's output dimension is equal to 4096 (just like face embedding vector size).

Here are the lines of code from the training script, which are responible for downloading model and swapping its "head".

ast = AutoModelForAudioClassification.from_pretrained("MIT/ast-finetuned-audioset-10-10-0.4593", num_labels=4096, ignore_mismatched_sizes=True).to(device)
head = ast.classifier
new_head = nn.Sequential(
head,
nn.ReLU()
)
ast.classifier = new_head

So, I splitted AST VoiceEncoder model training into two stages.

  1. The first stage is responsilbe only for training this new "head" of the model (other layers are frozen). This is typically done during transfer learning, to avoid a situation where, due to a freshly initialized layer (head), the size of the gradient updates would be so large, that the other previously trained parameters would be altered too much, and the model would forget what it had learned. (Remember that the other parameters which are frozen during this step were initialized with pretrained weights).

Here are a few lines from training script, which are responsible for freezeing all model's parameters except the head. Additionally you can see, that the head's parameters are initialized with truncated normal distribution:

# freeze every layer but - classifier.dense.bias and classifier.dense.weight
for name, param in ast.named_parameters():
if name != "classifier.0.dense.weight" and name != "classifier.0.dense.bias":
param.requires_grad = False
else:
nn.init.trunc_normal_(param)

  1. During the second stage, more of the model's parameters should be unfrozen, so that they can be optimized for the problem of generating voice embedding. You can unfreeze the whole model, or only some parts of it. Recently I have added to the training script --unfreeze-number parameter with which you can controll how many layers are unfrozen. (Actually this parameter specifies from which layer model should be unfrozen).

Generally, to run the second stage, beyond all of the other necessary parameters like --train-dataset-path, --face-decoder-weights-path and so on, you need to pass these parameters to the script:

  • --fine-tune - it is used as a flag to mark that model's head was already trained
  • --continue-training-path - it is used to specify path to the weight's of the ast model (the one which head was already trained)
  • --unfreeze-number - this one is optional, because by default when fine tuning, the whole model will be unfrozen. But as I said, you can use it as a hyperparameter. During my research I achieved the best results when I unfroze the model from the 165th layer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants