We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When I try to run diarize.py or the Jupyter Notebook version, I encounter a dimension mismatch issue during the MSDD step
100% [................................................................................] 7646 / 7646[NeMo I 2024-08-17 17:31:13 msdd_models:1092] Loading pretrained diar_msdd_telephonic model from NGC [NeMo I 2024-08-17 17:31:13 cloud:58] Found existing object C:\Users\ckic\.cache\torch\NeMo\NeMo_1.20.0\diar_msdd_telephonic\3c3697a0a46f945574fa407149975a13\diar_msdd_telephonic.nemo. [NeMo I 2024-08-17 17:31:13 cloud:64] Re-using file from: C:\Users\ckic\.cache\torch\NeMo\NeMo_1.20.0\diar_msdd_telephonic\3c3697a0a46f945574fa407149975a13\diar_msdd_telephonic.nemo [NeMo I 2024-08-17 17:31:13 common:913] Instantiating model from pre-trained checkpoint [NeMo W 2024-08-17 17:31:14 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader. Train config : manifest_filepath: null emb_dir: null sample_rate: 16000 num_spks: 2 soft_label_thres: 0.5 labels: null batch_size: 15 emb_batch_size: 0 shuffle: true [NeMo W 2024-08-17 17:31:14 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). Validation config : manifest_filepath: null emb_dir: null sample_rate: 16000 num_spks: 2 soft_label_thres: 0.5 labels: null batch_size: 15 emb_batch_size: 0 shuffle: false [NeMo W 2024-08-17 17:31:14 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple_test_data() method and provide a valid configuration file to setup the test data loader(s). Test config : manifest_filepath: null emb_dir: null sample_rate: 16000 num_spks: 2 soft_label_thres: 0.5 labels: null batch_size: 15 emb_batch_size: 0 shuffle: false seq_eval_mode: false [NeMo I 2024-08-17 17:31:14 features:289] PADDING: 16 [NeMo I 2024-08-17 17:31:14 features:289] PADDING: 16 [NeMo I 2024-08-17 17:31:15 save_restore_connector:249] Model EncDecDiarLabelModel was successfully restored from C:\Users\ckic\.cache\torch\NeMo\NeMo_1.20.0\diar_msdd_telephonic\3c3697a0a46f945574fa407149975a13\diar_msdd_telephonic.nemo. [NeMo I 2024-08-17 17:31:15 features:289] PADDING: 16 [NeMo I 2024-08-17 17:31:16 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC [NeMo I 2024-08-17 17:31:16 cloud:58] Found existing object C:\Users\ckic\.cache\torch\NeMo\NeMo_1.20.0\vad_multilingual_marblenet\670f425c7f186060b7a7268ba6dfacb2\vad_multilingual_marblenet.nemo. [NeMo I 2024-08-17 17:31:16 cloud:64] Re-using file from: C:\Users\ckic\.cache\torch\NeMo\NeMo_1.20.0\vad_multilingual_marblenet\670f425c7f186060b7a7268ba6dfacb2\vad_multilingual_marblenet.nemo [NeMo I 2024-08-17 17:31:16 common:913] Instantiating model from pre-trained checkpoint [NeMo W 2024-08-17 17:31:16 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader. Train config : manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json sample_rate: 16000 labels: - background - speech batch_size: 256 shuffle: true is_tarred: false tarred_audio_filepaths: null tarred_shard_strategy: scatter augmentor: shift: prob: 0.5 min_shift_ms: -10.0 max_shift_ms: 10.0 white_noise: prob: 0.5 min_level: -90 max_level: -46 norm: true noise: prob: 0.5 manifest_path: /manifests/noise_0_1_musan_fs.json min_snr_db: 0 max_snr_db: 30 max_gain_db: 300.0 norm: true gain: prob: 0.5 min_gain_dbfs: -10.0 max_gain_dbfs: 10.0 norm: true num_workers: 16 pin_memory: true [NeMo W 2024-08-17 17:31:16 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). Validation config : manifest_filepath: /manifests/ami_dev_0.63.json,/manifests/freesound_background_dev.json,/manifests/freesound_laughter_dev.json,/manifests/ch120_moved_0.63.json,/manifests/fisher_2005_500_speech_sampled.json,/manifests/google_dev_manifest.json,/manifests/musan_music_dev.json,/manifests/mandarin_dev.json,/manifests/german_dev.json,/manifests/spanish_dev.json,/manifests/french_dev.json,/manifests/russian_dev.json sample_rate: 16000 labels: - background - speech batch_size: 256 shuffle: false val_loss_idx: 0 num_workers: 16 pin_memory: true [NeMo W 2024-08-17 17:31:16 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple_test_data() method and provide a valid configuration file to setup the test data loader(s). Test config : manifest_filepath: null sample_rate: 16000 labels: - background - speech batch_size: 128 shuffle: false test_loss_idx: 0 [NeMo I 2024-08-17 17:31:16 features:289] PADDING: 16 [NeMo I 2024-08-17 17:31:16 save_restore_connector:249] Model EncDecClassificationModel was successfully restored from C:\Users\ckic\.cache\torch\NeMo\NeMo_1.20.0\vad_multilingual_marblenet\670f425c7f186060b7a7268ba6dfacb2\vad_multilingual_marblenet.nemo. [NeMo I 2024-08-17 17:31:16 msdd_models:864] Multiscale Weights: [1, 1, 1, 1, 1] [NeMo I 2024-08-17 17:31:16 msdd_models:865] Clustering Parameters: { "oracle_num_speakers": false, "max_num_speakers": 8, "enhanced_count_thres": 80, "max_rp_threshold": 0.25, "sparse_search_volume": 30, "maj_vote_spk_count": false, "chunk_cluster_count": 50, "embeddings_per_chunk": 10000 } [NeMo W 2024-08-17 17:31:16 clustering_diarizer:411] Deleting previous clustering diarizer outputs. [NeMo I 2024-08-17 17:31:16 speaker_utils:93] Number of files to diarize: 1 [NeMo I 2024-08-17 17:31:16 clustering_diarizer:309] Split long audio file to avoid CUDA memory issue splitting manifest: 100%|████████████████████████████████████████████████████████████████| 1/1 [00:02<00:00, 2.50s/it] [NeMo I 2024-08-17 17:31:18 vad_utils:107] The prepared manifest file exists. Overwriting! [NeMo I 2024-08-17 17:31:18 classification_models:272] Perform streaming frame-level VAD [NeMo I 2024-08-17 17:31:18 collections:301] Filtered duration for loading collection is 0.00 hours. [NeMo I 2024-08-17 17:31:18 collections:302] Dataset loaded with 212 items, total duration of 2.95 hours. [NeMo I 2024-08-17 17:31:18 collections:304] # 212 files loaded accounting to # 1 labels vad: 100%|███████████████████████████████████████████████████████████████████████████| 212/212 [01:02<00:00, 3.41it/s] [NeMo I 2024-08-17 17:32:21 clustering_diarizer:250] Generating predictions with overlapping input segments [NeMo I 2024-08-17 17:34:04 clustering_diarizer:262] Converting frame level prediction to speech/no-speech segment in start and end times format. creating speech segments: 100%|██████████████████████████████████████████████████████████| 1/1 [00:08<00:00, 8.84s/it] [NeMo I 2024-08-17 17:34:14 clustering_diarizer:287] Subsegmentation for embedding extraction: scale0, C:\Users\ckic\Desktop\whisper-diarization-main\temp_outputs\speaker_outputs\subsegments_scale0.json [NeMo I 2024-08-17 17:34:14 clustering_diarizer:343] Extracting embeddings for Diarization [NeMo I 2024-08-17 17:34:14 collections:301] Filtered duration for loading collection is 0.00 hours. [NeMo I 2024-08-17 17:34:14 collections:302] Dataset loaded with 11136 items, total duration of 4.28 hours. [NeMo I 2024-08-17 17:34:14 collections:304] # 11136 files loaded accounting to # 1 labels [1/5] extract embeddings: 100%|██████████████████████████████████████████████████████| 174/174 [00:19<00:00, 8.79it/s] [NeMo I 2024-08-17 17:34:38 clustering_diarizer:389] Saved embedding files to C:\Users\ckic\Desktop\whisper-diarization-main\temp_outputs\speaker_outputs\embeddings [NeMo I 2024-08-17 17:34:38 clustering_diarizer:287] Subsegmentation for embedding extraction: scale1, C:\Users\ckic\Desktop\whisper-diarization-main\temp_outputs\speaker_outputs\subsegments_scale1.json [NeMo I 2024-08-17 17:34:38 clustering_diarizer:343] Extracting embeddings for Diarization [NeMo I 2024-08-17 17:34:38 collections:301] Filtered duration for loading collection is 0.00 hours. [NeMo I 2024-08-17 17:34:38 collections:302] Dataset loaded with 13484 items, total duration of 4.39 hours. [NeMo I 2024-08-17 17:34:38 collections:304] # 13484 files loaded accounting to # 1 labels [2/5] extract embeddings: 100%|██████████████████████████████████████████████████████| 211/211 [00:20<00:00, 10.54it/s] [NeMo I 2024-08-17 17:35:04 clustering_diarizer:389] Saved embedding files to C:\Users\ckic\Desktop\whisper-diarization-main\temp_outputs\speaker_outputs\embeddings [NeMo I 2024-08-17 17:35:04 clustering_diarizer:287] Subsegmentation for embedding extraction: scale2, C:\Users\ckic\Desktop\whisper-diarization-main\temp_outputs\speaker_outputs\subsegments_scale2.json [NeMo I 2024-08-17 17:35:05 clustering_diarizer:343] Extracting embeddings for Diarization [NeMo I 2024-08-17 17:35:05 collections:301] Filtered duration for loading collection is 0.00 hours. [NeMo I 2024-08-17 17:35:05 collections:302] Dataset loaded with 17027 items, total duration of 4.51 hours. [NeMo I 2024-08-17 17:35:05 collections:304] # 17027 files loaded accounting to # 1 labels [3/5] extract embeddings: 100%|██████████████████████████████████████████████████████| 267/267 [00:24<00:00, 10.77it/s] [NeMo I 2024-08-17 17:35:39 clustering_diarizer:389] Saved embedding files to C:\Users\ckic\Desktop\whisper-diarization-main\temp_outputs\speaker_outputs\embeddings [NeMo I 2024-08-17 17:35:39 clustering_diarizer:287] Subsegmentation for embedding extraction: scale3, C:\Users\ckic\Desktop\whisper-diarization-main\temp_outputs\speaker_outputs\subsegments_scale3.json [NeMo I 2024-08-17 17:35:40 clustering_diarizer:343] Extracting embeddings for Diarization [NeMo I 2024-08-17 17:35:40 collections:301] Filtered duration for loading collection is 0.00 hours. [NeMo I 2024-08-17 17:35:40 collections:302] Dataset loaded with 23024 items, total duration of 4.64 hours. [NeMo I 2024-08-17 17:35:40 collections:304] # 23024 files loaded accounting to # 1 labels [4/5] extract embeddings: 100%|██████████████████████████████████████████████████████| 360/360 [00:31<00:00, 11.53it/s] [NeMo I 2024-08-17 17:36:29 clustering_diarizer:389] Saved embedding files to C:\Users\ckic\Desktop\whisper-diarization-main\temp_outputs\speaker_outputs\embeddings [NeMo I 2024-08-17 17:36:29 clustering_diarizer:287] Subsegmentation for embedding extraction: scale4, C:\Users\ckic\Desktop\whisper-diarization-main\temp_outputs\speaker_outputs\subsegments_scale4.json [NeMo I 2024-08-17 17:36:29 clustering_diarizer:343] Extracting embeddings for Diarization [NeMo I 2024-08-17 17:36:30 collections:301] Filtered duration for loading collection is 0.00 hours. [NeMo I 2024-08-17 17:36:30 collections:302] Dataset loaded with 35138 items, total duration of 4.78 hours. [NeMo I 2024-08-17 17:36:30 collections:304] # 35138 files loaded accounting to # 1 labels [5/5] extract embeddings: 100%|██████████████████████████████████████████████████████| 550/550 [00:40<00:00, 13.63it/s] [NeMo I 2024-08-17 17:37:49 clustering_diarizer:389] Saved embedding files to C:\Users\ckic\Desktop\whisper-diarization-main\temp_outputs\speaker_outputs\embeddings clustering: 0%| | 0/1 [01:26<?, ?it/s] --------------------------------------------------------------------------- RuntimeError Traceback (most recent call last) Cell In[11], line 3 1 # Initialize NeMo MSDD diarization model 2 msdd_model = NeuralDiarizer(cfg=create_config(temp_path)).to("cuda") ----> 3 msdd_model.diarize() 5 del msdd_model 6 torch.cuda.empty_cache() File ~\anaconda3\envs\whisper-diarization\lib\site-packages\torch\utils\_contextlib.py:116, in context_decorator.<locals>.decorate_context(*args, **kwargs) 113 @functools.wraps(func) 114 def decorate_context(*args, **kwargs): 115 with ctx_factory(): --> 116 return func(*args, **kwargs) File ~\anaconda3\envs\whisper-diarization\lib\site-packages\nemo\collections\asr\models\msdd_models.py:1180, in NeuralDiarizer.diarize(self) 1173 @torch.no_grad() 1174 def diarize(self) -> Optional[List[Optional[List[Tuple[DiarizationErrorRate, Dict]]]]]: 1175 """ 1176 Launch diarization pipeline which starts from VAD (or a oracle VAD stamp generation), initialization clustering and multiscale diarization decoder (MSDD). 1177 Note that the result of MSDD can include multiple speakers at the same time. Therefore, RTTM output of MSDD needs to be based on `make_rttm_with_overlap()` 1178 function that can generate overlapping timestamps. `self.run_overlap_aware_eval()` function performs DER evaluation. 1179 """ -> 1180 self.clustering_embedding.prepare_cluster_embs_infer() 1181 self.msdd_model.pairwise_infer = True 1182 self.get_emb_clus_infer(self.clustering_embedding) File ~\anaconda3\envs\whisper-diarization\lib\site-packages\nemo\collections\asr\models\msdd_models.py:699, in ClusterEmbedding.prepare_cluster_embs_infer(self) 695 """ 696 Launch clustering diarizer to prepare embedding vectors and clustering results. 697 """ 698 self.max_num_speakers = self.cfg_diar_infer.diarizer.clustering.parameters.max_num_speakers --> 699 self.emb_sess_test_dict, self.emb_seq_test, self.clus_test_label_dict, _ = self.run_clustering_diarizer( 700 self._cfg_msdd.test_ds.manifest_filepath, self._cfg_msdd.test_ds.emb_dir 701 ) File ~\anaconda3\envs\whisper-diarization\lib\site-packages\nemo\collections\asr\models\msdd_models.py:866, in ClusterEmbedding.run_clustering_diarizer(self, manifest_filepath, emb_dir) 864 logging.info(f"Multiscale Weights: {self.clus_diar_model.multiscale_args_dict['multiscale_weights']}") 865 logging.info(f"Clustering Parameters: {clustering_params_str}") --> 866 scores = self.clus_diar_model.diarize(batch_size=self.cfg_diar_infer.batch_size) 868 # If RTTM (ground-truth diarization annotation) files do not exist, scores is None. 869 if scores is not None: File ~\anaconda3\envs\whisper-diarization\lib\site-packages\nemo\collections\asr\models\clustering_diarizer.py:456, in ClusteringDiarizer.diarize(self, paths2audio_files, batch_size) 451 embs_and_timestamps = get_embs_and_timestamps( 452 self.multiscale_embeddings_and_timestamps, self.multiscale_args_dict 453 ) 455 # Clustering --> 456 all_reference, all_hypothesis = perform_clustering( 457 embs_and_timestamps=embs_and_timestamps, 458 AUDIO_RTTM_MAP=self.AUDIO_RTTM_MAP, 459 out_rttm_dir=out_rttm_dir, 460 clustering_params=self._cluster_params, 461 device=self._speaker_model.device, 462 verbose=self.verbose, 463 ) 464 logging.info("Outputs are saved in {} directory".format(os.path.abspath(self._diarizer_params.out_dir))) 466 # Scoring File ~\anaconda3\envs\whisper-diarization\lib\site-packages\nemo\collections\asr\parts\utils\speaker_utils.py:486, in perform_clustering(embs_and_timestamps, AUDIO_RTTM_MAP, out_rttm_dir, clustering_params, device, verbose) 482 num_speakers = -1 484 base_scale_idx = uniq_embs_and_timestamps['multiscale_segment_counts'].shape[0] - 1 --> 486 cluster_labels = speaker_clustering.forward_infer( 487 embeddings_in_scales=uniq_embs_and_timestamps['embeddings'], 488 timestamps_in_scales=uniq_embs_and_timestamps['timestamps'], 489 multiscale_segment_counts=uniq_embs_and_timestamps['multiscale_segment_counts'], 490 multiscale_weights=uniq_embs_and_timestamps['multiscale_weights'], 491 oracle_num_speakers=int(num_speakers), 492 max_num_speakers=int(clustering_params.max_num_speakers), 493 max_rp_threshold=float(clustering_params.max_rp_threshold), 494 sparse_search_volume=int(clustering_params.sparse_search_volume), 495 ) 497 del uniq_embs_and_timestamps 498 if cuda: File ~\anaconda3\envs\whisper-diarization\lib\site-packages\nemo\collections\asr\parts\utils\offline_clustering.py:1288, in SpeakerClustering.forward_infer(self, embeddings_in_scales, timestamps_in_scales, multiscale_segment_counts, multiscale_weights, oracle_num_speakers, max_rp_threshold, max_num_speakers, enhanced_count_thres, sparse_search_volume, fixed_thres, kmeans_random_trials) 1285 if oracle_num_speakers > 0: 1286 max_num_speakers = oracle_num_speakers -> 1288 mat = getMultiScaleCosAffinityMatrix( 1289 multiscale_weights, self.embeddings_in_scales, self.timestamps_in_scales, self.device 1290 ) 1292 nmesc = NMESC( 1293 mat, 1294 max_num_speakers=max_num_speakers, (...) 1303 device=self.device, 1304 ) 1306 # If there are less than `min_samples_for_nmesc` segments, est_num_of_spk is 1. File ~\anaconda3\envs\whisper-diarization\lib\site-packages\nemo\collections\asr\parts\utils\offline_clustering.py:529, in getMultiScaleCosAffinityMatrix(multiscale_weights, embeddings_in_scales, timestamps_in_scales, device) 527 repeated_tensor_0 = torch.repeat_interleave(score_mat_torch, repeats=repeat_list, dim=0).to(device) 528 repeated_tensor_1 = torch.repeat_interleave(repeated_tensor_0, repeats=repeat_list, dim=1).to(device) --> 529 fused_sim_d += multiscale_weights[scale_idx] * repeated_tensor_1 530 return fused_sim_d RuntimeError: The size of tensor a (35138) must match the size of tensor b (31801) at non-singleton dimension 1
The text was updated successfully, but these errors were encountered:
Please upload an audio file to reproduce the problem
Sorry, something went wrong.
I tried this audio file https://content.blubrry.com/takeituneasy/lex_ai_elon_musk_and_neuralink_team.mp3
https://content.blubrry.com/takeituneasy/lex_ai_elon_musk_and_neuralink_team.mp3
No branches or pull requests
Issue Description
When I try to run diarize.py or the Jupyter Notebook version, I encounter a dimension mismatch issue during the MSDD step
log
The text was updated successfully, but these errors were encountered: