Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Phoneme labels and timestamps - take two #1377

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

madhephaestus
Copy link

@madhephaestus madhephaestus commented Jun 1, 2023

The first PR seems to have died. rutujaubale Made the original effort to add the feature. Nathravorn fixed the build in their branch. I am now making a new PR to get this feature merged in.

this PR replaces #528

and closes #687 with a solution

@Shallowmallow
Copy link

Shallowmallow commented Jul 20, 2023

Really nice. But it doesn't seem to work when you use alternatives ? It would be really cool if it was the case :)

@tobiasalanboyd
Copy link

Hello! I am trying to make a version of test_microphone.py that recognizes phonemes rather than words/sentences. However, I am struggling to figure out what the python equivalent would be to
vosk_recognizer_set_result_options(recognizer, "phones");
from test_phone_results.c
I thought perhaps that would be
SetResultOptions(rec, "phones")
but when I add this line I get the message that SetResultOptions is not defined.
Apologies if the answer to this is obvious, I am new to working with this type of code. Thank you in advance!

@madhephaestus
Copy link
Author

it looks like the method would be SetResultOptions(self, options), so no need to pass in an instance of the recognizer since that seems to be a private class variable not a parameter in the Python API.

@tobiasalanboyd
Copy link

tobiasalanboyd commented Oct 9, 2024

Thanks for getting back to me! I have tried pretty much every variation on the above that I can think of, and am not sure if the issue is due to me being new to Python or if there's something else happening here.
All of the below examples were inserted below this line in my copy of test_microphone.py:
rec = KaldiRecognizer(model, args.samplerate)
Examples of what I have tried adding so far with no success:
SetResultOptions("phones")
SetResultOptions(rec, "phones")
SetResultOptions(rec._handle, "phones")
rec.SetResultOptions("phones")
rec._handle.SetResultOptions("phones")
rec.SetResultOptions(rec, "phones")
rec.SetResultOptions(rec._handle, "phones")
rec.SetResultOptions()
rec._handle.SetResultOptions()

If this is helpful to know, I am running the program in CMD with the following command:
C:\Users\myusername\vosk-api\python\example>py .\test_microphone_phon.py

EDIT: Before realizing that regular Vosk would not provide individual phonemes, I installed it via pypi - is it possible this is contributing to the difficulties?

@321Proteus
Copy link

321Proteus commented Dec 3, 2024

Hello, I'm currently trying to build your version of Vosk, but I keep getting the same error as in #1082 :

recognizer.cc: In member function ‘const char* Recognizer::PartialResult()’:
recognizer.cc:855:13: error: ‘WordAlignLatticePartial’ was not declared in this scope
  855 |             WordAlignLatticePartial(clat, *model_->trans_model_, *model_->winfo_, 0, &aligned_lat);
      |             ^~~~~~~~~~~~~~~~~~~~~~~

I'm using the AlphaCephei branch of Kaldi (with OpenFST 1.7.2, tried also 1.8.3 from Kaldi-ASR with the same result). Any idea what's going on?

@nshmyrev
Copy link
Collaborator

nshmyrev commented Dec 4, 2024

I'm using the AlphaCephei branch of Kaldi

WordAlignLatticePartial is there. Probably you are using some old version. Please recheck.

@321Proteus
Copy link

OK, I did it. I redownloaded the Dockerfile and ran it on my machine instead of building everything locally (normally I'd just build Kaldi and OpenFST using Docker, then copy them to local and build Vosk from there). Now everything compiles just fine!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Is it possible to get the timing of phonemes, instead of full words?
7 participants