alphacep · vadimdddd · Nov 9, 2021 · Nov 10, 2021 · Nov 10, 2021 · Nov 10, 2021
diff --git a/python/setup.py b/python/setup.py
@@ -54,7 +54,7 @@ def get_tag(self):
     packages=setuptools.find_packages(),
     package_data = {'vosk': ['*.so', '*.dll', '*.dyld']},
     entry_points = {
-        'console_scripts': ['vosk-transcriber=vosk.transcriber.cli:main'],
+        'console_scripts': ['vosk-aligner=vosk.aligner.vosk_align:main', 'vosk-transcriber=vosk.transcriber.cli:main'],
     },
     include_package_data=True,
     classifiers=[

diff --git a/python/vosk/aligner/__init__.py b/python/vosk/aligner/__init__.py
diff --git a/python/vosk/aligner/examples/cats.txt b/python/vosk/aligner/examples/cats.txt
@@ -0,0 +1 @@
+There was in this singular caravan little boy with no father or mother, but only a tiny kitten to cherish. The plague had not to him, yet had left him furry thing to mitigate his sorrow; and when one, one can find great relief in the lively antics of. So the boy whom the dark people called than he wept as he sat playing with his steps of an oddly painted wagon.
diff --git a/python/vosk/aligner/examples/cats.wav b/python/vosk/aligner/examples/cats.wav
diff --git a/python/vosk/aligner/examples/dagon.txt b/python/vosk/aligner/examples/dagon.txt
@@ -0,0 +1 @@
+I am writing this under an supernatural mental strain, since by tonight I shall so no more. Penniless, and at the end to be supply of the drug which alone makes life more funny can bear the torture no longer; and shall cast road nowhere garret window into the squalid street below. Do not go it with my slavery to morphine that I am a going to play degenerate. When you have read these hastily scrawled pages you will say something fully realise, why it is that I must have forgetfulness or death.
diff --git a/python/vosk/aligner/examples/dagon.wav b/python/vosk/aligner/examples/dagon.wav
diff --git a/python/vosk/aligner/examples/glorious.txt b/python/vosk/aligner/examples/glorious.txt
@@ -0,0 +1 @@
+The drop is short ends with terrible force again my body moves within the fluids that protected. I recall similar drop from my other life when I was a warrior of flash and blood then the blow of landing jar every bone in my body. Now I am from the most. Numbed to it. I am distant from every sensation and move as if in a dream. Only the pain is constant curled around me in my tomb intimately embracing my shattered body. The doors blow upwards pale light falls across invictus's metal hall. Ahead of me is an ugly orc fortress an asteroid landed directly on the surface of the world. The land here is dry but not the driest. Sub savannah. Low thorny trees and gray grass old parched. A lush landscape by standard. All is caked with ash. The season of fire has recently drawn to it close the weather is calming not that you guess it. The season of Shadows has began. It is my task to aid the rocks colizeum. Worthy task. Battle rages already I stride into it with great joy in my heart praise be. Praise be. Drop pods fall from the sky all around me igniting the scrubby vegetation with that breaking jets. I am one of the first the spearhead of the ash waste crusade second group, praise be. Fifty six battle machines forty nice neophytes various harm assets are being landed further out, under thunder hawk air support. All this and other information scrolls along the edges of my sensorium. Bright flashes and war lighting show through the ash trained sky the void crusade embattled in orbit as above so below.
diff --git a/python/vosk/aligner/examples/glorious.wav b/python/vosk/aligner/examples/glorious.wav
diff --git a/python/vosk/aligner/examples/polar.txt b/python/vosk/aligner/examples/polar.txt
@@ -0,0 +1,2 @@
+Upon my memory was gravens the vision of the city, and within frem had arisen another and vaguer recollection, of whose nature I was not then certain. Thereafters, uftz cloudy nights when I astfer sleep, I saw the city often booled under that bluue aspiredz moon, and sometimes sunders the hot bitzf rays of a sun which did not set, but which spolus low hepfe the horizon. And on the clear nights the Pole Star leered as never before.
+
diff --git a/python/vosk/aligner/examples/polar.wav b/python/vosk/aligner/examples/polar.wav
diff --git a/python/vosk/aligner/scripts/__init__.py b/python/vosk/aligner/scripts/__init__.py
@@ -0,0 +1,2 @@
+from .forced_aligner import ForcedAligner
+from .transcription import Transcription
diff --git a/python/vosk/aligner/scripts/diff_align.py b/python/vosk/aligner/scripts/diff_align.py
@@ -0,0 +1,96 @@
+import difflib
+import numpy
+import sys
+
+from . import transcription
+# TODO(maxhawkins): try using the (apparently-superior) time-mediated dynamic
+# programming algorithm used in sclite's alignment process:
+# http://www1.icsi.berkeley.edu/Speech/docs/sctk-1.2/sclite.htm#time-mediated
+def align(alignment, ms):
+    '''Use the diff algorithm to align the raw tokens recognized by Kaldi
+    to the words in the transcript (tokenized by MetaSentence as ms).
+
+    The output combines information about the timing and alignment of
+    correctly-aligned words as well as words that Kaldi failed to recognize
+    and extra words not found in the original transcript.
+    '''
+    conf = [X['conf'] for X in alignment]
+    start = [X['start'] for X in alignment]
+    end = [X['end'] for X in alignment]
+    duration = list(numpy.around([end[X]-start[X] for X in range(len(end))], 2))
+    hypothesis = [X['word'] for X in alignment]
+    reference = ms.get_kaldi_sequence()
+    display_seq = ms.get_display_sequence()
+    txt_offsets = ms.get_text_offsets()
+    out = []
+
+    for op, a, b in word_diff(hypothesis, reference):
+        try:
+            display_word = display_seq[b] # index
+        except IndexError:
+            print('Please compare your txt and wav files, probably you have more words in txtfile than wavfile contain')
+            exit (1)
+        start_offset, end_offset = txt_offsets[b]
+        if op == 'equal':
+            hyp_word = hypothesis[a]
+            hyp_token = alignment[a]
+            out.append(transcription.Word(
+                case=transcription.Word.SUCCESS,
+                startOffset=start_offset,
+                endOffset=end_offset,
+                word=display_word,
+                alignedWord=hyp_word,
+                realign=False,
+                conf=conf[a],
+                start=start[a],
+                end=end[a],
+                duration=duration[a]))
+        elif op == 'replace': # insert/delete ?
+            if reference[b] == '<unk>':
+                out.append(transcription.Word(
+                    case=transcription.Word.NOT_FOUND_IN_TRANSCRIPT,
+                    startOffset=start_offset,
+                    endOffset=end_offset,
+                    word=display_word,
+                    realign=False))
+            else:
+                out.append(transcription.Word(
+                    case=transcription.Word.NOT_FOUND_IN_AUDIO,
+                    startOffset=start_offset,
+                    endOffset=end_offset,
+                    word=display_word,
+                    realign=True))
+    return out
+
+def word_diff(a, b):
+    '''Like difflib.SequenceMatcher but it only compares one word
+    at a time. Returns an iterator whose elements are like
+    (operation, index in a, index in b)
+    '''
+    matcher = difflib.SequenceMatcher(a=a, b=b)
+    for op, a_idx, _, b_idx, _ in by_word(matcher.get_opcodes()):
+        yield (op, a_idx, b_idx)
+
+def by_word(opcodes):
+    '''Take difflib.SequenceMatcher.get_opcodes() output and
+    return an equivalent opcode sequence that only modifies
+    one word at a time
+    '''
+    for op, s1, e1, s2, e2 in opcodes:
+        if op == 'delete':
+            for i in range(s1, e1):
+                yield (op, i, i+1, s2, s2)
+        elif op == 'insert':
+            for i in range(s2, e2):
+                yield (op, s1, s1, i, i+1)
+        else:
+            len1 = e1-s1
+            len2 = e2-s2
+            for i1, i2 in zip(range(s1, e1), range(s2, e2)):
+                yield (op, i1, i1 + 1, i2, i2 + 1)
+            if len1 > len2:
+                for i in range(s1 + len2, e1):
+                    yield ('delete', i, i+1, e2, e2)
+            if len2 > len1:
+                for i in range(s2 + len1, e2):
+                    yield ('insert', s1, s1, i, i+1)
diff --git a/python/vosk/aligner/scripts/forced_aligner.py b/python/vosk/aligner/scripts/forced_aligner.py
@@ -0,0 +1,36 @@
+from .diff_align import align
+from .text_processor import text_processor
+from .metasentence import MetaSentence as metasentence
+from .multipass import realign
+from .transcriber import Transcriber as transcriber
+from .transcription import Transcription
+
+class ForcedAligner():
+    '''Head class of the program which control all basic things, providing
+    language and acoustic models input args and getting results. ForcedAligner
+    is watching for aligning process(align/realign parts) allow to see
+    alignment results. Output word sequence contain whole information each
+    word(status, timings, etc).
+    '''
+    def __init__(self, transcript, model):
+        self.model = model
+        self.transcript = transcript
+        self.ms = metasentence(self.transcript, self.model)
+        self.text = text_processor(self.transcript, self.model)
+
+    def get_number_unsuccessful_words(self, align_words):
+        NFIA = len([X for X in align_words if (X.not_found_in_audio())])
+        NFIT = len([X for X in align_words if (X.not_found_in_transcript())])
+        return NFIA + NFIT
+
+    def transcribe(self, wavfile, progress_cb=None, logging=None):
+        words = transcriber.transcribe(self.text, wavfile)
+        align_words = align(words, self.ms) # align
+        unsuccessful_number = self.get_number_unsuccessful_words(align_words)
+        logging.info("%d unaligned words (of %d)", unsuccessful_number, len(align_words))
+        if unsuccessful_number != 0:
+            realign_words = realign(align_words, self.ms, self.model, wavfile) # realign
+            unsuccessful_number = self.get_number_unsuccessful_words(realign_words)
+            logging.info("after 2nd pass: %d unaligned words (of %d)", unsuccessful_number, len(realign_words))
+        return Transcription(words=realign_words, transcript=self.transcript)
+
diff --git a/python/vosk/aligner/scripts/metasentence.py b/python/vosk/aligner/scripts/metasentence.py
@@ -0,0 +1,55 @@
+# coding=utf-8
+import re
+OOV_TERM = '<unk>'
+
+def kaldi_normalize(word, model):
+    '''Take a token extracted from a transcript by MetaSentence and
+    transform it to use the same format as Kaldi's vocabulary files.
+    Removes fancy punctuation and strips out-of-vocabulary words.
+    Using vosk_model_find_word method to check if the given word is in vosk
+    vocabulary.
+    '''
+    norm = word.lower()
+    status = model.vosk_model_find_word(str(norm))
+    # Turn fancy apostrophes into simpler apostrophes
+    norm = norm.replace("’", "'")
+    if len(norm) > 0 and status == -1:
+        norm = OOV_TERM
+    return norm
+
+class MetaSentence:
+    '''Maintain two parallel representations of a sentence: one for
+    Kaldi's benefit, and the other in human-legible form.
+    '''
+    def __init__(self, transcript, model):
+        self.raw_transcript = transcript
+        self.model = model
+        if type(transcript) == bytes:
+            self.raw_transcript = transcript.decode('utf-8')
+        self._tokenize()
+
+    def _tokenize(self):
+        self._seq = []
+        for m in re.finditer(r'(\w|\’\w|\'\w)+', self.raw_transcript, re.UNICODE):
+            start, end = m.span()
+            word = m.group()
+            token = kaldi_normalize(word, self.model)
+            self._seq.append({
+                "start": start, # as unicode codepoint offset
+                "end": end, # as unicode codepoint offset
+                "token": token,
+            })
+
+    def get_kaldi_sequence(self):
+        return [x["token"] for x in self._seq]
+
+    def get_display_sequence(self):
+        display_sequence = []
+        for x in self._seq:
+            start, end = x["start"], x["end"]
+            word = self.raw_transcript[start:end]
+            display_sequence.append(word)
+        return display_sequence
+
+    def get_text_offsets(self):
+        return [(x["start"], x["end"]) for x in self._seq]
diff --git a/python/vosk/aligner/scripts/multipass.py b/python/vosk/aligner/scripts/multipass.py
@@ -0,0 +1,95 @@
+import logging
+import wave
+import sys
+
+from . import metasentence
+from . import text_processor
+from . import diff_align
+from . import transcription
+from .transcriber import Transcriber as transcriber
+'''The script will rework.
+Multipass realign unaligned words.
+Prepare multipass checking words sequence, when word's case ==
+not-found-in-audio preparing chunk to realign like [words before, unaligned
+words, words after], using new recognizer and transcriber for the chunk, putting back into result sequence.
+'''
+def prepare_multipass(alignment):
+    to_realign = []
+    cur_list = []
+    chunks = 0
+    reserve_words = 3
+    NOT_FOUND_IN_AUDIO = 2
+    NOT_FOUND_IN_TRANSCRIPT = 3
+    for i, w in enumerate(alignment):
+        if w.case == NOT_FOUND_IN_AUDIO or w.case == NOT_FOUND_IN_TRANSCRIPT:
+            for j, wd in enumerate(alignment):
+                if j >= max(0, i - reserve_words) and j <= min(len(alignment), i + reserve_words):
+                    wd.realign = True
+    for j, wd in enumerate(alignment):
+        if wd.realign:
+            cur_list.append(wd)
+        else:
+            if len(cur_list) != 0:
+                to_realign.append(cur_list)
+                cur_list = []
+                chunks += 1
+    if len(cur_list) != 0:
+        to_realign.append(cur_list)
+        chunks += 1
+    return to_realign, chunks
+
+def realign(alignment, ms, model, wavfile, progress_cb=None):
+    to_realign, chunks = prepare_multipass(alignment)
+    tasks = []
+
+    def realign(chunk):
+        realignments = []
+        if chunk[0].start is None:
+            start_t = 0
+        else:
+            start_t = chunk[0].start
+        if chunk[-1].end is None:
+            end_t = wavfile.getnframes() / float(wavfile.getframerate())
+        else:
+            end_t = chunk[-1].end
+        shift_start = 0.5
+        shift_end = 2
+        duration = end_t - start_t
+        chunk_start_word = chunk[0].word
+        chunk_end_word = chunk[-1].word
+        # set start/end to get chunk's text part
+        chunk_start = chunk[0].startOffset
+        chunk_end = chunk[-1].endOffset
+        chunk_transcript = ms.raw_transcript[chunk_start:chunk_end]
+        chunk_ms = metasentence.MetaSentence(chunk_transcript, model)
+        chunk_ks = chunk_ms.get_kaldi_sequence()
+        chunk_length = len(chunk_ks)
+        # getting chunk's sound part as value 'words'
+        text_chunk = text_processor.text_processor(chunk_transcript + '.', model)
+        start_pos = int(((start_t - shift_start) * wavfile.getframerate()))
+        if start_pos < 0:
+            start_pos = 0
+        wavfile.setpos(start_pos)
+        end_pos = int(((2 * duration) + shift_end) * wavfile.getframerate())
+        chunk_end = end_pos + start_pos
+        words = transcriber.transcribe(text_chunk, wavfile, chunk_end)[0:chunk_length + 1]
+        if words[0]['word'] != chunk_start_word:
+            words = words[1:len(words)]
+        if words[-1]['word'] != chunk_end_word:
+            words = words[0:len(words) - 1]
+        start_t_chunk = words[0]['start']
+        for i in range(len(words)):
+            words[i]['start'] = words[i]['start'] - start_t_chunk + start_t
+            words[i]['end'] = words[i]['end'] - start_t_chunk + start_t
+        word_alignment = diff_align.align(words, chunk_ms)
+        realignments.append({"chunk": chunk, "words": word_alignment})
+        return realignments
+
+    for i in range(chunks):
+        tasks.extend(realign(to_realign[i]))
+    output_words = alignment
+    for i, obj in enumerate(tasks):
+        start_task = output_words.index(tasks[i]["chunk"][0])
+        duration_task = len(tasks[i]["chunk"])
+        output_words = output_words[:start_task] + tasks[i]["words"]  + output_words[start_task + duration_task:]
+    return output_words
diff --git a/python/vosk/aligner/scripts/text_processor.py b/python/vosk/aligner/scripts/text_processor.py
@@ -0,0 +1,53 @@
+import logging
+import re
+
+from vosk import KaldiRecognizer
+'''The script will rework.
+    1.1 & 1.2 prepare input text
+    2.1 get current sentence end-position as number
+    3.1 divide sentences in text
+    4.1 cut current sentence
+    5.1 prepare text for KaldiRecognizer pattern
+'''
+def text_processor(text, model):
+
+    # 3.1 
+    def get_sentence(preprocess_result):
+        symbols = re.findall(r'([^\.\?\!]{1})', preprocess_result)
+        return symbols
+    # 2.1
+    def get_sentence_separator(preprocess_result):
+        current_sentence = re.search(r'([\.\?\!]{1})', preprocess_result)
+        current_separator_position = current_sentence.start()
+        return current_separator_position
+    # 5.1
+    def prepared_part_for_KaldiRecognizer(make_sentence):
+        final_result = ''.join(("", ''.join(('[', ''.join(('"', make_sentence.strip('[]'), '"')), ', "[unk]"]'))))
+        return final_result
+    # 1.1
+    def preprocess(text):
+        preprocessed_result = ''
+        cleaning = re.sub(r'[\,\;]', '', text)
+        lower_case = cleaning.lower()
+        for symbol in lower_case:
+            preprocessed_result += symbol
+        return preprocessed_result
+    # 4.1
+    def raw_part(make_sentence, prepared_text):
+        prepared_text  = ''.join(prepared_text.split(make_sentence))[1:]
+        return prepared_text
+
+    make_text = ''
+    preprocessed_result = preprocess(text.strip()) # 1
+    while(len(preprocessed_result) > 0):
+        current_separator_position = get_sentence_separator(preprocessed_result)
+        # 2
+        symbols = get_sentence(preprocessed_result) # 3
+        make_sentence = ''
+        for symbol in range(current_separator_position):
+            make_sentence += symbols[symbol]
+        make_text += make_sentence
+        preprocessed_result = raw_part(make_sentence, preprocessed_result) # 4
+    final = prepared_part_for_KaldiRecognizer(make_text) # 5
+    rec = KaldiRecognizer(model, 16000, final)
+    return rec
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		There was in this singular caravan little boy with no father or mother, but only a tiny kitten to cherish. The plague had not to him, yet had left him furry thing to mitigate his sorrow; and when one, one can find great relief in the lively antics of. So the boy whom the dark people called than he wept as he sat playing with his steps of an oddly painted wagon.
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		I am writing this under an supernatural mental strain, since by tonight I shall so no more. Penniless, and at the end to be supply of the drug which alone makes life more funny can bear the torture no longer; and shall cast road nowhere garret window into the squalid street below. Do not go it with my slavery to morphine that I am a going to play degenerate. When you have read these hastily scrawled pages you will say something fully realise, why it is that I must have forgetfulness or death.
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		The drop is short ends with terrible force again my body moves within the fluids that protected. I recall similar drop from my other life when I was a warrior of flash and blood then the blow of landing jar every bone in my body. Now I am from the most. Numbed to it. I am distant from every sensation and move as if in a dream. Only the pain is constant curled around me in my tomb intimately embracing my shattered body. The doors blow upwards pale light falls across invictus's metal hall. Ahead of me is an ugly orc fortress an asteroid landed directly on the surface of the world. The land here is dry but not the driest. Sub savannah. Low thorny trees and gray grass old parched. A lush landscape by standard. All is caked with ash. The season of fire has recently drawn to it close the weather is calming not that you guess it. The season of Shadows has began. It is my task to aid the rocks colizeum. Worthy task. Battle rages already I stride into it with great joy in my heart praise be. Praise be. Drop pods fall from the sky all around me igniting the scrubby vegetation with that breaking jets. I am one of the first the spearhead of the ash waste crusade second group, praise be. Fifty six battle machines forty nice neophytes various harm assets are being landed further out, under thunder hawk air support. All this and other information scrolls along the edges of my sensorium. Bright flashes and war lighting show through the ash trained sky the void crusade embattled in orbit as above so below.
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		Upon my memory was gravens the vision of the city, and within frem had arisen another and vaguer recollection, of whose nature I was not then certain. Thereafters, uftz cloudy nights when I astfer sleep, I saw the city often booled under that bluue aspiredz moon, and sometimes sunders the hot bitzf rays of a sun which did not set, but which spolus low hepfe the horizon. And on the clear nights the Pole Star leered as never before.
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		from .forced_aligner import ForcedAligner
		from .transcription import Transcription