Extract both sentences and words from publication content? #395
-
I am working on a project that aligns text with narration of the same (or similar) text. To do this, I need to be able to reference any word in a textual publication as well as the associated sentence for aligning with audio transcriptions and subsequently for highlighting the correct word/sentence as they're being narrated. In other words, I want a function of My implementation of this seems to be extracting paragraphs instead of sentences, though I am using |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 6 replies
-
No API to do this directly, but you could build it by creating two publication.content()
.elements()
.flatMap { element in
sentences = sentenceTokenizer(element)
return sentences.map { sentence in
(
sentence: sentence,
words: wordsTokenizer(sentence)
)
}
} My implementation of this seems to be extracting paragraphs instead of sentences, though I am using Maybe a bug, but I doubt it as the |
Beta Was this translation helpful? Give feedback.
I see, I forgot that the text tokenizers split the segments of a
TextContentElement
instead of returning moreTextContentElement
. But this is an implementation detail, you should assume that it might return more than one.And that's why you didn't get the individual word locators, you need to check the segments. You can try this version: