Extract both sentences and words from publication content? #395

domkm · 2024-02-28T05:48:18Z

domkm
Feb 28, 2024

I am working on a project that aligns text with narration of the same (or similar) text. To do this, I need to be able to reference any word in a textual publication as well as the associated sentence for aligning with audio transcriptions and subsequently for highlighting the correct word/sentence as they're being narrated. In other words, I want a function of (Publication) -> [(sentence: Locator, words: [Locator])].

My implementation of this seems to be extracting paragraphs instead of sentences, though I am using TextUnit.sentence. Is there a recommended technique for extracting multiple TextUnit cases simultaneously?

Answered by mickael-menu

Mar 7, 2024

tokenizer($0)[0] is because ContentTokenizer always returns a [any ContentElement] of length 1.

I see, I forgot that the text tokenizers split the segments of a TextContentElement instead of returning more TextContentElement. But this is an implementation detail, you should assume that it might return more than one.

And that's why you didn't get the individual word locators, you need to check the segments. You can try this version:

func contentPairs(publication: Publication) throws -> [(sentence: Locator, words: [Locator])] {
    guard let content = publication.content() else {
        return []
    }

    let wordTokenizer = makeTextContentTokenizer(
        defaultLanguage: publication.…

View full answer

mickael-menu · 2024-03-03T16:44:06Z

mickael-menu
Mar 3, 2024
Maintainer

No API to do this directly, but you could build it by creating two ContentTokenizer with makeTextContentTokenizer(), one for .sentence and the other for .words. Then iterate over publication.content() using something like that (not tested):

publication.content()
    .elements()
    .flatMap { element in
        sentences = sentenceTokenizer(element)
        return sentences.map { sentence in
            (
                sentence: sentence,
                words: wordsTokenizer(sentence)
            )
        }
    }

My implementation of this seems to be extracting paragraphs instead of sentences, though I am using TextUnit.sentence.

Maybe a bug, but I doubt it as the PublicationSpeechSynthesizer uses TextUnit.sentence. Maybe open a bug report issue with a code sample showing the issue?

6 replies

mickael-menu Mar 5, 2024
Maintainer

I'm not sure I understand why you only use the first element returned by the tokenizers:

try? sentenceTokenizer($0)[0] as? TextContentElement
...
try? wordTokenizer($0)[0] as? TextContentElement

And why you create a new TextContentElement using the first segment of the sentence, instead of passing the whole sentence element to the wordTokenizer?

domkm Mar 6, 2024
Author

tokenizer($0)[0] is because ContentTokenizer always returns a [any ContentElement] of length 1.

Creating new TextContentElements is because each sentence is a TextContentElement.Segment but ContentTokenizer expects a TextContentElement.

domkm Mar 6, 2024
Author

Alternatively, and perhaps this should actually be a bug report, but extracted locators do not have unique totalProgression values; they repeat for chunks of text. In other words, multiple extracted adjacent locators have the same totalProgression.

Solving this would actually be my preferred solution, because it would provide an alternative solution to the current topic and also allow me to implement a feature that requires finding the current word/sentence given an arbitrary locator. For example, if a user swipes between pages or double taps somewhere on a page (I know the latter isn't possible yet without #273), I want to seek the current audio playback location to the new text location. Though I can map bidirectionally from my canonical extracted word locators and my audio transcript, I do not yet know how to map from an arbitrary text locator to extracted text locator. A totalProgression value which increments for every extracted text element would solve this.

mickael-menu Mar 7, 2024
Maintainer

tokenizer($0)[0] is because ContentTokenizer always returns a [any ContentElement] of length 1.

I see, I forgot that the text tokenizers split the segments of a TextContentElement instead of returning more TextContentElement. But this is an implementation detail, you should assume that it might return more than one.

And that's why you didn't get the individual word locators, you need to check the segments. You can try this version:

func contentPairs(publication: Publication) throws -> [(sentence: Locator, words: [Locator])] {
    guard let content = publication.content() else {
        return []
    }

    let wordTokenizer = makeTextContentTokenizer(
        defaultLanguage: publication.metadata.language,
        textTokenizerFactory: { language in
            makeDefaultTextTokenizer(unit: .word, language: language)
        }
    )

    let sentenceTokenizer = makeTextContentTokenizer(
        defaultLanguage: publication.metadata.language,
        textTokenizerFactory: { language in
            makeDefaultTextTokenizer(unit: .sentence, language: language)
        }
    )

    let sentenceElems = try content.sequence()
        .flatMap { element in
            try sentenceTokenizer(element)
                .flatMap { $0.splitSegments() }
        }
    
    return try sentenceElems.flatMap { sentenceElem in
        let wordElems = try wordTokenizer(sentenceElem)
            .flatMap { $0.splitSegments() }

        return (
            sentence: sentenceElem.locator,
            words: wordElems.map(\.locator)
        )
    }
}

extension ContentElement {
    func splitSegments() -> [ContentElement] {
        guard let textContent = self as? TextContentElement else {
            return [self]
        }

        return textContent.segments.map { segment in
            TextContentElement(
                locator: segment.locator,
                role: textContent.role,
                segments: [segment]
            )
        }
    }
}

I added a ContentElement.splitSegments() extension to extract the segments into individual ContentElement.

Alternatively, and perhaps this should actually be a bug report, but extracted locators do not have unique totalProgression values; they repeat for chunks of text. In other words, multiple extracted adjacent locators have the same totalProgression.

The issue is that we can't really compute a new totalProgression for each segment unless you know the starting totalProgression of the next content element. But you could probably write an algorithm that adjusts the totalProgression by checking at the current element locator and the next one. This has to be done before using a tokenizer, otherwise you will have several elements with the same totalProgression.

Answer selected by domkm

domkm Mar 7, 2024
Author

That worked! Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract both sentences and words from publication content? #395

{{title}}

Replies: 1 comment 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Extract both sentences and words from publication content? #395

domkm Feb 28, 2024

Replies: 1 comment · 6 replies

mickael-menu Mar 3, 2024 Maintainer

mickael-menu Mar 5, 2024 Maintainer

domkm Mar 6, 2024 Author

domkm Mar 6, 2024 Author

mickael-menu Mar 7, 2024 Maintainer

domkm Mar 7, 2024 Author

domkm
Feb 28, 2024

Replies: 1 comment 6 replies

mickael-menu
Mar 3, 2024
Maintainer

mickael-menu Mar 5, 2024
Maintainer

domkm Mar 6, 2024
Author

domkm Mar 6, 2024
Author

mickael-menu Mar 7, 2024
Maintainer

domkm Mar 7, 2024
Author