-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Text direction needs to be taken into account #4
Comments
Can you give some examples of how these mixed direction texts are written? Here I mean the actual process of how they are written (e.g. which character, which strokes are written first). We didn't expect this to be a problem though. The assumption we made is that handwriting follows the natural flow of speech. In other words, we didn't expect the characters to be written in reverse (relative to their speech / interpretation direction). For example, we didn't expect "hello" to be written in "elloh" order). |
I grabbed some examples from Wikipedia home pages. First example. Unidirectional text, but the recogniser has to scan from right to left. Second example. Numbers and Latin text run LTR within the overall RTL flow. People writing the text tend to leave a gap and write the LTR text from LTR. They don't write the numbers or the Latin text backwards. Note, btw, that in the example just above, the parenthesis on the left is U+0029 RIGHT PARENTHESIS, and the one on the right is U+0028 LEFT PARENTHESIS. These are mirrored characters, whose glyph in typed text is established only when the directional context is known. The recogniser will also need to assign the glyph to a code point depending on the current base direction. Third example. Overall LTR sentence has RTL text with embedded LTR text in it. I expect that 'W3C' would probably be the last 3 code points written and stored in memory once the text has been recognised. To be honest, it can be difficult to know where the boundaries are for the changes in base direction here, though in this example the quote marks help. I don't know how this is done in practice, i'm just flagging up that it will be necessary. When it comes to speech, there is no flip-flopping of direction involved, and in fact in memory all code points are also arrange in one logical, unidirectional sequence. The changes in direction are only a feature of the written text. Unfortunately for you, that's what you're starting from. |
Sorry about the delay. I forgot to mention you in w3ctag/design-reviews#591 (comment) Let's continue the discussion here. WDYT about a For distinguishing between "Score: 28" and "Score: 82" (esp. rule based ones). I imagine the recognizer can determine the script of each word, use script's LTR or RTL to decide. In the above case, "Score:" is Hibrew, and "82" is Latin. With the presence of For machine learning based recognizers (the ones we currently have), handwriting "Score: 82" is part of their training dataset. The Hibrew recognizer will learn from the dataset and output characters in the correct in-memory order (i.e. direction hint is unnecessary). As for how it knows the right order, we don't know (hence why it's ML based). |
Not only will the recogniser need to take into account the language, but it will be unable to decipher the text unless it understands the glyphs it recognises proceed from right-to-left or left-to-right or vertically top-to-bottom with lines stacked LTR or RTL.
This includes orthographies that are generally written in one direction, but that have embedded text that runs in the opposite direction, and sometimes embedded text within that.
To some extent the recogniser will be able to apply the Unicode bidi algorithm to reverse engineer the logical character sequence, but in other bidirectional cases this will not be sufficient. Also it would probably be beneficial to indicate for the recogniser the overall scanning direction for the text being entered, for which it may be useful to apply a directional label, in a similar way to how one does this for language.
The text was updated successfully, but these errors were encountered: