-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
file2brl generating a wrong hyphenated word with hungarian eurobraille document #48
Comments
This is the html file with producing wrong output |
This is the html content with producing wrong output: <title>teszt</title>Innentől folytatódik a sima szöveg. Vajon ez behúzásos bekezdés lesz-e? This is the hu.cfg file content: This file contains all possible configuration settings, with samplevalues, where appropriate. It is used by the file2brl command-lineinterface if no configuration file is given. It is also part of thedocumentation.outputFormat The number of cells on a line in Braille translations
The number of lines per page in Braille translations
Whether to format the Braille translation for output on interpoint embossersNOTE: Other than the formatting, there is nothing specific to interpointembossing. This means that even if this is set to no, the output canstill be embossed on an interpoint embosser
What emphasis to include, comma separated list using values:italic, bold, computerBraille and underline.For all emphasis you may just use the value all
Whether to separate into Braille pages. If no then linesPerPage is ignored.
When numbering print pages should continuation be indicated in page numbers.For example first Braille page of print page 26 it will be a26,second page will be b26, etc.
Whether to include a print page separator mark in Braille at print pagebreaks.
Whether to include a page number on the page separator line.
Include Braille page numbers
What format should be produced from back translations
Line length for files produced from backtranslation.
Hyphenate translations
What type of Braille device should output be formatted for.
What characters mark a line ending, mostly relevant for text/brf format.
What character marks end of page, again mostly suitable for text/brf format.
What page number should Braille page numbers start from
Whether to format paragraphs. If set to no then a paragraph is one longline and cellsPerLine is ignored
Whether to show print page numbers
Where to place print page numbers
Where to place Braille page numbers
Encoding of output file.
Whether to produce a table of contents
The character to fill lines with (eg. in tables tracker dots)
The below settings for margins and paper dimensions are only used for UTDoutput. When formatting for UTD cellsPerLine and linesPerpage areignored.The margin at the top of the page in inches
The margin at the left of the page
The margin at the right of the page
The margin at the bottom of the page
Height of the Braille page in inches
Width of the Braille page in inches
If a print page has no page number, do not insert a page separator andso merge it with the previous page in the Braille translation
Whether to place any page numbers at the top of a page on a separate line
Whether to place any page numbers at the bottom of a page on a separate line
If a Braille page has more than one print page on it, whether to show therange of print page numbers present on the Braille page.
If there is an empty page in Print whether to ignore it in Braille.
Whether to include print page numbers in table of contents
Whether to include Braille page numbers in table of contents
translation What Braille table to use for the literary text
What table to use for computer Braille
What table to use for uncontracted BrailleNOTE: This setting is possibly depricated.
What table to use for non-mathematical content in books containing maths.This option is normally not needed in many codes and so should be thesame as literaryTextTable.
What Braille table to use for mathematical content
What table should be used to edit together parts of documents (eg. tojoin maths and text)
xml The XML header assumed for XML input documents with no header
Entity definitions
The semantic action files to be used
Whether to use the internet to get DTDs
Whether to create new semantic action definitions
What semantic action file to convert from UTD.
#(miscellaneous) Directive for including other configuration filesThe mode for translation
The input encoding of text files
Whether to use debug mode
You can override any style setting and define new styles.A style name will normally match the semantic action nameRefer to the liblouisutdml documentation for details on possible optionswhich can be used in styles.style document style arith style heading2 This is the hyphenate.py file content: -- coding: utf-8 --import louis, sys word=sys.argv[1] This is the wrong test.brf content part: With bekezdés word need hyphenate with be-2d1s word, because the bekez- part not fitting the 32 character line length. So, with interlnal louis.hyphenate function the bekezdés word right places hyphenated (be-kez-dés). With hungarian grade2 braille both us-table.dis, de-eurobrl6.dis and unicode.dis file usage is OK, except the bekezdés word hyphenated with bekez- word, and if I see right, the line length greater with 32 character. Attila |
Hi List,
In 2017 Norbert and me founded an interesting situation when using file2brl with following parameters:
file2brl -f hu.cfg -t test.html test.brf
If anybody would like trying reproducing or fix this issue, I attaching four files:
test.htm: this is the small source html document, with I cutted the affected HTML part.
test.brf: this is the wrong way generated hungarian grade1 braille document, with containing the 29TH line the wrong hungarian hyphenation part.
hu.cfg: this file containing my hungarian language specific preferences for file2brl.
In Linux anybody succesfully reproduce this issue if copying the hu.cfg file into /usr/share/liblouisutdml/lbu_files directory, and type following command:
file2brl -f hu.cfg -t test.htm test.brf
In the generated test.brf document 29TH line the file2brl utility wrong hyphenate the "bekezdés" word part.
This situation the hyphen character lands in the 29TH line with 32TH character position.
With Liblouis I verifyed what parts possible hyphenate hungarian language the bekezdés word, following parts resulting good hyphenation:
be-kez-dés
Because the lou_checkhyphens utility impossible to test the bekezdés word because this word containing accented character, I wrote a small python script to easy test any words in hungarian language.
The code is following:
#!/usr/bin/env python3
-- coding: utf-8 --
import louis, sys
def hyphenate_word(word):
try:
hyphen_mask=louis.hyphenate(['hu-hu-g1.ctb', 'hyph_hu_HU.dic'], word, 0)
temp="".join( list(map(lambda a,b: "-"+a if b=='1' else a, word, hyphen_mask)))
hyphenated_word=temp
except RuntimeError:
slice=word.split('-')
temp_hyphenated_word=''
for l in slice:
hyphen_mask=louis.hyphenate(['hu-hu-g1.ctb', 'hyph_hu_HU.dic'], l, 0)
temp="".join( list(map(lambda a,b: "-"+a if b=='1' else a, l, hyphen_mask)))+'-'
temp_hyphenated_word=temp_hyphenated_word+temp
hyphenated_word=hyphenated_word[0:len(hyphenated_word)-1]
return hyphenated_word
word=sys.argv[1]
hyphenated_word=hyphenate_word(word)
print('normal word: '+word)
print('hyphenated word: '+hyphenated_word)
If I run python3 hyphenate.py bekezdés command, I get following right output:
"normal word: bekezdés
hyphenated word: be-kez-dés"
I attaching this small test program too.
Liblouis builtin hyphenate function confirming me the generated beke- hyphenation part is not valid.
In the 29TH line the first right hyphenate part with fit the maximum 32 character line length is "be-", and need putting the next line the "kezdés" word part.
The affected text part right braille output after manual correction is following in eurobraille format in hungarian grade1 braille:
"5qveg. $vajon e2 beh02"sos be-
ke2d1s le5-e?"
How can possible preventing this situation with automatic braille conversion? How can possible for example backlisting this wrong hyphenation if Liblouis part generating good hyphenation masks this word?
Small texts easy correcting this type errors, but a large document when the purpose is a printable braille book, It is a very tedious task with document corrector persons.
Have big chance a large text possible happening more this type issues.
I attaching the affected files.
Attila
The text was updated successfully, but these errors were encountered: