file2brl generating a wrong hyphenated word with hungarian eurobraille document #48

hammera · 2018-08-24T14:35:52Z

Hi List,

In 2017 Norbert and me founded an interesting situation when using file2brl with following parameters:
file2brl -f hu.cfg -t test.html test.brf
If anybody would like trying reproducing or fix this issue, I attaching four files:
test.htm: this is the small source html document, with I cutted the affected HTML part.
test.brf: this is the wrong way generated hungarian grade1 braille document, with containing the 29TH line the wrong hungarian hyphenation part.
hu.cfg: this file containing my hungarian language specific preferences for file2brl.

In Linux anybody succesfully reproduce this issue if copying the hu.cfg file into /usr/share/liblouisutdml/lbu_files directory, and type following command:
file2brl -f hu.cfg -t test.htm test.brf

In the generated test.brf document 29TH line the file2brl utility wrong hyphenate the "bekezdés" word part.
This situation the hyphen character lands in the 29TH line with 32TH character position.

With Liblouis I verifyed what parts possible hyphenate hungarian language the bekezdés word, following parts resulting good hyphenation:
be-kez-dés
Because the lou_checkhyphens utility impossible to test the bekezdés word because this word containing accented character, I wrote a small python script to easy test any words in hungarian language.
The code is following:
#!/usr/bin/env python3

-- coding: utf-8 --

import louis, sys
def hyphenate_word(word):
try:
hyphen_mask=louis.hyphenate(['hu-hu-g1.ctb', 'hyph_hu_HU.dic'], word, 0)
temp="".join( list(map(lambda a,b: "-"+a if b=='1' else a, word, hyphen_mask)))
hyphenated_word=temp
except RuntimeError:
slice=word.split('-')
temp_hyphenated_word=''
for l in slice:
hyphen_mask=louis.hyphenate(['hu-hu-g1.ctb', 'hyph_hu_HU.dic'], l, 0)
temp="".join( list(map(lambda a,b: "-"+a if b=='1' else a, l, hyphen_mask)))+'-'
temp_hyphenated_word=temp_hyphenated_word+temp
hyphenated_word=hyphenated_word[0:len(hyphenated_word)-1]
return hyphenated_word

word=sys.argv[1]
hyphenated_word=hyphenate_word(word)
print('normal word: '+word)
print('hyphenated word: '+hyphenated_word)

If I run python3 hyphenate.py bekezdés command, I get following right output:
"normal word: bekezdés
hyphenated word: be-kez-dés"
I attaching this small test program too.

Liblouis builtin hyphenate function confirming me the generated beke- hyphenation part is not valid.
In the 29TH line the first right hyphenate part with fit the maximum 32 character line length is "be-", and need putting the next line the "kezdés" word part.
The affected text part right braille output after manual correction is following in eurobraille format in hungarian grade1 braille:
"5qveg. $vajon e2 beh02"sos be-
ke2d1s le5-e?"

How can possible preventing this situation with automatic braille conversion? How can possible for example backlisting this wrong hyphenation if Liblouis part generating good hyphenation masks this word?
Small texts easy correcting this type errors, but a large document when the purpose is a printable braille book, It is a very tedious task with document corrector persons.
Have big chance a large text possible happening more this type issues.

I attaching the affected files.
Attila

hammera · 2018-08-24T14:38:28Z

This is the html file with producing wrong output

hammera · 2018-08-24T14:47:27Z

This is the html content with producing wrong output:

<title>teszt</title>

Innentől folytatódik a sima szöveg. Vajon ez behúzásos bekezdés lesz-e?

This is the hu.cfg file content:

This file contains all possible configuration settings, with sample

values, where appropriate. It is used by the file2brl command-line

interface if no configuration file is given. It is also part of the

documentation.

outputFormat

The number of cells on a line in Braille translations

cellsPerLine 32

The number of lines per page in Braille translations

linesPerPage 25

Whether to format the Braille translation for output on interpoint embossers

NOTE: Other than the formatting, there is nothing specific to interpoint

embossing. This means that even if this is set to no, the output can

still be embossed on an interpoint embosser

interpoint yes

What emphasis to include, comma separated list using values:

italic, bold, computerBraille and underline.

For all emphasis you may just use the value all

emphasis all

Whether to separate into Braille pages. If no then linesPerPage is ignored.

braillePages yes

When numbering print pages should continuation be indicated in page numbers.

For example first Braille page of print page 26 it will be a26,

second page will be b26, etc.

continuePages yes

Whether to include a print page separator mark in Braille at print page

breaks.

pageSeparator yes

Whether to include a page number on the page separator line.

pageSeparatorNumber yes

Include Braille page numbers

numberBraillePages yes

What format should be produced from back translations

backFormat html

Line length for files produced from backtranslation.

backLineLength 70

Hyphenate translations

hyphenate yes

What type of Braille device should output be formatted for.

formatFor textDevice

What characters mark a line ending, mostly relevant for text/brf format.

lineEnd \n

What character marks end of page, again mostly suitable for text/brf format.

pageEnd \f

What page number should Braille page numbers start from

beginningPageNumber 1

Whether to format paragraphs. If set to no then a paragraph is one long

line and cellsPerLine is ignored

paragraphs yes

Whether to show print page numbers

printPages yes

Where to place print page numbers

printPageNumberAt top

Where to place Braille page numbers

braillePageNumberAt bottom

Encoding of output file.

outputEncoding utf8

Whether to produce a table of contents

contents yes

The character to fill lines with (eg. in tables tracker dots)

lineFill '

The below settings for margins and paper dimensions are only used for UTD

output. When formatting for UTD cellsPerLine and linesPerpage are

ignored.

The margin at the top of the page in inches

topMargin 0.5

The margin at the left of the page

leftMargin 1

The margin at the right of the page

rightMargin 0.5

The margin at the bottom of the page

bottomMargin 0.5

Height of the Braille page in inches

paperHeight 11

Width of the Braille page in inches

paperWidth 9.5
braillePageNumber

If a print page has no page number, do not insert a page separator and

so merge it with the previous page in the Braille translation

mergeUnnumberedPages yes

Whether to place any page numbers at the top of a page on a separate line

pageNumberTopSeparateLine no

Whether to place any page numbers at the bottom of a page on a separate line

pageNumberBottomSeparateLine no

If a Braille page has more than one print page on it, whether to show the

range of print page numbers present on the Braille page.

printPageNumberRange yes

If there is an empty page in Print whether to ignore it in Braille.

ignoreEmptyPages yes

Whether to include print page numbers in table of contents

printPageNumbersInContents yes

Whether to include Braille page numbers in table of contents

braillePageNumbersInContents yes

translation

What Braille table to use for the literary text

literaryTextTable hu-hu-g1.ctb,hyph_hu_HU.dic

What table to use for computer Braille

compbrlTable hu-hu-comp8.ctb

What table to use for uncontracted Braille

NOTE: This setting is possibly depricated.

uncontractedTable en-us-g1.ctb

What table to use for non-mathematical content in books containing maths.

This option is normally not needed in many codes and so should be the

same as literaryTextTable.

mathtextTable hu-hu-g1.ctb

What Braille table to use for mathematical content

mathexprTable nemeth.ctb

What table should be used to edit together parts of documents (eg. to

join maths and text)

editTable nemeth_edit.ctb

xml

The XML header assumed for XML input documents with no header

xmlheader "<?xml version='1.0' encoding='UTF-8' standalone='yes'?>

Entity definitions

#entity (an entity definition for the DTD)

The semantic action files to be used

semanticFiles *,nemeth.sem

Whether to use the internet to get DTDs

internetAccess no

Whether to create new semantic action definitions

newEntries yes

What semantic action file to convert from UTD.

converterSem utd.sem

#(miscellaneous)

Directive for including other configuration files

The mode for translation

mode dotsIO

The input encoding of text files

inputTextEncoding utf8

Whether to use debug mode

debug no

You can override any style setting and define new styles.

A style name will normally match the semantic action name

Refer to the liblouisutdml documentation for details on possible options

which can be used in styles.

style document
#This style contains all possible style settings.
linesBefore 0
linesAfter 0
leftMargin 0
firstLineIndent 0
#translationTable (a table name)
skipNumberLines no
format leftJustified
newPageBefore no
newPageAfter no
righthandPage no
braillePageNumberFormat normal
keepWithNext no
dontSplit no
orphanControl 0
newlineAfter yes

style arith
style attribution
format rightJustified
style biblio
style caption
leftMargin 4
firstLineIndent 2
style code
linesBefore 1
linesAfter 1
skipNumberLines yes
format computerCoded
style contentsheader
linesBefore 1
format centered
linesAfter 1
style contents1
firstLineIndent -2
leftMargin 2
format contents
style contents2
firstLineIndent -2
leftMargin 4
format contents
style contents3
firstLineIndent -2
leftMargin 6
format contents
style contents4
firstLineIndent -2
leftMargin 8
format contents
style dedication
newPageBefore yes
newPageAfter yes
format centered
style directions
style dispmath
leftMargin 2
style disptext
leftMargin 2
firstLineIndent 2
style exercise1
leftMargin 2
firstLineIndent -2
style exercise2
leftMargin 4
firstLineIndent -2
style exercise3
leftMargin 6
firstLineIndent -2
style glossary
firstLineIndent 2
style graph
skipNumberLines yes
style graphlabel
style heading1
linesBefore 1
format centered
linesAfter 1
keepWithNext yes
dontSplit yes

style heading2
linesBefore 1
firstLineIndent 4
style heading3
firstLineIndent 4
style heading4
firstLineIndent 4
style index
style line
firstLineIndent -2
leftMargin 2
style list
firstLineIndent -2
leftMargin 2
style matrix
format alignColumnsLeft
style music
skipNumberLines yes
style note
style para
firstLineIndent 2
style quotation
linesBefore 1
linesAfter 1
style section
firstLineIndent 4
style spatial
style stanza
linesBefore 1
linesAfter 1
style style1
style style2
style style3
style style4
style style5
style subsection
firstLineIndent 4
style table
linesBefore 1
linesAfter 1
style titlepage
newPageAfter yes
style trnote
firstLineIndent 7
leftMargin 5
style volume
style boxline
topBoxline c
bottomBoxline c

This is the hyphenate.py file content:
#!/usr/bin/env python3

-- coding: utf-8 --

import louis, sys
def hyphenate_word(word):
try:
hyphen_mask=louis.hyphenate(['hu-hu-g1.ctb', 'hyph_hu_HU.dic'], word, 0)
temp="".join( list(map(lambda a,b: "-"+a if b=='1' else a, word, hyphen_mask)))
hyphenated_word=temp
except RuntimeError:
slice=word.split('-')
temp_hyphenated_word=''
for l in slice:
hyphen_mask=louis.hyphenate(['hu-hu-g1.ctb', 'hyph_hu_HU.dic'], l, 0)
temp="".join( list(map(lambda a,b: "-"+a if b=='1' else a, l, hyphen_mask)))+'-'
temp_hyphenated_word=temp_hyphenated_word+temp
hyphenated_word=hyphenated_word[0:len(hyphenated_word)-1]
return hyphenated_word

word=sys.argv[1]
hyphenated_word=hyphenate_word(word)
print('normal word: '+word)
print('hyphenated word: '+hyphenated_word)

This is the wrong test.brf content part:
$innent7l fo�tat9dik a sima
5qveg. $vajon e2 beh02"sos beke-
2d1s le5-e?

With bekezdés word need hyphenate with be-2d1s word, because the bekez- part not fitting the 32 character line length.

So, with interlnal louis.hyphenate function the bekezdés word right places hyphenated (be-kez-dés).

With hungarian grade2 braille both us-table.dis, de-eurobrl6.dis and unicode.dis file usage is OK, except the bekezdés word hyphenated with bekez- word, and if I see right, the line length greater with 32 character.

Attila

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

file2brl generating a wrong hyphenated word with hungarian eurobraille document #48

file2brl generating a wrong hyphenated word with hungarian eurobraille document #48

hammera commented Aug 24, 2018

hammera commented Aug 24, 2018

hammera commented Aug 24, 2018

file2brl generating a wrong hyphenated word with hungarian eurobraille document #48

file2brl generating a wrong hyphenated word with hungarian eurobraille document #48

Comments

hammera commented Aug 24, 2018

-- coding: utf-8 --

hammera commented Aug 24, 2018

hammera commented Aug 24, 2018

This file contains all possible configuration settings, with sample

values, where appropriate. It is used by the file2brl command-line

interface if no configuration file is given. It is also part of the

documentation.

The number of cells on a line in Braille translations

The number of lines per page in Braille translations

Whether to format the Braille translation for output on interpoint embossers

NOTE: Other than the formatting, there is nothing specific to interpoint

embossing. This means that even if this is set to no, the output can

still be embossed on an interpoint embosser

What emphasis to include, comma separated list using values:

italic, bold, computerBraille and underline.

For all emphasis you may just use the value all

Whether to separate into Braille pages. If no then linesPerPage is ignored.

When numbering print pages should continuation be indicated in page numbers.

For example first Braille page of print page 26 it will be a26,

second page will be b26, etc.

Whether to include a print page separator mark in Braille at print page

breaks.

Whether to include a page number on the page separator line.

Include Braille page numbers

What format should be produced from back translations

Line length for files produced from backtranslation.

Hyphenate translations

What type of Braille device should output be formatted for.

What characters mark a line ending, mostly relevant for text/brf format.

What character marks end of page, again mostly suitable for text/brf format.

What page number should Braille page numbers start from

Whether to format paragraphs. If set to no then a paragraph is one long

line and cellsPerLine is ignored

Whether to show print page numbers

Where to place print page numbers

Where to place Braille page numbers

Encoding of output file.

Whether to produce a table of contents

The character to fill lines with (eg. in tables tracker dots)

The below settings for margins and paper dimensions are only used for UTD

output. When formatting for UTD cellsPerLine and linesPerpage are

ignored.

The margin at the top of the page in inches

The margin at the left of the page

The margin at the right of the page

The margin at the bottom of the page

Height of the Braille page in inches

Width of the Braille page in inches

If a print page has no page number, do not insert a page separator and

so merge it with the previous page in the Braille translation

Whether to place any page numbers at the top of a page on a separate line

Whether to place any page numbers at the bottom of a page on a separate line

If a Braille page has more than one print page on it, whether to show the

range of print page numbers present on the Braille page.

If there is an empty page in Print whether to ignore it in Braille.

Whether to include print page numbers in table of contents

Whether to include Braille page numbers in table of contents

What Braille table to use for the literary text

What table to use for computer Braille

What table to use for uncontracted Braille

NOTE: This setting is possibly depricated.

What table to use for non-mathematical content in books containing maths.

This option is normally not needed in many codes and so should be the

same as literaryTextTable.

What Braille table to use for mathematical content

What table should be used to edit together parts of documents (eg. to

join maths and text)

The XML header assumed for XML input documents with no header

Entity definitions

The semantic action files to be used

Whether to use the internet to get DTDs

Whether to create new semantic action definitions

What semantic action file to convert from UTD.

Directive for including other configuration files

The mode for translation

The input encoding of text files