Hyphenation: consider words that contain a NBSP #2271

ondras · 2024-10-07T20:37:32Z

The main goal here is to fix #2270.

My change is simple:

adjust the get_next_word_boundaries function so it considers the whole word-nbsp-word cluster as a single potentially-hyphenable word,
ignore suggested hyphenation dictionary item(s) that end with a nbsp

Tests seem to pass. However:

liZe · 2024-10-25T08:56:37Z

Thanks a lot for the pull request!

I think your code helps to improve the result for some cases, but I’m not confident with giving "word-nbsp-word" to Pyphen. We used to (and still do?) have problems with punctuation marks and other non-letter characters, as Pyphen is only designed to handle clean words, and I’d like to avoid this kind of problems again. We’ll also get right or wrong results depending on the hyphenation methods used by different languages.

Moreover, I think that get_next_word_boundaries should really find word boundaries, even the ones that are not line break opportunities (like with no-break space, but there are others, using is_line_break may help).

The origin of the problem is that we assume that lines break at word boundaries, and that the next word is the only one we try to hyphenate:

WeasyPrint/weasyprint/text/line_break.py

Lines 362 to 379 in becf6dc

    
           next_word_boundaries = get_next_word_boundaries(second_line_text, lang) 
        
           if next_word_boundaries: 
        
               # We have a word to hyphenate 
        
               start_word, stop_word = next_word_boundaries 
        
               next_word = second_line_text[start_word:stop_word] 
        
               if stop_word - start_word >= total: 
        
                   # This word is long enough 
        
                   first_line_width, _ = line_size(first_line, style) 
        
                   space = max_width - first_line_width 
        
                   if style['hyphenate_limit_zone'].unit == '%': 
        
                       limit_zone = ( 
        
                           max_width * style['hyphenate_limit_zone'].value / 100) 
        
                   else: 
        
                       limit_zone = style['hyphenate_limit_zone'].value 
        
                   if space > limit_zone or space < 0: 
        
                       # Available space is worth the try, or the line is even too 
        
                       # long to fit: try to hyphenate 
        
                       auto_hyphenation = True

Instead of finding only the next word, we should find the next words until we reach a line-break opportunity, and try to hyphenate these words separately. It requires some work, but it seems to be possible.

lame attempt at hyphenation when nbsp is present

619e3bc

ondras mentioned this pull request Oct 7, 2024

Hyphenation not working when a nbsp is present elsewhere in the word #2270

Open

ruff fixes

2ae30bc

liZe added the bug Existing features not working as expected label Oct 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hyphenation: consider words that contain a NBSP #2271

Hyphenation: consider words that contain a NBSP #2271

ondras commented Oct 7, 2024

liZe commented Oct 25, 2024

Hyphenation: consider words that contain a NBSP #2271

Are you sure you want to change the base?

Hyphenation: consider words that contain a NBSP #2271

Conversation

ondras commented Oct 7, 2024

liZe commented Oct 25, 2024