Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hyphenation: consider words that contain a NBSP #2271

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

ondras
Copy link

@ondras ondras commented Oct 7, 2024

The main goal here is to fix #2270.

My change is simple:

  • adjust the get_next_word_boundaries function so it considers the whole word-nbsp-word cluster as a single potentially-hyphenable word,
  • ignore suggested hyphenation dictionary item(s) that end with a nbsp

Tests seem to pass. However:
xVyoSl

@liZe
Copy link
Member

liZe commented Oct 25, 2024

Thanks a lot for the pull request!

I think your code helps to improve the result for some cases, but I’m not confident with giving "word-nbsp-word" to Pyphen. We used to (and still do?) have problems with punctuation marks and other non-letter characters, as Pyphen is only designed to handle clean words, and I’d like to avoid this kind of problems again. We’ll also get right or wrong results depending on the hyphenation methods used by different languages.

Moreover, I think that get_next_word_boundaries should really find word boundaries, even the ones that are not line break opportunities (like with no-break space, but there are others, using is_line_break may help).

The origin of the problem is that we assume that lines break at word boundaries, and that the next word is the only one we try to hyphenate:

next_word_boundaries = get_next_word_boundaries(second_line_text, lang)
if next_word_boundaries:
# We have a word to hyphenate
start_word, stop_word = next_word_boundaries
next_word = second_line_text[start_word:stop_word]
if stop_word - start_word >= total:
# This word is long enough
first_line_width, _ = line_size(first_line, style)
space = max_width - first_line_width
if style['hyphenate_limit_zone'].unit == '%':
limit_zone = (
max_width * style['hyphenate_limit_zone'].value / 100)
else:
limit_zone = style['hyphenate_limit_zone'].value
if space > limit_zone or space < 0:
# Available space is worth the try, or the line is even too
# long to fit: try to hyphenate
auto_hyphenation = True

Instead of finding only the next word, we should find the next words until we reach a line-break opportunity, and try to hyphenate these words separately. It requires some work, but it seems to be possible.

@liZe liZe added the bug Existing features not working as expected label Oct 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Existing features not working as expected
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Hyphenation not working when a nbsp is present elsewhere in the word
2 participants