Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix RT_USE_LOOKUP_TABLE when reading unicode chars #66

Merged

Conversation

laurenthuberdeau
Copy link
Collaborator

Context

I tried to compile examples/repl.c with the bootstrap-results/pnut.sh and it failed with line 6774: __c2i_é: parameter not set. This is because the [a-zA-Z0-9] pattern can include non-ASCII characters depending on the locale.

This change also has the nice side effect of speeding up ksh by 12% taking 28s instead of 32s on my machine. Other shells show no significant improvement when bootstrapping and in the compile-times benchmarks.

sh-runtime.c Outdated
@@ -371,7 +371,7 @@ DEFINE_RUNTIME_FUN(char_to_int)
putstr(" '}') __c=125 ;;\n");
putstr(" '~') __c=126 ;;\n");
putstr(" *)\n");
putstr(" __c=$(LC_CTYPE=C printf \"%d\" \"'$1\")\n");
putstr(" __c=$(printf \"%d\" \"'$1\")\n");
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting LC_ALL sets LC_CTYPE so we can remove this.

@monnier
Copy link
Collaborator

monnier commented Aug 18, 2024

Do we know if LC_ALL=... is guaranteed to immediately affect the locale behavior of the current shell process?
The easiest implementation would instead set the locale once and for all at startup.

@laurenthuberdeau
Copy link
Collaborator Author

Do we know if LC_ALL=... is guaranteed to immediately affect the locale behavior of the current shell process?

It's part of the prologue so it's one of the first thing that's executed when the script starts. I don't think any of the commands we use overwrite it, so setting it once should be fine (as the passing CI tests demonstrate).

@monnier
Copy link
Collaborator

monnier commented Aug 20, 2024 via email

This change also has the nice side effect of speeding up ksh by 12%.
Other shells show no significant improvement when bootstrapping
and in the compile-times benchmarks.
@laurenthuberdeau laurenthuberdeau force-pushed the laurent/fix-locale-bug-with-RT_USE_LOOKUP_TABLE branch from 30b6c62 to 5fab88c Compare September 4, 2024 15:41
@laurenthuberdeau laurenthuberdeau force-pushed the laurent/fix-locale-bug-with-RT_USE_LOOKUP_TABLE branch from 5fab88c to eefc97b Compare September 4, 2024 15:44
@laurenthuberdeau
Copy link
Collaborator Author

Ah I see what you mean. Setting the variable seems to affect the locale of the current shell for all shells except yash as demonstrated by this test:

➜ cat test-locale.sh
str="é"

case "$str" in
  [[:alnum:]]) echo "[:alnum:]";;
  [a-zA-Z0-9]) echo "[a-zA-Z0-9]";;
  *) echo "other";;
esac

LC_ALL=C

case "$str" in
  [[:alnum:]]) echo "[:alnum:]";;
  [a-zA-Z0-9]) echo "[a-zA-Z0-9]";;
  *) echo "other";;
esac

( # Open subshell
case "$str" in
  [[:alnum:]]) echo "[:alnum:]";;
  [a-zA-Z0-9]) echo "[a-zA-Z0-9]";;
  *) echo "other";;
esac
) # Close subshell
➜ locale
LANG="en_CA.UTF-8"
LC_COLLATE="en_CA.UTF-8"
LC_CTYPE="en_CA.UTF-8"
LC_MESSAGES="en_CA.UTF-8"
LC_MONETARY="en_CA.UTF-8"
LC_NUMERIC="en_CA.UTF-8"
LC_TIME="en_CA.UTF-8"
LC_ALL=
➜ ksh  test-locale.sh
[:alnum:]
other
other
➜ dash test-locale.sh
other
other
other
➜ bash test-locale.sh
[:alnum:]
other
other
➜ zsh  test-locale.sh
[:alnum:]
other
other
➜ mksh test-locale.sh
other
other
other
➜ yash test-locale.sh
[:alnum:]
[:alnum:] # Wrong, yash does not update the locale when `LC_` variables are set
[:alnum:] # Even more wrong!? I would have expected the locale to be updated in the subshell

As the test shows, the implementation of locale varies between shells, with dash and mksh not respecting the LC_COLLATE value when evaluating case patterns. I'm surprised opening a subshell doesn't fix the issue, but subshells come with their own performance problem (the ksh/bash bootstrap takes minutes with it for some weird reason) so it wouldn't be a viable solution anyway.

Anyway, the PR fixes the bug on all shells except yash which seems to be broken.

@laurenthuberdeau laurenthuberdeau merged commit 8aa6f5a into main Sep 4, 2024
28 checks passed
@laurenthuberdeau laurenthuberdeau deleted the laurent/fix-locale-bug-with-RT_USE_LOOKUP_TABLE branch September 4, 2024 16:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants