Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GLSL 4.6 Specification] Clarify the translation from UTF-8 scalar values to the corresponding character set tokens #220

Open
ContingencyOfTautologicalContradictions opened this issue Nov 7, 2023 · 3 comments
Assignees
Milestone

Comments

@ContingencyOfTautologicalContradictions

At the GLSL 4.6 specification, add the following paragraph to the 3.1 section:

The given files for compilation must be in the form of a well-formed UTF-8 code unit sequence. These files are decoded to produce their corresponding sequence of Unicode scalar values. A sequence of character set tokens is then formed by mapping each Unicode scalar value to the corresponding character set token. In the resulting sequence, each pair of characters in the input sequence consisting of U+000D CARRIAGE RETURN followed by U+000A LINE FEED, as well as each U+000D CARRIAGE RETURN not immediately followed by a U+000A LINE FEED, is replaced by a single new-line character.

@arcady-lunarg arcady-lunarg transferred this issue from KhronosGroup/glslang Nov 7, 2023
@arcady-lunarg
Copy link

This sounds like an issue with the spec, rather than the glslang compiler so I transferred it to the appropriate repository for that sort of issue.

@pdaniell-nv pdaniell-nv added this to the Needs Triage milestone Nov 8, 2023
@gnl21
Copy link
Contributor

gnl21 commented Nov 8, 2023

I'm not sure what ambiguity you're aiming to clear up here, perhaps because I'm not sufficiently knowledgeable about UTF-8. Is there an alternative way of interpreting a UTF-8 sequence other than what you describe? I'm fine with spelling things out clearly, but this seems to be straying into territory that should be covered by the UTF-8 spec, rather than GLSL.

One specific concern that I have, for example, is that the proposed text talks about mapping the UTF-8 characters into the character set but doesn't say what the mapping is. I think that the UTF-8 codepoints actually already represent the characters, so don't need mapping, which is why the correct mapping is obvious, but if they're different enough to require mapping then we should say what the mapping is.

I'm not convinced that the handling of new lines in the proposed text is correct according to the current spec. GLSL currently says that any of "\r", "\n" or "\r\n" are a valid line break, which isn't the same as in your comment. I'm not sure what glslang implements for this.

@arcady-lunarg
Copy link

It looks like glslang currently treats "\n" or "\r\n" as line terminators, the situation with bare "\r" is more complicated in that I think it will not produce syntax errors but also will not give the right numbers. Note that the spec actually limits the valid characters in GLSL tokens to (a subset of) ASCII and the core language does not have strings. The GLSL_EXT_debug_printf extension does add string literals but the extension spec language still does not allow the use of codepoints above 126 in tokens, so the only place where non-ASCII characters can occur is in comments, where the current spec allows allows any byte values and doesn't require well-formed UTF-8. In practice, glslang doesn't enforce this and just accepts any sequence of bytes in a string literal (or in a header name in a #include, another place where arbitrary strings are allowed).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants