Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2.0.0-beta1 script properties not up to date for unicode 16 #6041

Open
cmyr opened this issue Jan 27, 2025 · 9 comments · May be fixed by #6044
Open

2.0.0-beta1 script properties not up to date for unicode 16 #6041

cmyr opened this issue Jan 27, 2025 · 9 comments · May be fixed by #6044

Comments

@cmyr
Copy link
Contributor

cmyr commented Jan 27, 2025

Per the unicode 16 version of ScriptExtensions.txt, the following should pass:

    #[test]
    fn expected_script_thing() {
        let scripts = ScriptWithExtensions::new()
            .get_script_extensions_val('\u{2bc}')
            .iter()
            .collect::<Vec<_>>();
        assert_eq!(
            scripts,
            [
                Script::Bengali,
                Script::Cyrillic,
                Script::Devanagari,
                Script::Latin,
                Script::Lisu,
                Script::Thai,
                Script::Toto
            ]
        );
    }

but we end up with just Script::Common, which would have been expected for unicode 15 and earlier.

To make this more confusing, If I look at the raw data files in the release-76-1 tag, it does appear up to date. I haven't dug much past that.

@Manishearth
Copy link
Member

cc @robertbastian what's the status of icuexportdata being updated? I thought we were already on Unicode 16.

@robertbastian
Copy link
Member

All I can tell you is that 2.0.0-beta1 is on ICU release-76-1. Whether that correctly exports Unicode 16 requires me to debug ICU4C, which I'm not familiar with.

@Manishearth
Copy link
Member

It's supposed to , from the relnotes: https://github.com/unicode-org/icu/releases/tag/release-76-1

Confirmed that this reproduces on ICU4X main, and confirmed that Unicode 16 data has a whole bunch of scx values for low codepoints that are not available on Unicode 15.

@Manishearth
Copy link
Member

Trying to build ICU4C to see what's up

@Manishearth
Copy link
Member

Found the culprit: https://unicode-org.atlassian.net/browse/ICU-21821

That hardcoded table in icuexportdata needs to be updated

cc @sffc @echeran

@Manishearth
Copy link
Member

New data in #6044

Confirmed that it passes the following test:

#[test]
fn expected_script_thing() {
    use crate::props::Script;
    use crate::script::ScriptWithExtensions;
    let scripts = ScriptWithExtensions::new()
        .get_script_extensions_val('\u{2bc}')
        .iter()
        .collect::<Vec<_>>();
    assert_eq!(
        scripts,
        [
            Script::Bengali,
            Script::Cyrillic,
            Script::Devanagari,
            Script::Latin,
            Script::Thai,
            Script::Lisu,
            Script::Toto
        ]
    );
}

@robertbastian
Copy link
Member

Linking #4602

@Manishearth
Copy link
Member

It seems you have a workaround for this for now: We have fixed the ICU4C data export around this, and could do the work for a patch release, but @sffc and I would prefer to wait till ICU4X 2.0.0-beta2 which should happen in the next few weeks, instead of doing a transient patch release.

@cmyr
Copy link
Contributor Author

cmyr commented Feb 3, 2025

A few weeks is fine, thanks for tracking this down!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants