Add back character sets that had characters outside 16 bit plane #1964

rmkaplan · 2025-01-13T17:34:54Z

Some of the mappings had Unicodes outside of 16 bits, those character sets had been excluded before.

Now the character sets are included, but those particular lines in the mapping file (e.g. the Gothic characters in the Runic-Gothic character set) have been commented out, so that the other characters can be included.

Update title line

rmkaplan · 2025-01-18T07:27:43Z

Based on @hjellinek suggestion, I did some timing tests comparing the original table format with formats that uses the same top-level array but use either a hashtable or a digital search for the second level. The hash and digital search were both better than what I had before, so I simplified the code to the hash. (I might eventually go to the digital search, but I would first have to move by MULTI-ALIST macros over to Lispusers).

I also added a new format, :UTF-8-SLUG just like LUTF-8 except that its OUTCHARFN produces the unicode slug for codes whose mappings were not found in the table-files.

And new functions XTOUCODE? and UTOXCODE? that return the corresponding mapping for codes in the table-files, NIL otherwise.

If multiple XCCS codes map to the same Unicode, the normal UNICODE.TRANSLATE (and XTOUCODE) will return the lowest Unicode. But XTOUCODE? will return the list--caller has to decide what to do. Alternatives in the inverse direction behave in the same way.

Note that callers of UNICODE.TRANSLATE must be recompiled.

Please test this functionality/interface. Also I hope that the previously reported performance issues have been fixed.

rmkaplan · 2025-01-18T20:52:51Z

I did some timing using only a single global hash array for all the characters, that is at least as fast, maybe faster, then doing an initial array branch into smaller hash arrays. And simpler still.

hjellinek · 2025-01-19T19:22:35Z

I did some timing using only a single global hash array for all the characters, that is at least as fast, maybe faster, then doing an initial array branch into smaller hash arrays. And simpler still.

Thanks, @rmkaplan, for the new functionality. The increased speed is a bonus. I'm glad my suggestion worked out so well.

rmkaplan · 2025-01-20T02:57:02Z

I did some more careful speed testing with mapping tables that contained all of the X-to-U pairs, not just the common ones, and with looking up all of the possible codes, not just charset 0. The single-hash was a significant loser with the much larger mappings, by a factor of 6. So I reverted to a top-level branch to hash arrays that contain no more than 128 characters.

The multi-alist is slightly better than the multi-hash for a 512 branching array, but significantly better (~ 25%) with a 1024 branch. But I'll stick with the hash for now.

rmkaplan · 2025-01-20T06:09:45Z

I reworked the UNICODE.TRANSLATE macro so that it could be shared by XTOUCODE and XTOUCODE? etc.

This should not be called directly by functions outside of UNICODE, to avoid dependencies on internal structures. Use the XTOUCODE etc. function interface.

hjellinek · 2025-01-21T00:17:31Z

I'm testing it now. For whatever reason, my htmltest.funny-chars function is able to display characters in more character sets than before, e.g., Arabic and Hebrew work now. Which is good!

I did a spot test with Runic. XCCS defines characters in several Runic variants, and, as I just learned with the help of the new APIs, Unicode seems only to define characters in a single Runic script.

I guessed that there's an invariant such that, given an open output stream STREAM with format set to :UTF-8-SLUG, it is the case that:
for all X such that (XTOUCODE? X) returns NIL, (\OUTCHAR STREAM X) should write the Unicode slug -- REPLACEMENT CHARACTER U+FFFD (�) -- to the output stream STREAM.

However, instead of REPLACEMENT CHARACTER U+FFFD (�) I see U+E000, which is the initial codepoint of the Unicode private use area. Does this mean that the :UTF-8-SLUG format is acting like the :UTF-8 format, adding to the unmapped character table instead of outputting slugs? (EDIT: no, if it were acting like the :UTF-8 format I'd see U+E000, U+E001, U+E002, etc.)

Here's a screenshot from Chrome:

rmkaplan · 2025-01-21T05:44:05Z

On Jan 20, 2025, at 4:17 PM, Herb Jellinek ***@***.***> wrote: I guessed that there's an invariant such that, given an open output stream STREAM with format set to :UTF-8-SLUG, it is the case that: for all X such that (XTOUCODE? X) returns NIL, (\OUTCHAR STREAM X) should write the Unicode slug -- REPLACEMENT CHARACTER U+FFFD (�) -- to the output stream STREAM. However, instead of REPLACEMENT CHARACTER U+FFFD (�) I see U+E000, which is the initial codepoint of the Unicode private use area. Does this mean that the :UTF-8-SLUG format is acting like the :UTF-8 format, adding to the unmapped character table instead of outputting slugs?

Is this the right logic? If an XCODE doesn’t have a true (unfaked map), then call the user outcharfn giving it the slug code, but forcing the RAW flag to T. That suppresses the call to UNICODE.TRANSLATE. (LET ((UCODE (XTOUCODE? XCCSCODE))) (CL:IF UCODE (UTF8.OUTCHARFN STREAM UCODE RAW) (UTF8.OUTCHARFN STREAM (CONSTANT (HEXNUM? "FFFD")) T))) (Do you also want a separate raw-slug format, where the caller passes RAW=T? That would just convert the given code to utf-8 bytes without ever trying to map it.)

hjellinek · 2025-01-21T19:04:27Z

On Jan 20, 2025, at 4:17 PM, Herb Jellinek @.***> wrote: I guessed that there's an invariant such that, given an open output stream STREAM with format set to :UTF-8-SLUG, it is the case that: for all X such that (XTOUCODE? X) returns NIL, (\OUTCHAR STREAM X) should write the Unicode slug -- REPLACEMENT CHARACTER U+FFFD (�) -- to the output stream STREAM. However, instead of REPLACEMENT CHARACTER U+FFFD (�) I see U+E000, which is the initial codepoint of the Unicode private use area. Does this mean that the :UTF-8-SLUG format is acting like the :UTF-8 format, adding to the unmapped character table instead of outputting slugs?
Is this the right logic? If an XCODE doesn’t have a true (unfaked map), then call the user outcharfn giving it the slug code, but forcing the RAW flag to T. That suppresses the call to UNICODE.TRANSLATE. (LET ((UCODE (XTOUCODE? XCCSCODE))) (CL:IF UCODE (UTF8.OUTCHARFN STREAM UCODE RAW) (UTF8.OUTCHARFN STREAM (CONSTANT (HEXNUM? "FFFD")) T)))

Yes, that's the right logic. (I didn't know about HEXNUM?, which could make some of my code a lot easier to read.)

(Do you also want a separate raw-slug format, where the caller passes RAW=T? That would just convert the given code to utf-8 bytes without ever trying to map it.)

Hmm, that's tempting. What would that look like from the client point of view? At the moment OPENHTMLSTREAM opens an underlying output stream (BACKING) with FORMAT :UTH-8-SLUG, and \HTML.OUTCHARFN calls plain old \OUTCHAR to write to BACKING.

How would I need to change my code to work as you describe? Would I open the BACKING stream with a different FORMAT and have my OUTCHARFN call some alternative to \OUTCHAR, like UTF8.OUTCHARFN you mentioned above?

rmkaplan · 2025-01-21T21:09:40Z

I now remember why I set up the raw formats, I was anticipating an improbable future in which we have switched the internal encoding from XCCS to Unicode. So there would be no code translation either in or out, just conversion to and from the proper sequence of file bytes. On that view, there should probably also be a raw slug format.

…

On Jan 21, 2025, at 11:04 AM, Herb Jellinek ***@***.***> wrote: On Jan 20, 2025, at 4:17 PM, Herb Jellinek @.***> wrote: I guessed that there's an invariant such that, given an open output stream STREAM with format set to :UTF-8-SLUG, it is the case that: for all X such that (XTOUCODE? X) returns NIL, (\OUTCHAR STREAM X) should write the Unicode slug -- REPLACEMENT CHARACTER U+FFFD (�) -- to the output stream STREAM. However, instead of REPLACEMENT CHARACTER U+FFFD (�) I see U+E000, which is the initial codepoint of the Unicode private use area. Does this mean that the :UTF-8-SLUG format is acting like the :UTF-8 format, adding to the unmapped character table instead of outputting slugs? Is this the right logic? If an XCODE doesn’t have a true (unfaked map), then call the user outcharfn giving it the slug code, but forcing the RAW flag to T. That suppresses the call to UNICODE.TRANSLATE. (LET ((UCODE (XTOUCODE? XCCSCODE))) (CL:IF UCODE (UTF8.OUTCHARFN STREAM UCODE RAW) (UTF8.OUTCHARFN STREAM (CONSTANT (HEXNUM? "FFFD")) T))) Yes, that's the right logic. (I didn't know about HEXNUM?, which could make some of my code a lot easier to read.) (Do you also want a separate raw-slug format, where the caller passes RAW=T? That would just convert the given code to utf-8 bytes without ever trying to map it.) Hmm, that's tempting. What would that look like from the client point of view? At the moment OPENHTMLSTREAM opens an underlying output stream (BACKING) with FORMAT :UTH-8-SLUG, and \HTML.OUTCHARFN calls plain old \OUTCHAR to write to BACKING. How would I need to change my code to work as you describe? Would I open the BACKING stream with a different FORMAT and have my OUTCHARFN call some alternative to \OUTCHAR, like UTF8.OUTCHARFN you mentioned above? — Reply to this email directly, view it on GitHub <#1964 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AQSTUJLZKAYY4CED2X7JUET2L2K5FAVCNFSM6AAAAABVDD6UISVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMBVGUZDINRQGM>. You are receiving this because you were mentioned.

hjellinek · 2025-01-22T21:59:08Z

I noticed something funny when testing XCCS charset 0xEB, General and Technical Symbols.

My test code writes a single XCCS character to the UTF-8 backing stream and expects to see its Unicode equivalent or the REPLACEMENT CHARACTER, 0xFFFD. And that's the case with all of my other tests until this one.

Writing any of the characters in the range 0xEB21 - 0xEB2B outputs a duplicate Unicode character except for 0xEB27:

Code 0xEB21 = ℙℙ
Code 0xEB22 = ℋℋ
Code 0xEB23 = ℐℐ
Code 0xEB24 = ≋≋
Code 0xEB25 = ⊜⊜
Code 0xEB26 = ℇℇ
Code 0xEB27 = ̲̲
Code 0xEB28 = ‽‽
Code 0xEB29 = ⌘⌘
Code 0xEB2A = �
Code 0xEB2B = ℌℌ

Interestingly, XTOUCODE? returns duplicate results for all of these but 0xEB2A GROUP:

_ (FOR X FROM #xEB21 TO #xEB2B DO (CL:FORMAT T "XCCS 0x~4,,'0x = ~A~%" X
(XTOUCODE? X)))

XCCS 0xEB21 = (8473 8473)
XCCS 0xEB22 = (8459 8459)
XCCS 0xEB23 = (8464 8464)
XCCS 0xEB24 = (8779 8779)
XCCS 0xEB25 = (8860 8860)
XCCS 0xEB26 = (8455 8455)
XCCS 0xEB27 = (818 818)
XCCS 0xEB28 = (8253 8253)
XCCS 0xEB29 = (8984 8984)
XCCS 0xEB2A = NIL
XCCS 0xEB2B = (8460 8460)

hjellinek · 2025-01-22T22:12:52Z

I noticed the same thing partway through a sample of charset 238:

Code 0xEE2F = ℏ
Code 0xEE30 = ≬
Code 0xEE31 = ℀
Code 0xEE32 = �
Code 0xEE33 = ☏☏
Code 0xEE34 = ∶∶
Code 0xEE35 = !!
Code 0xEE36 = ⊦⊦
Code 0xEE37 = ⌯⌯
Code 0xEE38 = ⌰⌰
Code 0xEE39 = ƵƵ
Code 0xEE3A = ⌂⌂

rmkaplan · 2025-01-22T22:14:52Z

I also discovered this last night, there was a missing remove-duplicates on the file names when it was building the UNICODE.MAPPINGS file. I’m putting a duplicate check in the actual code-insert loop.

…

On Jan 22, 2025, at 1:59 PM, Herb Jellinek ***@***.***> wrote: I noticed something funny when testing XCCS charset 0xEB, General and Technical Symbols. My test code writes a single XCCS character to the UTF-8 backing stream and expects to see its Unicode equivalent or the REPLACEMENT CHARACTER, 0xFFFD. And that's the case with all of my other tests until this one. Writing any of the characters in the range 0xEB21 - 0xEB2B outputs a duplicate Unicode character except for 0xEB27: Code 0xEB21 = ℙℙ Code 0xEB22 = ℋℋ Code 0xEB23 = ℐℐ Code 0xEB24 = ≋≋ Code 0xEB25 = ⊜⊜ Code 0xEB26 = ℇℇ Code 0xEB27 = ̲̲ Code 0xEB28 = ‽‽ Code 0xEB29 = ⌘⌘ Code 0xEB2A = � Code 0xEB2B = ℌℌ Interestingly, XTOUCODE? returns duplicate results for all of these but 0xEB2A GROUP: _ (FOR X FROM #xEB21 TO #xEB2B DO (CL:FORMAT T "XCCS 0x~4,,'0x = ~A~%" X (XTOUCODE? X))) XCCS 0xEB21 = (8473 8473) XCCS 0xEB22 = (8459 8459) XCCS 0xEB23 = (8464 8464) XCCS 0xEB24 = (8779 8779) XCCS 0xEB25 = (8860 8860) XCCS 0xEB26 = (8455 8455) XCCS 0xEB27 = (818 818) XCCS 0xEB28 = (8253 8253) XCCS 0xEB29 = (8984 8984) XCCS 0xEB2A = NIL XCCS 0xEB2B = (8460 8460) — Reply to this email directly, view it on GitHub <#1964 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AQSTUJOU4KWZRQKOBJK343L2MAIEHAVCNFSM6AAAAABVDD6UISVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMBYGM2TOMBYGI>. You are receiving this because you were mentioned.

rmkaplan · 2025-01-22T22:40:39Z

I updated the UNICODE-MAPPINGS and INVERTED-UNICODE-MAPPINGS files, to remove duplicate mappings that were sneaking in.

Also, I did another performance test, this time looking up a more representative set of codes--the codes that actually exist in the mappings instead of all codes from 0 to 65535. The array-branch/multiple-hash had an undeserved advantage in my prior test, because the array lookup suppressed the gethash for about 80% of the codes.

On this test, the multiple-hash was 4 times slower than the single hash. So I reverted to the simpler, single-hash implementation. On this pass I also changed the indexing of the all-mappings file to use the standard high-order charset bits and not the 9 bits I had before.

hjellinek · 2025-01-23T00:22:15Z

I've been testing the :UTF-8-SLUG format and, incidentally, the function XTOUCODE?. It's working nicely now, and the double characters have vanished.

I haven't yet gotten to my code that exercises UTOXCODE?. I believe @MattHeffron is planning to work in that area too. So I'm not ready to say this PR is ready to merge, but maybe soon.

(Incidentally, I had thought of XCCS as encoding a far smaller set of characters than Unicode, given the sizes of their respective codespaces, but I'm impressed and surprised by the number of characters in XCCS that seem to have no Unicode equivalent.)

MattHeffron · 2025-01-25T19:20:23Z

I'm not sure what's going on, but it appears that UTOXCODE? is returning values that aren't in any standard charset.
(e.g., Charsets 1-6, 9, 10, 14, 16, 17, and others).
Here's what I'm using to check:

(DEFUN U-TO-XCHARSET ()
   (LET ((CSETS (MAKE-ARRAY 256 :INITIAL-CONTENTS (LOOP :REPEAT 256 :COLLECT (CONS NIL))))
         XCODE CS XCS)
        (LOOP :FOR ENC :FROM 1 :TO 65535 :DO (SETQ XCODE (IL:UTOXCODE? ENC))
              (UNLESS (NULL XCODE)
                  (COND
                     ((AND (INTEGERP XCODE)
                           (<= 0 XCODE 65535))
                      (IL:TCONC (AREF CSETS (IL:LRSH XCODE 8))
                             ENC))
                     ((LISTP XCODE)
                      (LOOP :FOR XC :IN XCODE :UNLESS (MEMBER (SETQ CS (IL:LRSH XC 8))
                                                             XCS)
                            :DO
                            (PUSH CS XCS)
                            (IL:TCONC (AREF CSETS CS)
                                   ENC))))))
        (LOOP :FOR I :FROM 0 :TO 255 :DO (SETF (AREF CSETS I)
                                               (CAR (AREF CSETS I))))
        CSETS))

There are Unicode encoding values in unexpected places.
(Of course, this may be a totally incorrect way to check this, but it seems the most straightforward to me.)

Edit: I'm using rmk55 as of 2025-01-22 14:31:07.

rmkaplan · 2025-01-25T19:43:12Z

If you have found a unicode for which UTOXCODE? has returned what looks like a bad XCODE, can you go to a fresh sysout and apply UTOXCODE? again to that unicode?

MattHeffron · 2025-01-25T23:58:47Z

If you have found a unicode for which UTOXCODE? has returned what looks like a bad XCODE, can you go to a fresh sysout and apply UTOXCODE? again to that unicode?

OK, in a fresh sysout, (UTOXCODE? #O306) returns 225, but my looping code gives

Unicode: #o306 => #o1406 :: (#o6) in CS #o3

Likewise, in the fresh sysout, (UTOXCODE? #O272) returns 235, looping gives

Unicode: #o272 => #o20035 :: (#o35) in CS #o40

(I revised my code to print all mappings as they occur, and I capture that with IL:DRIBBLE.)

(DEFUN U-TO-XCHARSET2 (&OPTIONAL AS-LIST)             (IL:* IL:\; "Edited 25-Jan-2025 15:30 by mth")
   (LET ((CSETS (MAKE-ARRAY 256))
         XCODE)
        (LOOP :FOR UC :FROM 1 :TO 65535 :DO 
              (SETQ XCODE (IL:UTOXCODE? UC))
              (UNLESS (NULL XCODE)
                  (COND
                     ((AND (INTEGERP XCODE)
                           (<= 0 XCODE 65535))
                      (FORMAT T "Unicode: #o~O => #o~O :: (#o~O) in CS #o~O~%" UC XCODE
                             (LOGAND XCODE 255)
                             (IL:LRSH XCODE 8))
                      (PUSH (CONS UC XCODE)
                            (AREF CSETS (IL:LRSH XCODE 8))))
                     ((LISTP XCODE)
                      (LOOP :FOR XC :IN XCODE :DO 
                            (FORMAT T "Unicode: #o~O => #o~O :: (#o~O) in CS #o~O~%"
                                UC XC (LOGAND XC 255)
                                (IL:LRSH XC 8))
                            (PUSH (CONS UC XC)
                                  (AREF CSETS (IL:LRSH XC 8))))))))
        (LOOP :FOR I :FROM 0 :TO 255 :DO 
              (SETF (AREF CSETS I) (REVERSE (AREF CSETS I))))
        (IF AS-LIST
            (SETQ CSETS (LOOP :FOR I :FROM 0 :TO 255 :NCONC 
                              (LET ((CS (AREF CSETS I)))
                                   (WHEN CS
                                         (LIST (LIST I CS)))))))
        CSETS))

rmkaplan · 2025-01-26T00:26:14Z

In the mapping tables there is a correspondence between X code x2336 (= 9014) and U code x0306 (= 774). Maybe there is a confusion between hex and octal?

…

On Jan 25, 2025, at 3:59 PM, Matt Heffron ***@***.***> wrote: If you have found a unicode for which UTOXCODE? has returned what looks like a bad XCODE, can you go to a fresh sysout and apply UTOXCODE? again to that unicode? OK, in a fresh sysout, (UTOXCODE? #O306) returns 225, but my looping code gives Unicode: #o306 => #o1406 :: (#o6) in CS #o3 Likewise, in the fresh sysout, (UTOXCODE? #O272) returns 235, looping gives Unicode: #o272 => #o20035 :: (#o35) in CS #o40 (I revised my code to print all mappings as they occur, and I capture that with IL:DRIBBLE.) (DEFUN U-TO-XCHARSET2 (&OPTIONAL AS-LIST) (IL:* IL:\; "Edited 25-Jan-2025 15:30 by mth") (LET ((CSETS (MAKE-ARRAY 256)) XCODE) (LOOP :FOR UC :FROM 1 :TO 65535 :DO (SETQ XCODE (IL:UTOXCODE? UC)) (UNLESS (NULL XCODE) (COND ((AND (INTEGERP XCODE) (<= 0 XCODE 65535)) (FORMAT T "Unicode: #o~O => #o~O :: (#o~O) in CS #o~O~%" UC XCODE (LOGAND XCODE 255) (IL:LRSH XCODE 8)) (PUSH (CONS UC XCODE) (AREF CSETS (IL:LRSH XCODE 8)))) ((LISTP XCODE) (LOOP :FOR XC :IN XCODE :DO (FORMAT T "Unicode: #o~O => #o~O :: (#o~O) in CS #o~O~%" UC XC (LOGAND XC 255) (IL:LRSH XC 8)) (PUSH (CONS UC XC) (AREF CSETS (IL:LRSH XC 8)))))))) (LOOP :FOR I :FROM 0 :TO 255 :DO (SETF (AREF CSETS I) (REVERSE (AREF CSETS I)))) (IF AS-LIST (SETQ CSETS (LOOP :FOR I :FROM 0 :TO 255 :NCONC (LET ((CS (AREF CSETS I))) (WHEN CS (LIST (LIST I CS))))))) CSETS)) — Reply to this email directly, view it on GitHub <#1964 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AQSTUJOSH5XX5PPVMBPBR5D2MQQM3AVCNFSM6AAAAABVDD6UISVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMJUGE2DANZSGI>. You are receiving this because you were mentioned.

MattHeffron · 2025-01-26T00:45:22Z

In the mapping tables there is a correspondence between X code x2336 (= 9014) and U code x0306 (= 774). Maybe there is a confusion between hex and octal?

Numbers are numbers internally. My code is just displaying in octal; radix doesn't matter.
I just rebuilt the loadups again for rmk55.
I entered (UTOXCODE? 198) and got 225. Likewise for 186 got 235.
Then (in XCL Exec) I ran the

(PROGN (DRIBBLE "CSets2-rmk55.txt")(SETQ CSETS2 (U-TO-XCHARSET2 T))(DRIBBLE))

After that I repeated (UTOXCODE? 198) and got (774). (UTOXCODE? 186) got (8221).
Notice that this time the singleton values were returned in lists (of 1 item)!

So, it appears that the Unicode to XCCS table is getting corrupted pretty early when probing all 16-bit Unicode values. (Or the mapping files have errors that somehow clobber the initial state of the table.)

rmkaplan · 2025-01-26T02:30:48Z

In a simple test (collect all the values for ucodes from 0 to 255 twice and compare the differences) it looks like a small number of values are showing up as (CONS X) instead of just X the second time. But the actual code numbers are the same.

…

On Jan 25, 2025, at 4:45 PM, Matt Heffron ***@***.***> wrote: In the mapping tables there is a correspondence between X code x2336 (= 9014) and U code x0306 (= 774). Maybe there is a confusion between hex and octal? Numbers are numbers internally. My code is just displaying in octal; radix doesn't matter. I just rebuilt the loadups again for rmk55. I entered (UTOXCODE? 198) and got 225. Likewise for 186 got 235. Then (in XCL Exec) I ran the (PROGN (DRIBBLE "CSets2-rmk55.txt")(SETQ CSETS2 (U-TO-XCHARSET2 T))(DRIBBLE)) After that I repeated (UTOXCODE? 198) and got (774). (UTOXCODE? 186) got (8221). Notice that this time the singleton values were returned in lists (of 1 item)! So, it appears that the Unicode to XCCS table is getting corrupted pretty early when probing all 16-bit Unicode values. (Or the mapping files have errors that somehow clobber the initial state of the table.) — Reply to this email directly, view it on GitHub <#1964 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AQSTUJNT5VSIAHJ2DLVRWO32MQV3RAVCNFSM6AAAAABVDD6UISVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMJUGE2TIMRXHE>. You are receiving this because you were mentioned.

MattHeffron · 2025-01-26T04:33:01Z

In a simple test (collect all the values for ucodes from 0 to 255 twice and compare the differences) it looks like a small number of values are showing up as (CONS X) instead of just X the second time. But the actual code numbers are the same.

That's not what I'm seeing.
In a fresh sysout, IL:UTOXCODE? of 198 and 186 return 225 and 235, respectively.
If I then do a version of U-TO-XCHARSET2 (above) (i.e., U-TO-XCHARSET3 below) that probes only 0 through 255, then repeating the IL:UTOXCODE? of 198 and 186 now return (774) and (8221), respectively.
So, something in that first block of Unicode values clobbers the table.

(DEFUN U-TO-XCHARSET3 (OUTFILEPATH &AUX XCODE)        (IL:* IL:\; "Edited 25-Jan-2025 20:04 by mth")
   (WITH-OPEN-STREAM (OUT (OPEN OUTFILEPATH :DIRECTION :OUTPUT :IF-EXISTS :NEW-VERSION))
          (LOOP :FOR UC :FROM 0 :TO 255 :NCONC
                (UNLESS (NULL (SETQ XCODE (IL:UTOXCODE? UC)))
                    (SETQ XCODE (IL:MKLIST XCODE))
                    (LOOP :FOR XC :IN XCODE :COLLECT (PROGN (FORMAT OUT 
                                                          "Unicode: U+~4,'0X (~D) => #x~4,'0X (~D)~%"
                                                                   UC UC XC XC)
                                                            (CONS UC XC)))))))

Called as (U-TO-XCHARSET3 "UTEST.TXT").
Here is UTEST.TXT

MattHeffron · 2025-01-26T05:18:16Z

So, I wrote this to find which Unicode value passed to IL:UTOXCODE? causes things to go sideways.

(DEFUN U-TO-XCHARSET4 (TEST-PAIRS &AUX XCODE FAILED)  (IL:* IL:\; "Edited 25-Jan-2025 21:02 by mth")
   (LOOP :FOR UC :FROM 0 :TO 255 :DO
       (UNLESS (NULL (SETQ XCODE (IL:UTOXCODE? UC)))
           (UNLESS (OR FAILED (LOOP :FOR TP :IN TEST-PAIRS :ALWAYS
                                    (EQUAL (IL:UTOXCODE? (CAR TP))
                                           (CDR TP))))
               (FORMAT T 
                    "Test fails after probing Unicode: U+~4,'0X (~D)~%"
                      UC UC)
               (SETQ FAILED T))
           (SETQ XCODE (IL:MKLIST XCODE))
           (LOOP :FOR XC :IN XCODE :COLLECT
                 (PROGN (FORMAT T 
                            "Unicode: U+~4,'0X (~D) => #x~4,'0X (~D)~%"
                               UC UC XC XC)
                        (CONS UC XC))))))

And called it as: (U-TO-XCHARSET4 '((198 . 225) (186 . 235)))
The failure was reported:
Test fails after probing Unicode: U+00A0 (160)
It may fail before that if something earlier corrupts one of the values not in the TEST-PAIRS list.

nbriggs · 2025-01-26T05:23:38Z

Could we be seeing unhandled/unexpected hash table collisions?

rmkaplan · 2025-01-26T08:19:07Z

Below is my simple test function. It returns a list of 132 mismatches, basically one each for the ascii codes plus a few others. Most of the discrepancies have the same values, except that a CONS is returned on the second pass. But for a few a value got added during the first pass that showed up on the second.

I have an inkling of some of what's going on, but I have to look further. The tables are initialized with a default collection of XCCS character sets with their mappings to Unicode. So in character set 0 XCCS code x0063 (= 99 = c) maps to 99, as would be expected for Ascii. And Unicode 99 maps back to XCCS 99, if only character set 0 is involved.

However, XCCS character xE2D6 (=58072 in character-set 343, IPA) also has x0063 has its corresponding Unicode. But XCCS character-set 343 isn't loaded in the initial set, and it's only when you have later asked for characters in 343 that that mapping gets installed. After that, when you ask for the XCCS codes corresponding to Unicode x63 (99), you get both 99 and 58072.

So I think that the problem of seeing one value first and 2 values later is because the initial inverted mapping is still not correct. I don't yet know why the other ascii entries get an extra CONS the second time.

(LAMBDA NIL
(LET (VAL1 VAL2)
(SETQ VAL1 (for U from 0 to 65535 collect (LIST U (UTOXCODE? U))))
(SETQ VAL2 (for U from 0 to 65535 collect (LIST U (UTOXCODE? U))))
(for V1 in VAL1 as V2 in VAL2
unless (EQUAL V1 V2)
collect
(CL:UNLESS (EQ (CAR V1)
(CAR V2))
(HELP "UCODE mismatch"))
(LIST (CAR V1)
(CADR V1)
(CADR V2)))))

hjellinek · 2025-01-26T17:37:48Z

Last week I wrote a short function that goes in the other direction, calling XTOUCODE? to create a static mapping table I can use in JavaScript. I noticed that its behavior seemed different after its first run, but I attributed that to my not paying close enough attention to the results of the first run. Specifically, certain calls on the first run had returned SMALLP results, I thought, but on subsequent runs those same SMALLP values were wrapped in a list as the only element - (SMALLP). (I may have that backwards.)

I thought nothing of it - I just changed my code to adapt - but reading @MattHeffron's UTOXCODE? results make me think I was seeing something similar.

rmkaplan · 2025-01-27T03:44:50Z

It should now be the case that XTOUCODE and UTOXCODE always return SMALLP characters (possibly faked), and XTOUCODE? and UTOXCODE? return SMALLP's for singletons, lists for alternatives, and NIL if nonexistent.

I still haven't worked out the back-and-forth logic for keeping tables in both directions complete and consistent for incremental on-demand updates. So this version creates the tables on load up for all possible character sets (including Japanese) instead of the much smaller number of default sets. So instead of hash arrays of size about 1.5K, they are about 12K, most of which would never be used.

But I hope this now gives the behavior you expect.

hjellinek · 2025-01-27T23:17:01Z

I pulled the latest changes to this PR and built a new loadup. I performed two tests.

(1) I wrote a quickie function that applies XTOUCODE? to every possible 16-bit integer and records the charsets of the valid XCCS codes. It returns a list of 105 character sets:

(0 33 34 35 36 37 38 39 40 41 42 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106
107 108 109 110 111 112 113 114 115 116 117 118 224 225 226 227 228 229 230 231 232 235 236 237 238 239 240 241
242 243 244 245 253)

(2) I regenerated my static JavaScript XCCS-to-Unicode mapping table. It's byte-for-byte identical to the one I created prior to pulling the latest changes.

rmkaplan · 2025-01-28T00:38:22Z

105 is correct. I had an extra file name, because I didn’t ask for only the highest version. I’ll fix that and update again.

…

On Jan 27, 2025, at 3:17 PM, Herb Jellinek ***@***.***> wrote: I pulled the latest changes to this PR and built a new loadup. I performed two tests. (1) I wrote a quickie function that applies XTOUCODE? to every possible 16-bit integer and records the charsets of the valid XCCS codes. It returns a list of 105 character sets: (0 33 34 35 36 37 38 39 40 41 42 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 224 225 226 227 228 229 230 231 232 235 236 237 238 239 240 241 242 243 244 245 253) (2) I regenerated my static JavaScript XCCS-to-Unicode mapping table. It's byte-for-byte identical to the one I created prior to pulling the latest changes. — Reply to this email directly, view it on GitHub <#1964 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AQSTUJPPULXLM4JLKYKUXEL2M25AHAVCNFSM6AAAAABVDD6UISVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMJXGEYDIMRYGA>. You are receiving this because you were mentioned.

hjellinek · 2025-01-28T19:05:01Z

A minor correction to my previous comment:

(2) I regenerated my static JavaScript XCCS-to-Unicode mapping table and found a small block of changes in the Unicode characters that correspond to the XCCS codes from 0x7521 through 0x7526.

t is a table that maps XCCS codes to Unicode.

Previous

// t[XCCS] = Unicode;
t[0x7521] = 0x5B57;
t[0x7522] = 0x69C7;
t[0x7523] = 0x9059;
t[0x7524] = 0x7464;
t[0x7525] = 0x8655;
t[0x7526] = 0x76F8;

Now

// t[XCCS] = Unicode;
t[0x7521] = 0x582F;
t[0x7522] = 0x600E;
t[0x7523] = 0x5FEB;
t[0x7524] = 0x5E2B;
t[0x7525] = 0x51DC;
t[0x7526] = 0x7199;

rmkaplan · 2025-01-28T19:41:56Z

Interesting. It turns out that those XCCS codes appear in 2 Japanese character sets, 164 and 165, with different corresponding unicodes. Hard to say whether that is an error in the tables (in which case, which is correct?), or whether the claim is false that the tables in the X-to-U direction are functional (in which case that assumption should be removed from the code so the lookup would give you a list of 2 alternative unicodes). We would need more Japanese expertise to figure this out. In the meantime, I think whatever mapping the current code picks out is good enough.

…

On Jan 28, 2025, at 11:05 AM, Herb Jellinek ***@***.***> wrote: A minor correction to my previous comment <#1964 (comment)>: (2) I regenerated my static JavaScript XCCS-to-Unicode mapping table and found a small block of changes in the Unicode characters that correspond to the XCCS codes from 0x7521 through 0x7526. t is a table that maps XCCS codes to Unicode. Previous // t[XCCS] = Unicode; t[0x7521] = 0x5B57; t[0x7522] = 0x69C7; t[0x7523] = 0x9059; t[0x7524] = 0x7464; t[0x7525] = 0x8655; t[0x7526] = 0x76F8; Now // t[XCCS] = Unicode; t[0x7521] = 0x582F; t[0x7522] = 0x600E; t[0x7523] = 0x5FEB; t[0x7524] = 0x5E2B; t[0x7525] = 0x51DC; t[0x7526] = 0x7199; — Reply to this email directly, view it on GitHub <#1964 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AQSTUJOEPYPVLCBQN77ZVPD2M7IHJAVCNFSM6AAAAABVDD6UISVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMJZHAZTQNJTGI>. You are receiving this because you were mentioned.

hjellinek · 2025-01-28T20:03:40Z

Interesting. It turns out that those XCCS codes appear in 2 Japanese character sets, 164 and 165, with different corresponding unicodes. Hard to say whether that is an error in the tables (in which case, which is correct?), or whether the claim is false that the tables in the X-to-U direction are functional (in which case that assumption should be removed from the code so the lookup would give you a list of 2 alternative unicodes). We would need more Japanese expertise to figure this out. In the meantime, I think whatever mapping the current code picks out is good enough.

"I think whatever mapping the current code picks out is good enough." I agree.

It would be cool to be able to ask the creators of XCCS questions like these. The evolution of XCCS into Unicode is a chunk of CS history that may be in danger of being lost.

rmkaplan · 2025-01-28T22:12:34Z

I poked around a bit more. I think that the table we have for charset 164 is goofy, because it includes mappings whose XCCS codes are actually in 165. That can’t possibly be correct. I don’t remember the provenance of all the Japanese tables, Peter Cravens filled in a lot in the last round. I can probably clean up some of this by going back and forth between the images in the XCCS document and the images in my big Unicode book. Maybe there was just a wholesale translation. But for sure the 165 mappings don’t belong in 164.

…

On Jan 28, 2025, at 12:04 PM, Herb Jellinek ***@***.***> wrote: Interesting. It turns out that those XCCS codes appear in 2 Japanese character sets, 164 and 165, with different corresponding unicodes. Hard to say whether that is an error in the tables (in which case, which is correct?), or whether the claim is false that the tables in the X-to-U direction are functional (in which case that assumption should be removed from the code so the lookup would give you a list of 2 alternative unicodes). We would need more Japanese expertise to figure this out. In the meantime, I think whatever mapping the current code picks out is good enough. "I think whatever mapping the current code picks out is good enough." I agree. It would be cool to be able to ask the creators of XCCS questions like these. The evolution of XCCS into Unicode is a chunk of CS history that may be in danger of being lost. — Reply to this email directly, view it on GitHub <#1964 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AQSTUJM7UGSCIQ5Y2C32QJT2M7PDHAVCNFSM6AAAAABVDD6UISVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMJZHE2DSMBYGA>. You are receiving this because you were mentioned.

rmkaplan added 7 commits January 13, 2025 09:31

Add back character sets that had characters outside 16 bit plane

c305cbe

Update XCCS-353=SYMBOLS3.TXT

4e6d8dd

Update title line

Update UNICODE.TEDIT

11ac36f

Merge branch 'master' into rmk55--Add-Unicode-character-sets

98ea51d

Fix charset names

6ec2c35

Merge branch 'master' into rmk55--Add-Unicode-character-sets

870c68e

Reorganized the tables, added requested interfaces

a256d0d

rmkaplan mentioned this pull request Jan 18, 2025

Unmapped XCCS and Unicode characters #1971

Open

Use a single hash

fbaeb35

rmkaplan added 2 commits January 19, 2025 11:32

Merge branch 'master' into rmk55--Add-Unicode-character-sets

0e5c9a1

Top-level array branch beats a single hash

7f8c57a

cleanup UNICODE.TRANSLATE macro

3e276eb

Merge branch 'master' into rmk55--Add-Unicode-character-sets

751bc94

rmkaplan added 2 commits January 20, 2025 21:46

Fix slug in outcharfn

0ffa3f3

Remove a stray line

5fd0b39

Another try, would work for raw

4a9f0c2

rmkaplan added 2 commits January 22, 2025 14:31

Remove duplicates, redo hashing

8ed8151

Merge branch 'master' into rmk55--Add-Unicode-character-sets

28821b9

rmkaplan added 2 commits January 26, 2025 19:25

Getting complete maps in both directions

95e39f7

Initializing

d990b46

Only the latest file versions

8b6fc45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add back character sets that had characters outside 16 bit plane #1964

Add back character sets that had characters outside 16 bit plane #1964

rmkaplan commented Jan 13, 2025

rmkaplan commented Jan 18, 2025

rmkaplan commented Jan 18, 2025

hjellinek commented Jan 19, 2025

rmkaplan commented Jan 20, 2025

rmkaplan commented Jan 20, 2025

hjellinek commented Jan 21, 2025 •

edited

Loading

rmkaplan commented Jan 21, 2025 via email

hjellinek commented Jan 21, 2025

rmkaplan commented Jan 21, 2025 via email

hjellinek commented Jan 22, 2025

hjellinek commented Jan 22, 2025

rmkaplan commented Jan 22, 2025 via email

rmkaplan commented Jan 22, 2025

hjellinek commented Jan 23, 2025

MattHeffron commented Jan 25, 2025 •

edited

Loading

rmkaplan commented Jan 25, 2025

MattHeffron commented Jan 25, 2025

rmkaplan commented Jan 26, 2025 via email

MattHeffron commented Jan 26, 2025

rmkaplan commented Jan 26, 2025 via email

MattHeffron commented Jan 26, 2025

MattHeffron commented Jan 26, 2025

nbriggs commented Jan 26, 2025

rmkaplan commented Jan 26, 2025

hjellinek commented Jan 26, 2025 •

edited

Loading

rmkaplan commented Jan 27, 2025

hjellinek commented Jan 27, 2025

rmkaplan commented Jan 28, 2025 via email

hjellinek commented Jan 28, 2025

rmkaplan commented Jan 28, 2025 via email

hjellinek commented Jan 28, 2025

rmkaplan commented Jan 28, 2025 via email

Add back character sets that had characters outside 16 bit plane #1964

Are you sure you want to change the base?

Add back character sets that had characters outside 16 bit plane #1964

Conversation

rmkaplan commented Jan 13, 2025

rmkaplan commented Jan 18, 2025

rmkaplan commented Jan 18, 2025

hjellinek commented Jan 19, 2025

rmkaplan commented Jan 20, 2025

rmkaplan commented Jan 20, 2025

hjellinek commented Jan 21, 2025 • edited Loading

rmkaplan commented Jan 21, 2025 via email

hjellinek commented Jan 21, 2025

rmkaplan commented Jan 21, 2025 via email

hjellinek commented Jan 22, 2025

hjellinek commented Jan 22, 2025

rmkaplan commented Jan 22, 2025 via email

rmkaplan commented Jan 22, 2025

hjellinek commented Jan 23, 2025

MattHeffron commented Jan 25, 2025 • edited Loading

rmkaplan commented Jan 25, 2025

MattHeffron commented Jan 25, 2025

rmkaplan commented Jan 26, 2025 via email

MattHeffron commented Jan 26, 2025

rmkaplan commented Jan 26, 2025 via email

MattHeffron commented Jan 26, 2025

MattHeffron commented Jan 26, 2025

nbriggs commented Jan 26, 2025

rmkaplan commented Jan 26, 2025

hjellinek commented Jan 26, 2025 • edited Loading

rmkaplan commented Jan 27, 2025

hjellinek commented Jan 27, 2025

rmkaplan commented Jan 28, 2025 via email

hjellinek commented Jan 28, 2025

rmkaplan commented Jan 28, 2025 via email

hjellinek commented Jan 28, 2025

rmkaplan commented Jan 28, 2025 via email

hjellinek commented Jan 21, 2025 •

edited

Loading

MattHeffron commented Jan 25, 2025 •

edited

Loading

hjellinek commented Jan 26, 2025 •

edited

Loading