String performance #5

walterdejong · 2016-10-19T08:46:23Z

The performance of the String class is rather poor. This is because the methods call utf8_decode() all the time. This is a consequence of the design decision to have the String be an UTF-8 string internally and have it present itself as a string of characters rather than bytes.

It's probably better to have both a UTF-8 byte-string String class and a UTF-32 String32 or uString class and let the programmer decide what she wants to use.
For example, like in Python:

>>> s = '普通话/普通話'
>>> s
'\xe6\x99\xae\xe9\x80\x9a\xe8\xaf\x9d/\xe6\x99\xae\xe9\x80\x9a\xe8\xa9\xb1'
>>> len(s)
19
>>> s[0]
'\xe6'
>>> s[1]
'\x99'
>>> s[2]
'\xae'

>>> us = u'普通话/普通話'
>>> len(us)
7
>>> us[0]
u'\u666e'

(This example demonstrates behavior of len and operator[]).

Note that changing the design of String is a major change that would break backwards compatibility.

The text was updated successfully, but these errors were encountered:

This fixes performance issue mentioned in github issue #5

walterdejong · 2016-10-22T15:51:47Z

Commit 05b020b presumably fixes this. It changes the String class as described.
There is no String32 class as of yet.

walterdejong added a commit that referenced this issue Oct 22, 2016

change design of String to a regular byte-string

05b020b

This fixes performance issue mentioned in github issue #5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

String performance #5

String performance #5

walterdejong commented Oct 19, 2016

walterdejong commented Oct 22, 2016

String performance #5

String performance #5

Comments

walterdejong commented Oct 19, 2016

walterdejong commented Oct 22, 2016