Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refine the Unicode Cheat Sheet #90

Merged
merged 6 commits into from
Dec 10, 2018
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
235 changes: 123 additions & 112 deletions docs/notes/python-unicode.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,48 +6,29 @@
Unicode
=======

The main goal of this cheat sheet is to collect some common snippets which are
related to Unicode. In Python 3, strings are represented by Unicode instead of
bytes. Further information can be found on PEP `3100 <https://www.python.org/dev/peps/pep-3100>`_

**ASCII** code is the most well-known standard which defines numeric codes
for characters. The numeric values only define 128 characters originally,
so ASCII only contains control codes, digits, lowercase letters, uppercase
letters, etc. However, it is not enough for us to represent characters such as
accented characters, Chinese characters, or emoji existed around the world.
Therefore, **Unicode** was developed to solve this issues. It defines the
*code point* to represent various characters like ASCII but the number of
characters is up to 1,111,998.

.. contents:: Table of Contents
:backlinks: none

String
------

Encode: unicode code point to bytes
------------------------------------

.. code-block:: python

>>> s = u'Café'
>>> type(s.encode('utf-8'))
<class 'bytes'>

Decode: bytes to unicode code point
------------------------------------

.. code-block:: python

>>> s = bytes('Café', encoding='utf-8')
>>> s.decode('utf-8')
'Café'

Get unicode code point
-----------------------

.. code-block:: python

>>> s = u'Café'
>>> for _c in s: print('U+%04x' % ord(_c))
...
U+0043
U+0061
U+0066
U+00e9
>>> u = '中文'
>>> for _c in u: print('U+%04x' % ord(_c))
...
U+4e2d
U+6587

python2 ``str`` is equivalent to byte string
---------------------------------------------
In Python 2, strings are represented in *bytes*, not *Unicode*. Python provides
different types of string such as Unicode string, raw string, and so on.
In this case, if we want to declare a Unicode string, we add ``u`` prefix for
string literals.

.. code-block:: python

Expand All @@ -62,9 +43,12 @@ python2 ``str`` is equivalent to byte string
>>> type(u)
<type 'unicode'>


python3 ``str`` is equivalent to unicode string
-------------------------------------------------
In Python 3, strings are represented in *Unicode*. If we want to represent a
byte string, we add the ``b`` prefix for string literals. Note that the early
Python versions (3.0-3.2) do not support the ``u`` prefix. In order to ease
the pain to migrate Unicode aware applications from Python 2, Python 3.3 once
again supports the ``u`` prefix for string literals. Further information can
be found on PEP `414 <https://www.python.org/dev/peps/pep-0414>`_

.. code-block:: python

Expand All @@ -78,9 +62,13 @@ python3 ``str`` is equivalent to unicode string
>>> s.encode('utf-8').decode('utf-8')
'Café'

Characters
----------

python2 take ``str`` char as byte character
--------------------------------------------
Python 2 takes all string characters as bytes. In this case, the length of
strings may be not equivalent to the number of characters. For example,
the length of ``Café`` is 5, not 4 because ``é`` is encoded as a 2 bytes
character.

.. code-block:: python

Expand All @@ -95,8 +83,8 @@ python2 take ``str`` char as byte character
>>> len(s)
4

python3 take ``str`` char as unicode character
-----------------------------------------------
Python 3 takes all string characters as Unicode code point. The lenght of
a string is always equivalent to the number of characters.

.. code-block:: python

Expand All @@ -109,11 +97,62 @@ python3 take ``str`` char as unicode character
>>> print(bs)
b'Caf\xc3\xa9'
>>> len(bs)
5
5

Unicode Code Point
------------------

unicode normalization
----------------------
`ord <https://docs.python.org/3/library/functions.html#ord>`_ is a powerful
built-in function to get a Unicode code point from a given character.
Consequently, If we want to check a Unicode code point of a character, we can
use ``ord``.

.. code-block:: python

>>> s = u'Café'
>>> for _c in s: print('U+%04x' % ord(_c))
...
U+0043
U+0061
U+0066
U+00e9
>>> u = '中文'
>>> for _c in u: print('U+%04x' % ord(_c))
...
U+4e2d
U+6587


Encoding
--------

A *Unicode code point* transfers to a *byte string* is called encoding.

.. code-block:: python

>>> s = u'Café'
>>> type(s.encode('utf-8'))
<class 'bytes'>

Decodeing
---------

A *byte string* transfers to a *Unicode code point* is called encoding.

.. code-block:: python

>>> s = bytes('Café', encoding='utf-8')
>>> s.decode('utf-8')
'Café'

Unicode Normalization
---------------------

Some characters can be represented in two similar form. For example, the
character, ``é`` can be written as ``e ́`` (Canonical Decomposition) or ``é``
(Canonical Composition). In this case, we may acquire unexpected results when we
are comparing two strings even though they look alike. Therefore, we can
normalize a Unicode form to solve the issue.

.. code-block:: python

Expand Down Expand Up @@ -147,85 +186,57 @@ unicode normalization
(b'Cafe\xcc\x81', b'Cafe\xcc\x81')


Avoid UnicodeDecodeError
-------------------------

.. code-block:: python

# raise a UnicodeDecodeError
Avoid ``UnicodeDecodeError``
----------------------------

>>> u = b"0xff"
>>> u.decode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
Python raises `UnicodeDecodeError` when byte strings cannot decode to Unicode
code points. If we want to avoid this exception, we can pass *replace*,
*backslashreplace*, or *ignore* to errors argument in `decode <https://docs.python.org/3/library/stdtypes.html#bytes.decode>`_.

# raise a UnicodeDecodeError
.. code-block:: python

>>> u.decode('utf-8', "strict")
Traceback (most recent call last):
>>> u = b"\xff"
>>> u.decode('utf-8', 'strict')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

# use U+FFFD, REPLACEMENT CHARACTER

UnicodeDecodeError: 'utf-8' codec can\'t decode byte 0xff in position 0: invalid start byte
>>> # use U+FFFD, REPLACEMENT CHARACTER
>>> u.decode('utf-8', "replace")
'\ufffd'

# inserts a \xNN escape sequence

>>> # inserts a \xNN escape sequence
>>> u.decode('utf-8', "backslashreplace")
'\\xff'

# leave the character out of the Unicode result

>>> # leave the character out of the Unicode result
>>> u.decode('utf-8', "ignore")
''

Long String
-----------

Original long string
The following snippet shows common ways to declare a multi-line string in
Python.

.. code-block:: python

# original long string
>>> s = 'This is a very very very long python string'
>>> s
'This is a very very very long python string'

Single quote with an escaping backslash

.. code-block:: python

>>> s = "This is a very very very " \
... "long python string"
>>> s
'This is a very very very long python string'

Using brackets

.. code-block:: python

>>> s = ("This is a very very very "
... "long python string")
>>> s
'This is a very very very long python string'

Using ``+``

.. code-block:: python

>>> s = ("This is a very very very " +
... "long python string")
>>> s
'This is a very very very long python string'

Using triple-quote with an escaping backslash

.. code-block:: python

>>> s = '''This is a very very very \
... long python string'''
>>> s
'This is a very very very long python string'
s = 'This is a very very very long python string'

# Single quote with an escaping backslash
s = "This is a very very very " \
"long python string"

# Using brackets
s = (
"This is a very very very "
"long python string"
)

# Using ``+``
s = (
"This is a very very very " +
"long python string"
)

# Using triple-quote with an escaping backslash
s = '''This is a very very very \
long python string'''