crazyguitar · crazyguitar · Dec 10, 2018 · Nov 30, 2018 · Nov 30, 2018 · Dec 1, 2018
diff --git a/docs/notes/python-unicode.rst b/docs/notes/python-unicode.rst
@@ -6,48 +6,29 @@
 Unicode
 =======
 
+The main goal of this cheat sheet is to collect some common snippets which are
+related to Unicode. In Python 3, strings are represented by Unicode instead of
+bytes. Further information can be found on PEP `3100 <https://www.python.org/dev/peps/pep-3100>`_
+
+**ASCII** code is the most well-known standard which defines numeric codes
+for characters. The numeric values only define 128 characters originally,
+so ASCII only contains control codes, digits, lowercase letters, uppercase
+letters, etc. However, it is not enough for us to represent characters such as
+accented characters, Chinese characters, or emoji existed around the world.
+Therefore, **Unicode** was developed to solve this issues. It defines the
+*code point* to represent various characters like ASCII but the number of
+characters is up to 1,111,998.
+
 .. contents:: Table of Contents
     :backlinks: none
 
+String
+------
 
-Encode: unicode code point to bytes
-------------------------------------
-
-.. code-block:: python
-
-    >>> s = u'Café'
-    >>> type(s.encode('utf-8'))
-    <class 'bytes'>
-
-Decode: bytes to unicode code point
-------------------------------------
-
-.. code-block:: python
-
-    >>> s = bytes('Café', encoding='utf-8')
-    >>> s.decode('utf-8')
-    'Café'
-
-Get unicode code point
------------------------
-
-.. code-block:: python
-
-    >>> s = u'Café'
-    >>> for _c in s: print('U+%04x' % ord(_c))
-    ...
-    U+0043
-    U+0061
-    U+0066
-    U+00e9
-    >>> u = '中文'
-    >>> for _c in u: print('U+%04x' % ord(_c))
-    ...
-    U+4e2d
-    U+6587
-
-python2 ``str`` is equivalent to byte string
----------------------------------------------
+In Python 2, strings are represented in *bytes*, not *Unicode*. Python provides
+different types of string such as Unicode string, raw string, and so on.
+In this case, if we want to declare a Unicode string, we add ``u`` prefix for
+string literals.
 
 .. code-block:: python
 
@@ -62,9 +43,12 @@ python2 ``str`` is equivalent to byte string
     >>> type(u)
     <type 'unicode'>
 
-
-python3 ``str`` is equivalent to unicode string
--------------------------------------------------
+In Python 3, strings are represented in *Unicode*. If we want to represent a
+byte string, we add the ``b`` prefix for string literals. Note that the early
+Python versions (3.0-3.2) do not support the ``u`` prefix. In order to ease
+the pain to migrate Unicode aware applications from Python 2, Python 3.3 once
+again supports the ``u`` prefix for string literals. Further information can
+be found on PEP `414 <https://www.python.org/dev/peps/pep-0414>`_
 
 .. code-block:: python
 
@@ -78,9 +62,13 @@ python3 ``str`` is equivalent to unicode string
     >>> s.encode('utf-8').decode('utf-8')
     'Café'
 
+Characters
+----------
 
-python2 take ``str`` char as byte character
---------------------------------------------
+Python 2 takes all string characters as bytes. In this case, the length of
+strings may be not equivalent to the number of characters. For example,
+the length of ``Café`` is 5, not 4 because ``é`` is encoded as a 2 bytes
+character.
 
 .. code-block:: python
 
@@ -95,8 +83,8 @@ python2 take ``str`` char as byte character
     >>> len(s)
     4
 
-python3 take ``str`` char as unicode character
------------------------------------------------
+Python 3 takes all string characters as Unicode code point. The lenght of
+a string is always equivalent to the number of characters.
 
 .. code-block:: python
 
@@ -109,11 +97,62 @@ python3 take ``str`` char as unicode character
     >>> print(bs)
     b'Caf\xc3\xa9'
     >>> len(bs)
-   5
+    5
 
+Unicode Code Point
+------------------
 
-unicode normalization
-----------------------
+`ord <https://docs.python.org/3/library/functions.html#ord>`_ is a powerful
+built-in function to get a Unicode code point from a given character.
+Consequently, If we want to check a Unicode code point of a character, we can
+use ``ord``.
+
+.. code-block:: python
+
+    >>> s = u'Café'
+    >>> for _c in s: print('U+%04x' % ord(_c))
+    ...
+    U+0043
+    U+0061
+    U+0066
+    U+00e9
+    >>> u = '中文'
+    >>> for _c in u: print('U+%04x' % ord(_c))
+    ...
+    U+4e2d
+    U+6587
+
+
+Encoding
+--------
+
+A *Unicode code point* transfers to a *byte string* is called encoding.
+
+.. code-block:: python
+
+    >>> s = u'Café'
+    >>> type(s.encode('utf-8'))
+    <class 'bytes'>
+
+Decodeing
+---------
+
+A *byte string* transfers to a *Unicode code point* is called encoding.
+
+.. code-block:: python
+
+    >>> s = bytes('Café', encoding='utf-8')
+    >>> s.decode('utf-8')
+    'Café'
+
+Unicode Normalization
+---------------------
+
+Some characters can be represented in two similar form. For example, the
+character, ``é`` can be written as ``e ́`` (Canonical Decomposition) or ``é``
+(Canonical Composition). In this case, we may acquire unexpected results when we
+are comparing two strings even though they look alike. Therefore, we can
+normalize a Unicode form to solve the issue.
 
 .. code-block:: python
 
@@ -147,85 +186,57 @@ unicode normalization
     (b'Cafe\xcc\x81', b'Cafe\xcc\x81')
 
 
-Avoid UnicodeDecodeError
--------------------------
-
-.. code-block:: python
-
-    # raise a UnicodeDecodeError
+Avoid ``UnicodeDecodeError``
+----------------------------
 
-    >>> u = b"0xff"
-    >>> u.decode('utf-8')
-    Traceback (most recent call last):
-      File "<stdin>", line 1, in <module>
-    UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
+Python raises `UnicodeDecodeError` when byte strings cannot decode to Unicode
+code points. If we want to avoid this exception, we can pass *replace*,
+*backslashreplace*, or *ignore* to errors argument in `decode <https://docs.python.org/3/library/stdtypes.html#bytes.decode>`_.
 
-    # raise a UnicodeDecodeError
+.. code-block:: python
 
-    >>> u.decode('utf-8', "strict")
-    Traceback (most recent call last):
+    >>> u = b"\xff"
+    >>> u.decode('utf-8', 'strict')
+        Traceback (most recent call last):
       File "<stdin>", line 1, in <module>
-    UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
-
-    # use U+FFFD, REPLACEMENT CHARACTER
-
+    UnicodeDecodeError: 'utf-8' codec can\'t decode byte 0xff in position 0: invalid start byte
+    >>> # use U+FFFD, REPLACEMENT CHARACTER
     >>> u.decode('utf-8', "replace")
     '\ufffd'
-
-    # inserts a \xNN escape sequence
-
+    >>> # inserts a \xNN escape sequence
     >>> u.decode('utf-8', "backslashreplace")
     '\\xff'
-
-    # leave the character out of the Unicode result
-
+    >>> # leave the character out of the Unicode result
     >>> u.decode('utf-8', "ignore")
     ''
 
 Long String
 -----------
 
-Original long string
+The following snippet shows common ways to declare a multi-line string in
+Python.
 
 .. code-block:: python
 
     # original long string
-    >>> s = 'This is a very very very long python string'
-    >>> s
-    'This is a very very very long python string'
-
-Single quote with an escaping backslash
-
-.. code-block:: python
-
-    >>> s = "This is a very very very " \
-    ...     "long python string"
-    >>> s
-    'This is a very very very long python string'
-
-Using brackets
-
-.. code-block:: python
-
-    >>> s = ("This is a very very very "
-    ...      "long python string")
-    >>> s
-    'This is a very very very long python string'
-
-Using ``+``
-
-.. code-block:: python
-
-    >>> s = ("This is a very very very " +
-    ...      "long python string")
-    >>> s
-    'This is a very very very long python string'
-
-Using triple-quote with an escaping backslash
-
-.. code-block:: python
-
-    >>> s = '''This is a very very very \
-    ... long python string'''
-    >>> s
-    'This is a very very very long python string'
+    s = 'This is a very very very long python string'
+
+    # Single quote with an escaping backslash
+    s = "This is a very very very " \
+        "long python string"
+
+    # Using brackets
+    s = (
+        "This is a very very very "
+        "long python string"
+    )
+
+    # Using ``+``
+    s = (
+        "This is a very very very " +
+        "long python string"
+    )
+
+    # Using triple-quote with an escaping backslash
+    s = '''This is a very very very \
+    long python string'''