Skip to content

Embedding Parser: Encoding

Martin Mitáš edited this page Dec 15, 2016 · 1 revision

Mostly Encoding-Agnostic

The CommonMark specification generally assumes UTF-8 input, but under closer inspection, Unicode plays any role in few very specific situations when handling Markdown documents.

MD4C relies on this property of the CommonMark and the implementation is, to a large degree, encoding-agnostic. Most of MD4C code only assumes that the encoding of your choice is compatible with ASCII, i.e. that the codepoints below 128 have the same numeric values as ASCII.

Any input MD4C does not recognize as a Markdown syntax construction is simply seen as part of the document text and sent to the renderer's callback functions unchanged.

Where Unicode Matters

If you carefully study the CommonMark specification, you may see that Unicode really matters in very few specific situations:

  • For detection of word boundary when processing emphasis and strong emphasis, some classification of Unicode character is used. This plays a role in decision-making whether an emphasis or strong-emphasis mark can open or close (or both) an emphasis (or strong emphasis) span. The parser only checks whether (Unicode) whitespace or (Unicode) punctuation precedes or follows the mark.

  • For matching of a link reference with corresponding link reference definition, the parser performs case-folding of link label to perform Unicode case-insensitive matching.

  • For translating HTML entities (e.g. &) and numeric character references (e.g. # or ಫ) into their Unicode equivalents. However MD4C leaves this translation on the renderer/application; as the renderer is supposed to really know output encoding and whether it really needs to perform this kind of translation. (Consider that a renderer converting Markdown to HTML may leave the entities untranslated and defer the work to a web browser.)

Unicode Support

MD4C implements the Unicode support. However note if you embed MD4C parse in your application, you have to explicitly enable it.

If preprocessor macro MD4C_USE_UTF8 is defined, MD4C assumes UTF-8 for word boundary detection and case-folding.

On Windows, if preprocessor macro MD4C_USE_UTF16 is defined, MD4C uses WCHAR instead of char and assumes UTF-16 encoding in those situations. (UTF-16 is what Windows developers usually call just "Unicode" and what Win32API works with.)

By default (when none of the macros is defined), ASCII-only mode is used even in the specific situations. That effectively means that non-ASCII whitespace or punctuation characters won't be recognized as such and that case-folding is performed only on ASCII letters (i.e. [a-zA-Z]).

(Adding support for yet another encodings should be relatively simple due the isolation of the respective code.)