Parsing Unicode text with Python PLY

Python PLY is quite helpful in writing a simple scanner and parser, but I had some trouble figuring out how to make it accept Unicode tokens. I kept getting weird errors like this:

Getr. Zählung
Syntax Error: '̈hlung'

The syntax error is displayed as an h with umlaut, something that does not even exist in German. On closer inspection, the problem turns out to be the COMBINING DIAERESIS character, which prevents the regular expressions in PLY from matching.

The solution is to use the Unicode normalization form KC (Compatibility Decomposition, followed by Canonical Composition), and to make sure that PLY uses unicode for matching:

lex.input(unicodedata.normalize('NFKC', s)


Because I keep losing this snippet, I will record it here. It’s what you add at the top of the file if print doesn’t work. Don’t ask.

import codecs, sys
sys.stdout = codecs.getwriter('utf-8')(sys.stdout)