Python2, string .format(), and unicode

Primer

If you’ve dealt with unicode and byte str mixing in python2 before, you’ll know that there are certain percent-formatting operations that you absolutely should not do with them. For instance, if you are combining a string of each type and they both have non-ascii characters then you are going to get a traceback:

>>> print(u'くら%s' % (b'とみ',))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 0: ordinal not in range(128)
>>> print(b'くら%s' % (u'とみ',))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 0: ordinal not in range(128)

The canonical answer to this is to clean up your code to not mix unicode and byte str which seems fair enough here. You can convert one of the two strings to match with the other fairly easily:

>>> print(u'くら%s' % (unicode(b'とみ', 'utf-8'),))
くらとみ

However, if you’re part of a project which was written before the need to separate the two string types was realized you may be mixing the two types sometimes and relying on bug reports and python tracebacks to alert you to pieces of the code that need to be fixed. If you don’t get tracebacks then you may not bother to explicitly convert in some cases. Unfortunately, as code is changed you may find that the areas you thought of as safe to mix aren’t quite as broad as they first appeared. That can lead to UnicodeError exceptions suddenly popping up in your code with seemingly harmless changes….

A New Idiom

If you’re like me and trying to adopt python3-supported idioms into your python-2.6+ code bases then one of the changes you may be making is to switch from using percent formatting to construct your strings to the new string .format() method. This is usually fairly straightforward:

name = u"Kuratomi"

# Old style
print("Hello Mr. %s!" % (name,))

# New style
print("Hello Mr. {0}!".format(name))

# Output:
Hello Mr. Kuratomi!
Hello Mr. Kuratomi!

This seems like an obvious transformation with no possibility of UnicodeError being thrown. And for this simple example you’d be right. But we all know that real code is a little more obfuscated than that. So let’s start making this a little more real-world, shall we?

name = u"くらとみ"
print("Hello Mr. %s!" % (name,))
print("Hello Mr. {0}!".format(name))

# Output
Hello Mr. くらとみ!
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)

What happened here? In our code we set name to a unicode string that has non-ascii characters. Used with the old-style percent formatting, this continued to work fine. But with the new-style .format() method we ended up with a UnicodeError. Why? Well under the hood, the percent formatting uses the “%” operator. The function that handles the “%” operator (__mod__()) sees that you were given two strings one of which is a byte str and one of which is a unicode string. It then decides to convert the byte str to a unicode string and combine the two. Since our example only has ascii characters in the byte string, it converts successfully and python can then construct the unicode string u"Hello Mr. くらとみ!". Since it’s always the byte str that’s converted to unicode type we can build up an idea of what things will work and which will throw an exception:

# These are good as the byte string
# which is converted is ascii-only
"Mr. %s" % (u"くらとみ",)
u"%s くらとみ" % ("Mr.",)

# Output of either of those:
u"Mr. くらとみ"

# These will throw an exception as the
# *byte string* contains non-ascii characters
u"Mr. %s" % ("くらとみ",)
"%s くらとみ" % (u"Mr",)

Okay, so that explains what’s happening with the percent-formatting example. What’s happening with the .format() code? .format() is a method of one of the two string types (str for python2 byte strings or unicode for python2 text strings). This gives programmers a feeling that the method is more closely associated with the type it is a method of than the parameters that it is given. So the design decision was made that the method should convert to the type that the method is bound to instead of always converting to unicode string type. This means that we have to make sure parameters can be converted to the type of the format string rather than always to unicode. Taking that in mind, this is the matrix of things we expect to work and expect to fail:

# These are good as the parameter string
# which is converted is ascii-only
u"{0} くらとみ".format("Mr.")
"{0} くらとみ".format(u"Mr.")

# Output (first is a unicode, second is a str):
u"Mr. くらとみ"
"Mr. くらとみ"

# These will throw an exception as the
# parameters contain non-ascii characters
u"Mr. {0}".format("くらとみ")
"Mr. {0}".format(u"くらとみ")

So now we know why we get a traceback in the converted code but not in the original code. Let’s apply this to our example:

name = u"くらとみ"
# name is a unicode type so we need to make
# sure .format() does not implicitly convert it
print(u"Hello Mr. {0}!".format(name))

# Output
Hello Mr. くらとみ!

Alright! That seems good now, right? Are we done? Well, let’s take this real-world thing one step farther. With real-world users we often get transient errors because users are entering a value we didn’t test with. In real-world code, variables often aren’t being set a few lines above where you’re using them. Instead, they’re coming from user input or a config file or command line parsing which happened tens of function calls and thousands of lines away from where you are encountering your traceback. After you step through your program for a few hours you may be able to realize that the relation between your variable and where it is used looks something like this:

# Near the start of your program
name = raw_input("Your name")
if not name.strip():
    name = u"くらとみ"

# [..thousands of lines of code..]

print(u"Hello Mr. {0}!".format(name))

So what’s happening? There’s two ways that our variable could be set. One of those ways (the return from raw_input()) sets it to a byte str. The other way (when we set the default value) sets it to a unicode string. The way we’re using the variable in the print() function means that the value will be converted to a unicode string if it’s a byte string. Remember that we earlier determined that ascii-only byte strings would convert but non-ascii byte strings would throw an error. So that means the code will behave correctly if the default is used or if the user enters “Kuratomi” but it will throw an exception if the user enters “くらとみ” because it has non-ascii characters.

This is where explicit conversion comes in. We need to explicitly convert the value to a unicode string so that we do not throw a traceback when we use it later. There’s two sensible locations to do that conversion. The better long term option is to convert where the variable is being set:

name = raw_input("Your name")
name = unicode(name, "utf-8", "replace")
if not name.strip():
    name = u"くらとみ"

Doing it there means that everywhere in your code you know that the variable will contain a unicode string. If you do this to all of your variables you will get to the point where you know that all of your variables are unicode strings unless you are explicitly converting them to byte str (or have special variables that should always be bytes — in which case you should have a naming convention to identify them). Having this sort of default makes it much easier to write code that uses the variable without fearing that it will unexpectedly cause tracebacks.

The other point at which you can convert is at the point that the variable is being used:

if isinstance(name, 'str'):
    name = unicode(name, 'utf-8', 'replace')
print(u"Hello Mr. {0}!".format(name))

The drawbacks to setting the variable here include having to put this code in wherever you are using it (usually more places than the variable could be set) and having to add the isinstance check because you don’t know whether it was set to a unicode or str type at this point. However, it can be useful to use this strategy when you have some critical code deployed and you know you’re getting tracebacks at a specific location but don’t know what unintended consequences might occur from changing the type of the variable everywhere. In this case you might analyze the problem for a bit and decide to hotfix your production machines to convert at the point of use but in your development tree you change it where the variable is being set so that you have a bit more time to work your way through all the places that shows you that you are mixing string types.