Python2, string .format(), and unicode

Primer

If you’ve dealt with unicode and byte str mixing in python2 before, you’ll know that there are certain percent-formatting operations that you absolutely should not do with them. For instance, if you are combining a string of each type and they both have non-ascii characters then you are going to get a traceback:

>>> print(u'くら%s' % (b'とみ',))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 0: ordinal not in range(128)
>>> print(b'くら%s' % (u'とみ',))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 0: ordinal not in range(128)

The canonical answer to this is to clean up your code to not mix unicode and byte str which seems fair enough here. You can convert one of the two strings to match with the other fairly easily:

>>> print(u'くら%s' % (unicode(b'とみ', 'utf-8'),))
くらとみ

However, if you’re part of a project which was written before the need to separate the two string types was realized you may be mixing the two types sometimes and relying on bug reports and python tracebacks to alert you to pieces of the code that need to be fixed. If you don’t get tracebacks then you may not bother to explicitly convert in some cases. Unfortunately, as code is changed you may find that the areas you thought of as safe to mix aren’t quite as broad as they first appeared. That can lead to UnicodeError exceptions suddenly popping up in your code with seemingly harmless changes….

A New Idiom

If you’re like me and trying to adopt python3-supported idioms into your python-2.6+ code bases then one of the changes you may be making is to switch from using percent formatting to construct your strings to the new string .format() method. This is usually fairly straightforward:

name = u"Kuratomi"

# Old style
print("Hello Mr. %s!" % (name,))

# New style
print("Hello Mr. {0}!".format(name))

# Output:
Hello Mr. Kuratomi!
Hello Mr. Kuratomi!

This seems like an obvious transformation with no possibility of UnicodeError being thrown. And for this simple example you’d be right. But we all know that real code is a little more obfuscated than that. So let’s start making this a little more real-world, shall we?

name = u"くらとみ"
print("Hello Mr. %s!" % (name,))
print("Hello Mr. {0}!".format(name))

# Output
Hello Mr. くらとみ!
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)

What happened here? In our code we set name to a unicode string that has non-ascii characters. Used with the old-style percent formatting, this continued to work fine. But with the new-style .format() method we ended up with a UnicodeError. Why? Well under the hood, the percent formatting uses the “%” operator. The function that handles the “%” operator (__mod__()) sees that you were given two strings one of which is a byte str and one of which is a unicode string. It then decides to convert the byte str to a unicode string and combine the two. Since our example only has ascii characters in the byte string, it converts successfully and python can then construct the unicode string u"Hello Mr. くらとみ!". Since it’s always the byte str that’s converted to unicode type we can build up an idea of what things will work and which will throw an exception:

# These are good as the byte string
# which is converted is ascii-only
"Mr. %s" % (u"くらとみ",)
u"%s くらとみ" % ("Mr.",)

# Output of either of those:
u"Mr. くらとみ"

# These will throw an exception as the
# *byte string* contains non-ascii characters
u"Mr. %s" % ("くらとみ",)
"%s くらとみ" % (u"Mr",)

Okay, so that explains what’s happening with the percent-formatting example. What’s happening with the .format() code? .format() is a method of one of the two string types (str for python2 byte strings or unicode for python2 text strings). This gives programmers a feeling that the method is more closely associated with the type it is a method of than the parameters that it is given. So the design decision was made that the method should convert to the type that the method is bound to instead of always converting to unicode string type. This means that we have to make sure parameters can be converted to the type of the format string rather than always to unicode. Taking that in mind, this is the matrix of things we expect to work and expect to fail:

# These are good as the parameter string
# which is converted is ascii-only
u"{0} くらとみ".format("Mr.")
"{0} くらとみ".format(u"Mr.")

# Output (first is a unicode, second is a str):
u"Mr. くらとみ"
"Mr. くらとみ"

# These will throw an exception as the
# parameters contain non-ascii characters
u"Mr. {0}".format("くらとみ")
"Mr. {0}".format(u"くらとみ")

So now we know why we get a traceback in the converted code but not in the original code. Let’s apply this to our example:

name = u"くらとみ"
# name is a unicode type so we need to make
# sure .format() does not implicitly convert it
print(u"Hello Mr. {0}!".format(name))

# Output
Hello Mr. くらとみ!

Alright! That seems good now, right? Are we done? Well, let’s take this real-world thing one step farther. With real-world users we often get transient errors because users are entering a value we didn’t test with. In real-world code, variables often aren’t being set a few lines above where you’re using them. Instead, they’re coming from user input or a config file or command line parsing which happened tens of function calls and thousands of lines away from where you are encountering your traceback. After you step through your program for a few hours you may be able to realize that the relation between your variable and where it is used looks something like this:

# Near the start of your program
name = raw_input("Your name")
if not name.strip():
    name = u"くらとみ"

# [..thousands of lines of code..]

print(u"Hello Mr. {0}!".format(name))

So what’s happening? There’s two ways that our variable could be set. One of those ways (the return from raw_input()) sets it to a byte str. The other way (when we set the default value) sets it to a unicode string. The way we’re using the variable in the print() function means that the value will be converted to a unicode string if it’s a byte string. Remember that we earlier determined that ascii-only byte strings would convert but non-ascii byte strings would throw an error. So that means the code will behave correctly if the default is used or if the user enters “Kuratomi” but it will throw an exception if the user enters “くらとみ” because it has non-ascii characters.

This is where explicit conversion comes in. We need to explicitly convert the value to a unicode string so that we do not throw a traceback when we use it later. There’s two sensible locations to do that conversion. The better long term option is to convert where the variable is being set:

name = raw_input("Your name")
name = unicode(name, "utf-8", "replace")
if not name.strip():
    name = u"くらとみ"

Doing it there means that everywhere in your code you know that the variable will contain a unicode string. If you do this to all of your variables you will get to the point where you know that all of your variables are unicode strings unless you are explicitly converting them to byte str (or have special variables that should always be bytes — in which case you should have a naming convention to identify them). Having this sort of default makes it much easier to write code that uses the variable without fearing that it will unexpectedly cause tracebacks.

The other point at which you can convert is at the point that the variable is being used:

if isinstance(name, 'str'):
    name = unicode(name, 'utf-8', 'replace')
print(u"Hello Mr. {0}!".format(name))

The drawbacks to setting the variable here include having to put this code in wherever you are using it (usually more places than the variable could be set) and having to add the isinstance check because you don’t know whether it was set to a unicode or str type at this point. However, it can be useful to use this strategy when you have some critical code deployed and you know you’re getting tracebacks at a specific location but don’t know what unintended consequences might occur from changing the type of the variable everywhere. In this case you might analyze the problem for a bit and decide to hotfix your production machines to convert at the point of use but in your development tree you change it where the variable is being set so that you have a bit more time to work your way through all the places that shows you that you are mixing string types.

Dear Lazyweb, how would you nicely bundle python code?

I’ve been looking into bundling the python six library into ansible because it’s getting painful to maintain compatibility with the old versions on some distros. However, the distribution developer in me wanted to make it easy for distro packagers to make sure the system copy was used rather than the bundled copy if needed and also make it easy for other ansible developers to make use of it. It seemed like the way to achieve that was to make an import in our namespace that would transparently decide which version of six was needed and use that. I figured out three ways of doing this but haven’t figured out which is better. So throwing the three ways out there in the hopes that some python gurus can help me understand the pros and cons of each (and perhaps improve on what I have so far).

Boilerplate
To be both transparent to our developers and use system packages if the system had a recent enough six, I created a six package in our namespace. Inside of this module I included the real six library as _six.py. Then I created an __init__.py with code to decide whether to use the system six or the bundled _six.py. So the directory layout is like this:

+ ansible/
  + __init__.py
  + compat/
    + __init__.py
    + six/
      + __init__.py
      + _six.py

__init__.py has two tasks. It has to determine whether we want the system six library or the bundled one. And then it has to make that choice what other code gets when it does import ansible.compat.six. here’s the basic boilerplate:

# Does the system have a six library installed?
try:
    import six as _system_six
except ImportError:
    _system_six = None

if _system_six:
    # Various checks that system six library is current enough
    if not hasattr(_system_six.moves, 'shlex_quote'):
        _system_six = None

if _system_six:
    # Here's where we have to load up the system six library
else:
    # Alternatively, we load up the bundled library

Loading using standard import
Now things start to get interesting. We know which version of the six library we want. We just have to make it available to people who are now going to use it. In the past, I’d used the standard import mechanism so that was the first thing I tried here:

if _system_six:
    from six import *
else:
    from ._six import *

As a general way of doing this, it has some caveats. It only pulls in the symbols that the module considers public. If a module has any functions or variables that are supposed to be public and marked with a leading underscore then they won’t be pulled in. Or if a module has an __all__ = [...] that doesn’t contain all of the public symbols then those won’t get pulled in. You can pull those additions in by specifying them explicitly if you have to.

For this case, we don’t have any issues with those as six doesn’t use __all__ and none of the public symbols are marked with a leading underscore. However, when I then started porting the ansible code to use ansible.compat.six I encountered an interesting problem:

# Simple things like this work
>>> from ansible.compat.six import moves
>>> moves.urllib.parse.urlsplit('https://toshio.fedorapeople.org/')
SplitResult(scheme='https', netloc='toshio.fedorapeople.org', path='/', query='', fragment='')

# this throws an error:
>>> from ansible.compat.six.moves import urllib
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: No module named moves

Hmm… I’m not quite sure what’s happening but I zero in on the word “module”. Maybe there’s something special about modules such that import * doesn’t give me access to import subpackages or submodules of that. Time to look for answers on the Internet…

The Sorta, Kinda, Hush-Hush, Semi-Official Way

Googling for a means to replace a module from itself eventually leads to a strategy that seems to have both some people who like it and some who don’t. It seems to be supported officially but people don’t want to encourage people to use it. It involves a module replacing its own entry in sys.modules. Going back to our example, it looks like this:

import sys
[...]
if _system_six:
    six = _system_six
else:
    from . import _six as six

sys.modules['ansible.compat.six'] = six

When I ran this with a simple test case of a python package with several nested modules, that seemed to clear up the problem. I was able to import submodules of the real module from my fake module just fine. So I was hopeful that everything would be fine when I implemented it for six.

Nope:

>>> from ansible.compat.six.moves import urllib
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: No module named moves

Hmm… same error. So I take a look inside of six.py to see if there’s any clue as to why my simple test case with multiple files and directories worked but six’s single file is giving us headaches. Inside I find that six is doing its own magic with a custom importer to make moves work. I spend a little while trying to figure out if there’s something specifically conflicting between my code and six’s code and then throw my hands up. There’s a lot of stuff that I’ve never used before here… it’ll take me a while to wrap my head around it and there’s no assurance that I’ll be able to make my code work with what six is doing even after I understand it. Is there anything else I could try to just tell my code to run everything that six would normally do when it is imported but do it in my ansible.compat.six namespace?

You tell me: Am I beating my code with the ugly stick?

As a matter of fact, python does provide us with a keyword in python2 and a function in python3 that might do exactly that. So here’s strategy number three:

import os.path
[...]
if _system_six:
    import six
else:
    from . import _six as six
six_py_file = '{0}.py'.format(os.path.splitext(six.__file__)[0])
exec (open(six_py_file, 'r'))

Yep, exec will take an open file handle of a python module and execute it in the current namespace. So this seems like it will do what we want. Let’s test it:

>>> from ansible.compat.six.moves import urllib
>>>
>>> from ansible.compat.six.moves.urllib.parse import urlsplit
>>> urlsplit('https://toshio.fedorapeople.org/')
SplitResult(scheme='https', netloc='toshio.fedorapeople.org', path='/', query='', fragment='')

So dear readers, you tell me — I now have some code that works but it relies on exec. And moreover, it relies on exec to overwrite the current namespace. Is this a good idea or a bad idea? Let’s contemplate a little further — is this an idea that should only be applied sparingly (Using sys.modules instead if the module isn’t messing around with a custom importer of its own) or is it a general purpose strategy that should be applied to other libraries that I might bundle as well? Are there caveats to doing things this way? For instance, is it bypassing the standard import caching and so might be slower? Is there a better way to do this that in my ignorance I jsut don’t know about?

Why sys.setdefaultencoding() will break code

I know wiser and more experienced Python coders have written to python-dev about this before but every time I’ve needed to reference one of those messages for someone else I have trouble finding one. This time when I did my google search the most relevant entry was a post from myself to the yum-devel mailing list in 2011. Since I know I’ll need to prove why setdefaultencoding() is to be avoided in the future I figured I should post the reasoning here so that I don’t have to search the web next time.

Some Background

15 years ago: Creating a Unicode Aware Python

In Python 2 it is possible to mix byte strings (str type) and text strings (unicode type) together to a limited extent. For instance:

>>> u'Toshio' == 'Toshio'
True
>>> print(u'Toshio' + ' Kuratomi')
Toshio Kuratomi

When you perform these operations Python sees that you have a unicode type on one side and a str type on the other. It takes the str value and decodes it to a unicode type and then performs the operation. The encoding it uses to interpret the bytes is what we’re going to call Python’s defaultencoding (named after sys.getdefaultencoding() which allows you to see what this value is set to.)

When the Python developers were first experimenting with a unicode-aware text type that was distinct from byte strings it was unclear what the value of defaultencoding should be. So they created a function to set the defaultencoding when Python started in order to experiment with different settings. The function they created was sys.setdefaultencoding() and the Python authors would modify their individual site.py files to gain experience with how different encodings would change the experience of coding in Python.

Eventually, in October of 2000 (fourteen and a half years prior to me writing this) that experimental version of Python became Python-2.0 and the Python authors had decided that the sensible setting for defaultencoding should be ascii.

I know it’s easy to second guess the ascii decision today but remember 14 years ago the encoding landscape was a lot more cluttered. New programming languages and new APIs were emerging that optimized for fixed-width 2-byte encodings of unicode. 1-byte, non-unicode encodings for specific natural languages were even more popular then than they are now. Many pieces of data (even more than today!) could include non-ascii text without specifying what encoding to interpret that data as. In that environment anyone venturing outside of the ascii realm needed to be warned that they were entering a world where encoding dragons roamed freely. The ascii encoding helps to warn people that they were entering a land where their code had to take special precautions by throwing an error in many of the cases where the boundary was crossed.

However, there was one oversight about the unicode functionality that went into Python-2.0 that the Python authors grew to realize was a bad idea. That oversight was not removing the setdefaultencoding() function. They had taken some steps to prevent it being used outside of initialization (in the site.py file) by deleting the reference to it from the sys module after Python initialized but it still existed for people to modify the defaultencoding in site.py.

The rise of the sys.setdefaultencoding() hack

As time went on, the utf-8 encoding emerged as the dominant encoding of both Unix-like systems and the Internet. Many people who only had to deal with utf-8 encoded text were tired of getting errors when they mixed byte strings and text strings together. Seeing that there was a function called setdefaultencoding(), people started trying to use it to get rid of the errors they were seeing.

At first, those with the ability to, tried modifying their Python installation’s global site.py to make sys.setdefaultencoding do its thing. This is what the Python documentation suggests is the proper way to use it and it seemed to work on the user’s own machines. Unfortunately, the users often turned out to be coders. And it turned out that what these coders were doing was writing programs that had to work on machines run by other people: the IT department, customers, and users all over the Internet. That meant that applying the change in their site.py often left them in a worse position than before: They would code something which would appear to work on their machines but which would fail for the people who were actually using their software.

Since the coders’ concern was confined to whether people would be able to run their software the coders figured if their software could set the defaultencoding as part of its initialization that would take care of things. They wouldn’t have to force other people to modify their Python install; their software could make that decision for them when the software was invoked. So they took another look at sys.setdefaultencoding(). Although the Python authors had done their best to make the function unavailable after python started up these coders hit upon a recipe to get at the functionality anyway:

import sys
reload(sys)
sys.setdefaultencoding('utf-8')

Once this was run in the coders’ software, the default encoding for coercing byte strings to text strings was utf-8. This meant that when utf-8 encoded byte strings were mixed with unicode text strings, Python would successfully convert the str type data to unicode type and combine the two into one unicode string. This is what this new generation of coders were expecting from the majority of their data so the idea that solving their problem with just these few lines of (admittedly very hacky) code was very attractive to them. Unfortunately, there are non-obvious drawbacks to doing this….

Why sys.setdefaultencoding() will break your code

(1) Write once, change everything

The first problem with sys.setdefaultencoding() is not obviously a problem at first glance. When you call sys.setdefaultencoding() you are telling Python to change the defaultencoding for all of the code that it is going to run. Your software’s code, the code in the stdlib, and third-party library code all end up running with your setting for defaultencoding. That means that code which you weren’t responsible for that relied on the behaviour of having the defaultencoding be ascii would now stop throwing errors and potentially start creating garbage values. For instance, let’s say one of the libraries you rely on does this:

def welcome_message(byte_string):
    try:
        return u"%s runs your business" % byte_string
    except UnicodeError:
        return u"%s runs your business" % unicode(byte_string,
            encoding=detect_encoding(byte_string))

print(welcome_message(u"Angstrom (Å®)".encode("latin-1"))

Previous to setting defaultencoding this code would be unable to decode the “Å” in the ascii encoding and then would enter the exception handler to guess the encoding and properly turn it into unicode. Printing: Angstrom (Å®) runs your business. Once you’ve set the defaultencoding to utf-8 the code will find that the byte_string can be interpreted as utf-8 and so it will mangle the data and return this instead: Angstrom (Ů) runs your business.

Naturally, if this was your code, in your piece of software, you’d be able to fix it to deal with the defaultencoding being set to utf-8. But if it’s in a third party library that luxury may not exist for you.

(2) Let’s break dictionaries!

The most important problem with setting defaultencoding to the utf-8 encoding is that it will break certain assumptions about dictionaries. Let’s write a little code to show this:

def key_in_dict(key, dictionary):
    if key in dictionary:
        return True
    return False

def key_found_in_dict(key, dictionary):
    for dict_key in dictionary:
        if dict_key == key:
            return True
    return False

Would you assume that given the same inputs the output of both functions will be the same? In Python, if you don’t hack around with sys.setdefaultencoding(), your assumption would be correct:

>>> # Note: the following is the same as d = {'Café': 'test'} on
>>> #       systems with a utf-8 locale
>>> d = { u'Café'.encode('utf-8'): 'test' }
>>> key_in_dict('Café', d)
True
>>> key_found_in_dict('Café', d)
True
>>> key_in_dict(u'Café', d)
False
>>> key_found_in_dict(u'Café', d)
__main__:1: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
False

But what happens if you call sys.setdefaultencoding('utf-8')? Answer: the assumption breaks:

>>> import sys
>>> reload(sys)
>>> sys.setdefaultencoding('utf-8')
>>> d = { u'Café'.encode('utf-8'): 'test' }
>>> key_in_dict('Café', d)
True
>>> key_found_in_dict('Café', d)
True
>>> key_in_dict(u'Café', d)
False
>>> key_found_in_dict(u'Café', d)
True

This happens because the in operator hashes the keys and then compares the hashes to determine if they are equal. In the utf-8 encoding, only the characters represented by ascii hash to the same values whether in a byte string or a unicode type text string. For all other characters the hash for the byte string and the unicode text string will be different values. The comparison operator (==), on the other hand, converts the byte string to a unicode type and then compares the results. When you call setdefaultencoding('utf-8') you allow the byte string to be transformed into a unicode type. Then the two text strings will be compared and found to be equal. The ramifications of this are that containment tests with in now yield different values than equality testing individual entries via ==. This is a pretty big difference in behaviour to get used to and for most people would count as having broken a fundamental assumption of the language.

So how does Python 3 fix this?

You may have heard that in Python 3 the default encoding has been switched from ascii to utf-8. How does it get away with that without encountering the equality versus containment problem? The answer is that python3 does not perform implicit conversions between byte strings (python3 bytes type) and text strings (python3 str type). Since the two objects are now entirely separate comparing them via both equality and containment will always yield False:

$ python3
>>> a = {'A': 1}
>>> b'A' in a
False
>>> b'A' == list(a.keys())[0]
False

At first, coming from python2 where ascii values were the same this might look a little funny. But just remember that bytes are really a type of number and you wouldn’t expect this to work either:

>>> a = {'1': 'one'}
>>> 1 in a
False
>>> 1 == list(a.keys())[0]
False