Python2, string .format(), and unicode

Primer

If you’ve dealt with unicode and byte str mixing in python2 before, you’ll know that there are certain percent-formatting operations that you absolutely should not do with them. For instance, if you are combining a string of each type and they both have non-ascii characters then you are going to get a traceback:

>>> print(u'くら%s' % (b'とみ',))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 0: ordinal not in range(128)
>>> print(b'くら%s' % (u'とみ',))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 0: ordinal not in range(128)

The canonical answer to this is to clean up your code to not mix unicode and byte str which seems fair enough here. You can convert one of the two strings to match with the other fairly easily:

>>> print(u'くら%s' % (unicode(b'とみ', 'utf-8'),))
くらとみ

However, if you’re part of a project which was written before the need to separate the two string types was realized you may be mixing the two types sometimes and relying on bug reports and python tracebacks to alert you to pieces of the code that need to be fixed. If you don’t get tracebacks then you may not bother to explicitly convert in some cases. Unfortunately, as code is changed you may find that the areas you thought of as safe to mix aren’t quite as broad as they first appeared. That can lead to UnicodeError exceptions suddenly popping up in your code with seemingly harmless changes….

A New Idiom

If you’re like me and trying to adopt python3-supported idioms into your python-2.6+ code bases then one of the changes you may be making is to switch from using percent formatting to construct your strings to the new string .format() method. This is usually fairly straightforward:

name = u"Kuratomi"

# Old style
print("Hello Mr. %s!" % (name,))

# New style
print("Hello Mr. {0}!".format(name))

# Output:
Hello Mr. Kuratomi!
Hello Mr. Kuratomi!

This seems like an obvious transformation with no possibility of UnicodeError being thrown. And for this simple example you’d be right. But we all know that real code is a little more obfuscated than that. So let’s start making this a little more real-world, shall we?

name = u"くらとみ"
print("Hello Mr. %s!" % (name,))
print("Hello Mr. {0}!".format(name))

# Output
Hello Mr. くらとみ!
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)

What happened here? In our code we set name to a unicode string that has non-ascii characters. Used with the old-style percent formatting, this continued to work fine. But with the new-style .format() method we ended up with a UnicodeError. Why? Well under the hood, the percent formatting uses the “%” operator. The function that handles the “%” operator (__mod__()) sees that you were given two strings one of which is a byte str and one of which is a unicode string. It then decides to convert the byte str to a unicode string and combine the two. Since our example only has ascii characters in the byte string, it converts successfully and python can then construct the unicode string u"Hello Mr. くらとみ!". Since it’s always the byte str that’s converted to unicode type we can build up an idea of what things will work and which will throw an exception:

# These are good as the byte string
# which is converted is ascii-only
"Mr. %s" % (u"くらとみ",)
u"%s くらとみ" % ("Mr.",)

# Output of either of those:
u"Mr. くらとみ"

# These will throw an exception as the
# *byte string* contains non-ascii characters
u"Mr. %s" % ("くらとみ",)
"%s くらとみ" % (u"Mr",)

Okay, so that explains what’s happening with the percent-formatting example. What’s happening with the .format() code? .format() is a method of one of the two string types (str for python2 byte strings or unicode for python2 text strings). This gives programmers a feeling that the method is more closely associated with the type it is a method of than the parameters that it is given. So the design decision was made that the method should convert to the type that the method is bound to instead of always converting to unicode string type. This means that we have to make sure parameters can be converted to the type of the format string rather than always to unicode. Taking that in mind, this is the matrix of things we expect to work and expect to fail:

# These are good as the parameter string
# which is converted is ascii-only
u"{0} くらとみ".format("Mr.")
"{0} くらとみ".format(u"Mr.")

# Output (first is a unicode, second is a str):
u"Mr. くらとみ"
"Mr. くらとみ"

# These will throw an exception as the
# parameters contain non-ascii characters
u"Mr. {0}".format("くらとみ")
"Mr. {0}".format(u"くらとみ")

So now we know why we get a traceback in the converted code but not in the original code. Let’s apply this to our example:

name = u"くらとみ"
# name is a unicode type so we need to make
# sure .format() does not implicitly convert it
print(u"Hello Mr. {0}!".format(name))

# Output
Hello Mr. くらとみ!

Alright! That seems good now, right? Are we done? Well, let’s take this real-world thing one step farther. With real-world users we often get transient errors because users are entering a value we didn’t test with. In real-world code, variables often aren’t being set a few lines above where you’re using them. Instead, they’re coming from user input or a config file or command line parsing which happened tens of function calls and thousands of lines away from where you are encountering your traceback. After you step through your program for a few hours you may be able to realize that the relation between your variable and where it is used looks something like this:

# Near the start of your program
name = raw_input("Your name")
if not name.strip():
    name = u"くらとみ"

# [..thousands of lines of code..]

print(u"Hello Mr. {0}!".format(name))

So what’s happening? There’s two ways that our variable could be set. One of those ways (the return from raw_input()) sets it to a byte str. The other way (when we set the default value) sets it to a unicode string. The way we’re using the variable in the print() function means that the value will be converted to a unicode string if it’s a byte string. Remember that we earlier determined that ascii-only byte strings would convert but non-ascii byte strings would throw an error. So that means the code will behave correctly if the default is used or if the user enters “Kuratomi” but it will throw an exception if the user enters “くらとみ” because it has non-ascii characters.

This is where explicit conversion comes in. We need to explicitly convert the value to a unicode string so that we do not throw a traceback when we use it later. There’s two sensible locations to do that conversion. The better long term option is to convert where the variable is being set:

name = raw_input("Your name")
name = unicode(name, "utf-8", "replace")
if not name.strip():
    name = u"くらとみ"

Doing it there means that everywhere in your code you know that the variable will contain a unicode string. If you do this to all of your variables you will get to the point where you know that all of your variables are unicode strings unless you are explicitly converting them to byte str (or have special variables that should always be bytes — in which case you should have a naming convention to identify them). Having this sort of default makes it much easier to write code that uses the variable without fearing that it will unexpectedly cause tracebacks.

The other point at which you can convert is at the point that the variable is being used:

if isinstance(name, 'str'):
    name = unicode(name, 'utf-8', 'replace')
print(u"Hello Mr. {0}!".format(name))

The drawbacks to setting the variable here include having to put this code in wherever you are using it (usually more places than the variable could be set) and having to add the isinstance check because you don’t know whether it was set to a unicode or str type at this point. However, it can be useful to use this strategy when you have some critical code deployed and you know you’re getting tracebacks at a specific location but don’t know what unintended consequences might occur from changing the type of the variable everywhere. In this case you might analyze the problem for a bit and decide to hotfix your production machines to convert at the point of use but in your development tree you change it where the variable is being set so that you have a bit more time to work your way through all the places that shows you that you are mixing string types.

My first python3 script

I’ve been hacking on other people’s python3 code for a while doing porting and bugfixes but so far my own code has been tied to python2 because of dependencies. Yesterday I ported my first personal script from python2 to python3. This was just a simple, one file script that hacks together a way to track how long my kids are using the computer and log them off after they’ve hit a quota. The kind of thing that many a home sysadmin has probably hacked together to automate just a little bit of their routine. For that use, it seemed very straightforward to make the switch. There were only three changes in the language that I encountered when making the transition:

  • octal values. I use octal for setting file permissions. The syntax for octal values has changed from "0755" to "0o755"
  • exception catching. No longer can you do: except Exception, exc. The new syntax is: except Exception as exc.
  • print function. In python2, print is a keyword so you do this: print 'hello world'. In python3, it’s a function so you write it this way: print('hello world')
  • The strict separation of bytes and string types. Required me to specify that one subprocess function should return string instead of bytes to me

When I’ve worked on porting libraries that needed to maintain some form of compat between python2 (older versions… no nice shiny python-2.7 for you!) and python3 these concerns were harder to address as there needed to be two versions of the code (usually, maintained via automatic build-time invocation of 2to3). With this application/script, throwing out python2 compatibility was possible so switching over was just a matter of getting an error when the code executed and switching the syntax over.

This script also didn’t use any modules that had either not ported, been dropped, or been restructured in the switch from python2 to python3. Unlike my day job where urllib’s restructuring would affect many of the things that we’ve written and lack of ported third-party libraries would prevent even more things from being ported, this script (and many other of my simple-home-use scripts) didn’t require any changes due to library changes.

Verdict? Within these constraints, porting to python3 was as painless as porting between some python2.x releases has been. I don’t see any reason I won’t use python3 for new programming tasks like this. I’ll probably port other existing scripts as I need to enhance them.

Account System 0.8.11 Release

We’ve just deployed a new Fedora Account System to production. This release just pulls a few new features that didn’t quite make the 0.8.10 release:

  • Ian Cole (icole) Added a feature to allow for email address to be used instead of login name for logging in. Because of the way we do authentication, this means that email addresses can also be used on the other applications on admin.fedoraproject.org as well.
  • Pierre-Yves Chibon (pingou) Implemented an audio captcha for people signing up for a new account. It generates a wav file that gets downloaded to your computer that you can listen to and then type in the proper answer to the captcha.
  • Adam M. Dutko (addutko) Standardized some of the errors that can be returned from our JSON API.
  • Our translation team pointed out a few areas where we weren’t loading translations correctly and I fixed them. Look forward to more complete translations in the future.

That’s it for this minor update.

/me goes to play with the audio captcha some more.

Kitchen 1.1.0 release

As mentioned last week a new kitchen release went out today. Since last week some small changes were made to the documentation and a few changes to make building kitchen easier were implemented but nothing has changed in the code. Here’s the text of the release announcement:

== Kitchen 1.1.0 has been released ==

Kitchen is a python library that brings together small snippets of code that you might otherwise find yourself reimplementing via cut and paste. Each little bit is useful and important but they usually feel too small and too trivial to create a whole module just for that one little function. However, experience has shown that any code implemented by copying will inevitably be shown to have bugs. And when you fix those bugs, you’ll wish you had created the module so you could fix the bug in one place rather than two (or five.. or ten…). Kitchen aims to be that one place.

Kitchen currently has code for easily setting up gettext so it won’t throw UnicodeErrors in corner cases, compatibility modules for different python2 stdlib versions (2.4, 2.5, 2.7), a little bit of iterators, and a whole lot of code for unicode-byte string conversion. In addition to the code, kitchen contains documentation that explains some of the common problems that arise when dealing with unicode in python2 and how to fix them through changes in coding practices and/or making use of functions from kitchen.

The 1.1.0 release enhances the gettext portion of kitchen. The major enhancements are:

  • get_translation_object can now be used as a drop in replacement for the stdlib’s gettext.translations() function.
  • If get_translation_object finds multiple message catalogs for the domain, it will setup the additional catalogs as fallbacks in case the message isn’t found in the first one.
  • The gettext and lgettext functions were reworked so that they guarantee that the string they return is both a byte str (this was present in previous kitchen releases) and is a valid sequence of bytes in the selected output_charset. This should prevent tracebacks if your code decodes and reencodes a value returned from the gettext and lgettext family of functions.
  • Several fixes to the way fallback message catalogs interacted with input and output charsets.

For the complete set of changes, see the NEWS file.

New kitchen release coming soon

[EDIT]For those who are curious, kitchen is a python module for miscellaneous code snippets. Things that people end up reimplementing via cut and paste because they seem to be too small to write a module for but are so useful that they need them in many places. Currently, it has code for i18n, compatibility modules for different python2 stdlib versions, a little bit of iterators, and a whole lot of code for unicode-byte string conversions.

Over the recent vacation I put the finishing (code) touches on a new kitchen release. I’ve scheduled the release of this code for next week on January 10, 2012. This is mainly since I just added the kitchen module on transifex.net and I’d like to see if any translations show up before next week. If anyone finds any bugs in the code on python-2.3.1 through python-2.7.x, please bring them up on the mailing list, on irc.freenode.net (I hang out in #fedora-admin and #fedora-python), or in the kitchen bug tracker so that I can address them before the release date.

The beta code is available from fedorahosted.org at: https://fedorahosted.org/releases/k/i/kitchen/kitchen-1.1.0b1.tar.gz

or from the bzr repository:

  bzr branch bzr://bzr.fedorahosted.org/bzr/kitchen/devel

The major changes are in the kitchen.i18n module. Previously, kitchen.i18n.*Translations objects guaranteed that they would return byte str when requested (via gettext(), ngettext(), lgettext(), and lngettext() methods) and unicode strings when requested (via ugettext() and ungettext()). The new code makes the additional guarantee that byte str‘s that are returned are valid in the requested output charset.

Here’s an example of the old behaviour vs new behaviour:

   >>> from kitchen.i18n import get_translation_object
   >>> translations = get_translation_object('kitchen')
   >>> b_ = translations.lgettext
   >>> translations.set_output_charset('utf-8')
   >>> translations.input_charset = 'latin-1'
   >>> # This would be: 'Café does not exist in the message catalog'
   >>> print repr(b_('Caf\xe9 does not exist in the message catalog'))
   # Old behaviour =>
   'Caf\xe9 does not exist in the message catalog'
   # New behaviour =>
   'Caf\xc3\xa9 does not exist in the message catalog'

   # Example 2: with wrong input_charset =>
   >>> translations.input_charset = 'utf-8'
   >>> print repr(b_('Caf\xe9 does not exist in the message catalog'))
   # New behaviour yields valid utf-8 bytes even when input_charset is wrong =>
   'Caf\xef\xbf\xbd does not exist in the message catalog'

Notice that this is not a magical panacea. The second example, shows that if input_encoding does not match the byte encoding of the strings that are given, the output string will be mangled (replacement characters or garbage characters). However, all the bytes in the output string will be valid in the chosen encoding so you won’t have to worry about exceptions if you attempt to transform the string again.

The other major change is that the kitchen.i18n.get_translation_object() function has been rewritten to be a drop in replacement for the stdlib’s gettext.translations(). The behaviour changes from that include the code now attempting to discover translations in every message catalog that it finds in the paths given to it. Additionally, those code changes lead to bugs in the *Translations classes fallback code being discovered and squashed.

See the NEWS file for other changes.

Brought to you by a cast of thousands (well, tens, at least)

The past few weeks I’ve been coordinating a new release of the Fedora Account System(FAS). Since FAS is something used within Fedora but not a whole lot of other places, development is usually driven by a relatively small handful of people: Ricky Zhou, Mike Mcgrath, and I. This release saw a large number of other contributors which has been very good as the three of us have been increasingly pulled into other projects so our time for FAS has steadily decreased.

  • Adam M. Dutko fixed several long standing bugs and feature requests
  • Luis Bazán updated several pieces of the UI
  • Sijis Aviles switched us from signing the CLA to the new FPCA
  • Pierre-Yves Chibon created a new captcha to replace the universally hated one that we were employing
  • Jun Chen added a means to clear a user’s ssh key
  • Nick Bebout started the work of removing copyright phrase-ology that we no longer want to use (“All rights reserved”) and tracking down which people needed to agree that we could switch licenses from GPLv2-only to GPLv2-or-later
  • Jim Lieb contributed code to make handling of languages easier and made FAS more configurable for use in other sites.
  • and for the first time in several releases we coordinated our release with the Fedora Translation Team on transifex so that the translations they contributed could go out with the first release instead of when a subsequent bugfix was released.

So let’s take a brief tour of some of these new features.

More Translations

Although we’ve been using transifex to manage translations for FAS for a while now, I hadn’t really understood how to leverage the full power of the Fedora Translation Team to get translations.  Thanks to some prodding by pingou, I got in touch with the translation team this time around and arranged for a string freeze before release during which they worked hard to translate FAS into their native languages. Thanks to transifex, I can show you this nice graph of their hard work:

Top translations: fas » faspot Full graph of translation stats

Clearing SSH Keys

When Fedora Infrastructure recently made the decision to invalidate public ssh keys because we had no way to tell which users might have hosted their ssh private keys on other projects servers which had been attacked and infiltrated, one of the options was for a user who didn’t actually need to use ssh to simply remove their ssh. Unfortunately, the web interface didn’t include the ability to do that so user’s who wanted to go this route had to contact one of the admins and have them remove it for them manually. Thanks to Jun Chen, users can now perform this step for themselves:
ssh clear key button

New Captcha!

There have always been many times more accounts in FAS than there were active contributors to Fedora. In itself, this wasn’t a problem. However, at some point, spammers started signing their bots up for Fedora accounts as they found that with that, they could write to the Fedora Wiki. To combat this, we added a captcha to the signup process. However, we quickly found that the captcha we added was too hard. Many people came to us to complain that they could not answer the captcha successfully. Thanks to pingou, we have a new captcha which displays a simple math equation in a much less distorted image. Writing the correct answer to the equation is all you need to do.
New captcha

These are just some of the more user visible changes. If you’re interested in the more behind the scenes changes (SELinux fixes from ricky, password strength checking, and more), check out the changes in FAS’s git repository.