Porting Kitchen to Python3: Part 1 — Detecting string types

I’ve spent a good part of the last week working on the python3 port of kitchen. It’s now to the point where I’ve reviewed all of the code and got the unittests passing. I still need to add some deprecation warnings and a gettext object that mirrors the python3 API instead of the python2 API. Then it’ll be ready for an alpha release. Still a lot of work to do before a final release. Most of the documentation will need to be updated to change from unicode + str to str + bytes and the best practices sections will need a major overhaul since a lot of the problems with python2 and unicode have either been fixed, mitigated, or moved to a different level.

It was both an easy and hard undertaking. The easy part was that kitchen is largely a collection of dependent but unrelated functions. So it’s reasonably easy to pick a set of functions, figure out that they don’t depend on anything else in kitchen, and then port them one by one.

The hard part is that a lot of those functions deal with things that are explicitly unicode and things that are explicitly byte strings; an area that has both changed dramatically in python3 and that 2to3 doesn’t handle very well. Here’s a couple of things I ended up doing to help out:

Detecting String Types

Kitchen has several places that need to know whether an object it’s been given is a byte string, unicode string, or a generic string. The python2 idioms for this are:

if isinstance(obj, basestring):
    pass # object is any of the string types
    if isinstance(obj, str):
        pass # object is a byte string
    elif isinstance(obj, unicode):
        pass # object is a unicode string
else:
    pass # object was not a string type

In python3, a couple things have changed.

  • There’s no longer a basestring type as byte strings and unicode strings are no longer meant to be related types.
  • Byte strings now have an immutable (bytes) and mutable (bytearray) type.

With these changes, the python3 idioms equivalent to the python2 ones look something like this:

if isinstance(obj, str) or isinstance(obj, bytes) or isinstance(obj, bytearray):
    pass # any string type
    if isinstance(obj, bytes) or isinstance(obj, bytearray):
        pass # byte string
    elif isinstance(obj, str):
        pass # unicode string

There’s two issues with these changes:

  • code that needs to do this needs to be manually ported when moving from python2 to python3. 2to3 can correctly change all occurrences of isinstance(obj, unicode) to isinstance(obj, str) but occurrences of isinstance(obj, basestring) and isinstance(obj, str) will also be rendered as isinstance(obj, str) in the 2to3 output. This is correct for a lot of normal python2 code that is trying to separate strings from ints, floats, or other types but it is incorrect for code that’s trying to explicitly separate bytes from unicode. So you’ll need to hand-audit and fix your code wherever these idioms are being used.
  • This is more prolix and tedious to write than the python2 version and if your code has to do this sort of differentiation in many places you’ll soon get bored of it.

For kitchen, I added a few helper functions into kitchen.text.misc that encapsulate the python2 and python3 idioms. For instance:

def isbasestring(obj):
    if isinstance(obj, str) or isinstance(obj, bytes) or isinstance(obj, bytearray):
        return True
    return False

and similar for isunicodestring() and isbytestring(). [In case you're curious, I broke with PEP8 style for these function names because of the long history of is* functions and methods in python and other programming languages.] By pushing these into functions, I can use if isbasetring(obj): on both python2 and python3. I only have to change the implementation of the is*string() functions in a single place when porting from python2 to python3.

Now let’s mention some of the caveats to using this:

  • In python, calling a function (isbasestring()) is somewhat expensive. So if you use this in any hot inner loops, you may want to benchmark with the function and with the expanded version to see whether you take a noticable loss of speed.
  • Not every piece of code is going to want to define “string” in the same way. For instance, bytearrays are mutable so maybe your code shouldn’t include those with the “normal” string types.
  • Maybe your code can be changed to only deal with unicode strings (str). In python3 byte strings are not as ubiquitous as they were in python2 so maybe your code can be changed to stop checking for the type of the object altogether or reduced to a single isinstance(obj, str). The language has evolved so when possible, evolve your code to adapt as well.

Next time: Literals

Kitchen 1.1.0 release

As mentioned last week a new kitchen release went out today. Since last week some small changes were made to the documentation and a few changes to make building kitchen easier were implemented but nothing has changed in the code. Here’s the text of the release announcement:

== Kitchen 1.1.0 has been released ==

Kitchen is a python library that brings together small snippets of code that you might otherwise find yourself reimplementing via cut and paste. Each little bit is useful and important but they usually feel too small and too trivial to create a whole module just for that one little function. However, experience has shown that any code implemented by copying will inevitably be shown to have bugs. And when you fix those bugs, you’ll wish you had created the module so you could fix the bug in one place rather than two (or five.. or ten…). Kitchen aims to be that one place.

Kitchen currently has code for easily setting up gettext so it won’t throw UnicodeErrors in corner cases, compatibility modules for different python2 stdlib versions (2.4, 2.5, 2.7), a little bit of iterators, and a whole lot of code for unicode-byte string conversion. In addition to the code, kitchen contains documentation that explains some of the common problems that arise when dealing with unicode in python2 and how to fix them through changes in coding practices and/or making use of functions from kitchen.

The 1.1.0 release enhances the gettext portion of kitchen. The major enhancements are:

  • get_translation_object can now be used as a drop in replacement for the stdlib’s gettext.translations() function.
  • If get_translation_object finds multiple message catalogs for the domain, it will setup the additional catalogs as fallbacks in case the message isn’t found in the first one.
  • The gettext and lgettext functions were reworked so that they guarantee that the string they return is both a byte str (this was present in previous kitchen releases) and is a valid sequence of bytes in the selected output_charset. This should prevent tracebacks if your code decodes and reencodes a value returned from the gettext and lgettext family of functions.
  • Several fixes to the way fallback message catalogs interacted with input and output charsets.

For the complete set of changes, see the NEWS file.

New kitchen release coming soon

[EDIT]For those who are curious, kitchen is a python module for miscellaneous code snippets. Things that people end up reimplementing via cut and paste because they seem to be too small to write a module for but are so useful that they need them in many places. Currently, it has code for i18n, compatibility modules for different python2 stdlib versions, a little bit of iterators, and a whole lot of code for unicode-byte string conversions.

Over the recent vacation I put the finishing (code) touches on a new kitchen release. I’ve scheduled the release of this code for next week on January 10, 2012. This is mainly since I just added the kitchen module on transifex.net and I’d like to see if any translations show up before next week. If anyone finds any bugs in the code on python-2.3.1 through python-2.7.x, please bring them up on the mailing list, on irc.freenode.net (I hang out in #fedora-admin and #fedora-python), or in the kitchen bug tracker so that I can address them before the release date.

The beta code is available from fedorahosted.org at: https://fedorahosted.org/releases/k/i/kitchen/kitchen-1.1.0b1.tar.gz

or from the bzr repository:

  bzr branch bzr://bzr.fedorahosted.org/bzr/kitchen/devel

The major changes are in the kitchen.i18n module. Previously, kitchen.i18n.*Translations objects guaranteed that they would return byte str when requested (via gettext(), ngettext(), lgettext(), and lngettext() methods) and unicode strings when requested (via ugettext() and ungettext()). The new code makes the additional guarantee that byte str‘s that are returned are valid in the requested output charset.

Here’s an example of the old behaviour vs new behaviour:

   >>> from kitchen.i18n import get_translation_object
   >>> translations = get_translation_object('kitchen')
   >>> b_ = translations.lgettext
   >>> translations.set_output_charset('utf-8')
   >>> translations.input_charset = 'latin-1'
   >>> # This would be: 'Café does not exist in the message catalog'
   >>> print repr(b_('Caf\xe9 does not exist in the message catalog'))
   # Old behaviour =>
   'Caf\xe9 does not exist in the message catalog'
   # New behaviour =>
   'Caf\xc3\xa9 does not exist in the message catalog'

   # Example 2: with wrong input_charset =>
   >>> translations.input_charset = 'utf-8'
   >>> print repr(b_('Caf\xe9 does not exist in the message catalog'))
   # New behaviour yields valid utf-8 bytes even when input_charset is wrong =>
   'Caf\xef\xbf\xbd does not exist in the message catalog'

Notice that this is not a magical panacea. The second example, shows that if input_encoding does not match the byte encoding of the strings that are given, the output string will be mangled (replacement characters or garbage characters). However, all the bytes in the output string will be valid in the chosen encoding so you won’t have to worry about exceptions if you attempt to transform the string again.

The other major change is that the kitchen.i18n.get_translation_object() function has been rewritten to be a drop in replacement for the stdlib’s gettext.translations(). The behaviour changes from that include the code now attempting to discover translations in every message catalog that it finds in the paths given to it. Additionally, those code changes lead to bugs in the *Translations classes fallback code being discovered and squashed.

See the NEWS file for other changes.

kitchen 0.2.4 released

I realize I didn’t announce 0.2.3 so here’s the NEWS entries for both of 0.2.3 and 0.2.4:

0.2.4

  • Have easy_gettext_setup() return lgettext functions instead of gettext
    functions when use_unicode=False
  • Correct docstring for kitchen.text.converters.exception_to_bytes() — we’re
    transforming into a byte str, not into unicode.
  • Correct some examples in the unicode frustrations documentation
  • Correct some cross-references in the documentation

0.2.3

  • Expose MAXFD, list2cmdline(), and mswindows in kitchen.pycompat27.subprocess.
    These are undocumented, and not in upstream’s __all__ but google (and bug
    reports against kitchen) show that some people are using them. Note that
    upstream is leaning towards these being private so they may be deprecated in
    the python3 subprocess.

So what do these changes mean for you? Hopefully it’ll just be bugfixes for everyone. The subprocess changes in 0.2.3 make more of the subprocess interface public because some code uses those functions and variables. People using them are advised to stop using them as this upstream bug report shows that the python maintainers don’t intend them to be public and will be deprecating them in the future. Since I had to dig into the code to look into this, I’ll also note that if your code is using list2cmdline() it it’s likely that it’s buggy in corner cases. From thesubprocess documentation: “list2cmdline() is designed for applications using the same rules as the MS C runtime.” That means that it’s not intended for dealing with Unix shells or even the MS-DOS command prompt. It’s only intended for the MS C runtime itself.

The 0.2.4 changes to easy_gettext_setup() changes behaviour so there is a potential to break code although I still classify it as a bugfix. easy_gettext_setup() is intended to return the gettext functions needed to translate an application. Since python has both byte str and unicode string types that can be used, there are gettext functions that return one or the other of those. easy_gettext_setup() takes a parameter, use_unicode to know whether to return a set of functions that works with byte str or a set of functions that work with unicode strings. There’s only one set of functions that return unicode so when unicode is requested the code returns the ugettext() and ungettext() functions as expected. When byte str is requested, however, things are a little messier as there’s two sets of function to choose from: gettext()/ngettext() or lgettext()/lngettext().

Prior to 0.2.4, easy_gettext_setup() returned gettext() and ngettext(). The gettext functions do return byte strings. However, the byte strings they return are in the encoding that was saved in the message catalogs on the filesystem. So, if the translators used utf-8 to encode their strings, you’d get utf-8 output; if they used latin-1, you’d get latin-1 output and so forth. This works fine as long as you’re using the same encoding as the translators were. However, when the translator uses a different encoding than you, you get mojibake.

In 0.2.4, we’ve switched to returning the lgettext functions to address this. lgettext and lngettext take the byte strings and the encoding information from the message catalog that the translator provided and use that to re-encode the strings in the desired encoding. That way if you have a locale setting of ja_JP.EUC_JP you get text encoded in EUC_JP and if you have a locale setting of ja_JP.UTF8 your text is encoded in UTF8.

Results from a program before and after updating the kitchen.i18n.easy_gettext_setup() function to use lgettext. In both terminals, the terminal is set to display characters using the EUC_JP encoding. The terminal on the right displays mojibake because the earlier version of easy_gettext_setup() uses the gettext() function which returns the characters in utf8 (the encoding that the translator used). The terminal on the right displays correctly because lgettext reencodes the strings as EUC_JP.

kitchen 0.2.2 released

Kitchen is a python module of small, useful snippets of code. It has functions to help with internationalizing applications, working with unicode and byte strings, iterators, and a whole bunch more.

0.2.2 is the first release that we’ve marked as a beta. The plan is to have a few weeks for people to try this out and report any bugs. If no bugs are reported we’ll release 1.0 soon afterwards. With that out the door, we’ll spend some time working on addon modules for a while — getting new features prototyped in separate submodules before merging them into a later kitchen 1.x (or 2.x) release.

Now bringing you everything but the kitchen sink!

Those of you who hang out in the same IRC channels as me may have heard me mention the python module I’ve been working on with some other Fedora python programmers. For those who haven’t, the 0.2.1a1 release seems like the perfect time for me to invade the blogosphere!

Kitchen is a module that aims to collect small, useful pieces of code into one python package for install on your machine. It’s a kind of a library of miscellaneous, small python functions. Why is that something special? Well, what those of us working on kitchen were realizing is that small pieces of code are a strange beast. They’re so small that it feels like the overhead of creating a python package just for them will result in a setup.py that’s larger than the module. On the other hand, the code is so useful that you end up reimplementing it in every project that you work on. And of course, once you start cutting and pasting and reimplementing between projects you have the problem of keeping those copies in sync; making sure that bugs that were fixed in the copy in one project are fixed in all of them.

What’s needed is to have one larger umbrella package that pulls all of those functions together. Then you can fix problems in a single place and also only have the overhead of writing setup.py files and making releases once for all of those functions. This is exactly what kitchen is.

So what do you get with the 0.2.1-alpha1 release? You get a bunch of code that originated in yum, python-fedora, a few snippets from the ActiveState Python Recipe Collection, and some backports from python-2.7 for earlier python. You get API documentation and a few tutorials on using kitchen to ease your programming burden.

What can you do with these shiny new tools? Let’s take a really brief tour of the modules included in the release:

i18n
Functions and objects in here will help you setup gettext for internationalizing your application’s messages
text
Have you ever been stumped by the difference between unicode and str types in python? how many times have you written your own to_unicode and to_bytes functions? Well, you can stop cutting and pasting that because this kitchen module has both of those functions builtin for you! As a bonus, we also throw in a

collections
Currently this only provides a dictionary implementation that keeps strict separation between str and unicode keys. ie: ‘a’ and u’a’ are different keys.
iterutils
Contains two functions at the moment:

  • isiterable() will accurately detect if an object is an iterable.
  • iterate takes a value and iterates over it. If the value is a scalar, it becomes an iterator that returns a single value. If it’s already an iterable, we just iterate that.
versioning
Currently has one function for assembling a PEP-386 compliant version string from a set of tuples (format of which is also in PEP-386)
Compatibility modules for things added in Python-2.4, Python-2.5, and Python-2.7.

The compatibility modules implement functionality that was merged into python at a later version than you might have. So if you need defaultdict that appeared in python-2.5, it’s here. The nice thing about these modules is that we take care to import from the python standard library if it’s available and only use the copied compat library if it’s not available.

As with any open source project, this is an evolving code base. We’re doing our best to have a long deprecation cycle for functions that we remove and clearly document how to replace the functionality with other functions in kitchen. If you have a favourite useful piece of code that you’d like to see merged, feel free to send us a message on the mailing list or open up a ticket in our trac instance.