GNOME strongly encourages the use of Python 3 for writing applications!
Python 2 comes with two different kinds of objects that can be used to represent strings, str
and unicode
. Instances of unicode
are used to express Unicode strings, whereas instances of the str
type are byte representations (the encoded string). Under the hood, Python represents Unicode strings as either 16- or 32-bit integers, depending on how the Python interpreter was compiled.
>> unicode_string = u"Fu\u00dfb\u00e4lle"
>>> print unicode_string]]>
Fußbälle
Unicode strings can be converted to 8-bit strings with unicode.encode()
. Python’s 8-bit strings have a str.decode()
method that interprets the string using the given encoding (that is, it is the inverse of the unicode.encode()
):
>> type(unicode_string)
>>> unicode_string.encode("utf-8")
'Fu\xc3\x9fb\xc3\xa4lle'
>>> utf8_string = unicode_string.encode("utf-8")
>>> type(utf8_string)
>>> unicode_string == utf8_string.decode("utf-8")
True]]>
Unfortunately, Python 2.x allows you to mix unicode
and str
if the 8-bit string happened to contain only 7-bit (ASCII) bytes, but would get
Since Python 3.0, all strings are stored as Unicode in an instance of the str
type. Encoded strings on the other hand are represented as binary data in the form of instances of the bytes type. Conceptually, str
refers to text, whereas bytes refers to data. Use encode()
to go from str
to bytes
, and decode()
to go from bytes
to str
.
In addition, it is no longer possible to mix Unicode strings with encoded strings, because it will result in a TypeError
:
>> text = "Fu\u00dfb\u00e4lle"
>>> data = b" sind rund"
>>> text + data
Traceback (most recent call last):
File "", line 1, in
TypeError: Can't convert 'bytes' object to str implicitly
>>> text + data.decode("utf-8")
'Fußbälle sind rund'
>>> text.encode("utf-8") + data
b'Fu\xc3\x9fb\xc3\xa4lle sind rund']]>
GTK+ uses UTF-8 encoded strings for all text. This means that if you call a method that returns a string you will always obtain an instance of the str
type. The same applies to methods that expect one or more strings as parameter, they must be UTF-8 encoded. However, for convenience PyGObject will automatically convert any unicode instance to str if supplied as argument:
>> from gi.repository import Gtk
>>> label = Gtk.Label()
>>> unicode_string = u"Fu\u00dfb\u00e4lle"
>>> label.set_text(unicode_string)
>>> txt = label.get_text()
>>> type(txt)
]]>
Furthermore:
>> txt == unicode_string]]>
would return False
, with the warning __main__:1: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
(Gtk.Label.get_text()
will always return a str
instance; therefore, txt
and unicode_string
are not equal).
This is especially important if you want to internationalize your program using gettext
. You have to make sure that gettext
will return UTF-8 encoded 8-bit strings for all languages.
In general it is recommended to not use unicode
objects in GTK+ applications at all, and only use UTF-8 encoded str
objects since GTK+ does not fully integrate with unicode
objects.
String encoding is more consistent in Python 3.x because PyGObject will automatically encode/decode to/from UTF-8 if you pass a string to a method or a method returns a string. Strings, or text, will always be represented as instances of str
only.
How To Deal With Strings - The Python GTK+ 3 Tutorial