Download all about python and unicode | boodebr.org

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Downloads
Python
Home
All About Python and Unicode
Search
March 4, 2007 - 3:39pm — frank
... and even more about Unicode
Popular content
All time:
Contents
All About Python and
Unicode
A Starting Point
Unicode Text in Python
The Python Tourist #4:
None, empty, and nothing.
Converting Unicode symbols to Python literals
Building Python extensions
Why doesn't "print" work?
for Windows with only free
Codecs
tools
From Unicode to binary
From binary to Unicode
String Operations
A wrinkle in {{{\U}}}
The Python Tourist #5:
Replacing sys.version_info
with pyconfig
MYOML #6: Metainfo, notes,
warnings, and more HTML
tags
Bugs in Python 2.0 & 2.1
Python as a "universal recoder"
Last viewed:
All About Python and
Now the Fun Begins ... Unicode and the Real World
Unicode Filenames
Microsoft Windows
Unix/POSIX/Linux
Mac OS/X
Unicode and HTML
Unicode and XML
Unicode
WikklyText - Home
The Python Tourist #3:
Forgetting how cmp()
works ... no problem!
The Python Tourist #2:
Taking Exception
The Python Tourist #4:
None, empty, and nothing.
Unicode and network shares (Samba)
Summary
User login
Username: *
A Starting Point
Two weeks before I started writing this document, my knowledge of using Python and Unicode was about like
Password: *
this:
Log in
All there is to using Unicode in Python is just passing your strings to unicode()
Request new password
Now where would I get such a strange idea? Oh, that's right, from the Python tutorial on Unicode, which states:
"Creating Unicode strings in Python is just as simple as creating normal strings":
>>> u'Hello World !' u'Hello World !'
While this example is technically correct, it can be misleading to the Unicode newbie, since it glosses over several
details needed for real-life usage. This overly-simplified explanation gave me a completely wrong understanding of
how Unicode works in Python.
If you have been led down the overly-simplistic path as well, then this tutorial will hopefully help you out. This
tutorial contains a set of examples, tests, and demos that docment my "relearning" of the correct way to work
with Unicode in Python. It includes cross-platform issues, as well as issues that arise when dealing with HTML,
XML, and filesystems.
By the way, Unicode is fairly simple, I just wish I had learned it correctly the first time.
Where to begin?
At a top level, computers use three types of text representations:
1.
ASCII
2.
Multibyte character sets
3.
Unicode
I think Unicode is easier to understand if you understand how it evolved from ASCII. The following is a brief
synopsis of this evolution.
From ASCII to Multibyte
In the beginning, there was ASCII. (OK, there was also EBCDIC, but that never caught on outside of mainframes,
so I'm omitting it here.) The ASCII character set contains 256 characters, as you can see on this ASCII Chart. Even
though 256 characters are available, the lower 128 (codes 0-127) are the most often used codes. Early email
systems in fact would only allow you to transmit characters 0-127 (i.e. "7-bit text") and in fact this is still true of
many systems today. As you can see from the chart, ASCII is sufficient for English language documents.
Problems arose as computer use grew in countries where ASCII was not sufficient. ASCII lacks the ability to
handle Greek, Cyrillic, or Japanese texts, to name a few. Furthermore, Japanese texts alone need thousands of
characters, so there is no way to fit them into an 8-bit scheme. To overcome this, Multibyte Character Sets were
invented. Most (if not all?) Multibyte Character Sets take advantage of the fact that only the first 128 characters of
the ASCII set are commonly used (codes 0-127 in decimal, or 0x00-0x7f in hex). The upper codes (128..255 in
decimal, or 0x80-0xff in hex) are used to define the non-English extended sets.
Lets look at an example: Shift-JIS is one encoding for Japanese text. You can see its character table here. Notice
that the first byte of each character begins with a hex value from 0x80 - 0xfc. This is an interesting property,
because it means that English and Japanese text can be freely mixed! The string "Hello World!" is a perfectly valid
Shift-JIS encoding of English text. When parsing Shift-JIS, if you get a byte in the range 0x80-0xff, you know it is
the first character of a two code sequence. Else, it is a single byte of regular ASCII.
This works just fine as long as you are working only in Japanese, but what happens if you switch to a Greek
character set? As you can see from the table, ISO-8859-7 has redefined the codes from 0x80-0xff in a completely
different way than Shift-JIS defines them. So, although you can mix English and Japanese, you cannot mix Greek
and Japanese since they would step on each other. This is a common problem with mixing any multibyte character
sets.
From Multibyte to Unicode
To overcome the problem of mixing different languages, Unicode proposes to combine all of the worlds character
sets into a single huge table. Take a look at the Unicode character set.
At first glance, there appears to be separate tables for each language, so you may not see the improvement over
ASCII. In reality though these are all in the same table, and are just indexed here for easy (human) reference.
The key thing to notice is that since these are all part of the same table, they don't overlap like in the
ASCII/multibyte world. This allows Unicode documents to freely mix languages with no coding conflicts.
Unicode terminology
Lets look at the Greek chart and grab a few characters:
Sample Unicode Symbols
03A0
Π
Greek Capital Letter Pi
03A3
Σ
Greek Capital Letter Sigma
03A9
Ω
Greek Capital Letter Omega
It is common to refer to these symbols using the notation U+NNNN, for example U+03A0. So we could define a
string that contains these characters, using the following notation (I added brackets for clarity):
uni = {U+03A0} + {U+03A3} + {U+03A9} Now, even though we know exactly what 'uni' represents (ΠΣΩ) note that there is no way to:
Print uni to the screen.
Save uni to a file.
Add uni to another piece of text.
Tell me how many bytes it takes to store uni .
Why? Because uni is an idealized Unicode string - nothing more than a concept at this point. Shortly we'll see
how to print it, save it, and manipulate it, but for now, take note of the last statement: There is no way to tell me
how many bytes it takes to store uni . In fact, you should forget all about bytes and think of Unicode strings as
sets of symbols.
Why should you forget about bytes in the Unicode world? Take the Greek symbol Omega: Ω. There are at least 4
ways to encode this as binary:
Encoding name
Binary representation
\xD9
ISO-8859-7
"Native" Greek encoding
UTF-8
\xCE\xA9
UTF-16
\xFF\xFE\xA9\x03
UTF-32
\xFF\xFE\x00\x00\xA9\x03\x00\x00
Each of these is a perfectly valid coding of Ω, but trying to work with bytes like this is no better than dealing with
the ASCII/Multibyte world. This is why I say you should think of Unicode as symbols (Ω), not as bytes.
Unicode Text in Python
To convert our idealized Unicode string uni (ΠΣΩ) to a useful form, we need to look a few things:
1.
Representing Unicode literals
2.
Converting Unicode to binary
3.
Converting binary to Unicode
4.
Using string operations
Converting Unicode symbols to Python literals
Creating a Unicode string from symbols is very easy. Recall our Greek symbols from above:
Sample Unicode Symbols
03A0
Π
Greek Capital Letter Pi
03A3
Σ
Greek Capital Letter Sigma
03A9
Ω
Greek Capital Letter Omega
Lets say we want to make a Unicode string with those characters, plus some good old-fashioned ASCII
characters.
Pseudocode:
uni = 'abc_' + {U+03A0} + {U+03A3} + {U+03A9} + '.txt'
Here is how you make that string in Python:
uni = u"abc_\u03a0\u03a3\u03a9.txt"
A few things to notice:
Plain-ASCII characters can be written as themselves. You can just say "a", and not have to use the Unicode
symbol "\u0061" . (But remember, "a" really is {U+0061}; there is no such thing as a Unicode symbol
"a" .)
The \u escape sequence is used to denote Unicode codes.
This is somewhat like the traditional C-style \xNN to insert binary values. However, a glance at the
Unicode table shows values with up to 6 digits. These cannot be represented conveniently by \xNN , so
\u was invented.
For Unicode values up to (and including) 4 digits, use the 4-digit version:
\uNNNN
Note that you must include all 4 digits, using leading 0's as needed.
For Unicode values longer than 4 digits, use the 8-digit version:
\UNNNNNNNN
Note that you must include all 8 digits, using leading 0's as needed.
Here is another example:
Pseudocode:
uni = {U+1A} + {U+B3C} + {U+1451} + {U+1D10C} Python:
uni = u'\u001a\u0bc3\u1451\U0001d10c'
Note how I padded each of the values to 4/8 digits as appropriate. Python will give you an error if you don't
do this. Also note that you can use either capital or lowecase letters in the codes. The following would give
you exactly the same thing:
Python:
uni = u'\u001A\u0BC3\u1451\U0001D10C'
Why doesn't "print" work?
Remember how I said earlier that uni has no fixed computer representation. So what happens if we try to print
uni ?
uni = u"\u001A\u0BC3\u1451\U0001D10C"
print uni
You would see:
Traceback (most recent call last):
File "t6.py", line 2, in ?
print uni
UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-4:
ordinal not in range(128)
What happened? Well, you told Python to print uni , but since uni has no fixed computer representation,
Python first had to convert uni to some printable form. Since you didn't tell Python how to do the conversion, it
assumed you wanted ASCII. Unfortunately, ASCII can only handle values from 0 to 127, and uni contains values
out of that range, hence you see an error.
A quick method to print uni is to use Python's repr() method:
uni = u"\u001A\u0BC3\u1451\U0001D10C"
print repr(uni)
This prints:
u'\x1a\u0bc3\u1451\U0001d10c'
This of course makes sense, since that's exactly how we just defined uni . But repr(uni) is just as useless in
the real world as uni itself. What we really need to do is learn about codecs.
Codecs
Codecs
In general, Python's codecs allow arbitrary object-to-object transformations. However, in the context of this
article, it is enough to think of codecs as functions that transform Unicode objects into binary Python strings,
and vice versa.
Why do we need them?
Unicode objects have no fixed computer representation. Before a Unicode object can be printed, stored to
disk, or sent across a network, it must be encoded into a fixed computer representation. This is done using a
codec. Some popular codecs you may have heard about in your day to day experiences: ascii, iso-8859-7,
UTF-8, UTF-16.
From Unicode to binary
To turn a Unicode value into a binary representation, you call its .encode method with the name of the codec. For
example, to convert a Unicode value to UTF-8:
binary = uni.encode("utf-8")
How about we make uni more interesting and add some plain text characters:
uni = u"Hello\u001A\u0BC3\u1451\U0001D10CUnicode"
Now lets have a look at how different codecs represent uni . Here is a little test program:
ERROR - No such file or resource "test_codec01.py"
This results in the output:
UTF-8 'Hello\x1a\xe0\xaf\x83\xe1\x91\x91\xf0\x9d\x84\x8cUnicode'
UTF-16 '\xff\xfeH\x00e\x00l\x00l\x00o\x00\x1a\x00\xc3\x0bQ\x144
\xd8\x0c\xddU\x00n\x00i\x00c\x00o\x00d\x00e\x00'
ASCII Hello????Unicode
ISO-8859-1 Hello????Unicode
Note that I still used repr() to print the UTF-8 and UTF-16 strings. Why? Well, otherwise, it would have printed
raw binary values to the screen which would have been hard to capture in this document.
From binary to Unicode
Say someone gives you a UTF-8 encoded version of a Unicode object. How do you convert it back into Unicode?
You might naively try this:
The Naive (and Wrong) Way
uni = unicode( utf8_string )
Why is this wrong? Here is a sample program doing exactly that:
uni = u"Hello\u001A\u0BC3\u1451\U0001D10CUnicode"
utf8_string = uni.encode('utf-8')
# naively convert back to Unicode
uni = unicode(utf8_string)
Here is what happens:
Traceback (most recent call last):
File "t6.py", line 5, in ?
uni = unicode(utf8_string)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe0
in position 6: ordinal not in range(128)
You see, the function unicode() really takes two parameters:
def unicode(string, encoding):
....
In the above example, we omitted the encoding so Python, in faithful style, assumed once again that we wanted
ASCII (footnote 1), and gave us the wrong thing.
Here is the correct way to do it:
uni = u"Hello\u001A\u0BC3\u1451\U0001D10CUnicode"
utf8_string = uni.encode('utf-8')
# have to decode with the same codec the encoder used!
uni = unicode(utf8_string,'utf-8')
print "Back from UTF-8: ",repr(uni)
Which gives the output:
Back from UTF-8: u'Hello\x1a\u0bc3\u1451\U0001d10cUnicode'
String Operations
The above examples hopefully give you a good idea of why you want to avoid dealing with Unicode values as
binary strings as much as possible! The UTF-8 version was 23 bytes long, the UTF-16 version was 36 bytes, the
ASCII version was only 16 bytes (but it completely discarded 4 Unicode values) and similarly with ISO-8859-1.
This is why, at the very start of this document I suggested that you forget all about bytes!
The good news is that once you have a Unicode object, it behaves exactly like a regular string object, so there is
no new syntax to learn (other than the \u and \U escapes). Here is a short sample that shows Unicode objects
behaving the way you would expect:
ERROR - No such file or resource "test_stringops01.py"
Running this sample gives the output:
uni = u'Hello\x1a\u0bc3\u1451\U0001d10cUnicode'
len(uni) = 17
uni[:5] = Hello
uni[5] = u'\x1a'
uni[6] = u'\u0bc3'
uni[7] = u'\u1451'
uni[8] = u'\ud834'
uni[9] = u'\udd0c'
uni[10:] = u'Unicode'
A wrinkle in \U
Depending on how your version of Python was compiled, it will store Unicode objects internally in either UTF-16 (2
bytes/character) or UTF-32 (4 bytes/character) format. Unfortunately this low-level detail is exposed through the
normal string interface.
For 4-digit (16-bit) characters like \u03a0 , there is no difference.
a = u'\u03a0'
print len(a)
Will show of length of 1, regardless of how your Python was built, and a[0] will always be \u03a0 . However, for
8-digit (32-bit) characters, like \U0001FF00 , you will see a difference. Obviously, 32-bit values cannot be directly
represented in a 16-bit code, so a pair of two 16-bit values are used. (Codes 0xD800 - 0xDFFF , called "surrogate
pairs", are reserved for these two-character sequences. These values are invalid when used by themselves, per
the Unicode specification.)
A sample program that shows what happens:
What happens with \U ...
a = u'\U0001ff00'
print "Length:",len(a)
print "Chars:"
for c in a:
print repr(c)
If you run this under a "UTF-16" Python, you will see:
Output, 'UTF-16' Python
Length: 2
Chars:
u'\ud83f'
u'\udf00'
Under a 'UTF-32' Python, you will see:
Output, 'UTF-16' Python
Length: 1
Chars:
u'\U0001ff00'
This is an annoying detail to have to worry about. I wrote a module that lets you step character-by-character
through a Unicode string, regardless of whether you are running on a 'UTF-16' or 'UTF-32' flavor of Python. It is
called xmlmap and is part of Gnosis Utils. Here are two examples, one using xmlmap, one not.
Without xmlmap
a = u'A\U0001ff00C\U0001fafbD'
print "Length:",len(a)
print "Chars:"
for c in a:
print repr(c)
Results without xmlmap , on a UTF-16 Python
Length: 7
Chars:
u'A'
u'\ud83f'
u'\udf00'
u'C'
u'\ud83e'
u'\udefb'
u'D'
Now, using the usplit() function, to get the characters one-at-a-time, combining split values where needed:
With xmlmap
from gnosis.xml.xmlmap import usplit
a = u'A\U0001ff00C\U0001fafbD'
print "Length:",len(a)
print "Chars:"
for c in usplit(a):
print repr(c)
Results with xmlmap , on a UTF-16 Python
Length: 7
Chars:
u'A'
u'\U0001ff00'
u'C'
u'\U0001fafb'
u'D'
Now you will get identical results regardless of how your Python was compiled. (Note that the length is still the
same, but usplit() has combined the surrogate pairs so you don't see them.)
Bugs in Python 2.0 & 2.1
Yes, you may wonder "who cares" when it comes to Python 2.0 and 2.1, but when writing code that's supposed
to be completely portable, it does matter!
Python 2.0.x and 2.1.x have a fatal bug when trying to handle single-character codes from in the range \uD800\uDFFF .
The sample code below demonstrates the problem:
u = unichr(0xd800)
print "Orig: ",repr(u)
# create utf-8 from '\ud800'
ue = u.encode('utf-8')
print "UTF-8: ",repr(ue)
# decode back to unicode
uu = unicode(ue,'utf-8')
print "Back: ",repr(uu)
Running this under Python 2.2 and up gives the expected result:
Orig: u'\ud800'
UTF-8: '\xed\xa0\x80'
Back: u'\ud800'
Python 2.0.x gives:
Orig: u'\uD800'
UTF-8: '\240\200'
Traceback (most recent call last):
File "test_utf8_bug.py", line 9, in ?
uu = unicode(ue,'utf-8')
UnicodeError: UTF-8 decoding error: unexpected code byte
Python 2.1.x gives:
Orig: u'\ud800'
UTF-8: '\xa0\x80'
Traceback (most recent call last):
File "test_utf8_bug.py", line 9, in ?
uu = unicode(ue,'utf-8')
UnicodeError: UTF-8 decoding error: unexpected code byte
As you can see, both fail to encode u'\ud800' when used as a single character. While it is true that the
characters from 0xD800 .. 0xDFF are not valid when used by themselves, the fact is that Python will let you use
them alone.
But if they're invalid, why should Python bother?
I came up with a good example, completely by accident while working on the code for this tutorial. Create two
Python files:
aaa.py
x = u'\ud800'
bbb.py
import sys
sys.path.insert(0,'.')
import aaa
Now, use Python 2.0.x/2.1.x to run bbb.py twice (it needs to run twice so it will load aaa.pyc the second time).
On the second run, you'll get:
Traceback (most recent call last):
File "bbb.py", line 3, in ?
import aaa
UnicodeError: UTF-8 decoding error: unexpected code byte
That's right, Python 2.0.x/2.1.x are unable to reload their own bytecode from a .pyc file if the source contains a string
like u'\ud800' . A portable workaround in that case would be to use unichr(0xd800) instead of u'\ud800' (this
is what gnosis.xml.pickle does).
Python as a "universal recoder"
Up to this point, I've been translating Unicode to/from UTF for purposes of demonstration. However, Python lets
you do much more than that. It allows you to translate nearly any multibyte character string into Unicode (and vice
versa). Implementing all of these translations is a lot of work. Fortunately, it has been done for us, so all we have
to do is know how to use it.
Lets revisit our Greek table, except this time I'm going to list the characters both in Unicode as well as ISO-8859-7
("native Greek").
Character
Name
As Unicode
As ISO-8859-7
Π Greek Capital Letter Pi
03A0
0xD0
Σ Greek Capital Letter Sigma
03A3
0xD3
Ω Greek Capital Letter Omega
03A9
0xD9
With Python, using unicode() and .encode() makes it trivial to translate between these.
# {Pi}{Sigma}{Omega} as ISO-8859-7 encoded string b = '\xd0\xd3\xd9'
# Convert to Unicode ('univeral format')
u = unicode(b, 'iso-8859-7')
print repr(u)
# ... and back to ISO-8859-7
c = u.encode('iso-8859-7')
print repr(c)
Shows:
u'\u03a0\u03a3\u03a9'
\xd0\xd3\xd9
You can also use Python as a "universal recoder". Say you received a file in the Japanese encoding ShiftJIS and
wanted to convert to the EUC-JP encoding:
txt = ... the ShiftJIS-encoded text ...
# convert to Unicode ("universal format")
u = unicode(txt, 'shiftjis')
# convert to EUC-JP
out = u.encode('eucjp')
Of course, this only works when translating between compatible character sets. Trying to translate between
Japanese and Greek character sets this way would not work.
Now the Fun Begins ... Unicode and the Real World
Now you know about everything you need to know to work with Unicode objects within Python. Isn't that nice?
However, the rest of the world isn't quite as nice and neat as Python, so you need to understand how the nonPython portion of the world handles Unicode. It isn't terribly hard, but there are a lot of special cases to consider.
From here on out, we'll be looking at Unicode issues that arise when dealing with:
1.
Filenames (Operating System specific issues)
2.
XML
3.
HTML
4.
Network files (Samba)
Unicode Filenames
Sounds simple enough, right? If I want to name a file with my Greek letters, I just say:
open(unicode_name, 'w')
In theory, yes, that's supposed to be all there is to it. However, there are many ways for this to not work, and
they depend on the platform your program is running on.
Microsoft Windows
There are at least two ways of running Python under Windows. The first is to use the Win32 binaries from
www.python.org. I will refer to this method as "Windows-native Python".
The other method is by using the version of Python that comes with Cygwin This version of Python looks (to user
code) more like POSIX, instead of like a Windows-native environment.
For many things, the two versions are interchangeable. As long as you write portable Python code, you shouldn't
have to care which interpreter you are running under. However, one important exception is when handling
Unicode. That is why I'll be specific here about which version I am running.
Using Windows-native Python
Lets keep using our familiar Greek symbols:
Sample Unicode Symbols
03A0
Π
Greek Capital Letter Pi
03A3
Σ
Greek Capital Letter Sigma
03A9
Ω
Greek Capital Letter Omega
Our sample Unicode filename will be:
# this is: abc_{PI}{Sigma}{Omega}.txt
uname = u"abc_\u03A0\u03A3\u03A9.txt"
Lets create a file with that name, containing a single line of text:
open(uname,'w').write('Hello world!\n')
Opening up an Explorer window shows the results (click for a larger version):
win32_01.jpg
There the filename is in all its unicode glory.
Now, lets see how os.listdir() works with this name. The first thing to know is that os.listdir() has two
modes of operation:
Non-unicode, achieved by passing a non-Unicode string to os.listdir(), i.e. os.listdir('.')
Unicode, achieved by passing a Unicode string to os.listdir(), i.e. os.listdir(u'.')
First, lets try as Unicode:
os.chdir('ttt')
# there is only one file in directory 'ttt'
name = os.listdir(u'.')[0]
print "Got name: ",repr(name)
print "Line: ",open(name,'r').read()
Running this program gives the following output:
Got name: u'abc_\u03a0\u03a3\u03a9.txt'
Line: Hello world!
Comparing with above, that looks correct. Note that print repr(name) was required, since an error would have
occurred if I had tried to print name directly to the screen. Why? Yep, once again Python would have assumed you
wanted an ASCII coding, and would have failed with an error.
Now let's try the above sample again, but using the non-Unicode version of os.listdir() :
os.chdir('ttt')
# there is only one file in directory 'ttt'
name = os.listdir('.')[0]
print "Got name: ",repr(name)
print "Line: ",open(name,'r').read()
Gives this output:
Got name: 'abc_?SO.txt'
Line: Traceback (most recent call last):
File "c:\frank\src\unicode\t2.py", line 8, in ?
print "Line: ",open(name,'r').read()
IOError: [Errno 2] No such file or directory: 'abc_?SO.txt'
Yikes! What happened? Welcome to the wonderful work of the win32 "dual-API".
A little background:
Windows NT/2000/XP always write filenames to the the underlying filesystem as Unicode (footnote 2). So in
theory, Unicode filenames should work flawlessly with Python.
Unfortunately, win32 actually provides two sets of APIs for interfacing with the filesystem. And in true
Microsoft style, they are incompatible. The two APIs are:
1.
A set of APIs for Unicode-aware applications, that return the true Unicode names.
2.
A set of APIs for non-Unicode aware applications that return a locale-dependent coding of the true
Unicode filenames.
Python (for better or worse) follows this convention on win32 platforms, so you end up with two incompatible
ways of calling os.listdir() and open() :
1.
When you call os.listdir() , open() , etc. with a Unicode string, Python calls the Unicode version of
2.
the APIs, and you get the true Unicode filenames. (This corresponds to the first set of APIs above).
When you call os.listdir() , open() , etc. with a non-Unicode string, Python calls the non-Unicode
version of the APIs, and here is where the trouble creeps in. The non-Unicode API's handle Unicode with a
particular codec called MBCS. MBCS is a lossy codec: Every MBCS name can be represented as Unicode,
but not vice versa. MBCS coding also changes depending on the current locale. In other words, if I write
a CD with a multibyte-character filename as MBCS on my English locale machine, then send the CD to
Japan, the filename there may appear to contain completely different characters.
Now that we know the background facts, we can see what happened above. By using os.listdir('.') , you are
getting the MBCS-version of the true Unicode name that is stored on the filesystem. And, on my English-locale
computer, there is no accurate mapping for the Greek characters, so you end up with "?" , "S" , and "O" . This
leads to the weird result that there is no way to open our Greek-lettered file using the MBCS APIs in an English
locale (!!).
Bottom line
I recommend always using Unicode strings in os.listdir() , open() , etc. Remember that Windows
NT/2000/XP always stores filenames as Unicode, and so this is the native behavior. And, as shown above,
can sometimes be the only way to open a Unicode filename.
Danger! Cygwin
Cygwin has a huge problem here. It (currently, at least) has no support for Unicode. That
is, it will never call the Unicode versions of the win32 APIs. Hence, it is impossible to open
certain files (like our Greek-lettered filename) from Cygwin. It doesn't matter if you use
os.listdir(u'.') or os.listdir('') ; you always get the MBCS-coded versions.
Please note that this isn't a Python-specific problem; it is a systemic problem with Cygwin.
All Cygwin utilities, such as zsh , ls , zip , unzip , mkisofs , will be unable to recognize
our Greek-lettered name, and will report various errors.
Unix/POSIX/Linux
Unlike Windows NT/2000/XP, which always store filenames in Unicode format, POSIX systems (including Linux)
always store filenames as binary strings. This is somewhat more flexible, since the operating system itself doesn't
have to know (or care) what encoding is used for filenames. The downside is that the user is responsible for
setting up their environment ("locale") for the proper coding.
Setting a locale
The specifics of setting up your POSIX box to handle Unicode filenames are beyond the scope of this document,
but it generally comes down to setting a few environment variables. In my case, I wanted to use the UTF-8 codec
in a U.S. English locale, so my setup involved adding a few lines to these startup files (I've tried this under Gentoo
Linux and Ubuntu, though all Linux systems should be similar):
Additions to .bashrc :
LANG="en_US.utf8"
LANGUAGE="en_US.utf8"
LC_ALL="en_US.utf8"
export LANG
export LANGUAGE
export LC_ALL
For good measure, I added the same lines to my .zshrc file.
Additionally, I added the first three lines to /etc/env.d/02locale .
CAUTION
Please do not blindy make changes like the above to your system if you aren't sure what you're
doing. You could make your files unreadable by switching locales. The above is meant only as an
example of a simple case of switching from an ASCII locale to a UTF-8 locale.
Python under POSIX
A big advantage under POSIX, as far as Python is concerned, is that you can use either:
os.listdir('.')
Or:
os.listdir(u'.')
Both methods will give you strings that you can pass to open() to open the files. This is much better than
Windows, which will return mangled versions of the Unicode names if you use os.listdir('.') , which as seen
above can sometimes fail to give you a valid name to open the file. You will always get a valid name under
POSIX/Linux.
Here is a sample function to demonstrate that:
ERROR
Unable to get source for "test_posix01.test"
If you run this you'll get:
As unicode: u'abc_\u03a0\u03a3\u03a9.txt'
Read line: Hello unicode!
As bytestring: 'abc_\xce\xa0\xce\xa3\xce\xa9.txt'
Read line: Hello unicode!
As you can see, we were able to successfully read the file, no matter if we used the Unicode or bytestring version
of the filename.
Application Demos
Unlike the Microsoft Windows world where you basically have a "DOS box" and Windows Explorer, under Linux you
have many choices about what terminal and file manager you want to run. This is both a blessing and a curse: A
blessing in that you can pick an application that suits your preferences, but also curse in that not all applications
support Unicode to the same extent.
The following is a survey of several popular applications to see what they support.
Applications that support Unicode filenames
My personal current favorite is mlterm , a multi-lingual terminal (click for a larger version):
mlterm_01.jpg
The GNOME terminal ( gnome-terminal ):
gnome_terminal_01.jpg
The KDE terminal ( konsole ):
konsole_01.jpg
A modified version of rxvt ( rxvt-unicode ) handles Unicode, although it has some issues with underscore
characters in the font I've chosen ...
urxvt_01.jpg
Here is our Greek-lettered file in the a KDE file manager window (konqueror):
konq_01.jpg
And here it is in the GNOME file manager (Nautilus):
naut_01.jpg
The XFCE 4 file manager:
xfce_01.jpg
The standard KDE file selector supports Unicode filenames:
kfilesel_01.jpg
As does the GNOME file selector:
gfilesel_01.jpg
Applications that do not support Unicode filenames
The standard rxvt does not handle Unicode correctly:
rxvt_01.jpg
The Xfm file manager does not handle Unicode filenames:
xfm_01.jpg
Mac OS/X
I don't have an OSX machine to test this on, but helpful readers have contributed some information on Unicode
support in OSX.
One reader pointed out that os.listdir('.') and os.listdir(u'.') both return objects that can be passed
directly to open() , as you can do under POSIX.
Reader Hraban noted:
You should mention that MacOS X uses a special kind of decomposed UTF-8 to store filenames. If you need to
e.g. read in filenames and write them to a "normal" UTF-8 file, you must normalize them (at least if your editor,
or my TeX system, doesn't understand decomposed UTF-8):
filename = unicodedata.normalize('NFC', unicode(filename, 'utf-8')).encode('utf-8')
For others reading this who aren't familiar with this issue (like I wasn't) here are a few references:
Text Encodings in VFS
unicode filenames
My understanding of this is that when you pass a name with an accented character like é, it will decompose this
into e plus ' before saving it to the filesystem (this behavior is defined by the Unicode standard).
If you can add anything else to this section, please leave a comment below!
Unicode and HTML
You may find yourself generating HTML with Python (i.e. when using mod_python, CherryPy, or such). So how do
you use Unicode characters in an HTML document?
The answer involves these easy steps:
1.
Use a <meta> tag to let the user's browser know the encoding you used. (footnote 3)
2.
Generate your HTML as a Unicode object.
3.
Write your HTML bytestream using a whichever codec you prefer.
Here is an example, writing the same Greek-lettered string I've been using all along:
code = 'utf-8' # make it easy to switch the codec later
html = u'<html>'
# use a <meta> tag to specify the document encoding used
html += u'<meta http-equiv="content-type" content="text/html; charset=%s">' % code
html += u'<head></head><body>'
# my actual Unicode content ...
html += u'abc_\u03A0\u03A3\u03A9.txt'
html += u'</body></html>'
# Now, you cannot write Unicode directly to a file. # First have to either convert it to a bytestring using a codec, or
# open the file with the 'codecs' module.
# Method #1, doing the conversion yourself:
open('t.html','w').write( html.encode( code ) )
# Or, by using the codecs module:
import codecs
codecs.open('t.html','w',code).write( html )
# .. the method you use depends on personal preference and/or
# convenience in the code you are writing.
Now let's open the page (t.html) in Firefox:
win32_02.jpg
Just as expected!
Now, if you go back into the sample code and replace the line:
code = 'utf-8'
With ...
code = 'utf-16'
... the HTML file will now be written in UTF-16 format, but the result displayed in the browser window will be
exactly the same.
Unicode and XML
The XML 1.0 standard requires all parsers to support UTF-8 and UTF-16 encoding. So, it would seem obvious that
an XML parser would allow any legal UTF-8 or UTF-16 encoded document as input, right?
Nope!
Have a look at this sample program:
xml = u'<?xml version="1.0" encoding="utf-8" ?>'
xml += u'<H> \u0019 </H>'
# encode as UTF-8
utf8_string = xml.encode( 'utf-8' )
At this point, utf8_string is a perfectly valid UTF-8 string representing the XML. So we should be able to parse
it, right?:
from xml.dom.minidom import parseString
parseString( utf8_string )
Here is what happens when we run the above code:
Traceback (most recent call last):
File "t9.py", line 9, in ?
parseString( utf8_string )
File "c:\py23\lib\xml\dom\minidom.py", line 1929, in parseString
return expatbuilder.parseString(string)
File "c:\py23\lib\xml\dom\expatbuilder.py", line 940, in parseString
return builder.parseString(string)
File "c:\py23\lib\xml\dom\expatbuilder.py", line 223, in parseString
parser.Parse(string, True)
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 43
Whoa - what happened there? It gave us an error at column 43. Lets see what column 43 is:
>> print repr(utf8_string[43])
'\x19'
You can see that it doesn't like the Unicode character U+0019 . Why is this? Section 2.2 of the XML 1.0 standard
defines the set of legal characters that may appear in a document. From the standard:
/* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */
Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] |
[#xE000-#xFFFD] | [#x10000-#x10FFFF]
Clearly, there are some major gaps in the characters that are legal to include in an XML document. Lets turn the
above into a Python function that can be used to test whether a given Unicode value is legal to write to an XML
stream:
ERROR
Unable to get source for "gnosis.xml.xmlmap.raw_illegal_xml_regex"
Using the code ...
def make_illegal_xml_regex():
return re.compile( raw_illegal_xml_regex() )
c_re_xml_illegal = make_illegal_xml_regex()
Finally:
ERROR
Unable to get source for "gnosis.xml.xmlmap.is_legal_xml"
The above function is good for when you have a Unicode string, but could be a little slow when searching a
character at a time. So here is an alternate function for doing that (note this makes use of the usplit() function
defined earlier):
ERROR
Unable to get source for "gnosis.xml.xmlmap.is_legal_xml_char"
Here is a fairly extensive test case to demonstrate the above functions:
ERROR
Unable to import module 'test_xml_legality'
I'm going to run this under two different versions of Python to show the differences you can see in \U coding.
First, under Python 2.0 (which uses 2-char \U encoding, on my machine):
** BAD VALUES **
u'abc\001def' 0 0
u'abc\014def' 0 0
u'abc\025def' 0 0
u'abc\uD900def' 0 0
u'abc\uDDDDdef' 0 0
u'abc\uFFFEdef' 0 0
u'abc\uD800' 0 0
u'\uDC00' 0 0
** GOOD VALUES **
u'abc\011def\012ghi' 1 1
u'abc\015def' 1 1
u'abc def\u8112ghi\uD7FFjkl' 1 1
u'abc\uE000def\uF123ghi\uFFFDjkl' 1 1
u'abc\uD800\uDC00def\uD84D\uDC56ghi\uDBC4\uDE34jkl' 1 1
Testing one char at a time ...
u'\000\005\010\013\014\016\020\031\uD800\uD900\000\uDC00\uDD00
\uDFFF\uFFFE\uFFFF'
OK
u'\011\012\015 \u2345\uD7FF\uE000\uE876\uFFFD\uD800\uDC00\uD808
\uDF45\uDBC0\uDC00\uDBFF\uDFFF\uD800\uDC00'
OK
And now under Python 2.3, which on my machine stores \U as a single character::
** BAD VALUES **
u'abc\x01def' False 0
u'abc\x0cdef' False 0
u'abc\x15def' False 0
u'abc\ud900def' False 0
u'abc\udddddef' False 0
u'abc\ufffedef' False 0
u'abc\ud800' False 0
u'\udc00' False 0
** GOOD VALUES **
u'abc\tdef\nghi' True 1
u'abc\rdef' True 1
u'abc def\u8112ghi\ud7ffjkl' True 1
u'abc\ue000def\uf123ghi\ufffdjkl' True 1
u'abc\U00010000def\U00023456ghi\U00101234jkl' True 1
Testing one char at a time ...
u'\x00\x05\x08\x0b\x0c\x0e\x10\x19\ud800\ud900\x00\udc00\udd00
\udfff\ufffe\uffff'
OK
u'\t\n\r \u2345\ud7ff\ue000\ue876\ufffd\U00010000\U00012345
\U00100000\U0010ffff\U00010000'
OK
You can see that both version of Python give the same answers (except Python 2.0 uses 1/0 instead of
True/False). But you can see in the repr() coding at the end that the two versions represent \U in different ways.
As long as you use the usplit() function defined earlier, you will see no differences in your code.
OK, so now we've established that you cannot put certain characters in an XML file. How do we get around this?
Maybe we can encode the illegal values as XML entities?
xml = u'<?xml version="1.0" encoding="utf-8" ?>'
# try to cheat and put \u0019 as an entity ...
xml += u'<H> &#x19; </H>'
# encode as UTF-8
utf8_string = xml.encode( 'utf-8' )
# parse it
from xml.dom.minidom import parseString
parseString( utf8_string )
Running this gives the output::
Traceback (most recent call last):
File "t10.py", line 11, in ?
parseString( utf8_string )
File "c:\py23\lib\xml\dom\minidom.py", line 1929, in parseString
return expatbuilder.parseString(string)
File "c:\py23\lib\xml\dom\expatbuilder.py", line 940, in parseString
return builder.parseString(string)
File "c:\py23\lib\xml\dom\expatbuilder.py", line 223, in parseString
parser.Parse(string, True)
xml.parsers.expat.ExpatError: reference to invalid character number: line 1, column 43
Nope! According to the XML 1.0 standard, the illegal characters are not allowed, no matter how we try to cheat
and stuff them in there. In fact, if a parser allows any of the illegal characters, then by definition it is not an XML
parser. A key idea of XML is that parsers are not allowed to be "forgiving", to avoid the mess of incompatibility
that exists in the HTML world.
So how do we handle the illegal characters?
Due to the fact that the characters are illegal, there is no standard way to handle them. It is up to the XML
author (or application) to find another way to represent the illegal characters. Perhaps a future version of
XML standard will help address this situation.
Unicode and network shares (Samba)
Samba 3.0 and up has the ability to share files with Unicode filenames. In fact, the test was very uneventful: I
simply opened a Samba share (from my Linux machine) on a Windows client, opened the folder with the Greeklettered filename in it, and the result is:
samba_01.jpg
Perhaps there are more complicated setups out there where this wouldn't work so well, but it was completely
painless for me. Samba defaults to UTF-8 coding, so I didn't even have to modify my smb.conf file.
Summary
There are a few topics I've omitted, but plan to add them later. Among them:
1.
Some examples of how to work-around the "illegal XML character" issues, by defining our own coding
transforms.
2.
It is perfectly possible for os.listdir(u'.') to return non-Unicode strings (it means that the filename was
not stored with a coding legal in the current locale). The problem is that if you have a mix of legal and illegal
names, e.g. /a-legal/b-illegal/c-legal , you cannot use os.path.join() to concatenate the Unicode
and non-Unicode parts, since that would not be the correct filename (due to b-illegal not having a valid
Unicode coding, in the above example). The only solution I've found is to os.chdir() to each path
component, one at a time, when opening files, traversing directories, etc. Need to write a section to expand
on this issue.
Several of the functions defined in this document ( usplit() , is_legal_xml() , is_legal_xml_string() ) are
available as part of Gnosis Utils (of which I'm a coauthor). Version 1.2.0 is the first release with the functions. They
are available in the package gnosis.xml.xmlmap . In upcoming versions, I plan to incorporate the Unicode->XML
transforms mentioned above.
Footnotes:
1.
In my opinion, if the creators of Python's Unicode support had merely omitted the "default ASCII" logic, it
would have been much clearer, as that would force newbies to understand what was going on, instead of
blindly using unicode(value) , without an explicit coding.
Now, to be fair, using ASCII as a default encoding is reasonable. Since Python's ASCII codec only accepts
codes from 0-127, if unicode() works, ASCII is almost certainly the correct codec.
2.
I'm not sure what earlier version do (95/98 era), but I'm guessing their Unicode support is not up to current
standards.
3.
Actually, Firefox and Internet Explorer were able to correctly display the page without a correct <meta> tag,
but in general you should always include it, since auto-guessing may not work on all platforms, or for all
HTML documents.
About this document ...
Author: Frank McIngvale
Version: 1.3
Last Revised: Apr 22, 2007
Written in WikklyText.
Python
XML
Comments
Comment viewing options
Threaded list - expanded 6
Date - newest first 6
50 comments per page
6
Save settings
Select your preferred way to display the comments and click "Save settings" to activate your changes.
many thanks!
June 24, 2009 - 5:28am — yuri (not verified)
many thanks for this precious guide!!!
reply
bizarre behaviour
January 28, 2009 - 4:59pm — rog peppe (not verified)
I've been trying to get utf-8 output working properly in python (2.5.1) and I've encountered some strangely
inconsistent behavious. It seems like a bug, but I'm probably just getting something wrong.
Basically print works, but sys.stdout.write of the same thing doesn't.
Any ideas?
5
Python 2.5.1 (r251:54863, Apr 15 2008, 22:57:26) [GCC 4.0.1 (Apple Inc. build 5465)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
> print unichr(0xe9)
é
> sys.stdout.write(unichr(0xe9))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 0: ordinal not in r
(128)
> class Test:
def write(self, x):
sys.stdout.write(x)
> print >> Test(), unichr(0xe9)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 3, in write
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 0: ordinal not in r
(128)
> print >> sys.stdout, unichr(0xe9)
é
> sys.stdout.encoding
'UTF-8'
> 3
4
6
Written in WikklyText.
reply
printing unicode strings to console of file
October 18, 2008 - 6:44pm — Hanan (not verified)
Thanks for the informative and well written page.
I've been trying to write a script that renames files. Basically:
1. reading the files with os.listdir('.') # also tried (u'.')
2. manipulate filenames
3. print "mv \"" + src + "\" \"" + dst + "\"
The weird problem is that it prints to the console, but when I try to redirect the output to a file "script > /tmp/a" I
get the following:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 9-15: ordinal not in range(128)
any clue?
thanks,
hanan
reply
Great, but you left out a major gotcha
June 10, 2008 - 9:41pm — Cameron Kerr (not verified)
As is common to most Unicode-related tutorials I have seen on the net (and including documents such as GNU’s
Libc manual), there are a number of very important gotchas that programmers are just going to have to know
about; the best place to read up on these issues is by browsing the Unicode Technical Documents and Unicode
Standard Annexes; the first that I would suggest is the Unicode Standard Annex regarding Normalisation as this
has a major impact on correct code.
Briefly, there are four normal forms, but you only need to know about two of them: Normalisation Form C (NFC)
and Normalisation Form KC (NFKC). Put very simply, Unicode text sent out (eg. to the network) SHOULD be in NFC.
Any string comparisons — such as for things like filenames, usernames or any string you wish to sort on — need to
be normalised into NFKC prior to comparison. NFKC remaps the string so any “compatibility characters” are
canonicalised, meaning their memory order is consistent and so can be compared. Here is an example:
>>> "Richard IV" == "Richard \u2163"
False
The strings "Richard IV" and "Richard Ⅳ" are considered identical for the purposes of human consideration. The
first string is composed of 'I' + 'V', but the second string is composed of a single code-point U+2163 ROMAN
NUMERAL FOUR. This is a ‘compatibility character’. It generally shouldn’t be used but some systems may
automatically have mapped ‘I’ + ‘V’ into U+2163 (or it may have been transcoded from another character set). The
NFKC normalisation process essentially changes occurrences of U+2163 to ‘I’ + ‘V’. Other examples in European
scripts typically come from ligatures such as U+0133 LATIN SMALL LIGATURE IJ (ij), which under NFKC would be recomposed to ‘i’ + ‘j’; there are plenty more examples in the Normalisation document.
Another major issue to do with normalisation is that normalisation is not closed under concatenation, which means
the string formed by NFKC(string1) + NFKC(string2) is not guaranteed to be NFKC normalised, although an
optimised function such as NFKC_concat(string1, string2) can be defined.
Another very important document to read is the Security Considerations document, which very broadly speaking
covers two themes: visual security, as illustrated by paypa<capital-i>.com, and technical security issues. This
document also gives user-agents a number of security related recommendations.
I think I shall eventually put together a reading list for people wanting to correctly get into Unicode — I’m just
getting into it myself and been exploring the various normative documents — but that may be a little time in
coming.
People with an interest in network protocols should also arm themselves with a knowledge of standards such as
stringprep (RFC 3435) which allows us to pick and choose rules (creating what is known as a stringprep ‘profile’)
to limit how particular strings, such as usernames, may be represented.
In conclusion, the programming world is in for a nasty shake-up; there is a definite requirement for CORRECT
training resources to be available to the programming world at large.
Written in WikklyText.
reply
Unicode in HMTL emails
February 6, 2008 - 11:17am — Pete (not verified)
Thanks for the great intro. I was still struggling to get some HTML formatted emails I was sending with python to
show unicode characters correctly. In the end, I found that the encode method has a really useful parameter:
'xmlcharrefreplace' that will turn any characters that can't be represented in ascii into #nnnn; type html/xml code
example:
str.encode('us-ascii','xmlcharrefreplace')
reply
probably like this not
November 12, 2008 - 4:52pm — Anonymous (not verified)
probably like this not encode? str.decode('us-ascii','xmlcharrefreplace')
reply
Portuguese Translation
January 29, 2008 - 9:19am — Nilo Menezes (not verified)
Hello,
Excellent article. Do you mind if I translate it to Portuguese? I would like to publish it on PythonBrasil wiki with
credits and links, of course.
http://www.pythonbrasil.com.br/moin.cgi/
Best Regards,
Nilo
reply
Translations
February 1, 2008 - 10:07pm — frank
Sure, no problem, just add a link back here (as you mentioned). Thanks.
reply
Brazilian Portuguese version
February 25, 2008 - 3:18am — Nilo Menezes (not verified)
Hello Frank,
Just to let you know the Brazilian Portuguese version is at
http://www.pythonbrasil.com.br/moin.cgi/TudoSobrePythoneUnicode.
Best Regards,
Nilo Menezes
reply
Brazilian Portuguese version
February 28, 2008 - 6:05am — frank
Thanks! Nice job.
reply
Having trouble displaying html unicode
December 10, 2007 - 1:58am — weheh (not verified)
Hi frank, Again ... great article. I'm running into trouble with my python program not displaying what I want to see
in the browser. I have a file with the word 'años' in it. I read the file and display it on the browser using repr() but it comes out in hex as 'a\xf1os'. If I do a straight print it comes out 'aos'. How do I get it to show as 'años'?
reply
HTML
December 11, 2007 - 7:55pm — frank
repr() definitely won't work. You need to write your Unicode strings out as encoded binary (e.g. UTF-8) and
set the coding of the HTML page the same. Search in the article for HTML, I have a section on this.
reply
Good summary.
July 6, 2007 - 1:44pm — Zart (not verified)
Excellent summary. I think you ought to mention that under Windows sys.argv and os.environ aren't Unicodeaware, so you can't pass unicode filenames on command-line, for example. There is a proposal to fix this which is
at the moment rejected.
reply
Well Done!!!
Well done article! Very informative!
"Frank's smart!" ;)
May 13, 2007 - 5:39pm — Bruce Tenison (not verified)
reply
Fantastic
May 10, 2007 - 3:57pm — Warren (not verified)
Excellent article about python and unicode, not to mention extremely useful information about using unicode in
other situtations, such as XML. The article is very clear and concise, thank you so much for spending the time to
document this!
reply
Excellent tutorial -- A++
May 3, 2007 - 7:03pm — weheh (not verified)
After wading through verious tutorials and references on unicode, I found yours! I only wish I had found it first as
yours is easily the best of the bunch. I had been struggling for hours with an XML printint problem, but after
reading your article, I was easily able to get my code to work. Many thanks! One comment -- it seems like
something on your page isn't displaying correctly. Where it says, "Click for a larger version" I see nothing, nor is
there anything for me to click. Otherwise, I can't thank you enough.
reply
Comments appreciated!
May 3, 2007 - 9:42pm — frank
Thanks, I appreciate the comments. On the "Click for larger version" problem, I recently changed the links to
open in a new window. Could they be opening in a separate tab, or maybe you have a popup blocker? Just
guessing. Let me know if it persists. I made the change because, for some reason, Firefox forgets where it
was in the page when clicking "Back" to here, so I made the links open a new window.
reply
"click for a larger version" image/link not showing
May 5, 2007 - 12:22pm — weheh (not verified)
The issue is not with popups. The issue is that the link isn't showing. I looked at your html source and the
href tag isn't closed properly, from the looks of it. You may wish to inspect the source carefully to make
sure it's legal html. Possibly run it through the W3C html validator to help debug it. Thanks again for the
article.
reply
Image link problem
May 5, 2007 - 8:12pm — frank
Ah, I see it now. I had a bug in my HTML generator. Thanks for letting me know about that.
reply
Thanks thanks and thanks
March 28, 2007 - 7:34am — Seb (not verified)
Man,
Thanks for this one. I need to use unicode and utf8 encoding for a project. I ve been browsing the web to
understand how all this work and so far i was struggling big time.
Now i almost got it right ! (need to review it one more time thogh)
Thanks again !!!
Cheers
seb
reply
Thanks!
March 15, 2007 - 7:42pm — Einars (not verified)
Thank you very much for writing this up. This has cleared some of my misinterpretations, and now I'm generally a
better person - at least when it comes with dealing with unicode in python (which I had to do just five minutes
ago) - it all just 'clicked' now ;)
reply
MacOS X
March 15, 2007 - 1:47pm — Hraban (not verified)
You should mention that MacOS X uses a special kind of decomposed UTF-8 to store filenames. If you need to e.g.
read in filenames and write them to a "normal" UTF-8 file, you must normalize them (at least if your editor, or my
TeX system, doesn't understand decomposed UTF-8): filename = unicodedata.normalize('NFC', unicode(filename,
'utf-8')).encode('utf-8') see also Python module docs for unicodedata
reply
MacOS X
March 15, 2007 - 3:41pm — frank
Thanks for pointing that out, I don't have a MacOS X machine so didn't realize that. For others reading this
who aren't familiar with this issue (like I wasn't) here are a few references:
Text Encodings in VFS
unicode filenames
My five-minute understanding of this is that when you pass a name with an accented character like é, it will
decompose this into e plus ' before saving it to the filesystem (this behavior is defined by the Unicode
standard). I will add something on this above after I play with it and make a few example.
Written in WikklyText.
reply
Good one!
March 14, 2007 - 10:29am — Anonymous (not verified)
Thanks for this really informative article! Unicode is fairly simple! Spot on! Joel Spolsky talked about this very fact
some time back!
reply
Mac OS/X
March 12, 2007 - 10:55pm — Michael (not verified)
I just tested it on OS X -- and os.listdir('.') and os.listdir(u'.') both return objects that can be passed directly to
open(), as you can do under POSIX. Seems to be well-behaved ;-)
reply
Mac OS/X
March 13, 2007 - 10:23pm — frank
Thanks! I'll add that info.
reply
Comment viewing options
Threaded list - expanded 6
Date - newest first 6
50 comments per page
6
Save settings
Select your preferred way to display the comments and click "Save settings" to activate your changes.
Post new comment
Your name:
Anonymous
E-mail:
The content of this field is kept private and will not be shown publicly.
Homepage:
Subject:
Comment: *
5
6
Input format
Preview comment
Post comment
boodebr.org is powered by: Linux, Apache, Drupal, and WikklyText.
converted by Web2PDFConvert.com