Download Word - Java.net

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
[UPDATECENTER2-1458] traceback when title with UTF-8 encoding ending
with non break space is entered in updatetool Created: 14/May/09 Updated: 21/May/09 Resolved:
21/May/09
Status:
Project:
Component/s:
Affects
Version/s:
Fix Version/s:
Resolved
updatecenter2
dependencies
current
Type:
Reporter:
Resolution:
Labels:
Remaining
Estimate:
Time Spent:
Original
Estimate:
Environment:
Bug
Tom Mueller
Fixed
None
Not Specified
Issuezilla Id:
Tags:
1,458
i18n
B29
Priority:
Assignee:
Votes:
Major
Joe Di Pol
0
Not Specified
Not Specified
Operating System: All
Platform: All
Description
If the following string is entered as a title for an image, problems result:
éÃáýÃ9à èç_èÃ
First, the following tracebacks are output to the error_log.txt file:
Traceback (most recent call last):
File
"/BUILD_AREA/workspace/updatecenter2-trunk/uc2/build/dist/sunos-i386/updatetool/vendorpackages/updatetool/gui/mainframe.py",
line 1253, in OnEditImage
File
"/BUILD_AREA/workspace/updatecenter2-trunk/uc2/build/dist/sunos-i386/updatetool/vendorpackages/updatetool/dialogs/imagecreateeditdialog.py",
line 356, in _init_
File
"/BUILD_AREA/workspace/updatecenter2-trunk/uc2/build/dist/sunos-i386/updatetool/vendorpackages/updatetool/common/ips/_init_.py",
line 415, in get_publishers
File "/export/home/trm/pkg-toolkit/pkg/vendor-packages/pkg/client/image.py",
line 268, in load_config
ic.read(self.imgdir)
File
"/export/home/trm/pkg-toolkit/pkg/vendor-packages/pkg/client/imageconfig.py",
line 170, in read
o, raw=True).decode('utf-8')
File "/python2.4/lib/python2.4/encodings/utf_8.py", line 16, in decode
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 20:
unexpected end of data
Traceback (most recent call last):
File "/python2.4/lib/python2.4/logging/handlers.py", line 73, in emit
File "/python2.4/lib/python2.4/logging/handlers.py", line 146, in shouldRollover
File "/python2.4/lib/python2.4/logging/_init_.py", line 617, in format
File "/python2.4/lib/python2.4/logging/_init_.py", line 405, in format
File "/python2.4/lib/python2.4/logging/_init_.py", line 276, in getMessage
TypeError: not enough arguments for format string
When any tree item for that image (Addon, Updates, etc) is clicked a dialog with
the following error message is generated:
'utf8' codec can't decode byte 0xc3 in position 20: unexpected end of data
and the content for the panel is not displayed. The Image Properties dialog will
not come up so the title cannot be set back to something else. Also,
the pkg(1) command line cannot process the value either (see issue 1442) so the
title cannot be reset with that either. The only work-around is to manually edit
the cfg_cache file to change the value of the title to a string without
non-ascii characters.
This is being filed against the dependencies subcategory because I suspect that
the root cause of this problem is that something may be missing from the
minimized python. When I run the following unit test that reproduces this the
stack trace in Image.load_config, it happens when using the minimized python but
not with the full python.
$ pkg/python2.4-minimal/bin/python
Python 2.4.4 (#2, Apr 11 2008, 12:11:12) [C] on sunos5
Type "help", "copyright", "credits" or "license" for more information.
>>> import pkg.client.image as image
>>> i = image.Image()
>>> i.find_root('.')
>>> i.load_config()
>>> import locale
>>> locale.setlocale(locale.LC_ALL, '')
'en_US.UTF-8'
>>> i.load_config()
Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "/export/home/trm/pkg-toolkit/pkg/vendor-packages/pkg/client/image.py",
line 268, in load_config
ic.read(self.imgdir)
File
"/export/home/trm/pkg-toolkit/pkg/vendor-packages/pkg/client/imageconfig.py",
line 170, in read
o, raw=True).decode('utf-8')
File "/python2.4/lib/python2.4/encodings/utf_8.py", line 16, in decode
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 20:
unexpected end of data
With the full python, the load_config call returns successfully.
Comments
Comment by Joe Di Pol [ 14/May/09 ]
I've reproduced this with a simpler program that
has no dependences on IPS:
-------------- snip --------------------import locale
l = locale.setlocale(locale.LC_ALL, '')
print "locale=",l
cfg_cache="cfg_cache"
from ConfigParser import *
conf = ConfigParser()
conf.add_section("filter")
conf.add_section("property")
conf.add_section("variant")
conf.add_section("authority_localhost")
print "Reading with ConfigParser....................."
conf.read(cfg_cache)
title = conf.get("property", "title")
print "Hex value of title:"
print title.encode("hex")
print "Converting utf-8 to binary."
print title.decode("utf-8")
print ""
-------------- snip --------------------When run with our python build (even the full build) you get the exception:
locale= en_US.UTF-8
Reading with ConfigParser.....................
Hex value of title:
c3a9c3adc3a1c3bdc3ad39c3a0c3a8c3a75fc3a8c3
Converting utf-8 to binary.
Traceback (most recent call last):
File "readit.py", line 24, in ?
print title.decode("utf-8")
File "/python2.4/lib/python2.4/encodings/utf_8.py", line 16, in decode
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 20:
unexpected end of data
But with the OpenSolaris python all is well:
$ python readit.py
locale= en_US.UTF-8
Reading with ConfigParser.....................
Hex value of title:
c3a9c3adc3a1c3bdc3ad39c3a0c3a8c3a75fc3a8c3a0
Converting utf-8 to binary.
éÃáýÃ9à èç_èÃ
Note the value of the title data in hex. When our python build is
used the trailing "a0" is missing which is what causes the error
(the c3 should be followed by another byte).
Another thing to note is if you remove the locale.setlocale() call
then using our python works (the trailing "a0" returns).
So the cfg_cache file appears to contain valid UTF-8, but something
is happening when the raw data is read with our python build.
Comment by Joe Di Pol [ 14/May/09 ]
Narrowed this down to string.strip() which config parser
runs on the value of an attribute:
-------------- snip --------------------import locale
l = locale.setlocale(locale.LC_ALL, '')
print "locale=",l
s="\xc3\xa9\xc3\xa0"
print "Original string=%s" % s.encode("hex")
ss=s.strip()
print " After strip=%s" % ss.encode("hex")
-------------- snip --------------------When run with our python:
locale= en_US.UTF-8
Original string=c3a9c3a0
After strip=c3a9c3
When run with OpenSolaris python:
locale= en_US.UTF-8
Original string=c3a9c3a0
After strip=c3a9c3a0
Comment by Joe Di Pol [ 14/May/09 ]
The small test program above fails on Windows but works on Ubuntu when
run with our minimized python.
My bet is this has to do with some build options, or one of the zillion
#define's that "configre" sets for a build. My hunch is on Solaris and
Windows our python is using some builtin ctypes functions, but on Linux
(and the OpenSolaris python) it is using the OS library.
Also, it seems like the python strip logic has a bug on UTF-8 since
it should know it needs to strip the full character – not
just one byte.
Anyway, Re-configuring our python builds this late in the game seems risky.
Another approach may be to check for this condition ourselves. I'm
not sure exactly what "0xc30xa0" is in UTF-8, but some tables do show it
as a space – so maybe we just check for it at the end of a string and
trim it.
Comment by Tom Mueller [ 15/May/09 ]
Excellent sleuthing, Joe.
0xC3 0xA0 in UTF-8 is the Unicode character 0xE0 which is the latin small letter
A with grave, i.e., Ã , which is the last character that is in the sample string.
However, 0xA0 is the character NO BREAK SPACE. So what is happening here is that
the UTF-8 encoding for this particular string is ending in a space character,
and the strip method is stripping that off.
One way to look at this is that the real root cause is that ConfigParser is
calling strip on a raw value. Or, pkg(5) isn't dealing with that by surrounding
the encoded value with delimiters that prevent ConfigParser from changing it.
This problem is only going to show up with strings that have a UTF-8 encoding
that ends with a character that strip removes and which mess up UTF-8 decoding.
All multi-byte encodings for UTF-8 end with a character with the high bit set.
So none of the ASCII characters are an issue here. It is probably that 0xA0,
the no break space, is the only issue. So that means this is really only an
issue for character strings whose UTF-8 encoding ends with 0xA0. The latin small
letter a with grave is one such character. For letters encoded with 2 bytes,
there are 8 such characters (one of them being a with grave). For 3 and 4 byte
encodings, there would be many more such characters.
What we can conclude though is that this problem is not as severe as first
assumed. The vast majority of non-ascii strings work. It is only those whose
UTF-8 encoding ends with a non break space that don't work.
Changing the title to reflect this more narrow problem.
Comment by Joe Di Pol [ 15/May/09 ]
To round out the analysis I'll expand on an observation made by Chris.
Using the OS python on OpenSolaris (which works) the value of
string.whitespace is:
0x09 HT
0x0a LF
0x0b VT
0x0c FF
0x0d CR
0x20 SP
Looks good. That's your basic ASCII whitespace. When using our minimized
python it adds these two unicode characters to the list:
0x85 NEL (Next Line)
0xa0 NBSP (No-Break Space)
This explains the behavioral difference. With the OS python it's just
using the ASCII definition of whitespace and all is well since in
UTF-8 those are represented by a single ASCII byte. Our minimized
python adds two additional unicode characters as white space leading
to the problem for any trailing UTF-8 character that ends in either
0xa0 or 0x85.
Should we do anything for this in 2.2? Some options:
o Check for the condition when the user sets the title or description
and add a character (or quotes) to protect the trailing character
o Check for the condition and inform the user that they must use
another title. The problem here is that this message would not get
localized in time for the release.
Even though this condition may not get hit often in the wild, the
fact that it hoses the image means we should make some effort to
prevent "bad" data from getting into cfg_cache.
Comment by Joe Di Pol [ 18/May/09 ]
Due to time constraints this has been fixed by implementing the
following workaround:
If we detect an image title or description that ends with trailing
whitespace that could encounter this problem, we add a guard character
at the end of the string to prevent hitting the exception down the road.
For titles we append ":", for descriptions we append "."
This is a bit silly, but it's better than having the image get into
a state where you can't operate on it. Keep in mind we only append the
guard character in cases that would have triggered the exception.
I've opened UPDATECENTER2-1475 to cover the behavior of appending these characters.
Comment by mnsingh [ 21/May/09 ]
Verified on Solaris/x86.
Generated at Fri Apr 28 18:56:58 UTC 2017 using JIRA 6.2.3#6260sha1:63ef1d6dac3f4f4d7db4c1effd405ba38ccdc558.