Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
[UPDATECENTER2-1458] traceback when title with UTF-8 encoding ending with non break space is entered in updatetool Created: 14/May/09 Updated: 21/May/09 Resolved: 21/May/09 Status: Project: Component/s: Affects Version/s: Fix Version/s: Resolved updatecenter2 dependencies current Type: Reporter: Resolution: Labels: Remaining Estimate: Time Spent: Original Estimate: Environment: Bug Tom Mueller Fixed None Not Specified Issuezilla Id: Tags: 1,458 i18n B29 Priority: Assignee: Votes: Major Joe Di Pol 0 Not Specified Not Specified Operating System: All Platform: All Description If the following string is entered as a title for an image, problems result: éÃáýÃ9à èç_èà First, the following tracebacks are output to the error_log.txt file: Traceback (most recent call last): File "/BUILD_AREA/workspace/updatecenter2-trunk/uc2/build/dist/sunos-i386/updatetool/vendorpackages/updatetool/gui/mainframe.py", line 1253, in OnEditImage File "/BUILD_AREA/workspace/updatecenter2-trunk/uc2/build/dist/sunos-i386/updatetool/vendorpackages/updatetool/dialogs/imagecreateeditdialog.py", line 356, in _init_ File "/BUILD_AREA/workspace/updatecenter2-trunk/uc2/build/dist/sunos-i386/updatetool/vendorpackages/updatetool/common/ips/_init_.py", line 415, in get_publishers File "/export/home/trm/pkg-toolkit/pkg/vendor-packages/pkg/client/image.py", line 268, in load_config ic.read(self.imgdir) File "/export/home/trm/pkg-toolkit/pkg/vendor-packages/pkg/client/imageconfig.py", line 170, in read o, raw=True).decode('utf-8') File "/python2.4/lib/python2.4/encodings/utf_8.py", line 16, in decode UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 20: unexpected end of data Traceback (most recent call last): File "/python2.4/lib/python2.4/logging/handlers.py", line 73, in emit File "/python2.4/lib/python2.4/logging/handlers.py", line 146, in shouldRollover File "/python2.4/lib/python2.4/logging/_init_.py", line 617, in format File "/python2.4/lib/python2.4/logging/_init_.py", line 405, in format File "/python2.4/lib/python2.4/logging/_init_.py", line 276, in getMessage TypeError: not enough arguments for format string When any tree item for that image (Addon, Updates, etc) is clicked a dialog with the following error message is generated: 'utf8' codec can't decode byte 0xc3 in position 20: unexpected end of data and the content for the panel is not displayed. The Image Properties dialog will not come up so the title cannot be set back to something else. Also, the pkg(1) command line cannot process the value either (see issue 1442) so the title cannot be reset with that either. The only work-around is to manually edit the cfg_cache file to change the value of the title to a string without non-ascii characters. This is being filed against the dependencies subcategory because I suspect that the root cause of this problem is that something may be missing from the minimized python. When I run the following unit test that reproduces this the stack trace in Image.load_config, it happens when using the minimized python but not with the full python. $ pkg/python2.4-minimal/bin/python Python 2.4.4 (#2, Apr 11 2008, 12:11:12) [C] on sunos5 Type "help", "copyright", "credits" or "license" for more information. >>> import pkg.client.image as image >>> i = image.Image() >>> i.find_root('.') >>> i.load_config() >>> import locale >>> locale.setlocale(locale.LC_ALL, '') 'en_US.UTF-8' >>> i.load_config() Traceback (most recent call last): File "<stdin>", line 1, in ? File "/export/home/trm/pkg-toolkit/pkg/vendor-packages/pkg/client/image.py", line 268, in load_config ic.read(self.imgdir) File "/export/home/trm/pkg-toolkit/pkg/vendor-packages/pkg/client/imageconfig.py", line 170, in read o, raw=True).decode('utf-8') File "/python2.4/lib/python2.4/encodings/utf_8.py", line 16, in decode UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 20: unexpected end of data With the full python, the load_config call returns successfully. Comments Comment by Joe Di Pol [ 14/May/09 ] I've reproduced this with a simpler program that has no dependences on IPS: -------------- snip --------------------import locale l = locale.setlocale(locale.LC_ALL, '') print "locale=",l cfg_cache="cfg_cache" from ConfigParser import * conf = ConfigParser() conf.add_section("filter") conf.add_section("property") conf.add_section("variant") conf.add_section("authority_localhost") print "Reading with ConfigParser....................." conf.read(cfg_cache) title = conf.get("property", "title") print "Hex value of title:" print title.encode("hex") print "Converting utf-8 to binary." print title.decode("utf-8") print "" -------------- snip --------------------When run with our python build (even the full build) you get the exception: locale= en_US.UTF-8 Reading with ConfigParser..................... Hex value of title: c3a9c3adc3a1c3bdc3ad39c3a0c3a8c3a75fc3a8c3 Converting utf-8 to binary. Traceback (most recent call last): File "readit.py", line 24, in ? print title.decode("utf-8") File "/python2.4/lib/python2.4/encodings/utf_8.py", line 16, in decode UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 20: unexpected end of data But with the OpenSolaris python all is well: $ python readit.py locale= en_US.UTF-8 Reading with ConfigParser..................... Hex value of title: c3a9c3adc3a1c3bdc3ad39c3a0c3a8c3a75fc3a8c3a0 Converting utf-8 to binary. éÃáýÃ9à èç_èà Note the value of the title data in hex. When our python build is used the trailing "a0" is missing which is what causes the error (the c3 should be followed by another byte). Another thing to note is if you remove the locale.setlocale() call then using our python works (the trailing "a0" returns). So the cfg_cache file appears to contain valid UTF-8, but something is happening when the raw data is read with our python build. Comment by Joe Di Pol [ 14/May/09 ] Narrowed this down to string.strip() which config parser runs on the value of an attribute: -------------- snip --------------------import locale l = locale.setlocale(locale.LC_ALL, '') print "locale=",l s="\xc3\xa9\xc3\xa0" print "Original string=%s" % s.encode("hex") ss=s.strip() print " After strip=%s" % ss.encode("hex") -------------- snip --------------------When run with our python: locale= en_US.UTF-8 Original string=c3a9c3a0 After strip=c3a9c3 When run with OpenSolaris python: locale= en_US.UTF-8 Original string=c3a9c3a0 After strip=c3a9c3a0 Comment by Joe Di Pol [ 14/May/09 ] The small test program above fails on Windows but works on Ubuntu when run with our minimized python. My bet is this has to do with some build options, or one of the zillion #define's that "configre" sets for a build. My hunch is on Solaris and Windows our python is using some builtin ctypes functions, but on Linux (and the OpenSolaris python) it is using the OS library. Also, it seems like the python strip logic has a bug on UTF-8 since it should know it needs to strip the full character – not just one byte. Anyway, Re-configuring our python builds this late in the game seems risky. Another approach may be to check for this condition ourselves. I'm not sure exactly what "0xc30xa0" is in UTF-8, but some tables do show it as a space – so maybe we just check for it at the end of a string and trim it. Comment by Tom Mueller [ 15/May/09 ] Excellent sleuthing, Joe. 0xC3 0xA0 in UTF-8 is the Unicode character 0xE0 which is the latin small letter A with grave, i.e., à , which is the last character that is in the sample string. However, 0xA0 is the character NO BREAK SPACE. So what is happening here is that the UTF-8 encoding for this particular string is ending in a space character, and the strip method is stripping that off. One way to look at this is that the real root cause is that ConfigParser is calling strip on a raw value. Or, pkg(5) isn't dealing with that by surrounding the encoded value with delimiters that prevent ConfigParser from changing it. This problem is only going to show up with strings that have a UTF-8 encoding that ends with a character that strip removes and which mess up UTF-8 decoding. All multi-byte encodings for UTF-8 end with a character with the high bit set. So none of the ASCII characters are an issue here. It is probably that 0xA0, the no break space, is the only issue. So that means this is really only an issue for character strings whose UTF-8 encoding ends with 0xA0. The latin small letter a with grave is one such character. For letters encoded with 2 bytes, there are 8 such characters (one of them being a with grave). For 3 and 4 byte encodings, there would be many more such characters. What we can conclude though is that this problem is not as severe as first assumed. The vast majority of non-ascii strings work. It is only those whose UTF-8 encoding ends with a non break space that don't work. Changing the title to reflect this more narrow problem. Comment by Joe Di Pol [ 15/May/09 ] To round out the analysis I'll expand on an observation made by Chris. Using the OS python on OpenSolaris (which works) the value of string.whitespace is: 0x09 HT 0x0a LF 0x0b VT 0x0c FF 0x0d CR 0x20 SP Looks good. That's your basic ASCII whitespace. When using our minimized python it adds these two unicode characters to the list: 0x85 NEL (Next Line) 0xa0 NBSP (No-Break Space) This explains the behavioral difference. With the OS python it's just using the ASCII definition of whitespace and all is well since in UTF-8 those are represented by a single ASCII byte. Our minimized python adds two additional unicode characters as white space leading to the problem for any trailing UTF-8 character that ends in either 0xa0 or 0x85. Should we do anything for this in 2.2? Some options: o Check for the condition when the user sets the title or description and add a character (or quotes) to protect the trailing character o Check for the condition and inform the user that they must use another title. The problem here is that this message would not get localized in time for the release. Even though this condition may not get hit often in the wild, the fact that it hoses the image means we should make some effort to prevent "bad" data from getting into cfg_cache. Comment by Joe Di Pol [ 18/May/09 ] Due to time constraints this has been fixed by implementing the following workaround: If we detect an image title or description that ends with trailing whitespace that could encounter this problem, we add a guard character at the end of the string to prevent hitting the exception down the road. For titles we append ":", for descriptions we append "." This is a bit silly, but it's better than having the image get into a state where you can't operate on it. Keep in mind we only append the guard character in cases that would have triggered the exception. I've opened UPDATECENTER2-1475 to cover the behavior of appending these characters. Comment by mnsingh [ 21/May/09 ] Verified on Solaris/x86. Generated at Fri Apr 28 18:56:58 UTC 2017 using JIRA 6.2.3#6260sha1:63ef1d6dac3f4f4d7db4c1effd405ba38ccdc558.