RE: String handling broken?
by Fuzzier other posts by this author
Jun 27 2007 6:46PM messages near this date
view in the new Beta List Site
RE: String handling broken?
|
RE: String handling broken?
Thanks a lot!
The problem is solved... partially indeed.
Now I know that the possible solutions are:
1. Explicitly convert all string variables to unicode by using unicode() function. But I dou
bt that this solution is little slow and makes code harder to write.
2. Implicitly force functions such as os.listdir() return unicode strings by passing unicode
arguments. But it's a little ridiculous though, that I wonder for some functions that requi
re more than one argument, if I pass one as unicode and another as ansii, what will happen?.
But why is the problem partially solved?
If I don't used unicode at all, which means the script is saved in an ansii text file, and e
ncoding is specified explicitly (in this case, 'shift-jis').
I made another script that outputs internal representations of the strings, and let's see wh
at I've found.
################ Script ############################################
# -*- encoding: shift_jis -*-
# script5.py (saved as ANSI text file)
import os, re
def rename():
pattern = 'ã??ã?¤ã?½ã?³\.txt' # ANSI
print 'pattern: ', repr(pattern)
myre = re.compile(pattern)
for f in os.listdir('.'):
m = myre.match(f)
if m != None: print repr(f), ': match!'
else: print repr(f), ': doesn\'t match!'
rename()
################# Output ###########################################
pattern: '\x83p\x83C\x83\\\x83\x93\\.txt'
'\x83p\x83C\x83\\\x83\x93.txt' : doesn't match!
As we can see that there is a '\\' inside the internal representation of the pattern string
and the file name as well.
I think this is why a match is not possible: the interpretor perceives this '\\' as the star
t of an escape sequence, rather than something it should be --- the second byte of a MBCS ch
aracter.
It's a bug, I think?
In Activestate Python 2.5 documentation, I find this:'On systems whose native character set
is not ASCII, strings may use EBCDIC in their internal representation, provided the function
s chr() and ord() implement a mapping between ASCII and EBCDIC, and string comparison preser
ves the ASCII order. Or perhaps someone can propose a better rule?'.
So string mathing is based on the rather messy internal representation, not something uni-co
de. I mean strings that look the same externally (on stdin), are indeed perceived very diff
erently internally, yet the matching of the strings is based on their internal representatio
ns, but not handled as they should be (as their external representations).
BTW, Perl on the other hand handles strings quite well. Plus Python interpretor doesn't reco
gnize unicode text files saved by Notepad, but I hope that such feature can be presented in
the future to prevent confusion, and boost performance as we don't have to convert anything
to unciode (at least on Windows NT system ;)).
Thanks again!
_______________________________________________
ActivePython mailing list
ActivePython@[...].com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs
Other options: http://listserv.ActiveState.com/mailman/listinfo/ActivePython
Thread:
Fuzzier
Ueta Masayuki
Fuzzier
Terry Carroll
|