ASPN ActiveState Programmer Network
ActiveState
/ Home / Perl / PHP / Python / Tcl / XSLT /
/ Safari / My ASPN /
Cookbooks | Documentation | Mailing Lists | Modules | News Feeds | Products | User Groups


Recent Messages
List Archives
About the List
List Leaders
Subscription Options

View Subscriptions
Help

View by Topic
ActiveState
.NET Framework
Open Source
Perl
PHP
Python
Tcl
Web Services
XML & XSLT

View by Category
Database
General
SOAP
System Administration
Tools
User Interfaces
Web Programming
XML Programming


MyASPN >> Mail Archive >> activepython
activepython
RE: String handling broken?
by Fuzzier other posts by this author
Jun 27 2007 6:46PM messages near this date
view in the new Beta List Site
RE: String handling broken? | RE: String handling broken?
Thanks a lot!
The problem is solved... partially indeed.

Now I know that the possible solutions are:
1. Explicitly convert all string variables to unicode by using unicode() function. But I dou
bt that this solution is little slow and makes code harder to write.
2. Implicitly force functions such as os.listdir() return unicode strings by passing unicode
 arguments. But it's a little ridiculous though, that I wonder for some functions that requi
re more than one argument, if I pass one as unicode and another as ansii, what will happen?.

But why is the problem partially solved?
If I don't used unicode at all, which means the script is saved in an ansii text file, and e
ncoding is specified explicitly (in this case, 'shift-jis').
I made another script that outputs internal representations of the strings, and let's see wh
at I've found.
################ Script ############################################
# -*- encoding: shift_jis -*-
# script5.py (saved as ANSI text file)

import os, re

def rename():
	pattern = 'ã??ã?¤ã?½ã?³\.txt'	# ANSI
	print 'pattern: ', repr(pattern)

	myre = re.compile(pattern)
	for f in os.listdir('.'):
		m = myre.match(f)
		if m != None: print repr(f), ': match!'
		else: print repr(f), ': doesn\'t match!'

rename()
################# Output ###########################################
pattern:  '\x83p\x83C\x83\\\x83\x93\\.txt'
'\x83p\x83C\x83\\\x83\x93.txt' : doesn't match!

As we can see that there is a '\\' inside the internal representation of the pattern string 
and the file name as well.
I think this is why a match is not possible: the interpretor perceives this '\\' as the star
t of an escape sequence, rather than something it should be --- the second byte of a MBCS ch
aracter.
It's a bug, I think?

In Activestate Python 2.5 documentation, I find this:'On systems whose native character set 
is not ASCII, strings may use EBCDIC in their internal representation, provided the function
s chr() and ord() implement a mapping between ASCII and EBCDIC, and string comparison preser
ves the ASCII order. Or perhaps someone can propose a better rule?'.
So string mathing is based on the rather messy internal representation, not something uni-co
de. I mean strings that look the same  externally (on stdin), are indeed perceived very diff
erently internally, yet the matching of the strings is based on their internal representatio
ns, but not handled as they should be (as their external representations).
BTW, Perl on the other hand handles strings quite well. Plus Python interpretor doesn't reco
gnize unicode text files saved by Notepad, but I hope that such feature can be presented in 
the future to prevent confusion, and boost performance as we don't have to convert anything 
to unciode (at least on Windows NT system ;)).

Thanks again!

_______________________________________________
ActivePython mailing list
ActivePython@[...].com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs
Other options: http://listserv.ActiveState.com/mailman/listinfo/ActivePython
Thread:
Fuzzier
Ueta Masayuki
Fuzzier
Terry Carroll

Privacy Policy | Email Opt-out | Feedback | Syndication
© 2004 ActiveState, a division of Sophos All rights reserved