ASPN ActiveState Programmer Network
ActiveState
/ Home / Perl / PHP / Python / Tcl / XSLT /
/ Safari / My ASPN /
Cookbooks | Documentation | Mailing Lists | Modules | News Feeds | Products | User Groups


Recent Messages
List Archives
About the List
List Leaders
Subscription Options

View Subscriptions
Help

View by Topic
ActiveState
.NET Framework
Open Source
Perl
PHP
Python
Tcl
Web Services
XML & XSLT

View by Category
Database
General
SOAP
System Administration
Tools
User Interfaces
Web Programming
XML Programming


MyASPN >> Mail Archive >> perl5-porters
perl5-porters
Re: [perl #57040] pos() function doesn't handle unicode well
by Moritz Lenz other posts by this author
Jul 17 2008 2:20PM messages near this date
Re: [perl #57040] pos() function doesn't handle unicode well | is $1 really readonly?
Marcela Maslanova wrote:
>  # New Ticket Created by  Marcela Maslanova 
>  # Please include the string:  [perl #57040]
>  # in the subject line of all future correspondence about this issue. 
>  # <URL: http://rt.perl.org/rt3/Ticket/Display.html?id=57040 >
>  
>  
>  generated with the help of perlbug 1.36 running under perl 5.10.0.
>  
>  
>  -----------------------------------------------------------------
>  [Please enter your report here]
>  
>  Function pos() doesn't return correct values for unicode strings.
>  For example:
>  perl -e '$string = "Ä?ščÅ?žýáíéÅ?";while ($string =~ /Å¡/gi) {printf "Found 
>  Å¡ at %d\n", pos($string)-1;}';

I don't see the bug here. pos() returns byte values if you use the
string with byte semenatics (for example not upgraded UTF-8), and
codepoint values in cases of text semantics (here in the case of 'use
utf8';). In both cases substr() will work with the same semantics, so
it'll do the right thing.

I don't see how that principle is violated in your example above.
So pos() and lenth() agree that "Ä?Å¡" is four bytes long.
$ perl -wle 'print length "Ä?Å¡"'
4

Or am I missing a subtle off-by-one error?

>  In this case it could be solved 'use utf8'. But the problem is still in 
>  other functions, which are
>  using pos(). For example expand from Text::Tabs:
>  perl -e'chop($ustr="\taa\t..\t\x{100}");for my 
>  $s("\t\x{010a}\x{010a}\t..\t","\taa\t..\t",$ustr){ 
>  $_=$s;s/\t/print(pos(),$");"\t"/ge; print "\n"}'
>  Here should be all numbers the same.

As a non-golfed version:

for my $s ( "\t\x{010a}\x{010a}\t..\t", "\taa\t..\t" ) {
    $_ = $s;
    s/\t/print(pos(),$");"\t"/ge;
    print "\n"
}

Output:
0 2 4
0 3 6

This looks a bit weird indeed. At least to me ;-)

>  [Please do not change anything below this line]
>  -----------------------------------------------------------------
>  ---
>  Flags:
>      category=core
>      severity=medium
>  ---
>  This perlbug was built using Perl 5.10.0 in the Fedora build system.
>  It is being executed now by Perl 5.10.0 - Wed Jul  2 05:13:09 EDT 2008.
>  
>  Site configuration information for perl 5.10.0:
>  
>  Configured by Red Hat, Inc. at Wed Jul  2 05:13:09 EDT 2008.
>  
>  Summary of my perl5 (revision 5 version 10 subversion 0) configuration:
>    Platform:
>      osname=linux, osvers=2.6.18-92.1.6.el5, archname=i386-linux-thread-multi
>      uname='linux x86-6 2.6.18-92.1.6.el5 #1 smp fri jun 20 02:36:06 edt 
>  2008 i686 i686 i386 gnulinux '
>      config_args='-des -Doptimize=-O2 -g -pipe -Wall 
>  -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector 
>  --param=ssp-buffer-size=4 -m32 -march=i386 -mtune=generic 
>  -fasynchronous-unwind-tables -DPERL_USE_SAFE_PUTENV -Dversion=5.10.0 
>  -Dmyhostname=localhost -Dperladmin=root@localhost -Dcc=gcc -Dcf_by=Red 
>  Hat, Inc. -Dinstallprefix=/usr -Dprefix=/usr 
>  -Dprivlib=/usr/lib/perl5/5.10.0 
>  -Dsitelib=/usr/local/lib/perl5/site_perl/5.10.0 
>  -Dvendorlib=/usr/lib/perl5/vendor_perl/5.10.0 
>  -Darchlib=/usr/lib/perl5/5.10.0/i386-linux-thread-multi 
>  -Dsitearch=/usr/local/lib/perl5/site_perl/5.10.0/i386-linux-thread-multi 
>  -Dvendorarch=/usr/lib/perl5/vendor_perl/5.10.0/i386-linux-thread-multi 
>  -Darchname=i386-linux-thread-multi 
>  -Dotherlibdirs=/usr/lib/perl5/site_perl/5.10.0 -Dvendorprefix=/usr 
>  -Dsiteprefix=/usr/local -Duseshrplib -Dusethreads -Duseithreads 
>  -Duselargefiles -Dd_dosuid -Dd_semctl_semun -Di_db -Ui_ndbm -Di_gdbm 
>  -Di_shadow -Di_syslog -Dman3ext=3pm -Duseperlio -Dinstallusrbinperl=n 
>  -Ubincompat5005 -Uversiononly -Dpager=/usr/bin/less -isr 
>  -Dd_gethostent_r_proto -Ud_endhostent_r_proto -Ud_sethostent_r_proto 
>  -Ud_endprotoent_r_proto -Ud_setprotoent_r_proto -Ud_endservent_r_proto 
>  -Ud_setservent_r_proto -Dscriptdir=/usr/bin'
>      hint=recommended, useposix=true, d_sigaction=define
>      useithreads=define, usemultiplicity=define
>      useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
>      use64bitint=undef, use64bitall=undef, uselongdouble=undef
>      usemymalloc=n, bincompat5005=undef
>    Compiler:
>      cc='gcc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DDEBUGGING 
>  -fno-strict-aliasing -pipe -I/usr/local/include -D_LARGEFILE_SOURCE 
>  -D_FILE_OFFSET_BITS=64 -I/usr/include/gdbm',
>      optimize='-O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions 
>  -fstack-protector --param=ssp-buffer-size=4 -m32 -march=i386 
>  -mtune=generic -fasynchronous-unwind-tables -DPERL_USE_SAFE_PUTENV',
>      cppflags='-D_REENTRANT -D_GNU_SOURCE -DDEBUGGING 
>  -fno-strict-aliasing -pipe -I/usr/local/include -I/usr/include/gdbm'
>      ccversion='', gccversion='4.3.0 20080428 (Red Hat 4.3.0-8)', 
>  gccosandvers=''
>      intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
>      d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
>      ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', 
>  lseeksize=8
>      alignbytes=4, prototype=define
>    Linker and Libraries:
>      ld='gcc', ldflags =' -L/usr/local/lib'
>      libpth=/usr/local/lib /lib /usr/lib
>      libs=-lresolv -lnsl -lgdbm -ldb -ldl -lm -lcrypt -lutil -lpthread -lc
>      perllibs=-lresolv -lnsl -ldl -lm -lcrypt -lutil -lpthread -lc
>      libc=/lib/libc-2.8.so, so=so, useshrplib=true, libperl=libperl.so
>      gnulibc_version='2.8'
>    Dynamic Linking:
>      dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E 
>  -Wl,-rpath,/usr/lib/perl5/5.10.0/i386-linux-thread-multi/CORE'
>      cccdlflags='-fPIC', lddlflags='-shared -O2 -g -pipe -Wall 
>  -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector 
>  --param=ssp-buffer-size=4 -m32 -march=i386 -mtune=generic 
>  -fasynchronous-unwind-tables -DPERL_USE_SAFE_PUTENV -L/usr/local/lib'
>  
>  Locally applied patches:
>     
>  
>  ---
>  @INC for perl 5.10.0:
>      /usr/lib/perl5/5.10.0/i386-linux-thread-multi
>      /usr/lib/perl5/5.10.0
>      /usr/local/lib/perl5/site_perl/5.10.0/i386-linux-thread-multi
>      /usr/local/lib/perl5/site_perl/5.10.0
>      /usr/lib/perl5/vendor_perl/5.10.0/i386-linux-thread-multi
>      /usr/lib/perl5/vendor_perl/5.10.0
>      /usr/lib/perl5/vendor_perl
>      /usr/lib/perl5/site_perl/5.10.0/i386-linux-thread-multi
>      /usr/lib/perl5/site_perl/5.10.0
>      .
>  
>  ---
>  Environment for perl 5.10.0:
>      HOME=/home/marca
>      LANG=en_US.UTF-8
>      LANGUAGE=
>      LD_LIBRARY_PATH (unset)
>      LOGDIR (unset)
>      
>  PATH=/usr/lib/qt-3.3/bin:/usr/kerberos/bin:/usr/local/bin:/usr/bin:/bin:/home/marca/bin
>      PERL_BADLANG (unset)
>      SHELL=/bin/bash
>  
Thread:
Marcela Maslanova
Eric Brine
Moritz Lenz

Privacy Policy | Email Opt-out | Feedback | Syndication
© 2004 ActiveState, a division of Sophos All rights reserved