Re: UTF8 in 5.8.1
by Aaron Sherman other posts by this author
Mar 1 2005 6:11AM messages near this date
Re: UTF8 in 5.8.1
|
Re: UTF8 in 5.8.1
On Mon, 2005-02-28 at 11:23 -0800, Gisle Aas wrote:
> Aaron Sherman <ajs@[...].com> writes:
> > errors in a function that's dealing only with strings that are read from
> > a file that was written to a file, and is being read back using the
> > :utf8 encoding layer. I had thought that substr was always safe on such
> > strings, but it's starting to look like that was a vain hope....
>
> The :utf8 layer just slaps on the UTF8 flag trusting the data it reads
> to be well formed utf8.
It was well-formed. I was doing something like this:
script 1:
open(IN, "<:encoding(windows-1252)", "name-list");
open(OUT, "> :utf8", "name-list.utf8");
while(<IN> ) { print OUT $_ }
I then verified that name-list.utf8 contained valid UTF-8 sequences
using od -c
script 2:
my %parts;
my %bits;
open(IN, "<:utf8", "name-list.utf8");
while(<IN> ) {
chomp;
$parts{substr($_,-3,2)}{substr($_,-1)}++ if length($_) > 2;
}
foreach my $part2 (keys %parts) {
my $part = substr($part2,-1); # UTF-8 error here
foreach my $part1 (keys %{$parts{$part2}}) {
$bits{$part.$part1}++;
}
}
This is a contrived simplification, and fails to reproduce the problem,
but it gives you an idea of what I was doing. 5.8.3 does not produce the
same warning.
Thread:
Aaron Sherman
Gisle Aas
Aaron Sherman
Nicholas Clark
Dan Kogai
Aaron Sherman
|