ASPN ActiveState Programmer Network
ActiveState
/ Home / Perl / PHP / Python / Tcl / XSLT /
/ Safari / My ASPN /
Cookbooks | Documentation | Mailing Lists | Modules | News Feeds | Products | User Groups


Recent Messages
List Archives
About the List
List Leaders
Subscription Options

View Subscriptions
Help

View by Topic
ActiveState
.NET Framework
Open Source
Perl
PHP
Python
Tcl
Web Services
XML & XSLT

View by Category
Database
General
SOAP
System Administration
Tools
User Interfaces
Web Programming
XML Programming


MyASPN >> Mail Archive >> activeperl
activeperl
charsets generally and with lwp
by Hildegard Schedthelm other posts by this author
Oct 21 2009 10:25AM messages near this date
ANNOUNCE: ActiveState Perl Dev Kit 8.1 released | Re: charsets generally and with lwp
Hello folks

I've some troubles with a perlscript that you can see below.
The problem is that some german special characters (umlaut) are not displayed as 
they should be. This seems to be an encoding-issue. Either the internal
perl variables have the wrong encoding or the lwp-module when grapping the html?
Additional the output to write the data into the MS-Access DB can also have the
incorrect encoding. How can we fix the uncertainty? What can I do to ensure the
right encoding in all levels?

Thanks a lot in advanced!

Johannes

Here comes the script:

#!C:\Program Files\Perl\bin\perl.exe -w

use strict;
use LWP::UserAgent;
use Win32::ODBC;


    my $db = new Win32::ODBC('PerlRes') ;    
        
    my($inhalt, $detail, @compInfo, $datum, $headline, $company, $message, $content, $ua, $r
equest, 
        $response, $ua2, $request2, $response2);

    for(my $i = 1; $i < 2; $i++) {
    
        $ua = LWP::UserAgent-> new();
        $request = HTTP::Request-> new('GET', "http://www.dgap.de/dgap/static/News/?newsType=
ADHOC&page=" . $i . "&limit=20");
        $request-> header('Content-Type' => 'text/html; charset=iso-8859-1');
        $response = $ua-> request($request);
        $inhalt = $response-> content;
        
        while($inhalt =~ /alt="DGAP-Ad-hoc" \/> \s+?<\/td>\s+?<td class="content_text">\s+?<a
 href="(.+)"> \s+?<strong>/g) {
            
            $ua2 = LWP::UserAgent-> new();
            $request2 = HTTP::Request-> new('GET', $1);
            $request2-> header('Content-Type' => 'text/html; charset=iso-8859-1');
            $response2 = $ua-> request($request2);
            $detail = $response2-> content;
    
            if($detail =~ /news_content "> \s+?<h2 class="darkblue">\s+?(.+)\s+?<\/h2>/) {
                    $datum = $1;
                    $datum =~ s/\s*//;
                    $datum =~ s/\s+?$//g;
            }
            
            if($detail =~ /<h2 class="darkblue"> \s+?.+?\s+?<\/h2>\s+?<div>\s+?<h1>(.+)<\/h1>
/) {
                    $headline = $1;
                    $headline =~ s/;/|/g;
                    $headline =~ s/\n//g;
            }
            
            if($detail =~ /<div class="newsDetail_body_pre"> <pre>\s+?<b>(.+)<\/b>/) {
                @compInfo = split("/",$1);
                $company = $compInfo[0];
                $company =~ s/\n//g;
                $message = $compInfo[1];
                $message =~ s/\s//g;
                
            }
            
            if($detail =~ /<pre> (.+)<\/pre>/s) {
                $content = $1;
                $content =~ s/;/|/g;
                $content =~ s/<\/?.+?> //g;
            }
            
                        
            $db-> Sql("INSERT into results VALUES('" . $datum  . "','" . $headline . "','" . 
$company . "','" . $message . "','" . $content . "')");

        }

        $db-> Close();
    
    }


      
Thread:
Hildegard Schedthelm
Bill Luebkert

Privacy Policy | Email Opt-out | Feedback | Syndication
© ActiveState Software Inc. All rights reserved