ASPN ActiveState Programmer Network
  ActiveState
/ Home / Perl / PHP / Python / Tcl / XSLT /
/ Safari / My ASPN /
Cookbooks | Documentation | Mailing Lists | Modules | News Feeds | Products | User Groups | Web Services
SEARCH
advanced | search help

Reference
ActivePerl 5.8
Core Documentation
perl
perlintro
perltoc
perlreftut
perldsc
perllol
perlrequick
perlretut
perlboot
perltoot
perltooc
perlbot
perlstyle
perlcheat
perltrap
perldebtut
perlfaq1
perlfaq2
perlfaq3
perlfaq4
perlfaq5
perlfaq6
perlfaq7
perlfaq8
perlfaq9
perlsyn
perldata
perlop
perlsub
perlfunc
perlopentut
perlpacktut
perlpod
perlpodspec
perlrun
perldiag
perllexwarn
perldebug
perlvar
perlre
perlreref
perlref
perlform
perlobj
perltie
perldbmfilter
perlipc
perlfork
perlnumber
perlthrtut
perlothrtut
perlport
perllocale
perluniintro
perlunicode
perlebcdic
perlsec
perlmod
perlmodlib
perlmodstyle
perlmodinstall
perlnewmod
perlutil
perlcompile
perlfilter
perlembed
perldebguts
perlxstut
perlxs
perlclib
perlguts
perlcall
perlapi
perlintern
perliol
perlapio
perlhack
perlbook
perltodo
perlhist
perl588delta
perl587delta
perl586delta
perl585delta
perl584delta
perl583delta
perl582delta
perl581delta
perl58delta
perl573delta
perl572delta
perl571delta
perl570delta
perl561delta
perl56delta
perl5005delta
perl5004delta
perlcn
perljp
perlko
perltw
perlaix
perlamiga
perlapollo
perlbeos
perlbs2000
perlce
perlcygwin
perldgux
perldos
perlepoc
perlfreebsd
perlhpux
perlhurd
perlirix
perlmachten
perlmacos
perlmacosx
perlmint
perlmpeix
perlnetware
perlopenbsd
perlos2
perlos390
perlos400
perlplan9
perlqnx
perlsolaris
perltru64
perluts
perlvmesa
perlvms
perlvos
perlwin32

MyASPN >> Reference >> ActivePerl 5.8 >> Core Documentation
ActivePerl 5.8 documentation

perlebcdic - Considerations for running Perl on EBCDIC platforms


NAME

perlebcdic - Considerations for running Perl on EBCDIC platforms


DESCRIPTION

An exploration of some of the issues facing Perl programmers on EBCDIC based computers. We do not cover localization, internationalization, or multi byte character set issues other than some discussion of UTF-8 and UTF-EBCDIC.

Portions that are still incomplete are marked with XXX.


COMMON CHARACTER CODE SETS

ASCII

The American Standard Code for Information Interchange is a set of integers running from 0 to 127 (decimal) that imply character interpretation by the display and other system(s) of computers. The range 0..127 can be covered by setting the bits in a 7-bit binary digit, hence the set is sometimes referred to as a "7-bit ASCII". ASCII was described by the American National Standards Institute document ANSI X3.4-1986. It was also described by ISO 646:1991 (with localization for currency symbols). The full ASCII set is given in the table below as the first 128 elements. Languages that can be written adequately with the characters in ASCII include English, Hawaiian, Indonesian, Swahili and some Native American languages.

There are many character sets that extend the range of integers from 0..2**7-1 up to 2**8-1, or 8 bit bytes (octets if you prefer). One common one is the ISO 8859-1 character set.

ISO 8859

The ISO 8859-$n are a collection of character code sets from the International Organization for Standardization (ISO) each of which adds characters to the ASCII set that are typically found in European languages many of which are based on the Roman, or Latin, alphabet.

Latin 1 (ISO 8859-1)

A particular 8-bit extension to ASCII that includes grave and acute accented Latin characters. Languages that can employ ISO 8859-1 include all the languages covered by ASCII as well as Afrikaans, Albanian, Basque, Catalan, Danish, Faroese, Finnish, Norwegian, Portuguese, Spanish, and Swedish. Dutch is covered albeit without the ij ligature. French is covered too but without the oe ligature. German can use ISO 8859-1 but must do so without German-style quotation marks. This set is based on Western European extensions to ASCII and is commonly encountered in world wide web work. In IBM character code set identification terminology ISO 8859-1 is also known as CCSID 819 (or sometimes 0819 or even 00819).

EBCDIC

The Extended Binary Coded Decimal Interchange Code refers to a large collection of slightly different single and multi byte coded character sets that are different from ASCII or ISO 8859-1 and typically run on host computers. The EBCDIC encodings derive from 8 bit byte extensions of Hollerith punched card encodings. The layout on the cards was such that high bits were set for the upper and lower case alphabet characters [a-z] and [A-Z], but there were gaps within each latin alphabet range.

Some IBM EBCDIC character sets may be known by character code set identification numbers (CCSID numbers) or code page numbers. Leading zero digits in CCSID numbers within this document are insignificant. E.g. CCSID 0037 may be referred to as 37 in places.

13 variant characters

Among IBM EBCDIC character code sets there are 13 characters that are often mapped to different integer values. Those characters are known as the 13 "variant" characters and are:

    \ [ ] { } ^ ~ ! # | $ @ `

0037

Character code set ID 0037 is a mapping of the ASCII plus Latin-1 characters (i.e. ISO 8859-1) to an EBCDIC set. 0037 is used in North American English locales on the OS/400 operating system that runs on AS/400 computers. CCSID 37 differs from ISO 8859-1 in 237 places, in other words they agree on only 19 code point values.

1047

Character code set ID 1047 is also a mapping of the ASCII plus Latin-1 characters (i.e. ISO 8859-1) to an EBCDIC set. 1047 is used under Unix System Services for OS/390 or z/OS, and OpenEdition for VM/ESA. CCSID 1047 differs from CCSID 0037 in eight places.

POSIX-BC

The EBCDIC code page in use on Siemens' BS2000 system is distinct from 1047 and 0037. It is identified below as the POSIX-BC set.

Unicode code points versus EBCDIC code points

In Unicode terminology a code point is the number assigned to a character: for example, in EBCDIC the character "A" is usually assigned the number 193. In Unicode the character "A" is assigned the number 65. This causes a problem with the semantics of the pack/unpack "U", which are supposed to pack Unicode code points to characters and back to numbers. The problem is: which code points to use for code points less than 256? (for 256 and over there's no problem: Unicode code points are used) In EBCDIC, for the low 256 the EBCDIC code points are used. This means that the equivalences

        pack("U", ord($character)) eq $character
        unpack("U", $character) == ord $character

will hold. (If Unicode code points were applied consistently over all the possible code points, pack("U",ord("A")) would in EBCDIC equal A with acute or chr(101), and unpack("U", "A") would equal 65, or non-breaking space, not 193, or ord "A".)

Remaining Perl Unicode problems in EBCDIC

  • Many of the remaining seem to be related to case-insensitive matching: for example, /[\x{131}]/ (LATIN SMALL LETTER DOTLESS I) does not match "I" case-insensitively, as it should under Unicode. (The match succeeds in ASCII-derived platforms.)

  • The extensions Unicode::Collate and Unicode::Normalized are not supported under EBCDIC, likewise for the encoding pragma.

Unicode and UTF

UTF is a Unicode Transformation Format. UTF-8 is a Unicode conforming representation of the Unicode standard that looks very much like ASCII. UTF-EBCDIC is an attempt to represent Unicode characters in an EBCDIC transparent manner.

Using Encode

Starting from Perl 5.8 you can use the standard new module Encode to translate from EBCDIC to Latin-1 code points

        use Encode 'from_to';
        my %ebcdic = ( 176 => 'cp37', 95 => 'cp1047', 106 => 'posix-bc' );
        # $a is in EBCDIC code points
        from_to($a, $ebcdic{ord '^'}, 'latin1');
        # $a is ISO 8859-1 code points

and from Latin-1 code points to EBCDIC code points

        use Encode 'from_to';
        my %ebcdic = ( 176 => 'cp37', 95 => 'cp1047', 106 => 'posix-bc' );
        # $a is ISO 8859-1 code points
        from_to($a, 'latin1', $ebcdic{ord '^'});
        # $a is in EBCDIC code points

For doing I/O it is suggested that you use the autotranslating features of PerlIO, see the perluniintro manpage.

Since version 5.8 Perl uses the new PerlIO I/O library. This enables you to use different encodings per IO channel. For example you may use

    use Encode;
    open($f, ">:encoding(ascii)", "test.ascii");
    print $f "Hello World!\n";
    open($f, ">:encoding(cp37)", "test.ebcdic");
    print $f "Hello World!\n";
    open($f, ">:encoding(latin1)", "test.latin1");
    print $f "Hello World!\n";
    open($f, ">:encoding(utf8)", "test.utf8");
    print $f "Hello World!\n";

to get two files containing "Hello World!\n" in ASCII, CP 37 EBCDIC, ISO 8859-1 (Latin-1) (in this example identical to ASCII) respective UTF-EBCDIC (in this example identical to normal EBCDIC). See the documentation of Encode::PerlIO for details.

As the PerlIO layer uses raw IO (bytes) internally, all this totally ignores things like the type of your filesystem (ASCII or EBCDIC).


SINGLE OCTET TABLES

The following tables list the ASCII and Latin 1 ordered sets including the subsets: C0 controls (0..31), ASCII graphics (32..7e), delete (7f), C1 controls (80..9f), and Latin-1 (a.k.a. ISO 8859-1) (a0..ff). In the table non-printing control character names as well as the Latin 1 extensions to ASCII have been labelled with character names roughly corresponding to The Unicode Standard, Version 3.0 albeit with substitutions such as s/LATIN// and s/VULGAR// in all cases, s/CAPITAL LETTER// in some cases, and s/SMALL LETTER ([A-Z])/\l$1/ in some other cases (the charnames pragma names unfortunately do not list explicit names for the C0 or C1 control characters). The "names" of the C1 control set (128..159 in ISO 8859-1) listed here are somewhat arbitrary. The differences between the 0037 and 1047 sets are flagged with ***. The differences between the 1047 and POSIX-BC sets are flagged with ###. All ord() numbers listed are decimal. If you would rather see this table listing octal values then run the table (that is, the pod version of this document since this recipe may not work with a pod2_other_format translation) through:

recipe 0
    perl -ne 'if(/(.{33})(\d+)\s+(\d+)\s+(\d+)\s+(\d+)/)' \
     -e '{printf("%s%-9o%-9o%-9o%o\n",$1,$2,$3,$4,$5)}' perlebcdic.pod

If you want to retain the UTF-x code points then in script form you might want to write:

recipe 1
    open(FH,"<perlebcdic.pod") or die "Could not open perlebcdic.pod: $!";
    while (<FH>) {
        if (/(.{33})(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\.?(\d*)\s+(\d+)\.?(\d*)/)  {
            if ($7 ne '' && $9 ne '') {
                printf("%s%-9o%-9o%-9o%-9o%-3o.%-5o%-3o.%o\n",$1,$2,$3,$4,$5,$6,$7,$8,$9);
            }
            elsif ($7 ne '') {
                printf("%s%-9o%-9o%-9o%-9o%-3o.%-5o%o\n",$1,$2,$3,$4,$5,$6,$7,$8);
            }
            else {
                printf("%s%-9o%-9o%-9o%-9o%-9o%o\n",$1,$2,$3,$4,$5,$6,$8);
            }
        }
    }

If you would rather see this table listing hexadecimal values then run the table through:

recipe 2
    perl -ne 'if(/(.{33})(\d+)\s+(\d+)\s+(\d+)\s+(\d+)/)' \
     -e '{printf("%s%-9X%-9X%-9X%X\n",$1,$2,$3,$4,$5)}' perlebcdic.pod

Or, in order to retain the UTF-x code points in hexadecimal:

recipe 3
    open(FH,"<perlebcdic.pod") or die "Could not open perlebcdic.pod: $!";
    while (<FH>) {
        if (/(.{33})(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\.?(\d*)\s+(\d+)\.?(\d*)/)  {
            if ($7 ne '' && $9 ne '') {
                printf("%s%-9X%-9X%-9X%-9X%-2X.%-6X%-2X.%X\n",$1,$2,$3,$4,$5,$6,$7,$8,$9);
            }
            elsif ($7 ne '') {
                printf("%s%-9X%-9X%-9X%-9X%-2X.%-6X%X\n",$1,$2,$3,$4,$5,$6,$7,$8);
            }
            else {
                printf("%s%-9X%-9X%-9X%-9X%-9X%X\n",$1,$2,$3,$4,$5,$6,$8);
            }
        }
    }
                                                                     incomp-  incomp-
                                 8859-1                              lete     lete
    chr                          0819     0037     1047     POSIX-BC UTF-8    UTF-EBCDIC
    ------------------------------------------------------------------------------------
    <NULL>                       0        0        0        0        0        0 
    <START OF HEADING>           1        1        1        1        1        1
    <START OF TEXT>              2        2        2        2        2        2
    <END OF TEXT>                3        3        3        3        3        3
    <END OF TRANSMISSION>        4        55       55       55       4        55 
    <ENQUIRY>                    5        45       45       45       5        45 
    <ACKNOWLEDGE>                6        46       46       46       6        46 
    <BELL>                       7        47       47       47       7        47 
    <BACKSPACE>                  8        22       22       22       8        22 
    <HORIZONTAL TABULATION>      9        5        5        5        9        5 
    <LINE FEED>                  10       37       21       21       10       21       ***
    <VERTICAL TABULATION>        11       11       11       11       11       11
    <FORM FEED>                  12       12       12       12       12       12
    <CARRIAGE RETURN>            13       13       13       13       13       13
    <SHIFT OUT>                  14       14       14       14       14       14
    <SHIFT IN>                   15       15       15       15       15       15
    <DATA LINK ESCAPE>           16       16       16       16       16       16
    <DEVICE CONTROL ONE>         17       17       17       17       17       17
    <DEVICE CONTROL TWO>         18       18       18       18       18       18
    <DEVICE CONTROL THREE>       19       19       19       19       19       19
    <DEVICE CONTROL FOUR>        20       60       60       60       20       60
    <NEGATIVE ACKNOWLEDGE>       21       61       61       61       21       61
    <SYNCHRONOUS IDLE>           22       50       50       50       22       50
    <END OF TRANSMISSION BLOCK>  23       38       38       38       23       38
    <CANCEL>                     24       24       24       24       24       24
    <END OF MEDIUM>              25       25       25       25       25       25
    <SUBSTITUTE>                 26       63       63       63       26       63
    <ESCAPE>                     27       39       39       39       27       39
    <FILE SEPARATOR>             28       28       28       28       28       28
    <GROUP SEPARATOR>            29       29       29       29       29       29
    <RECORD SEPARATOR>           30       30       30       30       30       30
    <UNIT SEPARATOR>             31       31       31       31       31       31
    <SPACE>                      32       64       64       64       32       64
    !                            33       90       90       90       33       90
    "                            34       127      127      127      34       127
    #                            35       123      123      123      35       123
    $                            36       91       91       91       36       91
    %                            37       108      108      108      37       108
    &                            38       80       80       80       38       80
    '                            39       125      125      125      39       125
    (                            40       77       77       77       40       77
    )                            41       93       93       93       41       93
    *                            42       92       92       92       42       92
    +                            43       78       78       78       43       78
    ,                            44       107      107      107      44       107
    -                            45       96       96       96       45       96
    .                            46       75       75       75       46       75
    /                            47       97       97       97       47       97
    0                            48       240      240      240      48       240
    1                            49       241      241      241      49       241
    2                            50       242      242      242      50       242
    3                            51       243      243      243      51       243
    4                            52       244      244      244      52       244
    5                            53       245      245      245      53       245
    6                            54       246      246      246      54       246
    7                            55       247      247      247      55       247
    8                            56       248      248      248      56       248
    9                            57       249      249      249      57       249
    :                            58       122      122      122      58       122
    ;                            59       94       94       94       59       94
    <                            60       76       76       76       60       76
    =                            61       126      126      126      61       126
    >                            62       110      110      110      62       110
    ?                            63       111      111      111      63       111
    @                            64       124      124      124      64       124
    A                            65       193      193      193      65       193
    B                            66       194      194      194      66       194
    C                            67       195      195      195      67       195
    D                            68       196      196      196      68       196
    E                            69       197      197      197      69       197
    F                            70       198      198      198      70       198
    G                            71       199      199      199      71       199
    H                            72       200      200      200      72       200
    I                            73       201      201      201      73       201
    J                            74       209      209      209      74       209
    K                            75       210      210      210      75       210
    L                            76       211      211      211      76       211
    M                            77       212      212      212      77       212
    N                            78       213      213      213      78       213
    O                            79       214      214      214      79       214
    P                            80       215      215      215      80       215
    Q                            81       216      216      216      81       216
    R                            82       217      217      217      82       217
    S                            83       226      226      226      83       226
    T                            84       227      227      227      84       227
    U                            85       228      228      228      85       228
    V                            86       229      229      229      86       229
    W                            87       230      230      230      87       230
    X                            88       231      231      231      88       231
    Y                            89       232      232      232      89       232
    Z                            90       233      233      233      90       233
    [                            91       186      173      187      91       173      *** ###
    \                            92       224      224      188      92       224      ### 
    ]                            93       187      189      189      93       189      ***
    ^                            94       176      95       106      94       95       *** ###
    _                            95       109      109      109      95       109
    `                            96       121      121      74       96       121      ###
    a                            97       129      129      129      97       129
    b                            98       130      130      130      98       130
    c                            99       131      131      131      99       131
    d                            100      132      132      132      100      132
    e                            101      133      133      133      101      133
    f                            102      134      134      134      102      134
    g                            103      135      135      135      103      135
    h                            104      136      136      136      104      136
    i                            105      137      137      137      105      137
    j                            106      145      145      145      106      145
    k                            107      146      146      146      107      146
    l                            108      147      147      147      108      147
    m                            109      148      148      148      109      148
    n                            110      149      149      149      110      149
    o                            111      150      150      150      111      150
    p                            112      151      151      151      112      151
    q                            113      152      152      152      113      152
    r                            114      153      153      153      114      153
    s                            115      162      162      162      115      162
    t                            116      163      163      163      116      163
    u                            117      164      164      164      117      164
    v                            118      165      165      165      118      165
    w                            119      166      166      166      119      166
    x                            120      167      167      167      120      167
    y                            121      168      168      168      121      168
    z                            122      169      169      169      122      169
    {                            123      192      192      251      123      192      ###
    |                            124      79       79       79       124      79
    }                            125      208      208      253      125      208      ###
    ~                            126      161      161      255      126      161      ###
    <DELETE>                     127      7        7        7        127      7
    <C1 0>                       128      32       32       32       194.128  32
    <C1 1>                       129      33       33       33       194.129  33
    <C1 2>                       130      34       34       34       194.130  34
    <C1 3>                       131      35       35       35       194.131  35
    <C1 4>                       132      36       36       36       194.132  36
    <C1 5>                       133      21       37       37       194.133  37       ***
    <C1 6>                       134      6        6        6        194.134  6
    <C1 7>                       135      23       23       23       194.135  23
    <C1 8>                       136      40       40       40       194.136  40
    <C1 9>                       137      41       41       41       194.137  41
    <C1 10>                      138      42       42       42       194.138  42
    <C1 11>                      139      43       43       43       194.139  43
    <C1 12>                      140      44       44       44       194.140  44
    <C1 13>                      141      9        9        9        194.141  9
    <C1 14>                      142      10       10       10       194.142  10
    <C1 15>                      143      27       27       27       194.143  27
    <C1 16>                      144      48       48       48       194.144  48
    <C1 17>                      145      49       49       49       194.145  49
    <C1 18>                      146      26       26       26       194.146  26
    <C1 19>                      147      51       51       51       194.147  51
    <C1 20>                      148      52       52       52       194.148  52
    <C1 21>                      149      53       53       53       194.149  53
    <C1 22>                      150      54       54       54       194.150  54
    <C1 23>                      151      8        8        8        194.151  8
    <C1 24>                      152      56       56       56       194.152  56
    <C1 25>                      153      57       57       57       194.153  57
    <C1 26>                      154      58       58       58       194.154  58
    <C1 27>                      155      59       59       59       194.155  59
    <C1 28>                      156      4        4        4        194.156  4
    <C1 29>                      157      20       20       20       194.157  20
    <C1 30>                      158      62       62       62       194.158  62
    <C1 31>                      159      255      255      95       194.159  255      ###
    <NON-BREAKING SPACE>         160      65       65       65       194.160  128.65
    <INVERTED EXCLAMATION MARK>  161      170      170      170      194.161  128.66
    <CENT SIGN>                  162      74       74       176      194.162  128.67   ###
    <POUND SIGN>                 163      177      177      177      194.163  128.68
    <CURRENCY SIGN>              164      159      159      159      194.164  128.69
    <YEN SIGN>                   165      178      178      178      194.165  128.70
    <BROKEN BAR>                 166      106      106      208      194.166  128.71   ###
    <SECTION SIGN>               167      181      181      181      194.167  128.72
    <DIAERESIS>                  168      189      187      121      194.168  128.73   *** ###
    <COPYRIGHT SIGN>             169      180      180      180      194.169  128.74
    <FEMININE ORDINAL INDICATOR> 170      154      154      154      194.170  128.81
    <LEFT POINTING GUILLEMET>    171      138      138      138      194.171  128.82
    <NOT SIGN>                   172      95       176      186      194.172  128.83   *** ###
    <SOFT HYPHEN>                173      202      202      202      194.173  128.84
    <REGISTERED TRADE MARK SIGN> 174      175      175      175      194.174  128.85
    <MACRON>                     175      188      188      161      194.175  128.86   ###
    <DEGREE SIGN>                176      144      144      144      194.176  128.87
    <PLUS-OR-MINUS SIGN>         177      143      143      143      194.177  128.88
    <SUPERSCRIPT TWO>            178      234      234      234      194.178  128.89
    <SUPERSCRIPT THREE>          179      250      250      250      194.179  128.98
    <ACUTE ACCENT>               180      190      190      190      194.180  128.99
    <MICRO SIGN>                 181      160      160      160      194.181  128.100
    <PARAGRAPH SIGN>             182      182      182      182      194.182  128.101
    <MIDDLE DOT>                 183      179      179      179      194.183  128.102
    <CEDILLA>                    184      157      157      157      194.184  128.103
    <SUPERSCRIPT ONE>            185      218      218      218      194.185  128.104
    <MASC. ORDINAL INDICATOR>    186      155      155      155      194.186  128.105
    <RIGHT POINTING GUILLEMET>   187      139      139      139      194.187  128.106
    <FRACTION ONE QUARTER>       188      183      183      183      194.188  128.112
    <FRACTION ONE HALF>          189      184      184      184      194.189  128.113
    <FRACTION THREE QUARTERS>    190      185      185      185      194.190  128.114
    <INVERTED QUESTION MARK>     191      171      171      171      194.191  128.115
    <A WITH GRAVE>               192      100      100      100      195.128  138.65
    <A WITH ACU