|
perlport - Writing portable Perl
Perl runs on numerous operating systems. While most of them share
much in common, they also have their own unique features.
This document is meant to help you to find out what constitutes portable
Perl code. That way once you make a decision to write portably,
you know where the lines are drawn, and you can stay within them.
There is a tradeoff between taking full advantage of one particular
type of computer and taking advantage of a full range of them.
Naturally, as you broaden your range and become more diverse, the
common factors drop, and you are left with an increasingly smaller
area of common ground in which you can operate to accomplish a
particular task. Thus, when you begin attacking a problem, it is
important to consider under which part of the tradeoff curve you
want to operate. Specifically, you must decide whether it is
important that the task that you are coding have the full generality
of being portable, or whether to just get the job done right now.
This is the hardest choice to be made. The rest is easy, because
Perl provides many choices, whichever way you want to approach your
problem.
Looking at it another way, writing portable code is usually about
willfully limiting your available choices. Naturally, it takes
discipline and sacrifice to do that. The product of portability
and convenience may be a constant. You have been warned.
Be aware of two important points:
- Not all Perl programs have to be portable
-
There is no reason you should not use Perl as a language to glue Unix
tools together, or to prototype a Macintosh application, or to manage the
Windows registry. If it makes no sense to aim for portability for one
reason or another in a given program, then don't bother.
- Nearly all of Perl already is portable
-
Don't be fooled into thinking that it is hard to create portable Perl
code. It isn't. Perl tries its level-best to bridge the gaps between
what's available on different platforms, and all the means available to
use those features. Thus almost all Perl code runs on any machine
without modification. But there are some significant issues in
writing portable code, and this document is entirely about those issues.
Here's the general rule: When you approach a task commonly done
using a whole range of platforms, think about writing portable
code. That way, you don't sacrifice much by way of the implementation
choices you can avail yourself of, and at the same time you can give
your users lots of platform choices. On the other hand, when you have to
take advantage of some unique feature of a particular platform, as is
often the case with systems programming (whether for Unix, Windows,
Mac OS, VMS, etc.), consider writing platform-specific code.
When the code will run on only two or three operating systems, you
may need to consider only the differences of those particular systems.
The important thing is to decide where the code will run and to be
deliberate in your decision.
The material below is separated into three main sections: main issues of
portability (ISSUES), platform-specific issues (PLATFORMS), and
built-in perl functions that behave differently on various ports
(FUNCTION IMPLEMENTATIONS).
This information should not be considered complete; it includes possibly
transient information about idiosyncrasies of some of the ports, almost
all of which are in a state of constant evolution. Thus, this material
should be considered a perpetual work in progress
(<IMG SRC="yellow_sign.gif" ALT="Under Construction">).
In most operating systems, lines in files are terminated by newlines.
Just what is used as a newline may vary from OS to OS. Unix
traditionally uses \012, one type of DOSish I/O uses \015\012,
and Mac OS uses \015.
Perl uses \n to represent the "logical" newline, where what is
logical may depend on the platform in use. In MacPerl, \n always
means \015. In DOSish perls, \n usually means \012, but
when accessing a file in "text" mode, STDIO translates it to (or
from) \015\012, depending on whether you're reading or writing.
Unix does the same thing on ttys in canonical mode. \015\012
is commonly referred to as CRLF.
To trim trailing newlines from text lines use chomp(). With default
settings that function looks for a trailing \n character and thus
trims in a portable way.
When dealing with binary files (or text files in binary mode) be sure
to explicitly set $/ to the appropriate value for your file format
before using chomp().
Because of the "text" mode translation, DOSish perls have limitations
in using seek and tell on a file accessed in "text" mode.
Stick to seek-ing to locations you got from tell (and no
others), and you are usually free to use seek and tell even
in "text" mode. Using seek or tell or other file operations
may be non-portable. If you use binmode on a file, however, you
can usually seek and tell with arbitrary values in safety.
A common misconception in socket programming is that \n eq \012
everywhere. When using protocols such as common Internet protocols,
\012 and \015 are called for specifically, and the values of
the logical \n and \r (carriage return) are not reliable.
print SOCKET "Hi there, client!\r\n";
print SOCKET "Hi there, client!\015\012";
However, using \015\012 (or \cM\cJ, or \x0D\x0A) can be tedious
and unsightly, as well as confusing to those maintaining the code. As
such, the Socket module supplies the Right Thing for those who want it.
use Socket qw(:DEFAULT :crlf);
print SOCKET "Hi there, client!$CRLF"
When reading from a socket, remember that the default input record
separator $/ is \n, but robust socket code will recognize as
either \012 or \015\012 as end of line:
while (<SOCKET>) {
}
Because both CRLF and LF end in LF, the input record separator can
be set to LF and any CR stripped later. Better to write:
use Socket qw(:DEFAULT :crlf);
local($/) = LF;
while (<SOCKET>) {
s/$CR?$LF/\n/;
}
This example is preferred over the previous one--even for Unix
platforms--because now any \015's (\cM's) are stripped out
(and there was much rejoicing).
Similarly, functions that return text data--such as a function that
fetches a web page--should sometimes translate newlines before
returning the data, if they've not yet been translated to the local
newline representation. A single line of code will often suffice:
$data =~ s/\015?\012/\n/g;
return $data;
Some of this may be confusing. Here's a handy reference to the ASCII CR
and LF characters. You can print it out and stick it in your wallet.
LF eq \012 eq \x0A eq \cJ eq chr(10) eq ASCII 10
CR eq \015 eq \x0D eq \cM eq chr(13) eq ASCII 13
| Unix | DOS | Mac |
---------------------------
\n | LF | LF | CR |
\r | CR | CR | LF |
\n * | LF | CRLF | CR |
\r * | CR | CR | LF |
---------------------------
* text-mode STDIO
The Unix column assumes that you are not accessing a serial line
(like a tty) in canonical mode. If you are, then CR on input becomes
"\n", and "\n" on output becomes CRLF.
These are just the most common definitions of \n and \r in Perl.
There may well be others. For example, on an EBCDIC implementation
such as z/OS (OS/390) or OS/400 (using the ILE, the PASE is ASCII-based)
the above material is similar to "Unix" but the code numbers change:
LF eq \025 eq \x15 eq \cU eq chr(21) eq CP-1047 21
LF eq \045 eq \x25 eq chr(37) eq CP-0037 37
CR eq \015 eq \x0D eq \cM eq chr(13) eq CP-1047 13
CR eq \015 eq \x0D eq \cM eq chr(13) eq CP-0037 13
| z/OS | OS/400 |
----------------------
\n | LF | LF |
\r | CR | CR |
\n * | LF | LF |
\r * | CR | CR |
----------------------
* text-mode STDIO
Different CPUs store integers and floating point numbers in different
orders (called endianness) and widths (32-bit and 64-bit being the
most common today). This affects your programs when they attempt to transfer
numbers in binary format from one CPU architecture to another,
usually either "live" via network connection, or by storing the
numbers to secondary storage such as a disk file or tape.
Conflicting storage orders make utter mess out of the numbers. If a
little-endian host (Intel, VAX) stores 0x12345678 (305419896 in
decimal), a big-endian host (Motorola, Sparc, PA) reads it as
0x78563412 (2018915346 in decimal). Alpha and MIPS can be either:
Digital/Compaq used/uses them in little-endian mode; SGI/Cray uses
them in big-endian mode. To avoid this problem in network (socket)
connections use the pack and unpack formats n and N, the
"network" orders. These are guaranteed to be portable.
As of perl 5.9.2, you can also use the > and < modifiers
to force big- or little-endian byte-order. This is useful if you want
to store signed integers or 64-bit integers, for example.
You can explore the endianness of your platform by unpacking a
data structure packed in native format such as:
print unpack("h*", pack("s2", 1, 2)), "\n";
If you need to distinguish between endian architectures you could use
either of the variables set like so:
$is_big_endian = unpack("h*", pack("s", 1)) =~ /01/;
$is_little_endian = unpack("h*", pack("s", 1)) =~ /^1/;
Differing widths can cause truncation even between platforms of equal
endianness. The platform of shorter width loses the upper parts of the
number. There is no good solution for this problem except to avoid
transferring or storing raw binary numbers.
One can circumnavigate both these problems in two ways. Either
transfer and store numbers always in text format, instead of raw
binary, or else consider using modules like Data::Dumper (included in
the standard distribution as of Perl 5.005) and Storable (included as
of perl 5.8). Keeping all data as text significantly simplifies matters.
The v-strings are portable only up to v2147483647 (0x7FFFFFFF), that's
how far EBCDIC, or more precisely UTF-EBCDIC will go.
Most platforms these days structure files in a hierarchical fashion.
So, it is reasonably safe to assume that all platforms support the
notion of a "path" to uniquely identify a file on the system. How
that path is really written, though, differs considerably.
Although similar, file path specifications differ between Unix,
Windows, Mac OS, OS/2, VMS, VOS, RISC OS, and probably others.
Unix, for example, is one of the few OSes that has the elegant idea
of a single root directory.
DOS, OS/2, VMS, VOS, and Windows can work similarly to Unix with /
as path separator, or in their own idiosyncratic ways (such as having
several root directories and various "unrooted" device files such NIL:
and LPT:).
Mac OS uses : as a path separator instead of /.
The filesystem may support neither hard links (link) nor
symbolic links (symlink, readlink, lstat).
The filesystem may support neither access timestamp nor change
timestamp (meaning that about the only portable timestamp is the
modification timestamp), or one second granularity of any timestamps
(e.g. the FAT filesystem limits the time granularity to two seconds).
The "inode change timestamp" (the -C filetest) may really be the
"creation timestamp" (which it is not in UNIX).
VOS perl can emulate Unix filenames with / as path separator. The
native pathname characters greater-than, less-than, number-sign, and
percent-sign are always accepted.
RISC OS perl can emulate Unix filenames with / as path
separator, or go native and use . for path separator and : to
signal filesystems and disk names.
Don't assume UNIX filesystem access semantics: that read, write,
and execute are all the permissions there are, and even if they exist,
that their semantics (for example what do r, w, and x mean on
a directory) are the UNIX ones. The various UNIX/POSIX compatibility
layers usually try to make interfaces like chmod() work, but sometimes
there simply is no good mapping.
If all this is intimidating, have no (well, maybe only a little)
fear. There are modules that can help. The File::Spec modules
provide methods to do the Right Thing on whatever platform happens
to be running the program.
use File::Spec::Functions;
chdir(updir());
$file = catfile(curdir(), 'temp', 'file.txt');
File::Spec is available in the standard distribution as of version
5.004_05. File::Spec::Functions is only in File::Spec 0.7 and later,
and some versions of perl come with version 0.6. If File::Spec
is not updated to 0.7 or later, you must use the object-oriented
interface from File::Spec (or upgrade File::Spec).
In general, production code should not have file paths hardcoded.
Making them user-supplied or read from a configuration file is
better, keeping in mind that file path syntax varies on different
machines.
This is especially noticeable in scripts like Makefiles and test suites,
which often assume / as a path separator for subdirectories.
Also of use is File::Basename from the standard distribution, which
splits a pathname into pieces (base filename, full path to directory,
and file suffix).
Even when on a single platform (if you can call Unix a single platform),
remember not to count on the existence or the contents of particular
system-specific files or directories, like /etc/passwd,
/etc/sendmail.conf, /etc/resolv.conf, or even /tmp/. For
example, /etc/passwd may exist but not contain the encrypted
passwords, because the system is using some form of enhanced security.
Or it may not contain all the accounts, because the system is using NIS.
If code does need to rely on such a file, include a description of the
file and its format in the code's documentation, then make it easy for
the user to override the default location of the file.
Don't assume a text file will end with a newline. They should,
but people forget.
Do not have two files or directories of the same name with different
case, like test.pl and Test.pl, as many platforms have
case-insensitive (or at least case-forgiving) filenames. Also, try
not to have non-word characters (except for .) in the names, and
keep them to the 8.3 convention, for maximum portability, onerous a
burden though this may appear.
Likewise, when using the AutoSplit module, try to keep your functions to
8.3 naming and case-insensitive conventions; or, at the least,
make it so the resulting files have a unique (case-insensitively)
first 8 characters.
Whitespace in filenames is tolerated on most systems, but not all,
and even on systems where it might be tolerated, some utilities
might become confused by such whitespace.
Many systems (DOS, VMS ODS-2) cannot have more than one . in their
filenames.
Don't assume > won't be the first character of a filename.
Always use < explicitly to open a file for reading, or even
better, use the three-arg version of open, unless you want the user to
be able to specify a pipe open.
open(FILE, '<', $existing_file) or die $!;
If filenames might use strange characters, it is safest to open it
with sysopen instead of open. open is magic and can
translate characters like >, <, and |, which may
be the wrong thing to do. (Sometimes, though, it's the right thing.)
Three-arg open can also help protect against this translation in cases
where it is undesirable.
Don't use : as a part of a filename since many systems use that for
their own semantics (Mac OS Classic for separating pathname components,
many networking schemes and utilities for separating the nodename and
the pathname, and so on). For the same reasons, avoid @, ; and
|.
Don't assume that in pathnames you can collapse two leading slashes
// into one: some networking and clustering filesystems have special
semantics for that. Let the operating system to sort it out.
The portable filename characters as defined by ANSI C are
a b c d e f g h i j k l m n o p q r t u v w x y z
A B C D E F G H I J K L M N O P Q R T U V W X Y Z
0 1 2 3 4 5 6 7 8 9
. _ -
and the "-" shouldn't be the first character. If you want to be
hypercorrect, stay case-insensitive and within the 8.3 naming
convention (all the files and directories have to be unique within one
directory if their names are lowercased and truncated to eight
characters before the ., if any, and to three characters after the
., if any). (And do not use .s in directory names.)
Not all platforms provide a command line. These are usually platforms
that rely primarily on a Graphical User Interface (GUI) for user
interaction. A program requiring a command line interface might
not work everywhere. This is probably for the user of the program
to deal with, so don't stay up late worrying about it.
Some platforms can't delete or rename files held open by the system,
this limitation may also apply to changing filesystem metainformation
like file permissions or owners. Remember to close files when you
are done with them. Don't unlink or rename an open file. Don't
tie or open a file already tied or opened; untie or close
it first.
Don't open the same file more than once at a time for writing, as some
operating systems put mandatory locks on such files.
Don't assume that write/modify permission on a directory gives the
right to add or delete files/directories in that directory. That is
filesystem specific: in some filesystems you need write/modify
permission also (or even just) in the file/directory itself. In some
filesystems (AFS, DFS) the permission to add/delete directory entries
is a completely separate permission.
Don't assume that a single unlink completely gets rid of the file:
some filesystems (most notably the ones in VMS) have versioned
filesystems, and unlink() removes only the most recent one (it doesn't
remove all the versions because by default the native tools on those
platforms remove just the most recent version, too). The portable
idiom to remove all the versions of a file is
1 |