Takes a LIST of values and converts it into a string using the rules given by the TEMPLATE. The resulting string is the concatenation of the converted values. Typically, each converted value looks like its machine-level representation. For example, on 32-bit machines an integer may be represented by a sequence of 4 bytes, which will in Perl be presented as a string that's 4 characters long.
See perlpacktut for an introduction to this function.
The TEMPLATE is a sequence of characters that give the order and type of values, as follows:
- a A string with arbitrary binary data, will be null padded.
- A A text (ASCII) string, will be space padded.
- Z A null-terminated (ASCIZ) string, will be null padded.
- b A bit string (ascending bit order inside each byte,
- like vec()).
- B A bit string (descending bit order inside each byte).
- h A hex string (low nybble first).
- H A hex string (high nybble first).
- c A signed char (8-bit) value.
- C An unsigned char (octet) value.
- W An unsigned char value (can be greater than 255).
- s A signed short (16-bit) value.
- S An unsigned short value.
- l A signed long (32-bit) value.
- L An unsigned long value.
- q A signed quad (64-bit) value.
- Q An unsigned quad value.
- (Quads are available only if your system supports 64-bit
- integer values _and_ if Perl has been compiled to support
- those. Raises an exception otherwise.)
- i A signed integer value.
- I A unsigned integer value.
- (This 'integer' is _at_least_ 32 bits wide. Its exact
- size depends on what a local C compiler calls 'int'.)
- n An unsigned short (16-bit) in "network" (big-endian) order.
- N An unsigned long (32-bit) in "network" (big-endian) order.
- v An unsigned short (16-bit) in "VAX" (little-endian) order.
- V An unsigned long (32-bit) in "VAX" (little-endian) order.
- j A Perl internal signed integer value (IV).
- J A Perl internal unsigned integer value (UV).
- f A single-precision float in native format.
- d A double-precision float in native format.
- F A Perl internal floating-point value (NV) in native format
- D A float of long-double precision in native format.
- (Long doubles are available only if your system supports
- long double values _and_ if Perl has been compiled to
- support those. Raises an exception otherwise.
- Note that there are different long double formats.)
- p A pointer to a null-terminated string.
- P A pointer to a structure (fixed-length string).
- u A uuencoded string.
- U A Unicode character number. Encodes to a character in char-
- acter mode and UTF-8 (or UTF-EBCDIC in EBCDIC platforms) in
- byte mode.
- w A BER compressed integer (not an ASN.1 BER, see perlpacktut
- for details). Its bytes represent an unsigned integer in
- base 128, most significant digit first, with as few digits
- as possible. Bit eight (the high bit) is set on each byte
- except the last.
- x A null byte (a.k.a ASCII NUL, "\000", chr(0))
- X Back up a byte.
- @ Null-fill or truncate to absolute position, counted from the
- start of the innermost ()-group.
- . Null-fill or truncate to absolute position specified by
- the value.
- ( Start of a ()-group.
One or more modifiers below may optionally follow certain letters in the TEMPLATE (the second column lists letters for which the modifier is valid):
- ! sSlLiI Forces native (short, long, int) sizes instead
- of fixed (16-/32-bit) sizes.
- ! xX Make x and X act as alignment commands.
- ! nNvV Treat integers as signed instead of unsigned.
- ! @. Specify position as byte offset in the internal
- representation of the packed string. Efficient
- but dangerous.
- > sSiIlLqQ Force big-endian byte-order on the type.
- jJfFdDpP (The "big end" touches the construct.)
- < sSiIlLqQ Force little-endian byte-order on the type.
- jJfFdDpP (The "little end" touches the construct.)
The >
and <
modifiers can also be used on ()
groups
to force a particular byte-order on all components in that group,
including all its subgroups.
The following rules apply:
Each letter may optionally be followed by a number indicating the repeat
count. A numeric repeat count may optionally be enclosed in brackets, as
in pack("C[80]", @arr)
. The repeat count gobbles that many values from
the LIST when used with all format types other than a
, A
, Z
, b
,
B
, h
, H
, @
, .
, x
, X
, and P
, where it means
something else, described below. Supplying a *
for the repeat count
instead of a number means to use however many items are left, except for:
@
, x
, and X
, where it is equivalent to 0
.
<.>, where it means relative to the start of the string.
u
, where it is equivalent to 1 (or 45, which here is equivalent).
One can replace a numeric repeat count with a template letter enclosed in brackets to use the packed byte length of the bracketed template for the repeat count.
For example, the template x[L]
skips as many bytes as in a packed long,
and the template "$t X[$t] $t"
unpacks twice whatever $t (when
variable-expanded) unpacks. If the template in brackets contains alignment
commands (such as x![d]
), its packed length is calculated as if the
start of the template had the maximal possible alignment.
When used with Z
, a *
as the repeat count is guaranteed to add a
trailing null byte, so the resulting string is always one byte longer than
the byte length of the item itself.
When used with @
, the repeat count represents an offset from the start
of the innermost ()
group.
When used with .
, the repeat count determines the starting position to
calculate the value offset as follows:
If the repeat count is 0
, it's relative to the current position.
If the repeat count is *
, the offset is relative to the start of the
packed string.
And if it's an integer n, the offset is relative to the start of the
nth innermost ( )
group, or to the start of the string if n is
bigger then the group level.
The repeat count for u
is interpreted as the maximal number of bytes
to encode per line of output, with 0, 1 and 2 replaced by 45. The repeat
count should not be more than 65.
The a
, A
, and Z
types gobble just one value, but pack it as a
string of length count, padding with nulls or spaces as needed. When
unpacking, A
strips trailing whitespace and nulls, Z
strips everything
after the first null, and a
returns data with no stripping at all.
If the value to pack is too long, the result is truncated. If it's too
long and an explicit count is provided, Z
packs only $count-1
bytes,
followed by a null byte. Thus Z
always packs a trailing null, except
when the count is 0.
Likewise, the b
and B
formats pack a string that's that many bits long.
Each such format generates 1 bit of the result. These are typically followed
by a repeat count like B8
or B64
.
Each result bit is based on the least-significant bit of the corresponding
input character, i.e., on ord($char)%2
. In particular, characters "0"
and "1"
generate bits 0 and 1, as do characters "\000"
and "\001"
.
Starting from the beginning of the input string, each 8-tuple
of characters is converted to 1 character of output. With format b
,
the first character of the 8-tuple determines the least-significant bit of a
character; with format B
, it determines the most-significant bit of
a character.
If the length of the input string is not evenly divisible by 8, the remainder is packed as if the input string were padded by null characters at the end. Similarly during unpacking, "extra" bits are ignored.
If the input string is longer than needed, remaining characters are ignored.
A *
for the repeat count uses all characters of the input field.
On unpacking, bits are converted to a string of 0
s and 1
s.
The h
and H
formats pack a string that many nybbles (4-bit groups,
representable as hexadecimal digits, "0".."9"
"a".."f"
) long.
For each such format, pack generates 4 bits of result.
With non-alphabetical characters, the result is based on the 4 least-significant
bits of the input character, i.e., on ord($char)%16
. In particular,
characters "0"
and "1"
generate nybbles 0 and 1, as do bytes
"\000"
and "\001"
. For characters "a".."f"
and "A".."F"
, the result
is compatible with the usual hexadecimal digits, so that "a"
and
"A"
both generate the nybble 0xA==10
. Use only these specific hex
characters with this format.
Starting from the beginning of the template to
pack, each pair
of characters is converted to 1 character of output. With format h
, the
first character of the pair determines the least-significant nybble of the
output character; with format H
, it determines the most-significant
nybble.
If the length of the input string is not even, it behaves as if padded by a null character at the end. Similarly, "extra" nybbles are ignored during unpacking.
If the input string is longer than needed, extra characters are ignored.
A *
for the repeat count uses all characters of the input field. For
unpack, nybbles are converted to a string of
hexadecimal digits.
The p
format packs a pointer to a null-terminated string. You are
responsible for ensuring that the string is not a temporary value, as that
could potentially get deallocated before you got around to using the packed
result. The P
format packs a pointer to a structure of the size indicated
by the length. A null pointer is created if the corresponding value for
p
or P
is undef; similarly with
unpack, where a null pointer unpacks into
undef.
If your system has a strange pointer size--meaning a pointer is neither as big as an int nor as big as a long--it may not be possible to pack or unpack pointers in big- or little-endian byte order. Attempting to do so raises an exception.
The /
template character allows packing and unpacking of a sequence of
items where the packed structure contains a packed item count followed by
the packed items themselves. This is useful when the structure you're
unpacking has encoded the sizes or repeat counts for some of its fields
within the structure itself as separate fields.
For pack, you write
length-item/
sequence-item, and the
length-item describes how the length value is packed. Formats likely
to be of most use are integer-packing ones like n
for Java strings,
w
for ASN.1 or SNMP, and N
for Sun XDR.
For pack, sequence-item may have a repeat count, in which case the minimum of that and the number of available items is used as the argument for length-item. If it has no repeat count or uses a '*', the number of available items is used.
For unpack, an internal stack of integer
arguments unpacked so far is
used. You write /
sequence-item and the repeat count is obtained by
popping off the last element from the stack. The sequence-item must not
have a repeat count.
If sequence-item refers to a string type ("A"
, "a"
, or "Z"
),
the length-item is the string length, not the number of strings. With
an explicit repeat count for pack, the packed string is adjusted to that
length. For example:
The length-item is not returned explicitly from unpack.
Supplying a count to the length-item format letter is only useful with
A
, a
, or Z
. Packing with a length-item of a
or Z
may
introduce "\000"
characters, which Perl does not regard as legal in
numeric strings.
The integer types s
, S
, l
, and L
may be
followed by a !
modifier to specify native shorts or
longs. As shown in the example above, a bare l
means
exactly 32 bits, although the native long
as seen by the local C compiler
may be larger. This is mainly an issue on 64-bit platforms. You can
see whether using !
makes any difference this way:
i!
and I!
are also allowed, but only for completeness' sake:
they are identical to i
and I
.
The actual sizes (in bytes) of native shorts, ints, longs, and long longs on the platform where Perl was built are also available from the command line:
- $ perl -V:{short,int,long{,long}}size
- shortsize='2';
- intsize='4';
- longsize='4';
- longlongsize='8';
or programmatically via the Config module:
$Config{longlongsize}
is undefined on systems without
long long support.
The integer formats s
, S
, i
, I
, l
, L
, j
, and J
are
inherently non-portable between processors and operating systems because
they obey native byteorder and endianness. For example, a 4-byte integer
0x12345678 (305419896 decimal) would be ordered natively (arranged in and
handled by the CPU registers) into bytes as
- 0x12 0x34 0x56 0x78 # big-endian
- 0x78 0x56 0x34 0x12 # little-endian
Basically, Intel and VAX CPUs are little-endian, while everybody else, including Motorola m68k/88k, PPC, Sparc, HP PA, Power, and Cray, are big-endian. Alpha and MIPS can be either: Digital/Compaq uses (well, used) them in little-endian mode, but SGI/Cray uses them in big-endian mode.
The names big-endian and little-endian are comic references to the egg-eating habits of the little-endian Lilliputians and the big-endian Blefuscudians from the classic Jonathan Swift satire, Gulliver's Travels. This entered computer lingo via the paper "On Holy Wars and a Plea for Peace" by Danny Cohen, USC/ISI IEN 137, April 1, 1980.
Some systems may have even weirder byte orders such as
- 0x56 0x78 0x12 0x34
- 0x34 0x12 0x78 0x56
These are called mid-endian, middle-endian, mixed-endian, or just weird.
You can determine your system endianness with this incantation:
The byteorder on the platform where Perl was built is also available via Config:
or from the command line:
- $ perl -V:byteorder
Byteorders "1234"
and "12345678"
are little-endian; "4321"
and "87654321"
are big-endian. Systems with multiarchitecture binaries
will have "ffff"
, signifying that static information doesn't work,
one must use runtime probing.
For portably packed integers, either use the formats n
, N
, v
,
and V
or else use the >
and <
modifiers described
immediately below. See also perlport.
Also floating point numbers have endianness. Usually (but not always)
this agrees with the integer endianness. Even though most platforms
these days use the IEEE 754 binary format, there are differences,
especially if the long doubles are involved. You can see the
Config
variables doublekind
and longdblkind
(also doublesize
,
longdblsize
): the "kind" values are enums, unlike byteorder
.
Portability-wise the best option is probably to keep to the IEEE 754
64-bit doubles, and of agreed-upon endianness. Another possibility
is the "%a"
) format of printf.
Starting with Perl 5.10.0, integer and floating-point formats, along with
the p
and P
formats and ()
groups, may all be followed by the
>
or <
endianness modifiers to respectively enforce big-
or little-endian byte-order. These modifiers are especially useful
given how n
, N
, v
, and V
don't cover signed integers,
64-bit integers, or floating-point values.
Here are some concerns to keep in mind when using an endianness modifier:
Exchanging signed integers between different platforms works only when all platforms store them in the same format. Most platforms store signed integers in two's-complement notation, so usually this is not an issue.
The >
or <
modifiers can only be used on floating-point
formats on big- or little-endian machines. Otherwise, attempting to
use them raises an exception.
Forcing big- or little-endian byte-order on floating-point values for
data exchange can work only if all platforms use the same
binary representation such as IEEE floating-point. Even if all
platforms are using IEEE, there may still be subtle differences. Being able
to use >
or <
on floating-point values can be useful,
but also dangerous if you don't know exactly what you're doing.
It is not a general way to portably store floating-point values.
When using >
or <
on a ()
group, this affects
all types inside the group that accept byte-order modifiers,
including all subgroups. It is silently ignored for all other
types. You are not allowed to override the byte-order within a group
that already has a byte-order modifier suffix.
Real numbers (floats and doubles) are in native machine format only. Due to the multiplicity of floating-point formats and the lack of a standard "network" representation for them, no facility for interchange has been made. This means that packed floating-point data written on one machine may not be readable on another, even if both use IEEE floating-point arithmetic (because the endianness of the memory representation is not part of the IEEE spec). See also perlport.
If you know exactly what you're doing, you can use the >
or <
modifiers to force big- or little-endian byte-order on floating-point values.
Because Perl uses doubles (or long doubles, if configured) internally for
all numeric calculation, converting from double into float and thence
to double again loses precision, so unpack("f", pack("f", $foo)
)
will not in general equal $foo.
Pack and unpack can operate in two modes: character mode (C0
mode) where
the packed string is processed per character, and UTF-8 byte mode (U0
mode)
where the packed string is processed in its UTF-8-encoded Unicode form on
a byte-by-byte basis. Character mode is the default
unless the format string starts with U
. You
can always switch mode mid-format with an explicit
C0
or U0
in the format. This mode remains in effect until the next
mode change, or until the end of the ()
group it (directly) applies to.
Using C0
to get Unicode characters while using U0
to get non-Unicode
bytes is not necessarily obvious. Probably only the first of these
is what you want:
- $ perl -CS -E 'say "\x{3B1}\x{3C9}"' |
- perl -CS -ne 'printf "%v04X\n", $_ for unpack("C0A*", $_)'
- 03B1.03C9
- $ perl -CS -E 'say "\x{3B1}\x{3C9}"' |
- perl -CS -ne 'printf "%v02X\n", $_ for unpack("U0A*", $_)'
- CE.B1.CF.89
- $ perl -CS -E 'say "\x{3B1}\x{3C9}"' |
- perl -C0 -ne 'printf "%v02X\n", $_ for unpack("C0A*", $_)'
- CE.B1.CF.89
- $ perl -CS -E 'say "\x{3B1}\x{3C9}"' |
- perl -C0 -ne 'printf "%v02X\n", $_ for unpack("U0A*", $_)'
- C3.8E.C2.B1.C3.8F.C2.89
Those examples also illustrate that you should not try to use pack/unpack as a substitute for the Encode module.
You must yourself do any alignment or padding by inserting, for example,
enough "x"
es while packing. There is no way for
pack and unpack
to know where characters are going to or coming from, so they
handle their output and input as flat sequences of characters.
A ()
group is a sub-TEMPLATE enclosed in parentheses. A group may
take a repeat count either as postfix, or for
unpack, also via the /
template character. Within each repetition of a group, positioning with
@
starts over at 0. Therefore, the result of
- pack("@1A((@2A)@3A)", qw[X Y Z])
is the string "\0X\0\0YZ"
.
x
and X
accept the !
modifier to act as alignment commands: they
jump forward or back to the closest position aligned at a multiple of count
characters. For example, to pack or
unpack a C structure like
- struct {
- char c; /* one signed, 8-bit character */
- double d;
- char cc[2];
- }
one may need to use the template c x![d] d c[2]
. This assumes that
doubles must be aligned to the size of double.
For alignment commands, a count
of 0 is equivalent to a count
of 1;
both are no-ops.
n
, N
, v
and V
accept the !
modifier to
represent signed 16-/32-bit integers in big-/little-endian order.
This is portable only when all platforms sharing packed data use the
same binary representation for signed integers; for example, when all
platforms use two's-complement representation.
Comments can be embedded in a TEMPLATE using #
through the end of line.
White space can separate pack codes from each other, but modifiers and
repeat counts must follow immediately. Breaking complex templates into
individual line-by-line components, suitably annotated, can do as much to
improve legibility and maintainability of pack/unpack formats as /x
can
for complicated pattern matches.
If TEMPLATE requires more arguments than pack
is given, pack
assumes additional ""
arguments. If TEMPLATE requires fewer arguments
than given, extra arguments are ignored.
Attempting to pack the special floating point values Inf
and NaN
(infinity, also in negative, and not-a-number) into packed integer values
(like "L"
) is a fatal error. The reason for this is that there simply
isn't any sensible mapping for these special values into integers.
Examples:
- $foo = pack("WWWW",65,66,67,68);
- # foo eq "ABCD"
- $foo = pack("W4",65,66,67,68);
- # same thing
- $foo = pack("W4",0x24b6,0x24b7,0x24b8,0x24b9);
- # same thing with Unicode circled letters.
- $foo = pack("U4",0x24b6,0x24b7,0x24b8,0x24b9);
- # same thing with Unicode circled letters. You don't get the
- # UTF-8 bytes because the U at the start of the format caused
- # a switch to U0-mode, so the UTF-8 bytes get joined into
- # characters
- $foo = pack("C0U4",0x24b6,0x24b7,0x24b8,0x24b9);
- # foo eq "\xe2\x92\xb6\xe2\x92\xb7\xe2\x92\xb8\xe2\x92\xb9"
- # This is the UTF-8 encoding of the string in the
- # previous example
- $foo = pack("ccxxcc",65,66,67,68);
- # foo eq "AB\0\0CD"
- # NOTE: The examples above featuring "W" and "c" are true
- # only on ASCII and ASCII-derived systems such as ISO Latin 1
- # and UTF-8. On EBCDIC systems, the first example would be
- # $foo = pack("WWWW",193,194,195,196);
- $foo = pack("s2",1,2);
- # "\001\000\002\000" on little-endian
- # "\000\001\000\002" on big-endian
- $foo = pack("a4","abcd","x","y","z");
- # "abcd"
- $foo = pack("aaaa","abcd","x","y","z");
- # "axyz"
- $foo = pack("a14","abcdefg");
- # "abcdefg\0\0\0\0\0\0\0"
- $foo = pack("i9pl", gmtime);
- # a real struct tm (on my system anyway)
- $utmp_template = "Z8 Z8 Z16 L";
- $utmp = pack($utmp_template, @utmp1);
- # a struct utmp (BSDish)
- @utmp2 = unpack($utmp_template, $utmp);
- # "@utmp1" eq "@utmp2"
- sub bintodec {
- unpack("N", pack("B32", substr("0" x 32 . shift, -32)));
- }
- $foo = pack('sx2l', 12, 34);
- # short 12, two zero bytes padding, long 34
- $bar = pack('s@4l', 12, 34);
- # short 12, zero fill to position 4, long 34
- # $foo eq $bar
- $baz = pack('s.l', 12, 4, 34);
- # short 12, zero fill to position 4, long 34
- $foo = pack('nN', 42, 4711);
- # pack big-endian 16- and 32-bit unsigned integers
- $foo = pack('S>L>', 42, 4711);
- # exactly the same
- $foo = pack('s<l<', -42, 4711);
- # pack little-endian 16- and 32-bit signed integers
- $foo = pack('(sl)<', -42, 4711);
- # exactly the same
The same template may generally also be used in unpack.