Object and Cross-Reference Streams
This chapter provides information about the implementation of object stream and cross-reference stream support in qpdf.
Object Streams
Object streams can contain any regular object except the following:
stream objects
objects with generation > 0
the encryption dictionary
objects containing the /Length of another stream
In addition, Adobe reader (at least as of version 8.0.0) appears to not be able to handle having the document catalog appear in an object stream if the file is encrypted, though this is not specifically disallowed by the specification.
There are additional restrictions for linearized files. See Implications for Linearized Files for details.
The PDF specification refers to objects in object streams as “compressed objects” regardless of whether the object stream is compressed.
The generation number of every object in an object stream must be zero. It is possible to delete and replace an object in an object stream with a regular object.
The object stream dictionary has the following keys:
/N
: number of objects/First
: byte offset of first object/Extends
: indirect reference to stream that this extends
Stream collections are formed with /Extends
. They must form a
directed acyclic graph. These can be used for semantic information and
are not meaningful to the PDF document’s syntactic structure. Although
qpdf preserves stream collections, it never generates them and doesn’t
make use of this information in any way.
The specification recommends limiting the number of objects in object
stream for efficiency in reading and decoding. Acrobat 6 uses no more
than 100 objects per object stream for linearized files and no more 200
objects per stream for non-linearized files. QPDFWriter
, in object
stream generation mode, never puts more than 100 objects in an object
stream.
Object stream contents consists of N pairs of integers, each of which is the object number and the byte offset of the object relative to the first object in the stream, followed by the objects themselves, concatenated.
Cross-Reference Streams
For non-hybrid files, the value following startxref
is the byte
offset to the xref stream rather than the word xref
.
For hybrid files (files containing both xref tables and cross-reference
streams), the xref table’s trailer dictionary contains the key
/XRefStm
whose value is the byte offset to a cross-reference stream
that supplements the xref table. A PDF 1.5-compliant application should
read the xref table first. Then it should replace any object that it has
already seen with any defined in the xref stream. Then it should follow
any /Prev
pointer in the original xref table’s trailer dictionary.
The specification is not clear about what should be done, if anything,
with a /Prev
pointer in the xref stream referenced by an xref table.
The QPDF
class ignores it, which is probably reasonable since, if
this case were to appear for any sensible PDF file, the previous xref
table would probably have a corresponding /XRefStm
pointer of its
own. For example, if a hybrid file were appended, the appended section
would have its own xref table and /XRefStm
. The appended xref table
would point to the previous xref table which would point the
/XRefStm
, meaning that the new /XRefStm
doesn’t have to point to
it.
Since xref streams must be read very early, they may not be encrypted, and the may not contain indirect objects for keys required to read them, which are these:
/Type
: value/XRef
/Size
: value n+1: where n is highest object number (same as/Size
in the trailer dictionary)/Index
(optional): value[:samp:`{n count}` ...]
used to determine which objects’ information is stored in this stream. The default is[0 /Size]
./Prev
: valueoffset
: byte offset of previous xref stream (same as/Prev
in the trailer dictionary)/W [...]
: sizes of each field in the xref table
The other fields in the xref stream, which may be indirect if desired, are the union of those from the xref table’s trailer dictionary.
Cross-Reference Stream Data
The stream data is binary and encoded in big-endian byte order. Entries
are concatenated, and each entry has a length equal to the total of the
entries in /W
above. Each entry consists of one or more fields, the
first of which is the type of the field. The number of bytes for each
field is given by /W
above. A 0 in /W
indicates that the field
is omitted and has the default value. The default value for the field
type is 1
. All other default values are 0
.
PDF 1.5 has three field types:
0: for free objects. Format:
0 obj next-generation
, same as the free table in a traditional cross-reference table1: regular non-compressed object. Format:
1 offset generation
2: for objects in object streams. Format:
2 object-stream-number index
, the number of object stream containing the object and the index within the object stream of the object.
It seems standard to have the first entry in the table be 0 0 0
instead of 0 0 ffff
if there are no deleted objects.
Implications for Linearized Files
For linearized files, the linearization dictionary, document catalog, and page objects may not be contained in object streams.
Objects stored within object streams are given the highest range of object numbers within the main and first-page cross-reference sections.
It is okay to use cross-reference streams in place of regular xref tables. There are on special considerations.
Hint data refers to object streams themselves, not the objects in the streams. Shared object references should also be made to the object streams. There are no reference in any hint tables to the object numbers of compressed objects (objects within object streams).
When numbering objects, all shared objects within both the first and second halves of the linearized files must be numbered consecutively after all normal uncompressed objects in that half.
Implementation Notes
There are three modes for writing object streams:
disable
, preserve
, and
generate
. In disable mode, we do not generate
any object streams, and we also generate an xref table rather than xref
streams. This can be used to generate PDF files that are viewable with
older readers. In preserve mode, we write object streams such that
written object streams contain the same objects and /Extends
relationships as in the original file. This is equal to disable if the
file has no object streams. In generate, we create object streams
ourselves by grouping objects that are allowed in object streams
together in sets of no more than 100 objects. We also ensure that the
PDF version is at least 1.5 in generate mode, but we preserve the
version header in the other modes. The default is
preserve
.
We do not support creation of hybrid files. When we write files, even in preserve mode, we will lose any xref tables and merge any appended sections.