amccormack.net

Things I've learned and suspect I'll forget.

Anatomy of a PDF document 2012-01-22

This post contains a line by line analysis of the structure of a sample PDF. I wrote it so that I could gain a better understanding of the PDF document. The example PDF is taken from a simpler explanation by Didier Stevens. The rest of the details are filled in by the Adobe PDF Specification. I must admit that much of this post is a gross plagiarism of the PDF Specification and I would describe it merely as a structural change so that a PDF can be explained line by line. There are a lot of topics concerning PDFs which I don't explain or reference because I intended this post only to explain this specific PDF and not all PDFs in general. I have two forms of the PDF available. They are the exact same file with different extensions. There is the PDF Version and the TXT Version. You should be able to edit these files with a basic text editor such as notepad. The PDF is delicate and relies heavily on byte-offsets, so you should be sure to check the values in your cross-reference table and trailer if you decide to edit the file.

The file structure of a PDF is made up of 4 distinct elements:

  • A one-line header identifying the version of the PDF and the PDF Magic Number
  • A body containing the hierarchical objects that make up the document contained in the file.
  • A cross-reference table which gives the address about the objects in the file
  • A trailer giving the location of the cross reference table.

The body is a list of sequential indirect objects and is hierarchical. That is, the objects in the body point to other objects, making a tree-like structure. The root of this tree is called the Document Catalog and it contains references to other important objects throughout the document. An example image from the PDF Specification is shown:

"Structure of a PDF Document"

An example of a PDF is given below.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
%PDF-1.7

1 0 obj
<<
 /Type /Catalog
 /Outlines 2 0 R
 /Pages 3 0 R
>>
endobj

2 0 obj
<<
 /Type /Outlines
 /Count 0
>>
endobj

3 0 obj
<<
 /Type /Pages
 /Kids [4 0 R]
 /Count 1
>>
endobj

4 0 obj
<<
 /Type /Page
 /Parent 3 0 R
 /MediaBox [0 0 612 792]
 /Contents 5 0 R
 /Resources
 << /ProcSet 6 0 R
    /Font << /F1 7 0 R >>
 >>
>>
endobj

5 0 obj
<< /Length 48 >>
stream
BT
/F1 24 Tf
100 700 Td
(Hello World)Tj
ET
endstream
endobj

6 0 obj
[/PDF /Text]
endobj

7 0 obj
<<
 /Type /Font
 /Subtype /Type1
 /Name /F1
 /BaseFont /Helvetica
 /Encoding /MacRomanEncoding
>>
endobj

xref
0 8
0000000000 65535 f
0000000012 00000 n
0000000089 00000 n
0000000145 00000 n
0000000214 00000 n
0000000381 00000 n
0000000485 00000 n
0000000518 00000 n
trailer
<<
 /Size 8
 /Root 1 0 R
>>
startxref
642
%%EOF

Lets break the PDF down into sections and explain them a little bit more.

%PDF-1.7

This is the one line header section and all it does is declare the file as a PDF file of version 1.7.

Next we have the body of the PDF document. The body is a sequence of objects that make up the document. There are 8 types of objects and each one listed in the body is an indirect object. An indirect object is a labelled object, so that it may be called by other objects. The body of the PDF document is made up of dictionary objects. A dictionary object is an associative table containing pairs of objects (known as entries) represented by a key and a value. The key must be a name and the value may be of any kind (including another dictionary). The keys in a single dictionary must be unique. A dictionary is written as a sequence of key-value pairs enclosed in double angle brackets (<<) and (>>).

Lets take a look at the first object in our file:

3
4
5
6
7
8
9
1 0 obj
<<
 /Type /Catalog
 /Outlines 2 0 R
 /Pages 3 0 R
>>
endobj

Line 3 declares the indirect object and 9 ends it. An indirect object is defined as

X Y obj
ExampleObject
endobj
  • X is referred to as the object number
  • Y is referred to as the generation number. The generation number refers to the generation (version) of the PDF document as PDF documents may be incrementally updated.

Inside the object declaration is the dictionary itself.

  • Lines 4 and 8 start and end the dictionary.
  • Line 5 describes the type of the dictionary object.

We see that the type is a Catalog type. This is a special (required) type and is the root of the document. The catalog contains references to other objects defining the document's contents, outlines, and other attributes. A Catalog dictionary contains two required entires:

  • Type always has a value of Catalog (by definition).
  • Pages points to the object that is the root of the page tree. The page tree contains references to each page, and each page contains references to the content that makes up that page such as strings and images (see image above).

Outlines, an optional entry, references the root of the outline hierarchy. The document outline consists of a tree-structured hierarchy of outline items (sometimes called bookmarks), which serve as a visual table of contents to display the documents structure to the user. Since Outlines and Pages both reference indirect objects, we can see how they are described. The value 2 0 R refers to an indirect object. This is called an indirect reference. The indirect reference consists of the object number, the generation number and the character R.

Lets look at the next object:

11
12
13
14
15
16
2 0 obj
<<
 /Type /Outlines
 /Count 0
>>
endobj

This describes the document outline object. We see that this object has object number 2 and generation number 0. In addition the dictionary is described as the Outlines type. Count describes the total number of visible outline items at all levels of the outline.

Next we have object 3 which contains the dictionary for Pages, known as the Page Tree.

18
19
20
21
22
23
24
3 0 obj
<<
 /Type /Pages
 /Kids [4 0 R]
 /Count 1
>>
endobj

Page tree nodes are made up of the following:

  • Type - (Required) which is always Pages for a page tree node.
  • Parent - (Required - but it is prohibited in the root node) The page tree node that is the immediate parent of this one. We can tell that 3 is the root page tree node because it does not list a Parent entry.
  • Kids - (Required) An array of indirect references to the immediate children of this node. In this case the node has 1 Kid and it is object 4.
  • Count - (Required) The number of leaf nodes (page objects) that are descendants of this node within the page tree

This brings us to the page object. The source for our one page object is:

26
27
28
29
30
31
32
33
34
35
36
37
4 0 obj
<<
 /Type /Page
 /Parent 3 0 R
 /MediaBox [0 0 612 792]
 /Contents 5 0 R
 /Resources
 << /ProcSet 6 0 R
    /Font << /F1 7 0 R >>
 >>
>>
endobj

The page object is a dictionary specifying the attributes of a single page of the document. Lets discus the entries which have not been described previously.

  • MediaBox - (Required, inheritable) - Includes a Rectangle Object which describes "bounding boxes" for the object.
  • Contents (Optional) - A content stream that describe the contents of this page.
  • Resources(Required, inheritable) - A dictionary containing any resources required by the page. Here we have two entries in resources:
  • ProcSet - References the object that describes the procedure sets
  • Font - A dictionary that maps resource names to font dictionaries. In this case a font named F1 located in object 7.

Next we have object 5, which contains the content stream of our page.

39
40
41
42
43
44
45
46
47
48
5 0 obj
<< /Length 48 >>
stream
BT
/F1 24 Tf
100 700 Td
(Hello World)Tj
ET
endstream
endobj

The dictionary in this object describes only the length of the stream.

Next we see how the text is shown. It should be noted that the Text uses operators and operands. The operand (the object that is acted on) precedes the operator. In mathematics, we see this with the square root operator. If 5^2 is written, we know that 5 (the operand) is to be squared (the operator).

  • On lines 41 and 47 we see the declaration for starting and ending the stream.
  • Line 42 and 46 (BT and ET) begin and end the text object.
  • Line 43 specifies the font and font size to use (the operand). Tf is the operator and specifies the name of the font resource, that is, an entry in the Font subdictionary of the current resource dictionary.
  • Line 44 specifies the starting position for the text on the page. Td is a text-positioning operator, and helps determine the location of the text.
  • Line 45 contains the String, enclosed in parentheses, to be displayed.Tj takes a string operand and paints it using the font and other text related parameters.

Next we look at object 6.

50
51
52
6 0 obj
[/PDF /Text]
endobj

We remember that this object was referenced by object 4 (the page node) in the resource dictionary under the ProcSet key. The PDF operators used in content streams are grouped into categories of related operators called Procedure Sets. This object holds an array (declared by the right and left brackets [ ]) of two procedure sets called PDF and Text. It should be noted that as of PDF version 1.4 this information is not used by the reader, but is still generated so that older readers may work.

The final object is object 7, shown below.

54
55
56
57
58
59
60
61
62
7 0 obj
<<
 /Type /Font
 /Subtype /Type1
 /Name /F1
 /BaseFont /Helvetica
 /Encoding /MacRomanEncoding
>>
endobj

Object 7 was also referenced by object 4 (the page node) in the resources dictionary as the value to the Font key. The entries listed in this object are straightforward, and notice that the name /F1 is the same one referenced throughout the document.

This brings us to the cross-reference table. The cross-reference table lists the information that permits access to indirect objects within the file. Listing the file in this way allows a reader to read parts of the file before reading the entire thing (know as Random Access). The cross-reference table is shown below.

64
65
66
67
68
69
70
71
72
73
xref
0 8
0000000000 65535 f
0000000012 00000 n
0000000089 00000 n
0000000145 00000 n
0000000214 00000 n
0000000381 00000 n
0000000485 00000 n
0000000518 00000 n

Line 64 declares the start of the cross-reference table. The next line introduces the cross-reference subsection. For a file that has never been incrementally updated (such as this one), there will be only one cross-reference subsection. Each cross-reference subsection contains entries for a contiguous range of object numbers. The subsection begins with a line containing two numbers. The first (0 in our case) is the object number of the first object and the second (8) contains the number of objects in that subsection. Lines 66 through 73 contain the cross-reference entries themselves, one per line. Lines are constructed as followes:

  • If an entry is free ..- The entry should end with an f ..- The first group of 10 numbers should be the (0 padded) object number of the next free object ..- The group of 5 numbers should be the 5-digit generation number
  • If an entry is in use ..- The entry should end with a u ..- The first group of 10 numbers should be the (0 padded) byte offset in the stream ..- The group of 5 numbers should be the 5-digit generation number

The first entry in the table will always be free and shall have a generation number of 65,535. If it is the only free object (as in our case), it will have 0000000000 (itself) as the listing to the next free object.

Finally, the PDF file ends with the file trailer. The file trailer links to the cross-reference table and other special objects.

74
75
76
77
78
79
80
81
trailer
<<
 /Size 8
 /Root 1 0 R
>>
startxref
642
%%EOF

The trailer is declared by the word trailer. Next we see the trailer dictionary:. -Size - Contains the total number of entries in the files cross-reference table -Root - Contains the indirect reference to the root (catalog dictionary) of the document. After the trailer dictionary is the startxref keyword, which gives the byte-offset to the xref keyword. Finally, %%EOF declares the end of the PDF document.

published on 2012-01-22 by alex