Things I've learned and suspect I'll forget.
This post contains a line by line analysis of the structure of a sample PDF. I wrote it so that I could gain a better understanding of the PDF document. The example PDF is taken from a simpler explanation by Didier Stevens. The rest of the details are filled in by the Adobe PDF Specification. I must admit that much of this post is a gross plagiarism of the PDF Specification and I would describe it merely as a structural change so that a PDF can be explained line by line. There are a lot of topics concerning PDFs which I don't explain or reference because I intended this post only to explain this specific PDF and not all PDFs in general. I have two forms of the PDF available. They are the exact same file with different extensions. There is the PDF Version and the TXT Version. You should be able to edit these files with a basic text editor such as notepad. The PDF is delicate and relies heavily on byte-offsets, so you should be sure to check the values in your cross-reference table and trailer if you decide to edit the file.
The file structure of a PDF is made up of 4 distinct elements:
The body is a list of sequential indirect objects and is hierarchical. That is, the objects in the body point to other objects, making a tree-like structure. The root of this tree is called the Document Catalog and it contains references to other important objects throughout the document. An example image from the PDF Specification is shown:
An example of a PDF is given below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 | %PDF-1.7
1 0 obj
<<
/Type /Catalog
/Outlines 2 0 R
/Pages 3 0 R
>>
endobj
2 0 obj
<<
/Type /Outlines
/Count 0
>>
endobj
3 0 obj
<<
/Type /Pages
/Kids [4 0 R]
/Count 1
>>
endobj
4 0 obj
<<
/Type /Page
/Parent 3 0 R
/MediaBox [0 0 612 792]
/Contents 5 0 R
/Resources
<< /ProcSet 6 0 R
/Font << /F1 7 0 R >>
>>
>>
endobj
5 0 obj
<< /Length 48 >>
stream
BT
/F1 24 Tf
100 700 Td
(Hello World)Tj
ET
endstream
endobj
6 0 obj
[/PDF /Text]
endobj
7 0 obj
<<
/Type /Font
/Subtype /Type1
/Name /F1
/BaseFont /Helvetica
/Encoding /MacRomanEncoding
>>
endobj
xref
0 8
0000000000 65535 f
0000000012 00000 n
0000000089 00000 n
0000000145 00000 n
0000000214 00000 n
0000000381 00000 n
0000000485 00000 n
0000000518 00000 n
trailer
<<
/Size 8
/Root 1 0 R
>>
startxref
642
%%EOF
|
Lets break the PDF down into sections and explain them a little bit more.
%PDF-1.7
This is the one line header section and all it does is declare the file as a PDF file of version 1.7.
Next we have the body of the PDF document. The body is a sequence of objects that make up the document. There are 8 types of objects and each one listed in the body is an indirect object. An indirect object is a labelled object, so that it may be called by other objects. The body of the PDF document is made up of dictionary objects. A dictionary object is an associative table containing pairs of objects (known as entries) represented by a key and a value. The key must be a name and the value may be of any kind (including another dictionary). The keys in a single dictionary must be unique. A dictionary is written as a sequence of key-value pairs enclosed in double angle brackets (<<) and (>>).
Lets take a look at the first object in our file:
3 4 5 6 7 8 9 | 1 0 obj
<<
/Type /Catalog
/Outlines 2 0 R
/Pages 3 0 R
>>
endobj
|
Line 3 declares the indirect object and 9 ends it. An indirect object is defined as
X Y obj
ExampleObject
endobj
Inside the object declaration is the dictionary itself.
We see that the type is a Catalog type. This is a special (required) type and is the root of the document. The catalog contains references to other objects defining the document's contents, outlines, and other attributes. A Catalog dictionary contains two required entires:
Outlines, an optional entry, references the root of the outline hierarchy. The document outline consists of a tree-structured hierarchy of outline items (sometimes called bookmarks), which serve as a visual table of contents to display the documents structure to the user. Since Outlines and Pages both reference indirect objects, we can see how they are described. The value 2 0 R refers to an indirect object. This is called an indirect reference. The indirect reference consists of the object number, the generation number and the character R.
Lets look at the next object:
11 12 13 14 15 16 | 2 0 obj
<<
/Type /Outlines
/Count 0
>>
endobj
|
This describes the document outline object. We see that this object has object number 2 and generation number 0. In addition the dictionary is described as the Outlines type. Count describes the total number of visible outline items at all levels of the outline.
Next we have object 3 which contains the dictionary for Pages, known as the Page Tree.
18 19 20 21 22 23 24 | 3 0 obj
<<
/Type /Pages
/Kids [4 0 R]
/Count 1
>>
endobj
|
Page tree nodes are made up of the following:
This brings us to the page object. The source for our one page object is:
26 27 28 29 30 31 32 33 34 35 36 37 | 4 0 obj
<<
/Type /Page
/Parent 3 0 R
/MediaBox [0 0 612 792]
/Contents 5 0 R
/Resources
<< /ProcSet 6 0 R
/Font << /F1 7 0 R >>
>>
>>
endobj
|
The page object is a dictionary specifying the attributes of a single page of the document. Lets discus the entries which have not been described previously.
Next we have object 5, which contains the content stream of our page.
39 40 41 42 43 44 45 46 47 48 | 5 0 obj
<< /Length 48 >>
stream
BT
/F1 24 Tf
100 700 Td
(Hello World)Tj
ET
endstream
endobj
|
The dictionary in this object describes only the length of the stream.
Next we see how the text is shown. It should be noted that the Text uses operators and operands. The operand (the object that is acted on) precedes the operator. In mathematics, we see this with the square root operator. If 5^2 is written, we know that 5 (the operand) is to be squared (the operator).
Next we look at object 6.
50 51 52 | 6 0 obj
[/PDF /Text]
endobj
|
We remember that this object was referenced by object 4 (the page node) in the resource dictionary under the ProcSet key. The PDF operators used in content streams are grouped into categories of related operators called Procedure Sets. This object holds an array (declared by the right and left brackets [ ]) of two procedure sets called PDF and Text. It should be noted that as of PDF version 1.4 this information is not used by the reader, but is still generated so that older readers may work.
The final object is object 7, shown below.
54 55 56 57 58 59 60 61 62 | 7 0 obj
<<
/Type /Font
/Subtype /Type1
/Name /F1
/BaseFont /Helvetica
/Encoding /MacRomanEncoding
>>
endobj
|
Object 7 was also referenced by object 4 (the page node) in the resources dictionary as the value to the Font key. The entries listed in this object are straightforward, and notice that the name /F1 is the same one referenced throughout the document.
This brings us to the cross-reference table. The cross-reference table lists the information that permits access to indirect objects within the file. Listing the file in this way allows a reader to read parts of the file before reading the entire thing (know as Random Access). The cross-reference table is shown below.
64 65 66 67 68 69 70 71 72 73 | xref
0 8
0000000000 65535 f
0000000012 00000 n
0000000089 00000 n
0000000145 00000 n
0000000214 00000 n
0000000381 00000 n
0000000485 00000 n
0000000518 00000 n
|
Line 64 declares the start of the cross-reference table. The next line introduces the cross-reference subsection. For a file that has never been incrementally updated (such as this one), there will be only one cross-reference subsection. Each cross-reference subsection contains entries for a contiguous range of object numbers. The subsection begins with a line containing two numbers. The first (0 in our case) is the object number of the first object and the second (8) contains the number of objects in that subsection. Lines 66 through 73 contain the cross-reference entries themselves, one per line. Lines are constructed as followes:
The first entry in the table will always be free and shall have a generation number of 65,535. If it is the only free object (as in our case), it will have 0000000000 (itself) as the listing to the next free object.
Finally, the PDF file ends with the file trailer. The file trailer links to the cross-reference table and other special objects.
74 75 76 77 78 79 80 81 | trailer
<<
/Size 8
/Root 1 0 R
>>
startxref
642
%%EOF
|
The trailer is declared by the word trailer. Next we see the trailer dictionary:. -Size - Contains the total number of entries in the files cross-reference table -Root - Contains the indirect reference to the root (catalog dictionary) of the document. After the trailer dictionary is the startxref keyword, which gives the byte-offset to the xref keyword. Finally, %%EOF declares the end of the PDF document.
published on 2012-01-22 by alex