-
Parsing PDF documents
This isn't really a guide to parsing everything; it's really just notes on the general document structure and the approach you need to take in order to parse it efficiently.
The XREF table
Every PDF has one more cross-reference tables which contains the byte offsets and status for every object in the PDF. When the PDF is updated, new objects and tables are appended; the new table will have a link to the old table. So when constructing a master table for the current version, you'll basically work your way backwards through the tables.
Determining mode
If at all possible, you should know the length of the file before you begin; in HTTP you can get this from the headers, and locally you can get this from the filesystem.
You only need to the first 1024 bytes to make the most important decision - are you going to be in linear mode? If you're not in linear mode, you're going to need the whole file before you can start parsing.
First up is a version header; this identifies it as a PDF document and you might want to switch defaults, rendering paths, or features based on the version number. Luckily, the spec says to ignore anything you don't understand, so you have minimal forward compatibility.
After the version is a series of PDF objects; usually the first object is a PDF dictionary. If it is a linearization dictionary, you will be in linear mode and can really speed things up.
Linear mode initialisation
One of the parameters in the linearization dictionary is the file size; check that against the actual file size. If it's different, the file has been updated and you will need to do some analysis or get out of linear mode. The analysis is mentioned in the spec but not covered here.
Let's assume you're in linear mode. The next object will be an XREF table; read that in and remember the PREV location to locate the other entries later. You now have all the entries needed to draw the first page.
Next up will be the document catalog, possibly followed by some document-level objects specified in the spec (encryption and outlines, for example). Read these in.
Now you will get either the first page (and its attendant objects) or the hint table. Easy to tell which because the offset and length of the hint table were given to you in the linearization dictionary. If it's the first page, you can go ahead and render it (see page rendering, below).
Once you have the hint table, STOP! Decide if you're going to get the rest of the file, abort the request and use HTTP range (or its equivalent), or what.
Normal mode initialisation
PDF files are a little backwards; you'll need to go to the very end of the file for the next step. The line before the EOF marker is the offset of the current XREF table. You'll need to read the XREF table plus any previous tables (handily linked as a PREV entry) to construct a master table.
The XREF table has a ROOT entry that points to the document catalog. You'll want to read that in, plus deference the same entries that appear in linear mode (you don't have to dereference but it simplifies things if they both initialize the same way).
CHECKPOINT
Now you have:
- A cross-reference table with at least all the first-page entries.
- The document catalog, with some dereferenced blocks
- Possibly a hint table for fast rendering
- Possibly the objects needed for the first page.
Drawing a page
Locate the page object. In linear mode, consult the hint table to find out what range to request. In normal mode you'll probably be consulting the pagetree structure.
When you have the page contents stream (first to arrive in linear mode) you can begin drawing. In linear mode the other objects will continue to load asynchronously. In normal mode you may want to dereference the objects and start reading them asychronously.
Draw the page - skip images that are not ready and substitute for any embedded fonts that have not arrived. Put up the page when done.
Offscreen, redraw the page as images and fonts arrive. When all objects have shown up, swap with the existing page. You can swap earlier but that often results in flicker.
Comments
It's there a way for me to contact you. I have a to code a requirement, but i don't know if its possible to do. I would like to talk to you about it.
Thanks.
franzx@gmail.com
2OonfW <a href="http://fplemqcgmhkt.com/">fplemqcgmhkt</a>, [url=http://mvlfgefqyaib.com/]mvlfgefqyaib[/url], [link=http://cyhpjfnknwas.com/]cyhpjfnknwas[/link], http://qckotmvcgunz.com/
outerly adventuresome heteronymic entellus mel hairlace apophyseal votive
<a href= http://www.londonnormandyapt.com/ >London and Normandy House Apartments</a>
http://cnn.com/2002/WORLD/asiapcf/auspac...
<a href= http://www.buques.org/ >Sea and Ships</a>
http://cgi.netscape.com/newsref/pr/newsr...
<a href= http://www.brynhall.co.uk >Bryn Hall Country Furniture</a>
http://www.hotpepper.ca/family/
<a href= http://www.fordham.edu/halsall/mod/1965R... >Rhodesia: Unilateral Declaration of Independence Documents, 1965</a>
http://cnn.com/2001/BUSINESS/asia/05/02/...
i5RNAz <a href="http://egjdhlipltzl.com/">egjdhlipltzl</a>, [url=http://foxzgdzquisz.com/]foxzgdzquisz[/url], [link=http://hkakdzntbqdp.com/]hkakdzntbqdp[/link], http://dmbabgthnarr.com/
Post a Comment