07 October 2013

Operating with PDF in iOS is really low-level task. It is not just using some built-in features of iOS to create your own product. Indeed, it requires intensive knowledge at very low iOS layer and at least basic PDF definition. This post tends to supply some aspects of processing PDF in iOS which aims to help developers easy to win this hard job.

PDF Specification

Basic Definitions (mixed up Typography)

Leading

Leading is space between two lines. According to Wikipedia, In typography, leading /ˈlɛdɪŋ/ refers to the distance between the baselines of successive lines of type. However, there are more than one leading definition. The below image shows you 3 ways leading is defined.

alt text

X-height

X-height describes height of X character in a font collection as the image below demonstrates.

alt text

Descender and Ascender

In typography, Descender is a term refer to the part which is below word baseline (known as the underline of the word) and *Ascender** is the part of charater which is above the meanline. For example, in the image above, descender is a part from baseline to descent line (known as beardline), ascender is a part from meanline to ascender line (known as topline). Height of them are descent, ascent respectively. That means the height of a character is sum of three part: descent, x-height and ascent.

alt text

Font size

Font size is measuared by distance from ascent line to descent line. In general term, point is unit for desmonstrating font size which is equal to 1/72 inch per point (1 point = 1/72 inch. That means Arial 12pt = 1/6 inch = 4.3 mm). In PDF specification, Font information is usually in Resources dictionary which contains a Font dictionary. We can open PDF file by a text editor and search for Resources keyword to locate Font description.

Glyph

( Need more study )

Font

Font type (in PDF scope)

There are some ways to divide font type into bucket. We can base on how a font be organized or how a font can be drawn.

If using font structure, we have two kinds:

  • Simple Font
  • Composite Font

Only Type 0 is composite font. Others are of simple font, included: Type 1, Type 3, TrueType, CIDFont.

Or if using drawing method, we also have two buckets:

  • Bitmap-based font: Type 3
  • Outline font (Raster font): Type 0, Type 1, TrueType, CIDFont.
Font and CIDFont

Font is a collection of glyphs. CIDFont is also a set of glyphs. But, CIDFont is only used as a component of Type 0 font, cannot be used directly like other Fonts .

Descendant Font

is CIDFont, Type 0 will use glyphs of descendant font to draw characters.

CIDToGIDMap

When CIDToGIDMap is Identity:

Font Symbolic
Font program
Font dictionary

Reading PDF Flow

PDF Operator

PDF uses operators to determine what kind of text will be shown based on pre-defined format. However, it is so complex to understand how pdf is organized.

BT % Begin text object
/F1 1 Tf % Set text font and size 
64 0 0 64 7.1771 2.4414 Tm % Set text matrix 
0 Tc % Set character spacing 
0 Tw % Set word spacing
ET

In the above block, you can see Tf, Tm, Tc, Tw are four operators which define font (and font size), text matrix, character spacing, word spacing are used. In clearly words, if you want to read and do some processing tasks relating to pdf content such as highlighting word or bolding word, you must handle with as much as possible operators to get exactly text block information for making your job be accurate as your desire. You may need to check operator table to know more about pdf operators.

Scanner Stack

Stack is data structure to store PDF Objects when PDF file is being read. Prefix strategy is the method to read objects from object stack. Below is an example.

BT % Begin text object
/F1 1 Tf % Set text font and size 
64 0 0 64 7.1771 2.4414 Tm % Set text matrix 
0 Tc % Set character spacing 
0 Tw % Set word spacing
ET

Stack will be:

BT -> /F1 -> 1 -> Tf(operator) -> 64 -> 0 -> 0 -> 64 -> 7.1771 -> 2.4414 -> Tm(operator) -> … -> ET (top of the stack).

Getting PDF Objects from Scanner Stack

In iOS, you should implement some callback functions for several important operators to help scanner can recognize what sort of value must get for each operator when it scans through the pdf document. Fortunately, iOS supports us by providing Pop functions to get our desire objects when scanner meets a specific operator. Therefore, we must know clearly the format of each operator. For example, in the above block, Tf is font operator, when the value is Tf, scanner knows this operator and pop two most recent values: /F1 and 1 by two functions: CGPDFScannerPopName and CGPDFScannerPopNumber.

The following list displays Pop functions supplied by iOS.

  • CGPDFScannerPopObject
  • CGPDFScannerPopBoolean
  • CGPDFScannerPopInteger
  • CGPDFScannerPopNumber to get number, for example: get font’s size
  • CGPDFScannerPopName to get name, for example: get font’s name
  • CGPDFScannerPopString to get content string
  • CGPDFScannerPopArray
  • CGPDFScannerPopDictionary
  • CGPDFScannerPopStream

Copy from CGPDFScanner Reference

Parsing PDF Content

To parse exactly PDF content, we need to read all plain PDF information. However, except some pure control data, all streams in PDF are usually represented in compressed format. That means we need to decompress them to get the pure text. The compress algorithm is given clearly in stream description, for example:

12 0 obj
<< /Length 13 0 R /N 3 /Alternate /DeviceRGB /Filter /FlateDecode >>

/FlateDecode lets us that the encode algorithm is Flate and should use this algorithm to decode the stream.

As a standard, almost PDF creator use Flate Encoder to compress stream. Therefore, to decode that data, we can use some tools. In my experience, I usually use PDFToolkit (Server version) like this:

$ pdftk compressedPDF.pdf output decompressedPDF.pdf uncompress

Then, use less command to see decoded text:

$ less decompressedPDF.pdf

Do that, now we can see pure PDF text specification. Hence, we can parse PDF content based on PDF Operators Specification.

By a reason that PDF stream (text content, image content, …) is usually compressed for decreasing data size, PDF Reader and/or PDF Parser must uncompresses that stream before rendering or parsing. Almost PDF Document stream is decoded using FlateDecode filter which is based on the zlib/deflate algorithm. For getting plain text to analyze, we can use some tools to convert compressed PDF to uncompressed PDF. In my case, I utilize pdftk with the command:

pdftk compressedPDF.pdf output uncompressedPDF.pdf uncompress

Besides that, someone suggests itextrups which is laid on iText library (written in Java). See the discussion here.

Processing PDF content

Get Text Rect

Get Text Rect

CGPDFArrayRef rectArray;
if(CGPDFDictionaryGetArray(annotDict, "Rect", &rectArray)) {
    //continue;
    CGPDFReal coords[4];

    for( int k = 0; k < arrayCount; ++k ) {
        CGPDFObjectRef rectObj;
        if(!CGPDFArrayGetObject(rectArray, k, &rectObj)) {
            continue;
        }

        CGPDFReal coord;
        if(!CGPDFObjectGetValue(rectObj, kCGPDFObjectTypeReal, &coord)) {
            continue;
        }

        coords[k] = coord;
    }      
}

//blx,bly,trx,try>tlx,tly,w,h
CGRect rect = CGRectMake(coords[0],coords[3],coords[2]-coords[0],coords[3]-coords[1]);

Some add-on elements for showing PDF

Annotation

Linking Media/Web Elements

Performing with Font and Charset Encoding

Problem with Identity-H Encoding

Identity-H is a encoding which is used by Google Docs (when you export your docs to PDF file).

References

For starter

Advance

Open Sources for iOS

Open sources for other language

  • Mozilla PDF. Good but there are some mistakes when parses complex tables. Take a test with this doc and this.

  • PDF Minner. Great pdf parser but still has mistakes with PDF generated by Google Docs which using composite font and Identity-H Encoding. Give a try with this document (its output is at here).

  • mupdf. This library is used by librelio on Android version.

  • PDFBox. An Apache opensource which supports fully PDF tasks including text extraction, filling, splitting, merging, signing, printing and PDF/A Validation. It’s written by Java.

  • iText. Also using Java as mainly language. Support backend and has an Android porting version.



blog comments powered by Disqus

PLEASE, BUY ME A BREAD!