In-Depth Guides Beginner to Advanced

Complete Guide to PDFs

Creation, Editing, and Optimization โ€” Everything You Need to Know

25 min read โ€ข Updated February 2025

1. What is PDF? History and the PDF Standard

The Portable Document Format (PDF) is a file format developed by Adobe Systems in 1992 to present documents in a manner independent of application software, hardware, and operating systems. The core idea behind PDF was revolutionary at the time: a document should look exactly the same regardless of where or how it is viewed.

John Warnock, co-founder of Adobe, initiated the "Camelot Project" in 1991, which eventually became PDF. The first version of the PDF specification was published alongside Acrobat 1.0 in June 1993. In those early days, the format was proprietary and the Acrobat Reader software cost $50 โ€” a significant barrier to adoption.

Adobe made a pivotal decision in 1994 by making Acrobat Reader free, which dramatically accelerated adoption. Over the next decade, PDF evolved through multiple versions, each adding significant capabilities:

VersionYearKey Features Added
PDF 1.01993Core format, basic text and graphics
PDF 1.11994Passwords, encryption, device-independent color
PDF 1.21996Interactive forms (AcroForms), Unicode support
PDF 1.32000Digital signatures, JavaScript, ICC color profiles
PDF 1.42001Transparency, 128-bit encryption, accessibility (tagged PDF)
PDF 1.52003JPEG 2000 compression, cross-reference streams, object streams
PDF 1.620043D artwork, AES encryption, embedded files
PDF 1.72006XFA forms, 3D annotations, ISO standardization
PDF 2.02017AES-256 encryption, improved tagged PDF, unencrypted wrapper documents

The ISO Standard: ISO 32000

In 2008, PDF 1.7 was published as ISO 32000-1:2008, making it an open international standard no longer under Adobe's sole control. This was a watershed moment. Any company or developer could now implement the PDF specification without licensing concerns. The standard is maintained by ISO Technical Committee 171 (TC 171), Sub-Committee 2 (SC 2).

PDF 2.0, published as ISO 32000-2:2017 and later revised as ISO 32000-2:2020, was the first version of the PDF specification developed entirely within the ISO standardization process. It introduced several improvements including deprecation of proprietary technologies like XFA forms and Adobe's Flash-based rich media, better encryption with AES-256 as the only supported algorithm, and enhanced accessibility features.

Did You Know?

PDF is one of the most widely used file formats in the world. Over 2.5 trillion PDFs are created every year, according to Adobe. Government agencies, banks, universities, and businesses worldwide rely on PDF as their primary document format for contracts, invoices, academic papers, and regulatory filings.

2. How PDFs Work Internally

Understanding PDF internals is valuable for anyone who works with documents at a deeper level. A PDF file is not simply a flat image or a text file โ€” it is a sophisticated structured binary format built around a hierarchical object system. Every PDF file, no matter how simple, consists of four main sections.

The Four Parts of a PDF File

1. Header: The file begins with a header line that identifies the PDF version. For example:

%PDF-1.7
%รขรฃรร“

The first line declares the PDF version. The second line contains at least four binary characters (bytes with values above 127), which signals to file transfer programs that the file contains binary data and should not be treated as plain text.

2. Body: The body contains all the objects that make up the document content โ€” pages, fonts, images, annotations, and more. Each object is assigned a unique identifier consisting of an object number and a generation number.

3 0 obj          % Object number 3, generation 0
<<
  /Type /Page
  /Parent 2 0 R   % Reference to object 2
  /MediaBox [0 0 612 792]  % US Letter size in points
  /Contents 4 0 R
  /Resources <<
    /Font <<
      /F1 5 0 R   % Reference to font object
    >>
  >>
>>
endobj

3. Cross-Reference Table (xref): This table provides the byte offset of every object in the file, enabling random access. A PDF reader does not need to scan the entire file to find a specific object โ€” it looks up the offset in the cross-reference table and jumps directly to it.

xref
0 7
0000000000 65535 f    % Free object (object 0 is always free)
0000000015 00000 n    % Object 1 at byte offset 15
0000000078 00000 n    % Object 2 at byte offset 78
0000000192 00000 n    % Object 3 at byte offset 192
0000000413 00000 n    % Object 4 at byte offset 413
0000000560 00000 n    % Object 5 at byte offset 560
0000000732 00000 n    % Object 6 at byte offset 732

4. Trailer: The trailer sits at the end of the file and tells the PDF reader where to find the cross-reference table, the document catalog (root object), and other essential metadata.

trailer
<<
  /Size 7
  /Root 1 0 R     % The document catalog
  /Info 6 0 R     % Document information dictionary
>>
startxref
892               % Byte offset of the xref table
%%EOF

PDF Object Types

PDF supports eight fundamental data types that are used to construct all objects:

  • Boolean: true or false
  • Integer: Whole numbers like 42 or -17
  • Real: Floating-point numbers like 3.14
  • String: Literal strings (Hello World) or hexadecimal strings <48656C6C6F>
  • Name: Slash-prefixed identifiers like /Type, /Page, /Font
  • Array: Ordered collections, e.g. [0 0 612 792]
  • Dictionary: Key-value pairs enclosed in << >>
  • Stream: A dictionary followed by a sequence of bytes, used for page content, images, fonts, and more

Content Streams and Operators

The actual visible content on a PDF page is described by a content stream โ€” a sequence of operators that draw text, lines, curves, and images. Content streams use a stack-based postfix notation inherited from PostScript. Here is an example that draws text on a page:

BT              % Begin text object
  /F1 12 Tf     % Set font F1 at 12 points
  1 0 0 1 72 720 Tm  % Set text matrix (position at 72, 720)
  (Hello, World!) Tj  % Show the string
ET              % End text object

Graphics operators draw shapes and paths. The coordinate system in PDF places the origin (0,0) at the bottom-left corner of the page, with units in points (1 point = 1/72 of an inch). A standard US Letter page is 612 by 792 points.

Incremental Updates

One of PDF's most clever design features is incremental updates. When you modify a PDF (for example, by adding an annotation or filling in a form field), the application does not need to rewrite the entire file. Instead, it appends a new body section, a new cross-reference table, and a new trailer at the end of the file. The new cross-reference table only lists the objects that changed, and its trailer points back to the previous cross-reference table. This chain of updates allows the complete edit history to be reconstructed and makes saving changes to large files very fast.

Technical Insight

Incremental updates are the reason why a PDF file can sometimes become larger after you "delete" content. The original objects remain in the file; they are simply dereferenced by the newer cross-reference table. To truly remove old content, you must perform a "Save As" operation that rewrites the entire file from scratch, or use a tool that linearizes and optimizes the PDF.

3. Creating PDFs

There are many ways to create PDF files, ranging from simple "print to PDF" features built into operating systems to sophisticated programmatic generation using specialized libraries. The method you choose depends on your source material and requirements.

From Word Processors and Office Applications

The most common way to create PDFs is through export or "Save As" functionality in office applications. Microsoft Word, Google Docs, LibreOffice Writer, and Apple Pages all support direct PDF export. These applications convert their internal document model (which includes paragraphs, styles, tables, and images) into PDF objects. The quality of the resulting PDF varies by application:

  • Microsoft Word: Produces generally good PDFs with embedded fonts and hyperlinks. Recent versions also support creating tagged (accessible) PDFs.
  • Google Docs: Export quality is adequate for simple documents. Complex layouts with multiple columns or precise typography may not convert perfectly.
  • LibreOffice: Offers extensive PDF export options including PDF/A compliance, encryption, and initial view settings. It is one of the most configurable free PDF exporters available.
  • LaTeX: Produces exceptionally high-quality PDFs, especially for mathematical and scientific documents. The pdflatex, xelatex, and lualatex engines generate PDFs directly from source.

From Web Pages (HTML to PDF)

Converting web pages to PDF is useful for archiving, sharing, and offline reading. There are several approaches:

  • Browser Print Dialog: Every modern browser (Chrome, Firefox, Safari, Edge) includes a "Print to PDF" option. This uses the browser's rendering engine to convert the page layout, including CSS styles, into a PDF.
  • Headless Browser Tools: Tools like Puppeteer (using headless Chrome) or Playwright provide programmatic control over the conversion. They render the page identically to how a user would see it and then generate a high-quality PDF.
  • Server-Side Rendering: Libraries like wkhtmltopdf use the WebKit rendering engine to convert HTML and CSS to PDF on the server. Prince XML is a commercial tool that produces particularly high-quality PDFs with excellent CSS support, including CSS Paged Media.
// Example: Generating a PDF from HTML using Puppeteer (Node.js)
const puppeteer = require('puppeteer');

async function htmlToPdf(url, outputPath) {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto(url, { waitUntil: 'networkidle0' });
  await page.pdf({
    path: outputPath,
    format: 'A4',
    margin: { top: '1cm', right: '1cm', bottom: '1cm', left: '1cm' },
    printBackground: true
  });
  await browser.close();
}

From Images

Converting images to PDF is commonly needed for scanned documents, photo portfolios, and multi-page image collections. Each image becomes a page (or part of a page) in the resulting PDF. The image data can be embedded using various compression methods:

  • JPEG (DCTDecode): Best for photographs. The image data is stored in its original JPEG-compressed form, so no additional compression penalty is incurred.
  • PNG-style (FlateDecode): Best for screenshots, diagrams, and images with text. Uses lossless zlib/deflate compression.
  • JPEG 2000 (JPXDecode): Supported from PDF 1.5 onwards. Offers better compression ratios than JPEG, especially at low bitrates, but has less universal tool support.
  • JBIG2: Highly efficient compression for bi-level (black and white) images, commonly used for scanned text documents.

You can use our Images to PDF converter to combine multiple images into a single PDF document directly in your browser, with no uploads required.

From Scanners (Scan to PDF)

Scanning physical documents to PDF is a fundamental use case. Modern scanners and multifunction printers can produce PDFs directly. For best results, consider:

  • Scanning at 300 DPI for general documents, 600 DPI for fine print or archival quality
  • Using black-and-white mode for text-only documents to minimize file size
  • Applying OCR (Optical Character Recognition) to make scanned text searchable and selectable
  • Deskewing and cleaning up scanned images for readability

4. PDF Editing and Manipulation

PDF was originally designed as a final-form format โ€” a digital equivalent of paper. Editing PDFs is therefore more complex than editing a word processor document. Unlike formats such as DOCX where content flows and reflows naturally, PDF positions every character, line, and image at an exact coordinate on the page. Modifying text can disrupt the layout because the format does not inherently understand paragraphs, columns, or other high-level structure.

Types of PDF Editing

PDF editing falls into several categories, each with different levels of complexity:

Page-Level Operations: These manipulate whole pages without touching their content. Operations include merging multiple PDFs, splitting a PDF into separate files, reordering pages, rotating pages, and extracting specific page ranges. These are the safest and most reliable PDF operations because they do not modify page content streams.

Annotation and Form Filling: Adding comments, highlights, stamps, and filling in form fields. These operations add new objects on top of existing page content without modifying the original content stream.

Content Editing: Directly modifying text, images, or graphics within a page's content stream. This is the most complex type of editing. Professional tools like Adobe Acrobat Pro, Foxit PDF Editor, and PDF-XChange Editor can do this, but results vary depending on the PDF's structure and font embedding.

Our PDF Tools

We offer several browser-based PDF tools that handle common manipulation tasks. All processing happens locally in your browser โ€” your files are never uploaded to any server:

Merging PDFs: How It Works

Merging PDFs is conceptually simple but technically nuanced. The merge process involves copying page objects from multiple source files into a new PDF, re-numbering all objects to avoid ID collisions, and rebuilding the page tree. Complications arise when source PDFs share font names but use different font files, when documents have different page sizes, or when interactive elements like hyperlinks and bookmarks reference specific pages.

A well-implemented merge tool handles all of these cases. Font subsetting must be preserved or re-computed, shared resources should be deduplicated to avoid bloating the output file, and cross-references within each source document must be updated to reflect their new positions in the merged file.

Splitting PDFs: Practical Uses

Splitting a PDF is valuable in many scenarios: extracting a specific chapter from a book, separating individual invoices from a batch print, or reducing file size by removing unnecessary pages before sharing. The split operation creates new PDF files that each contain a subset of the original pages, along with all the resources (fonts, images, color profiles) needed by those pages.

Tip

When splitting a PDF, be aware that shared resources are typically duplicated across all output files. If the original PDF has a 5 MB embedded font used on every page, each split file will include that font. The sum of split file sizes may therefore exceed the original file size.

5. PDF Compression and Optimization

PDF files can become surprisingly large, especially when they contain high-resolution images, embedded fonts, or multiple pages. Compression and optimization reduce file size while preserving visual quality, making PDFs easier to share, email, and store. Understanding the different techniques helps you choose the right trade-offs for your use case.

Image Compression

Images are almost always the largest component of a PDF. A single uncompressed photograph at 300 DPI on a letter-size page would require approximately 25 MB of data. Compression techniques reduce this dramatically:

  • JPEG compression: Lossy compression ideal for photographs. Quality levels of 75-85% typically produce visually indistinguishable results while reducing file size by 10-20x compared to uncompressed data.
  • Flate (zlib/deflate) compression: Lossless compression used for line art, diagrams, and screenshots. Typically achieves 2-5x compression on suitable images.
  • Downsampling: Reducing image resolution to match the output requirement. A 600 DPI image in a PDF intended for screen viewing can be downsampled to 150 DPI without visible quality loss, reducing data by 16x.
  • Color space conversion: Converting images from CMYK to RGB (or vice versa) can save space, as CMYK uses four channels instead of three. This is only appropriate when print-specific color accuracy is not required.

Font Optimization

Fonts are the second largest contributor to PDF file size. A complete font file can be 500 KB to several MB. Optimization strategies include:

  • Font subsetting: Including only the glyphs (characters) actually used in the document. If your document only uses 100 different characters from a font with 5,000 glyphs, subsetting can reduce the font data by 95% or more.
  • Font deduplication: When multiple pages or merged documents embed the same font separately, deduplication consolidates them into a single copy. This is especially impactful in merged PDFs.
  • Using standard fonts: The PDF specification defines 14 standard fonts (including Times, Helvetica, and Courier families) that PDF readers are expected to have available. Using these fonts means no font data needs to be embedded at all, though this sacrifices precise typographic control.

Structural Optimization

Beyond content compression, the structure of the PDF file itself can be optimized:

  • Removing unused objects: Incremental updates leave orphaned objects in the file. A full rewrite removes these, often saving significant space.
  • Object stream compression: Introduced in PDF 1.5, object streams pack multiple small objects into a single compressed stream, reducing overhead.
  • Cross-reference stream compression: Also from PDF 1.5, replacing the text-based cross-reference table with a compressed binary stream.
  • Linearization (Fast Web View): Restructures the PDF so that the first page can be displayed before the entire file has been downloaded. This reorganizes objects so that first-page resources appear at the beginning of the file. Linearization adds a small amount of overhead to the file size but dramatically improves perceived performance when viewed over a network.

Compression Comparison

To give a practical sense of what optimization achieves, here is a representative example:

OptimizationBeforeAfterReduction
Image downsampling (600 to 150 DPI)48 MB4.2 MB91%
Font subsetting12 MB3.8 MB68%
Removing incremental updates8 MB5.1 MB36%
Object/xref stream compression5 MB4.3 MB14%

Optimization Rule of Thumb

If your PDF is large, look at images first. In a typical document-with-images scenario, 80-95% of the file size comes from embedded images. Downsampling and recompressing images will almost always yield the biggest size reduction. Font subsetting comes second, followed by structural optimizations.

6. PDF Security

PDF supports a comprehensive security model with encryption, digital signatures, and fine-grained permissions. Understanding these features is essential for anyone who handles sensitive documents.

Password Encryption

PDF encryption protects document content so that only authorized users can view or modify it. The format supports two types of passwords:

  • User Password (Open Password): Required to open and view the document. Without this password, the PDF content cannot be decrypted.
  • Owner Password (Permissions Password): Controls what operations are allowed on the document (printing, copying text, editing, etc.). The document can be viewed without the owner password, but restricted operations are blocked.

The encryption algorithms have evolved significantly over PDF versions:

  • 40-bit RC4 (PDF 1.1): The original encryption. Now considered completely insecure and can be cracked in seconds.
  • 128-bit RC4 (PDF 1.4): A significant improvement, but RC4 itself has known vulnerabilities.
  • 128-bit AES (PDF 1.6): Adopted the Advanced Encryption Standard. Much more secure than RC4.
  • 256-bit AES (PDF 2.0): The current standard. Provides strong encryption suitable for sensitive documents. This is the only algorithm permitted in PDF 2.0-compliant files.

Security Warning

The owner password (permissions password) is fundamentally a "gentleman's agreement." PDF viewers are supposed to enforce the restrictions, but the document content is encrypted with the same key regardless of which password is used. Third-party tools can easily ignore permission restrictions. If you need true security, always set a strong user (open) password and use AES-256 encryption. The owner password alone is not a reliable security measure.

Digital Signatures

Digital signatures in PDF provide three guarantees: authentication (verifying who signed the document), integrity (confirming the document has not been altered since signing), and non-repudiation (the signer cannot deny having signed). PDF supports several signature standards:

  • PKCS#7 / CMS: The most common signature format in PDFs. Uses X.509 certificates and supports RSA, DSA, and ECDSA algorithms.
  • PAdES (PDF Advanced Electronic Signatures): Built on CMS and defined in ETSI standards. PAdES defines four levels of signatures โ€” PAdES-B (basic), PAdES-T (with timestamp), PAdES-LT (with long-term validation data), and PAdES-LTA (with long-term archival). PAdES-LTA signatures remain verifiable even after the signing certificate has expired.
  • Timestamp signatures: Obtained from a Time Stamp Authority (TSA), these prove the document existed in its current form at a specific point in time.

Signed PDFs use incremental updates to ensure the signed byte range remains unchanged when subsequent modifications are made. Each signature covers a specific byte range, and any modification to those bytes will invalidate the signature. Multiple signatures can coexist in the same document, each covering a different byte range.

Permission Flags

PDF permission flags control individual operations on the document. These are set when applying the owner password and include:

  • Printing (standard resolution or high quality)
  • Content copying (extracting text and images)
  • Document modification (page insertion, deletion, rotation)
  • Annotation and form fill-in
  • Content extraction for accessibility purposes
  • Document assembly (page manipulation without content changes)

Redaction

Redaction is the permanent removal of sensitive information from a PDF. This is critically different from simply placing a black rectangle over text. Proper redaction actually removes the underlying text, image data, and metadata from the file. Improper "redaction" by covering text with a shape is a common security mistake โ€” the hidden text can easily be extracted by selecting it, searching for it, or inspecting the PDF's internal objects. Always use a proper redaction tool (such as Adobe Acrobat Pro's redaction feature) and verify the result by attempting to search for or select the redacted content.

7. PDF Accessibility

An accessible PDF is one that can be effectively used by people with disabilities, including those who use screen readers, magnification software, and other assistive technologies. Accessibility is not just a nice-to-have โ€” in many jurisdictions, it is a legal requirement for government agencies, educational institutions, and organizations receiving public funding. Standards like Section 508 (US), EN 301 549 (EU), and WCAG 2.1 mandate accessible electronic documents.

Tagged PDF: The Foundation of Accessibility

The key technology behind PDF accessibility is the tag tree (also called the structure tree). A tagged PDF contains a logical structure that maps visual elements on the page to semantic roles. This structure is separate from the visual content stream and provides the information assistive technologies need to present the document's content in a meaningful order.

Common structure tags include:

  • <Document> โ€” The root element of the tag tree
  • <H1> through <H6> โ€” Heading levels
  • <P> โ€” Paragraphs
  • <L>, <LI>, <Lbl>, <LBody> โ€” Lists, list items, labels, and list bodies
  • <Table>, <TR>, <TH>, <TD> โ€” Table structure
  • <Figure> โ€” Images and illustrations (with alt text)
  • <Link> โ€” Hyperlinks
  • <Span> โ€” Inline text with specific attributes (e.g., language changes)

If these tags look familiar, it is because they are intentionally modeled on HTML elements. The PDF standard drew direct inspiration from HTML's semantic structure when designing the tag system.

Key Accessibility Requirements

Creating an accessible PDF involves more than just adding tags. The following requirements are essential:

  • Reading order: The logical reading order must match the visual order. Multi-column layouts require careful tagging so screen readers present content in the correct sequence.
  • Alternative text: Every image must have descriptive alt text. Decorative images should be marked as artifacts so screen readers skip them.
  • Document language: The primary language must be specified in the document catalog, and language changes within the text must be tagged (e.g., a French phrase in an English document).
  • Table structure: Complex tables need header cell associations so screen readers can announce column and row headers when reading data cells.
  • Bookmarks: Long documents should have a bookmark tree (outline) that corresponds to the heading structure, enabling quick navigation.
  • Color contrast: Text must have sufficient contrast against its background (WCAG requires at least 4.5:1 for normal text and 3:1 for large text).
  • Font information: Fonts must be embedded with proper Unicode mappings so text can be extracted correctly. Without a ToUnicode CMap, screen readers may not be able to decode the text.

Testing PDF Accessibility

Several tools can help verify PDF accessibility:

  • Adobe Acrobat Pro Accessibility Checker: Built-in tool that checks for common accessibility issues including missing tags, missing alt text, incorrect reading order, and insufficient contrast.
  • PAC (PDF Accessibility Checker): A free tool by the Swiss foundation Access for All. It performs comprehensive checks against the PDF/UA standard (ISO 14289) and generates detailed reports.
  • Screen reader testing: The ultimate test is to navigate the document with an actual screen reader (JAWS, NVDA, or VoiceOver) to verify that content is presented logically and completely.

Accessibility Fact

According to the World Health Organization, over 2.2 billion people worldwide have a vision impairment. Accessible PDFs ensure these individuals can access the same information as everyone else. Many lawsuits have been filed against organizations that distribute inaccessible PDFs, particularly in the education and government sectors.

8. PDF/A for Archival

PDF/A is a specialized subset of the PDF format designed for long-term digital preservation. Standardized as ISO 19005, PDF/A restricts certain PDF features to ensure that documents remain fully self-contained and reproducible decades or even centuries into the future, regardless of the software or operating system used to open them.

Why PDF/A Exists

Standard PDFs can reference external resources: fonts that are not embedded, color profiles stored elsewhere, multimedia content that requires specific plugins, and JavaScript that may behave differently across viewers. Over time, these dependencies break. A PDF created in 2003 that relies on a Flash plug-in for embedded video is now unplayable. A PDF with non-embedded fonts may render with substitution fonts, altering its appearance.

PDF/A eliminates these risks by requiring all resources to be embedded and prohibiting features that depend on external software or have unpredictable behavior.

PDF/A Conformance Levels

PDF/A has evolved through several parts, each based on a different version of the PDF specification:

PDF/A-1 (ISO 19005-1:2005): Based on PDF 1.4. The original standard with two conformance levels:

  • PDF/A-1b (Basic): Ensures reliable visual reproduction. Requires all fonts to be embedded, no encryption, no external content references, device-independent color, and XMP metadata.
  • PDF/A-1a (Accessible): Everything in 1b plus full tagging (logical structure) and Unicode character mapping. This level ensures both visual fidelity and content extraction capability.

PDF/A-2 (ISO 19005-2:2011): Based on PDF 1.7. Adds support for JPEG 2000 compression, transparency, layers (optional content groups), and PDF/A-compliant file attachments. Introduces a third conformance level:

  • PDF/A-2u (Unicode): Like 2b but requires all text to have Unicode mapping. This is a practical middle ground between basic visual reproduction and full accessibility.

PDF/A-3 (ISO 19005-3:2012): Based on PDF 1.7. Identical to PDF/A-2 but allows embedding of arbitrary file formats (not just other PDF/A files) as attachments. This enables scenarios like attaching the original XML data source or a spreadsheet alongside the rendered PDF.

PDF/A-4 (ISO 19005-4:2020): Based on PDF 2.0. Simplifies the conformance levels and aligns with the latest PDF specification features.

What PDF/A Prohibits

  • Encryption of any kind (the content must be freely accessible)
  • JavaScript and executable code
  • Audio and video content (PDF/A-1 and PDF/A-2)
  • External content references (all fonts, images, and color profiles must be embedded)
  • Non-standard fonts without embedding
  • LZW compression (due to historical patent concerns)
  • Transparency (in PDF/A-1 only; allowed from PDF/A-2 onwards)

Who Uses PDF/A?

PDF/A is mandated or strongly recommended by numerous organizations worldwide:

  • National archives and libraries (US Library of Congress, German Federal Archives, Swiss Federal Archives)
  • Court systems for electronic filing (e.g., the US federal court system's CM/ECF)
  • European Union institutions for official documents
  • Healthcare systems for patient records archival
  • Financial institutions for regulatory compliance

Validation Tip

Creating a file with a ".pdf" extension and claiming it is PDF/A does not make it so. PDF/A compliance must be validated using a conformance checker. The industry-standard tool is veraPDF, an open-source PDF/A validator developed by the Open Preservation Foundation with EU funding. It checks every requirement of the PDF/A specification and generates detailed compliance reports.

9. Working with PDFs Programmatically

Developers frequently need to create, modify, or extract data from PDFs in their applications. The ecosystem of PDF libraries is rich, spanning virtually every programming language. Here is a survey of the most established options.

JavaScript / TypeScript

  • pdf-lib: A pure JavaScript library for creating and modifying PDFs. Works in both Node.js and browsers. Excellent for tasks like merging PDFs, adding pages, embedding images, and filling forms. Does not handle rendering or text extraction.
  • PDF.js (pdfjs-dist): Mozilla's open-source PDF rendering library, used in Firefox. Can render PDF pages to Canvas elements and extract text content. Primarily a viewer/renderer, not a creation tool.
  • jsPDF: Generates PDFs from scratch in the browser. Good for creating simple documents with text, images, and basic vector graphics. Often paired with html2canvas for HTML-to-PDF conversion.
  • Puppeteer / Playwright: Browser automation tools that can generate high-quality PDFs from HTML pages using headless Chrome or Firefox.
// Example: Merging PDFs with pdf-lib
import { PDFDocument } from 'pdf-lib';

async function mergePdfs(pdfBuffers: ArrayBuffer[]): Promise<Uint8Array> {
  const mergedPdf = await PDFDocument.create();

  for (const buffer of pdfBuffers) {
    const sourcePdf = await PDFDocument.load(buffer);
    const pages = await mergedPdf.copyPages(
      sourcePdf,
      sourcePdf.getPageIndices()
    );
    pages.forEach((page) => mergedPdf.addPage(page));
  }

  return mergedPdf.save();
}

Python

  • pypdf (formerly PyPDF2): The most popular pure-Python PDF library. Handles merging, splitting, extracting text, encrypting/decrypting, and manipulating metadata. The library was forked, reunified, and renamed to pypdf in 2022.
  • ReportLab: A powerful PDF generation library. The open-source toolkit creates PDFs from scratch with precise control over layout, fonts, graphics, and tables. Used by major organizations for automated report generation.
  • pdfplumber: Built on pdfminer.six, this library excels at extracting structured data from PDFs, including table extraction with row and column detection.
  • Camelot: Specializes in extracting tables from PDFs. Uses either stream-based or lattice-based detection to identify table boundaries and extract data into pandas DataFrames.
  • FPDF2: A lightweight PDF generation library (Python port of the PHP FPDF library). Simple API for creating documents with text, images, and basic graphics.
# Example: Extracting text from a PDF with pypdf
from pypdf import PdfReader

reader = PdfReader("document.pdf")
for page_num, page in enumerate(reader.pages):
    text = page.extract_text()
    print(f"--- Page {page_num + 1} ---")
    print(text)

Java / JVM

  • Apache PDFBox: A comprehensive open-source Java library for working with PDFs. Supports creation, manipulation, rendering, text extraction, form filling, digital signatures, and PDF/A validation. Used extensively in enterprise applications.
  • iText: A feature-rich library available in Java and C#. The open-source version (iText 5, AGPL license) and commercial version (iText 7) offer different capabilities. iText is known for strong support of PDF/A creation, digital signatures, and high-volume document generation.
  • OpenPDF: A community fork of iText 4 under the LGPL/MPL license. Good for basic PDF operations in projects that need a permissive open-source license.

C / C++ / .NET

  • Poppler: An open-source PDF rendering library forked from Xpdf. Powers many Linux PDF viewers and provides command-line utilities like pdftotext, pdftoppm, and pdfinfo.
  • MuPDF: A lightweight, high-performance PDF and e-book renderer. Known for exceptional rendering quality and small footprint. Licensed under AGPL.
  • QPDF: A C++ library and command-line tool focused on structural transformation of PDFs (linearization, encryption, decryption, page manipulation). Excellent for automated workflows.
  • iTextSharp / iText 7 for .NET: The .NET port of iText. Widely used in ASP.NET applications for PDF generation and manipulation.

Command-Line Tools

Several powerful command-line tools are invaluable for scripting and automation:

  • Ghostscript: The Swiss Army knife of PDF processing. Handles conversion, optimization, repair, rendering, and format conversion (PostScript to PDF, PDF to images). Used as a backend by many other tools.
  • QPDF: Specializes in structural transformations: linearization, encryption, decryption, merging, splitting, and page extraction.
  • pdftk: "The PDF Toolkit" provides a simple command-line interface for merging, splitting, rotating, encrypting, decrypting, and watermarking PDFs.
  • Poppler utilities: A suite of tools including pdftotext, pdftoppm, pdfimages, pdffonts, and pdfinfo.
# Ghostscript: Compress a PDF by downsampling images
gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 \
   -dPDFSETTINGS=/ebook \
   -dNOPAUSE -dBATCH -dQUIET \
   -sOutputFile=output.pdf input.pdf

# QPDF: Linearize a PDF for fast web viewing
qpdf --linearize input.pdf output.pdf

# pdftk: Merge two PDFs
pdftk file1.pdf file2.pdf cat output merged.pdf

# Poppler: Extract all text from a PDF
pdftotext input.pdf output.txt

10. Common PDF Problems and Solutions

Despite being a mature format, PDFs can present various challenges. Here are the most common problems and their solutions.

Problem: "The PDF displays garbled text or wrong characters"

Cause: Missing or improperly embedded fonts. When a PDF references a font but does not embed it, the viewer substitutes a system font that may have different character mappings, especially for non-Latin scripts.

Solution: Re-create the PDF with font embedding enabled. In most applications, this is an export setting. Ensure that the ToUnicode CMap is included so characters can be correctly mapped. For existing PDFs, tools like Ghostscript can re-embed fonts if they are available on the system.

Problem: "Copied text from the PDF is jumbled or in the wrong order"

Cause: PDF stores text as positioned glyphs, not as a logical text flow. If the PDF was generated with each line (or even each word) as a separate text operation, the reading order may not match the visual order. Multi-column layouts are particularly problematic.

Solution: Use a text extraction tool that performs layout analysis, such as pdfplumber or pdftotext with the -layout flag. For reliable text extraction, the PDF should ideally be tagged (see Section 7 on accessibility).

Problem: "The PDF file size is too large to email"

Cause: Usually caused by high-resolution images, uncompressed or unnecessarily detailed graphics, non-subsetted fonts, or accumulated incremental updates.

Solution: Apply the optimization techniques from Section 5. Start by downsampling images, then subset fonts, and finally rewrite the file to remove incremental update overhead. Ghostscript's -dPDFSETTINGS=/ebook preset is a quick way to reduce size for screen-resolution documents.

Problem: "The PDF looks different when printed vs. on screen"

Cause: Transparency, RGB-to-CMYK color conversion, or missing print-specific elements. PDF transparency is rendered differently by different print workflows. Bright RGB colors may appear dull when converted to CMYK for printing.

Solution: For print-critical documents, use PDF/X (a print-production subset of PDF). Flatten all transparency, use CMYK color spaces, and embed ICC color profiles. Run a preflight check using Adobe Acrobat Pro or a dedicated preflight tool.

Problem: "The scanned PDF is not searchable"

Cause: Scanned documents are stored as images. The PDF contains the pixel data of the scanned pages but no actual text objects that could be searched or selected.

Solution: Apply OCR (Optical Character Recognition) to create a text layer on top of the scanned image. Adobe Acrobat Pro, ABBYY FineReader, and the open-source tool Tesseract (via OCRmyPDF) can all add searchable text layers to scanned PDFs without altering the visual appearance.

Problem: "The PDF is corrupt or will not open"

Cause: File transfer errors (truncated download, email attachment corruption), software bugs during creation, or disk errors. Common symptoms include missing cross-reference tables, invalid object references, or premature end-of-file.

Solution: Try opening the file in different viewers โ€” some are more tolerant of minor corruption. Ghostscript can often repair moderately damaged PDFs by reading and rewriting the file: gs -o repaired.pdf -sDEVICE=pdfwrite -dPDFSETTINGS=/prepress damaged.pdf. QPDF also has a recovery mode: qpdf --replace-input damaged.pdf.

Problem: "Form fields lose their data when the PDF is opened in a different viewer"

Cause: There are two incompatible form technologies in PDF: AcroForms (the standard) and XFA Forms (an Adobe/XML-based system). XFA forms are only fully supported in Adobe Acrobat and Reader. Most other viewers cannot process XFA forms correctly.

Solution: Use AcroForms instead of XFA when creating new forms. PDF 2.0 officially deprecated XFA. If you must work with an existing XFA form, open it in Adobe Acrobat and "flatten" or convert it to AcroForm format. Alternatively, use the "Print to PDF" option to create a static version with the form data baked in.

11. PDF Best Practices Checklist

Whether you are creating PDFs for business, archival, web distribution, or accessibility, following these best practices will produce reliable, high-quality documents.

Creation

  • Always embed all fonts used in the document. Never rely on system fonts being available on the reader's machine.
  • Use font subsetting for documents that use only a small portion of a large font's glyph set.
  • Include a ToUnicode CMap for every font so text can be correctly extracted and searched.
  • Set the document language in the catalog dictionary.
  • Include meaningful metadata: title, author, subject, and keywords in the document information dictionary.
  • Use vector graphics instead of rasterized images where possible for diagrams, charts, and illustrations.

Optimization

  • Downsample images to match the intended output resolution (150 DPI for screen, 300 DPI for print).
  • Use appropriate compression: JPEG for photos, Flate for screenshots and line art.
  • Remove incremental updates by performing a full "Save As" rewrite.
  • Linearize PDFs intended for web distribution so the first page loads immediately.
  • Enable object stream and cross-reference stream compression for PDF 1.5+ files.
  • Remove metadata you do not want to distribute (author names, creation software, modification dates).

Accessibility

  • Create tagged PDFs with a complete structure tree mirroring the document's logical organization.
  • Provide alternative text for every meaningful image. Mark decorative images as artifacts.
  • Ensure the reading order is correct, especially in multi-column layouts.
  • Use heading levels (H1-H6) consistently to create a navigable hierarchy.
  • Add bookmarks for documents longer than a few pages.
  • Ensure sufficient color contrast (4.5:1 minimum for normal text).
  • Test with a screen reader and an accessibility checker before distribution.

Security

  • Use AES-256 encryption for any document containing sensitive information.
  • Set a strong user password (at least 12 characters with mixed case, numbers, and symbols).
  • Do not rely on the owner password alone for security; it only controls permissions, not access.
  • Use proper redaction tools (not just black rectangles) when removing sensitive information.
  • Apply digital signatures for documents requiring authentication and integrity verification.
  • Consider PAdES-LTA signatures for documents that need long-term signature validation.

Archival

  • Use PDF/A-2b at minimum for documents that need to be preserved long-term.
  • Use PDF/A-2a or PDF/A-2u when text extraction and accessibility are also required.
  • Validate PDF/A compliance with veraPDF or a similar conformance checker before archiving.
  • Do not use encryption in archival PDFs (PDF/A prohibits it).
  • Embed all color profiles and ensure device-independent color specifications.
  • Remove any JavaScript, multimedia, or external references before archiving.

Distribution

  • Optimize file size before distributing via email or web (target under 10 MB for email attachments).
  • Linearize PDFs served over HTTP for progressive loading.
  • Set the initial view options (page layout, zoom level, bookmarks panel) appropriately for your audience.
  • Include a descriptive filename โ€” "Q4_2024_Financial_Report.pdf" is far better than "document(3).pdf".
  • Test the PDF in multiple viewers (Adobe Acrobat, Chrome built-in viewer, Firefox, Preview on macOS) to ensure consistent rendering.

Final Thought

PDF is a remarkably capable format that has stood the test of time for over three decades. By understanding its internals, choosing the right tools, and following best practices, you can create PDFs that are compact, accessible, secure, and future-proof. Whether you are merging a few documents for a meeting or building an automated pipeline that generates thousands of reports, the principles covered in this guide will serve you well.

Try Our Free PDF Tools

All processing happens directly in your browser. No file uploads, no server storage, complete privacy.

More guides: JSON Guide โ€ข Image Optimization Guide โ€ข Regex Tutorial โ€ข Accessibility Guide