Basalt Studio logo
Basalt Studio.Basalt Studio.
Back

How to extract data from PDF to Excel/Spreadsheet: Advance parsing with n8n.io and LlamaParse

Eliott Ardisson

Eliott Ardisson

Founder & CEO - Basalt Studio

Updated
tutorials

Learn how to automate PDF-to-spreadsheet extraction using n8n and LlamaParse — a practical guide for SMBs processing invoices, contracts, and reports at scale.

ai agents
automation
programmatic

TL;DR

  • LlamaParse uses large language models to understand document structure, making it significantly more reliable than character-level OCR for tabular data in PDFs.
  • Combining LlamaParse with n8n gives you a flexible, automatable pipeline: ingest a PDF, parse it, extract structured data with an AI model, and write it to a spreadsheet.
  • The workflow handles invoices, financial reports, contracts, and forms — any document where data relationships matter and manual copying is a bottleneck.
  • Error handling and rate limit management are the two most common points of failure in production; this guide covers both.
  • The stack is accessible for SMBs: n8n’s self-hosted option is free, LlamaParse has a generous free tier, and the Claude or OpenAI API costs cents per document at typical volumes.

The Actual Problem This Solves

If you run a 20-person accounting practice, a mid-size real estate brokerage, or a recruitment agency, you probably have someone on your team whose job includes copying numbers from PDFs into spreadsheets. Not because it’s a good use of their time. Because no one has set up a better system yet.

PDF data extraction sounds like a solved problem. It isn’t. Basic export tools handle clean, single-page documents reasonably well. They fall apart on multi-page tables, inconsistent formatting, scanned files, or documents where the same field appears in different positions depending on the vendor or source.

This guide walks through a production-viable approach: LlamaParse for document parsing, n8n for workflow automation, and an AI model (Claude or GPT-4) for structured data extraction. The result is a pipeline that processes a typical business document in two to four minutes, with meaningful accuracy on tables, line items, dates, and totals.


Why LlamaParse Instead of Standard OCR

Traditional OCR reads text as a sequence of characters. It does not understand that the number in column three, row seven belongs to the label in column one of that same row. When a table spans two pages or a column shifts position between documents, standard OCR either mangles the output or loses relationships entirely.

LlamaParse processes PDFs using large language models that understand layout semantics. It can identify that a block of numbers is a table, preserve column-row relationships, and handle layout variations across documents from different sources. The output is Markdown with properly formatted tables, which is both human-readable and easy to feed into a subsequent AI extraction step.

For document types that matter most in SMB contexts — invoices, purchase orders, financial statements, lease agreements, payroll summaries — this structural understanding is the difference between automation that actually works and automation that produces garbage requiring manual correction anyway.


What You Need Before You Start

Here is the core stack:

  • n8n — workflow automation. Self-hosted is free; cloud plans start at around $20/month. If you are running sensitive documents, self-hosted is worth the setup time.
  • LlamaParse — PDF parsing via the LlamaIndex cloud API. Free tier covers 1,000 document uploads per day, which is more than adequate for most SMBs. Paid plans exist for higher volume.
  • An AI model API — Claude via the Anthropic SDK or GPT-4 via OpenAI. Either works. Use whichever you already have access to. Claude handles structured output instructions cleanly; GPT-4 is equally capable with the right prompt.
  • A destination spreadsheet — Google Sheets is easiest due to its API accessibility. Microsoft Excel via OneDrive works too, though the API setup is slightly more involved.

You do not need all of these to be enterprise accounts. The free or entry-level tiers of each tool are sufficient to build and test this pipeline.


Step 1: Configure LlamaParse Credentials in n8n

Create an account at cloud.llamaindex.ai and generate an API key from your account settings. Keep this key out of your workflow code — use n8n’s credential store.

In n8n, create a new Header Auth credential:

  • Header Name: Authorization
  • Header Value: Bearer YOUR_API_KEY

This credential is used for every API call to LlamaParse. Name it something descriptive like “LlamaParse API” so it is easy to identify when you have multiple credentials configured.


Step 2: Build the Ingestion Trigger

Your pipeline needs to start somewhere. Common trigger patterns for SMB document workflows:

  • Email trigger — monitor a Gmail or Outlook inbox for messages with PDF attachments. Useful for invoice processing where vendors email PDFs directly.
  • Webhook trigger — accept PDF uploads from a web form or an internal tool. Good for HR or legal teams uploading documents on demand.
  • File watcher — monitor a Dropbox or Google Drive folder. Works well for accounting teams who already have a “drop files here” habit.
  • Manual trigger — for testing and ad-hoc runs during development.

Add a condition node after the trigger to verify the file is actually a PDF. Check both the file extension and the MIME type (application/pdf). This prevents edge cases where a renamed file or a forwarded email attachment wastes your parsing quota.


Step 3: Submit the PDF to LlamaParse

LlamaParse uses an asynchronous job model. You upload the document, get back a job ID, then poll for completion.

Create an HTTP Request node:

  • Method: POST
  • URL: https://api.cloud.llamaindex.ai/api/v1/parsing/upload
  • Authentication: the Header Auth credential you created
  • Body Type: Form-Data
  • Parameters: file mapped to the binary PDF data from your trigger

LlamaParse accepts an optional parsing_instruction parameter. This is worth using. A targeted instruction improves accuracy on domain-specific documents. Examples that work well in practice:

  • For invoices: “Extract all table data including line items, quantities, unit prices, and totals. Preserve column headers.”
  • For lease agreements: “Extract key dates, party names, rent amounts, and term length. Identify clause headings.”
  • For payroll summaries: “Extract all rows from earnings and deductions tables. Preserve row labels.”

The API responds with a job_id. Store this value — you need it for the next steps.


Step 4: Poll for Parsing Completion

PDF parsing is not instant. Complex multi-page documents can take 60 to 90 seconds. You need a polling loop.

Add a Wait node set to 10 seconds, then an HTTP Request node to check job status:

  • Method: GET
  • URL: https://api.cloud.llamaindex.ai/api/v1/parsing/job/\{\{$json["job_id"]\}\}

The response includes a status field: pending, processing, or completed. Use a Switch node to branch:

  • completed → proceed to result retrieval
  • pending or processing → loop back to the Wait node
  • anything else → route to error handling

Add a counter to your loop. If a job has not completed after 20 polling cycles (roughly three to four minutes total), flag it as failed and trigger a notification. This prevents infinite loops on documents that get stuck.


Step 5: Retrieve the Parsed Output

Once the job status is completed, retrieve the Markdown output:

  • Method: GET
  • URL: https://api.cloud.llamaindex.ai/api/v1/parsing/job/\{\{$json["job_id"]\}\}/result/raw/markdown

The response is a Markdown string. Tables are formatted with pipe syntax, which preserves column-row relationships. A parsed invoice might look like:

## Line Items

| Description | Qty | Unit Price | Total |
|-------------|-----|------------|-------|
| Consulting services | 8 | 150.00 | 1200.00 |
| Travel expenses | 1 | 85.00 | 85.00 |

**Invoice Total**: 1285.00

This is much cleaner than raw OCR output and significantly easier for a language model to process accurately in the next step.


Step 6: Extract Structured Data with an AI Model

The Markdown output still needs to be converted into row-and-column data for a spreadsheet. This is where a language model API handles the heavy lifting.

Add an HTTP Request node pointed at the Claude API (or OpenAI, if you prefer) with a structured extraction prompt. The key is being explicit about the output format you want. A prompt that works consistently:

You are a document data extraction assistant. Extract structured data from the following parsed document and return valid JSON only — no explanation, no markdown wrapper.

Document content:
\{\{$json["content"]\}\}

Return this structure:
{
  "document_type": "invoice | receipt | report | contract | other",
  "document_date": "YYYY-MM-DD or null",
  "counterparty_name": "string or null",
  "total_amount": number or null,
  "currency": "string or null",
  "line_items": [
    {
      "description": "string",
      "quantity": number or null,
      "unit_price": number or null,
      "line_total": number or null
    }
  ],
  "notes": "any other relevant extracted information"
}

Rules:
- Remove currency symbols from numeric fields
- Normalize all dates to YYYY-MM-DD
- Use null for any field that cannot be determined from the document
- Return only valid, parseable JSON

Adjust the schema for your use case. A real estate brokerage processing lease agreements needs different fields than an HVAC contractor processing supplier invoices. The schema is the part you customize per workflow; the polling and parsing logic stays the same.


Step 7: Write to Google Sheets

Parse the JSON response from the AI model and map it to your spreadsheet columns. In n8n, the Google Sheets node with the Append operation handles this well.

A few practical notes:

  • Create a dedicated “Imports” tab in your spreadsheet for automated data. Keep your analysis or reporting tabs separate. This gives you a clean audit trail and prevents automated writes from overwriting formulas.
  • For line item arrays, you may need to flatten them first. n8n’s Code node or Split Out node can expand an array of line items into individual rows.
  • Add a timestamp column populated with \{\{$now\}\} so you know when each row was processed. Useful for debugging and for reconciling with source documents later.

Error Handling: The Part Most Tutorials Skip

In production, workflows fail. PDFs get corrupted, API keys expire, rate limits get hit, documents arrive in unexpected formats. Without proper error handling, these failures are silent until someone notices data is missing.

Build these into your workflow from the start:

Rate limit handling: Both LlamaParse and AI model APIs have rate limits. If you are processing batches, add a Wait node between iterations. Implement exponential backoff on retry — wait 10 seconds on the first retry, 30 on the second, 90 on the third, then alert and abandon.

Corrupted or unreadable files: Add a validation check after the LlamaParse result is retrieved. If the Markdown output is empty or below a minimum character threshold, route to a manual review queue rather than attempting AI extraction.

Failed AI extraction: If the AI model returns malformed JSON or an error, catch it with a Try-Catch equivalent in n8n and log the raw LlamaParse output alongside the error. This gives you something to work with manually without losing the parsed content.

Notifications: Set up an email or Slack notification node that fires on any workflow failure. Include the document name, the failure stage, and the error message. A daily summary of processing stats — documents processed, failures, average processing time — is worth adding once you are in production.

In our work helping founder-led accounting and legal firms deploy document processing pipelines, the most common post-launch issue is not accuracy — it is silent failures that go unnoticed for days because no one built monitoring into the workflow from the start.


Document Types and Expected Accuracy

Accuracy varies by document type and quality. Based on the kinds of documents SMBs typically process:

  • Structured invoices (consistent vendor templates): high accuracy on line items, dates, and totals
  • Financial statements and reports: good accuracy on tables; footnotes and narrative sections require more prompt tuning
  • Lease and contract documents: good on key terms and dates; nuanced legal language benefits from a more specific extraction schema
  • Scanned documents with clean scans: reasonable accuracy, degraded compared to native PDFs
  • Handwritten content or image-only PDFs: LlamaParse is not the right tool; these require a different approach

If you are processing documents from multiple vendors or sources, expect some initial tuning. The parsing instruction and extraction prompt will need adjustments for edge cases. Budget time for this in your implementation plan.


Common Mistakes to Avoid

Hardcoding API keys: Use n8n’s credential store. Never paste API keys directly into node parameters.

No loop counter on polling: Without a maximum iteration check, a stuck parsing job creates an infinite loop that consumes resources and potentially hits rate limits.

Overfitting the prompt to one document type: If your workflow will process invoices from different vendors, test with at least five to ten examples from different sources before going live.

Writing directly to a production spreadsheet: Use a staging tab for automated imports. Review before promoting to your primary dataset until you have confidence in accuracy.

Ignoring document security: PDFs often contain sensitive financial or personal data. If you are using n8n cloud, confirm you are comfortable with data transiting their infrastructure. For highly sensitive documents — HR files, legal agreements with personal data — self-hosted n8n keeps everything within your own environment.


Scaling This Workflow

The architecture described here handles one document at a time. For batch processing, you have a few options:

  • Trigger the workflow in parallel for multiple documents, with rate limit awareness built into the design
  • Use n8n’s workflow queuing to serialize processing at a controlled rate
  • For very high volume, consider a dedicated queue service upstream of n8n

The core workflow does not need to change as volume grows. The rate limit and error handling logic becomes more important, but the parsing and extraction steps scale linearly.


Where to Go From Here

This pipeline — PDF ingestion, LlamaParse parsing, AI extraction, spreadsheet output — is a solid foundation. Once it is running reliably, natural extensions include:

  • Routing different document types through specialized extraction schemas
  • Feeding extracted data directly into accounting software or CRM systems rather than a spreadsheet
  • Adding a human review step for low-confidence extractions before they are committed to your records
  • Building a simple dashboard showing processing volume, error rates, and document types over time

The tooling here (n8n, LlamaParse, Claude or GPT-4 APIs) is also what powers more complex document intelligence workflows — vendor onboarding automation, contract clause extraction, compliance checking. Getting comfortable with this stack opens up a significant range of automation possibilities beyond the spreadsheet use case.


If you want help scoping or implementing a document processing workflow for your team, Basalt Studio works with founder-led SMBs to design and deploy exactly this kind of AI-powered automation. You can book a strategy call to walk through your specific document types and volume: https://cal.com/eliott-ardisson-kzq7zs/ai-strategy-call