RAG is Not Magic. How to Build a Document Processing System That Actually Works [Tutorial]

by Kordian | Oct 25, 2025 | AI, Python | 0 comments

A few words on RAG.

The promise is tempting: “connect” a chatbot to your database of PDF documents, and it will answer questions about them.

In theory, it works. But it often fails. Why? Because in 90% of cases, RAG systems fail not at the generation stage, but at the data preparation stage. “Raw” files are thrown in, with the hope that the AI will “somehow figure it out.”

Today, I’m not going to show you a ready-made script to copy. I’m going to show you something much more valuable: an architectural strategy for building an intelligent document processing system. It’s a hybrid approach that combines classic precision with the contextual understanding of Large Language Models (LLMs).

Three “Chunking” Strategies – and Why Two of Them Can Be Wrong

The main problem in RAG is “chunking,” or splitting long documents into smaller pieces. A language model has a limited context window, so we must feed it only the fragments that are relevant to the question.

The Naive / Lazy RAG Approach (Simple Splitting): Most tutorials show a simple split: “divide the text every 500 characters.” For a technical document or a legal act, this is a disaster. This method cuts sentences in half, separates definitions from examples, and worst of all – it loses all structural context. A fragment like “shall apply accordingly” means nothing without the information that it came from Section IV, Chapter 2, Art. 15a, para. 3.
The “AI Only” (Brute Force) Approach: The other extreme is sending the entire document (e.g., 100 pages of PDF) to a super-model (like GPT-5) with the command: “split this intelligently for me.” This might work, but it will be:
- Expensive: Processing millions of tokens costs money.
- Slow: The response will take many minutes, not seconds.
- Unpredictable: The model might do it slightly differently each time.
The Hybrid Approach (Strategy and Precision): This is the approach I design. We treat the system like a specialized team of experts:
- An LLM / SLM analyzes the document structure to select the appropriate regex strategy.
- A Fast Analyst (e.g., Regex): Scans the text in milliseconds and creates a “map” of the document, finding all structural anchor points (Sections, Chapters, Articles).
- A Contextual Expert (LLM): Steps in only when absolutely necessary. We use it for “surgical” tasks.

Here’s how to translate this strategy into an architecture.

Step 1: The Foundation – Structure-Preserving Conversion

Before we start splitting anything, we must extract the text from the PDF. But we don’t want “bare” text. We want to preserve the structure – headings, lists, bold text.

That’s why, instead of a simple pdf.read(), we use tools like pymupdf4llm, which convert PDF to Markdown format.

Python

# Concept: Convert PDF to Markdown
# We use a library that understands page layouts
import pymupdf4llm

try:
    # This gives us text with preserved headers (#), lists (*), and bold (**)
    md_text = pymupdf4llm.to_markdown("my_document.pdf")
except Exception as e:
    print(f"Conversion error: {e}")

Markdown is ideal because **Art. 1.** or # SECTION I are clear structural signals for us, which we will use in the next step.

Step 2: Map the Terrain – Quick Structural Scanning

Now, instead of sending the entire text to an LLM, we use a fast and precise parser module (e.g., based on regular expressions) to map the document’s skeleton.

We want to instantly find all occurrences of structural headings:

Python

# Concept: Structural pattern definitions
# We compile Regexes once for maximum efficiency
import re

SECTION_RE = re.compile(r"^(SECTION\s+[IVXLCDM]+[A-Z]*)", re.MULTILINE | re.IGNORECASE)
CHAPTER_RE = re.compile(r"^(Chapter\s+\d+[a-z]*)", re.MULTILINE | re.IGNORECASE)
ARTICLE_RE = re.compile(r"(^|\n)(?<!\w)(Art\.\s*\d+[a-z]*\s*\.)", re.IGNORECASE | re.MULTILINE)

# ... and so on for other elements ...

Scanning a 100-page document with these patterns takes fractions of a second. The result is a “table of contents”: a list of all Sections, Chapters, and Articles, along with their position in the text. We have a map of the terrain before the LLM even enters the game.

Step 3: Surgical Precision – When Does the LLM Step In?

Only now do we activate the AI. But not on the whole document. We assign it only two precise, “surgical” tasks that Regex can’t handle.

Task 1: Understanding the Title

Regex won’t “understand” that “ACT of September 29, 1994, on accounting” means the logical title is “on accounting”. An LLM can do this. We send the model only the beginning of the document with a very specific prompt:

Plaintext

# Prompt Example: Title Extraction
Analyze the beginning of this legal act. Identify the main title of the act
(e.g., "on personal income tax").

Text fragment:
---
[...first 3000 characters of text...]
---

Respond ONLY in JSON format:
{
  "act_title": "..."
}

The token savings are enormous, and the result is precise.

Task 2: Intelligent Splitting of Complex Articles

The main system logic knows (from Step 2) where each article begins and ends. When it encounters an article that is very long (e.g., > 800 characters), it knows a simple split will fail.

At that point, it delegates the task to the LLM, sending it only the content of that single article:

Plaintext

# Prompt Example: Article Splitting
You are a legal expert. Split the following article text into a logical hierarchy:
paragraphs (e.g., "1.", "2."), points (e.g., "1)", "2)"), sub-points (e.g., "a)", "b)").
Preserve the full content.

Text to split:
---
[...content of Art. 15...]
---

Return ONLY a list of JSON objects:
[
  { "type": "paragraph", "number": "1", "content": "..." },
  { "type": "point", "number": "1)", "content": "...", "parent_type": "paragraph", "parent_number": "1" }
]

This is the heart of the hybrid strategy. Regex did 90% of the work (finding the article), and the LLM performs 10% of the analytical work – but it’s work that Regex couldn’t do.

Step 4: Future-Proof Architecture – Abstraction and Resilience

Two things distinguish a good IT system from a prototype: flexibility and resilience to errors.

1. Flexibility (LLM Provider Abstraction): Our main system shouldn’t “know” if it’s using a free Ollama on a local machine or a paid Azure OpenAI. We design this using an abstract class.

Python

# Concept: Modular architecture
from abc import ABC, abstractmethod

class LLMProvider(ABC):
    """Interface for any language model provider."""
    @abstractmethod
    def execute_prompt(self, prompt: str) -> str:
        pass

class OllamaProvider(LLMProvider):
    """Specific implementation for Ollama."""
    def execute_prompt(self, prompt: str) -> str:
        # ... logic for communicating with Ollama ...
        return llm_response

class AzureOpenAIProvider(LLMProvider):
    """Implementation for Azure."""
    def execute_prompt(self, prompt: str) -> str:
        # ... logic for communicating with Azure API ...
        return llm_response

This is architectural thinking. If tomorrow the client says “we’re switching to Azure,” we change one line at the system’s initialization, and the rest of the logic works unchanged.

2. Resilience (Fallback Logic): What if the LLM returns garbage, an empty response, or JSON that can’t be parsed? A prototype system would crash. A production system must be ready for this.

Our main logic must include fallback mechanisms. If the intelligent article split (Step 3.2) fails:

The system logs the error.
It does not stop processing.
Instead, it applies a fallback method (e.g., treats the entire article text as one, large chunk).

It is better to have one large but contextually correct chunk than to let the entire system fail. This is a mature, engineering approach to risk management.

Conclusion: Architecture Over Hype

By building a system this way, we get a solution that is:

Precise: Thanks to context-aware chunks, our future RAG system will provide accurate answers.
Efficient: Regex does most of the heavy lifting, and the LLM is used only where necessary, saving time and money.
Resilient: The system is prepared for LLM errors and can handle them without crashing.
Flexible: Thanks to provider abstraction, we can easily swap the AI “brain” underneath.

This is precisely my approach to technology, shaped by years of practice – both in IT and in karate. It’s not about using the latest, flashiest tools. It’s about strategically choosing the right tools and using them with precision, building systems that are reliable, scalable, and truly solve business problems.

{<Z Kordian Zadrożny

RAG is Not Magic. How to Build a Document Processing System That Actually Works [Tutorial]

Three “Chunking” Strategies – and Why Two of Them Can Be Wrong

Step 1: The Foundation – Structure-Preserving Conversion

Step 2: Map the Terrain – Quick Structural Scanning

Step 3: Surgical Precision – When Does the LLM Step In?

Task 1: Understanding the Title

Task 2: Intelligent Splitting of Complex Articles

Step 4: Future-Proof Architecture – Abstraction and Resilience

Conclusion: Architecture Over Hype

0 Comments

Submit a Comment Cancel reply

Categories

Archives

Recent Posts

Recent Comments

Share This