Building a PDF-Based RAG Workflow: Step-by-Step Guide

Introduction

Although current pre-trained large language models (LLMs) have demonstrated strong generalizability across various tasks, they often underperform downstream natural language processing (NLP) tasks due to the lack of domain-specific knowledge. Retrieval-augmented generation (RAG) [1] emerges to address this challenge by retrieving relevant data from a knowledge base to augment the input prompts of LLMs, thereby enhancing their performance on specific tasks. For most users, relevant data is mostly stored in PDF format or can be easily converted into PDFs. Therefore, this document introduces how to build a knowledge base using your own PDFs and implement the PDF-based RAG.

PDF-based RAG generally involves the following three steps:

PDF Parsing and Text Database Construction: To work with common embedding models that pre-trained on text, PDFs need to be parsed and split into smaller text chunks.
Text Database Indexing: An embedding model is used to convert each text chunk into vector representations. Any input query should be processed by the same embedding model.
Retrieval base preprocessing and Retrieval: Relevant content in the vector database is identified based on the query, and the corresponding text is returned to augment the prompt of the model.

Figure 1. Diagram illustrates the workflow of a Retrieval-Augmented Generation (RAG) system using vector search.

Figure 1. Diagram illustrates the workflow of a Retrieval-Augmented Generation (RAG) system using vector search

In this blog, we will walk you through how to quickly build a PDF-based RAG workflow using the Colossal-AI open-source repository. Colossal-AI is the starred training framework on GitHub and natively supports AI applications such as RFT and RAG. To fully leverage the capabilities of Colossal-AI, we recommend following the steps below on HPC-AI.com, our high-performance GPU platform offering powerful H200 and B200 GPUs.

Prerequisite

Clone the Colossal-AI repository and install dependencies

git clone https://github.com/hpcaitech/ColossalAI.git
cd ColossalAI
git checkout hpc-ai-cloud
cd examples/cloud/pdf-rag
pip install -r requirements.txt

Download Colossal-AI paper as an example

wget https://arxiv.org/pdf/2110.14883.pdf

PDF Parsing and Text Database Construction

There are two kinds of approaches to extract text from PDFs: visual-based and parser-based.

For PDFs that present content in image form, or can at least be converted into images, Optical Character Recognition (OCR) and Table Structure Recognition (TSR) serve as crucial solutions for text extraction.
However, most PDF files allow direct text extraction through parsers, which is the most resource-efficient and straightforward method. In this blog, we only present the parser-based solution.

Similar to many other solutions, we utilize PyPDF [2] – a free, open-source pure-Python PDF library that enables parsing PDF pages. Additionally, we have incorporated the Splitter and Cleaner from the open-source project Haystack [3] to further enhance the quality of the extracted text.

Parse the PDF with the following command:

python PDF-RAG/parse.py

Make sure to modify the following variables in parse.py before execution:

Set PDF_DIRECTORY to your PDF folder path.
Set OUTPUT_JSON to your desired output path.

Output json file Example:

{
    'file_path': '2110.14883.pdf',
    'source_id': 'ae0987761ddac39396563ddd296661e40db478f2656abab270128af8ff4297ec',
    'page_number': 1,
    'split_id': 20,
    'split_idx_start': 2730,
    'id': 'fa190e40280825b20e429f5985886e7d0d5962b09ad0101682306212686e309e',
    'content': 'INTRODUCTION\nDeep learning has been successful in many applications and brought\nbreakthroughs in difficult problems. ',
}

The elements in the JSON file include the split sentences (content) and their corresponding meta data (file_path, page_number, etc.). The generated JSON file will be used for subsequent embedding.

Text Database Indexing

The current prevalent indexing approach involves using an embedding model (artificial neural networks or Transformer-based models) to map source text into a semantic vector space. In this space, semantically similar terms are positioned closer together, enabling a search paradigm known as dense vector search. For example, word-level embeddings can effectively capture relationships between individual terms based on lexical or contextual similarity.

image (7)-1

Figure 2. This diagram illustrates how a vector space encodes semantic relationships, such as gender, verb tense, and country-capital pairs

However, when dealing with longer units of text — such as sentences or paragraphs — the word-level embeddings often fall short in preserving the overall meaning. To achieve more accurate representations at these levels, it is important to use specialized embedding models trained specifically for sentence-level semantics. For many downstream tasks, performance can be significantly improved by choosing models that have been pre-trained or fine-tuned on data relevant to the target language and domain.

We use the relatively versatile sentence-level embedding model BAAI/bge-base-en-v1.5 as an example. Generate embeddings with the following command:

python PDF-RAG/embedding.py

In the config dictionary in embeddings.py, modify the following fields before execution:

json_path: set to the path of your input JSON file.
embedding_path: set to the desired output path for the embedding file (.npy)

The shape of each vector in the embedding vector database is determined by the embedding model. The shape of the output embedding vectors obtained by the model we use is as follows.

query_vector_shape (1, 768)

Vector Database Preprocessing and Retrieval

This section demonstrates the final preprocessing step of building a vector database to enable retrieval of content relevant to a given query. Without preprocessing the existing vector database, searching for sources vectors similar to the query vector in this dense vector space would be very time-consuming. To address this, we introduce the Hierarchical Navigable Small World (HNSW) algorithm [5] — a key method for approximate nearest neighbor search in high-dimensional vector database — to preprocess the vector database and conduct approximate searches, thereby significantly reducing search latency. This approach achieves a substantial speedup (from 23 seconds to 1 second in a 10k-paper dataset) with only a marginal sacrifice in retrieval accuracy.

Figure 3. Illustration of the Hierarchical NSW idea. The search starts from an element from the top layer (shown red). Red arrows show direction of the greedy algorithm from the entry point to the query (shown green).

To further enhance retrieval quality, we incorporate BM25 [6] as part of our search strategy. While vector search captures semantic similarity in high-dimensional space, BM25 excels at identifying exact keyword matches based on term frequency and document length. By combining both methods, hybrid search leverages the strengths of semantic and lexical retrieval, improving the accuracy and relevance of the retrieved results.

Additionally, we illustrate how user queries are processed. Text-based queries are first mapped to the same vector space using the same embedding model employed during index construction. Subsequently, the most similar content from the sources is retrieved using a predefined similarity algorithm.

Retrieve the relevant content using the following command:

python PDF-RAG/HNSW_retrieve.py

When executing the above command, you can consider the following adjustments:

Modify the embedding path and JSON path (PATH_TO_YOUR_JSON.json).
Print the construction and retrieval time of HNSW and BM25, as well as the merged retrieval results.
Adjust parameters of HNSW and BM25 according to the description to obtain required results.
- In hnswlib.Index(), space can be set as:
  - 'l2' for Squared L2 distance,
  - 'ip' for Inner product,
  - 'cosine' for Cosine similarity.

def generate_rag_prompt(query: str, data: str) -> str:
    prompt_template = "Answer the query using the provided data. Query: {query}. Data: {data}. Respond based on the data."return prompt_template.format(query=query, data=data)

prompt = generate_rag_prompt(query, retrieved_data)

Result Example:

[
'In a nutshell, we\nbring the following major contributions to large-scale distributed\ntraining in this work:\n•Colossal-AI is a unified deep learning system that provides\nthe fullest set of acceleration techniques for the AI commu-\nnity. ',
'We have\nthus developed Colossal-AI, which is an open-source system to\ndemocratize complicated distributed training in the AI community\nby unifying an array of training acceleration techniques in one deep\nlearning system. ',
'3.3 Automatic Parallelization on Dynamic\nComputation Graph\nInspired by Alpa [42], Colossal-AI has included an experimental\nautomatic parallelism feature to improve upon the Alpa project.\n', '7 CONCLUSION\nIn this work, we designed and implemented Colossal-AI which\nintegrated a vast number of advanced acceleration techniques into\none unified system for large-scale distributed training. ',
'Colossal-AI\ncomes with a flexible system design that supports an easy combina-\ntion of different parallelism methods. ',
'3.1 Multi-dimensional model parallelism\nFirst of all, Colossal-AI provides an array of model parallelism\nmethods to cater to the needs of distributed training. ',
'In Colossal-AI, we implemented an adaptive hybrid Adam optimizer\ninstead. ',
'As a system under constant development, Colossal-\nAI provides various interfaces to implement customized functions\nfor future extensions. ',
'The Colossal-AI system addressed\nthe above challenge by introducing a unified interface to scale your\nsequential code of model training to distributed environments. ',
'As a\x0cColossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training ICPP 2023, August 7–10, 2023, Salt Lake City, UT, USA\nGPU 0 GPU 1\ndata data\n(a) data parallel\nGPU 0 GPU 1\ndata data (b) tensor parallel\nGPU 0 GPU 1\ndata (c) pipeline parallel\nFigure 3: Existing parallelism for distributed training\nresult, model parallelism was proposed to tackle this problem. '
]

Now you can use this prompt in any LLMs you like!

PDF-based RAG makes it easy to unlock insights from unstructured documents using powerful retrieval-augmented generation techniques. At HPC-AI.com, we help you smoothly deploy AI workflows with scalable RAG pipelines that efficiently process PDFs and provide accurate, domain-aware responses. Here are some instructions to help you get started on HPC-AI.com:

Have any other questions or need help? Drop us a line at service@hpc-ai.com — we would be glad to assist.

Reference

[1] Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., tau Yih, W., Rocktäschel, T., Riedel, S. & Kiela, D. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.. In H. Larochelle, M. Ranzato, R. Hadsell, M.-F. Balcan & H.-T. Lin (eds.), NeurIPS, .

[2] pypdf https://pypdf.readthedocs.io/en/stable/

[3] Pietsch, M., Soni, T., Chan, B., Möller, T., & Kostić, B. (2020). Haystack (Version 0.5.0). GitHub. https://github.com/deepset-ai/haystack/

[4] Xiao, S., Liu, Z., Zhang, P., & Muennighof, N. (2023). C-Pack: Packaged Resources To Advance General Chinese Embedding. arXiv.Org, abs**/2309.07597. https://doi.org/10.48550/arxiv.2309.07597

[5] Y. A. Malkov and D. A. Yashunin, “Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs,” in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 42, no. 4, pp. 824–836, 1 April 2020, doi: 10.1109/TPAMI.2018.2889473.

[6] Stephen Robertson; Hugo Zaragoza, The Probabilistic Relevance Framework , now, 2009, doi: 10.1561/1500000019.