Document Loading

Overview

The DocumentLoader class provides a simple interface to load documents from various file formats. It uses MarkItDown under the hood, which supports:

PDF files
Word documents (DOCX, DOC)
PowerPoint presentations (PPTX, PPT)
Excel spreadsheets (XLSX, XLS)
Images with OCR (PNG, JPG, JPEG)
HTML and Markdown files
Plain text files

Basic Usage

from mini.loader import DocumentLoader

# Initialize loader
loader = DocumentLoader()

# Load a single document
text = loader.load("document.pdf")
print(f"Loaded {len(text)} characters")

Loading Multiple Documents

# Load multiple files
texts = loader.load_documents([
    "document1.pdf",
    "document2.docx",
    "presentation.pptx"
])

print(f"Loaded {len(texts)} documents")
for i, text in enumerate(texts):
    print(f"Document {i+1}: {len(text)} characters")

Loading from Directory

# Load all documents from a directory
texts = loader.load_documents_from_directory("./knowledge_base/")

print(f"Loaded {len(texts)} documents from directory")

The load_documents_from_directory method recursively scans the directory and loads all supported file formats.

Supported File Formats

PDF Files (.pdf)

Extracts text from PDF documents, including scanned PDFs with OCR.

text = loader.load("report.pdf")

Word Documents (.docx, .doc)

Parses Microsoft Word documents and extracts formatted text.

text = loader.load("proposal.docx")

PowerPoint (.pptx, .ppt)

Extracts text and notes from PowerPoint presentations.

text = loader.load("presentation.pptx")

Excel Spreadsheets (.xlsx, .xls)

Reads data from Excel files, including multiple sheets.

text = loader.load("data.xlsx")

Images (.png, .jpg, .jpeg)

Uses OCR to extract text from images.

text = loader.load("screenshot.png")

HTML and Markdown

Parses HTML and Markdown files.

text = loader.load("readme.md")

Plain Text (.txt)

Loads plain text files with proper encoding detection.

text = loader.load("notes.txt")

Integration with AgenticRAG

When using AgenticRAG, document loading is handled automatically:

from mini import AgenticRAG

rag = AgenticRAG(vector_store=vector_store, embedding_model=embedding_model)

# The loader is used internally
rag.index_document("document.pdf")

# Load multiple documents
rag.index_documents([
    "doc1.pdf",
    "doc2.docx",
    "doc3.pptx"
])

Advanced Usage

Custom Processing

You can use the loader to get text and then apply custom processing:

from mini.loader import DocumentLoader
from mini.chunker import Chunker

loader = DocumentLoader()
chunker = Chunker()

# Load and process
text = loader.load("document.pdf")

# Apply custom preprocessing
text = text.replace("\n\n\n", "\n\n")  # Clean extra newlines
text = text.strip()

# Chunk the processed text
chunks = chunker.chunk(text)

Batch Loading with Metadata

documents = [
    {"path": "2023_report.pdf", "metadata": {"year": 2023, "type": "report"}},
    {"path": "2024_report.pdf", "metadata": {"year": 2024, "type": "report"}},
]

for doc in documents:
    text = loader.load(doc["path"])
    # Process text with metadata
    # ...

Error Handling

The document loader includes robust error handling:

try:
    text = loader.load("document.pdf")
except FileNotFoundError:
    print("Document not found")
except Exception as e:
    print(f"Error loading document: {e}")

Some file formats (especially scanned PDFs) may require additional system dependencies for OCR. Ensure you have the necessary libraries installed.

Best Practices

File Format Selection

Use text-based formats (TXT, MD) when possible for fastest loading
PDFs work well but may be slower to process
Images require OCR and may have accuracy limitations

Document Preprocessing

Clean documents before indexing (remove headers, footers, etc.)
Ensure proper formatting for better chunking
Remove unnecessary content that won’t be useful for retrieval

Batch Processing

Load multiple documents in batches for efficiency
Consider using multiprocessing for large document sets
Monitor memory usage when loading many large documents

Error Handling

Always wrap document loading in try-except blocks
Log failed documents for review
Implement retry logic for transient failures

Performance Considerations

File Type	Speed	Accuracy	Notes
TXT, MD	Fast	Perfect	Best for text-only content
DOCX	Fast	Excellent	Handles formatting well
PDF	Medium	Good	Depends on PDF type (text vs scanned)
Images	Slow	Variable	Requires OCR, accuracy varies
PPTX	Fast	Good	Extracts slide text and notes

Troubleshooting

OCR not working for images

Solution: Install required OCR dependencies:

# macOS
brew install tesseract

# Ubuntu
sudo apt-get install tesseract-ocr

# Windows
# Download from: https://github.com/UB-Mannheim/tesseract/wiki

PDF loading fails

Solution: Ensure you have the required PDF libraries:

pip install pymupdf  # or pdfplumber

Encoding errors

Solution: MarkItDown handles encoding automatically, but you can specify encoding if needed:

# The library handles this internally
text = loader.load("document.txt")

Next Steps

Chunking

Learn how to chunk loaded documents

Embeddings

Generate embeddings from chunks

AgenticRAG

Use the complete RAG pipeline

API Reference

Complete API documentation

Getting Started

Core Concepts

Features

Guides

Examples

Overview

Basic Usage

Loading Multiple Documents

Loading from Directory

Supported File Formats

Integration with AgenticRAG

Advanced Usage

Custom Processing

Batch Loading with Metadata

Error Handling

Best Practices

Performance Considerations

Troubleshooting

Next Steps

Chunking

Embeddings

AgenticRAG

API Reference

Getting Started

Core Concepts

Features

Guides

Examples

​Overview

​Basic Usage

​Loading Multiple Documents

​Loading from Directory

​Supported File Formats

​Integration with AgenticRAG

​Advanced Usage

​Custom Processing

​Batch Loading with Metadata

​Error Handling

​Best Practices

​Performance Considerations

​Troubleshooting

​Next Steps

Chunking

Embeddings

AgenticRAG

API Reference

Overview

Basic Usage

Loading Multiple Documents

Loading from Directory

Supported File Formats

Integration with AgenticRAG

Advanced Usage

Custom Processing

Batch Loading with Metadata

Error Handling

Best Practices

Performance Considerations

Troubleshooting

Next Steps