DocumentLoader

Overview

DocumentLoader handles loading documents from various file formats using Microsoft’s MarkItDown library. It supports PDFs, Word documents, images with OCR, and more.

Constructor

from mini.loader import DocumentLoader

loader = DocumentLoader()

The DocumentLoader requires no configuration and is ready to use immediately.

Methods

load

Load a single document and return its text content.

def load(self, document_path: str) -> str

Parameters

document_path

str

required

Path to the document file

Returns

text

str

The extracted text content from the document

Example

from mini.loader import DocumentLoader

loader = DocumentLoader()

# Load a PDF
text = loader.load("document.pdf")
print(f"Loaded {len(text)} characters")

# Load a Word document
text = loader.load("document.docx")

# Load an image with OCR
text = loader.load("screenshot.png")

load_documents

Load multiple documents and return a list of their text contents.

def load_documents(self, document_paths: List[str]) -> List[str]

Parameters

document_paths

List[str]

required

List of paths to document files

Returns

texts

List[str]

List of extracted text contents, one per document

Example

from mini.loader import DocumentLoader

loader = DocumentLoader()

documents = [
    "doc1.pdf",
    "doc2.docx",
    "doc3.txt"
]

texts = loader.load_documents(documents)
print(f"Loaded {len(texts)} documents")

for i, text in enumerate(texts):
    print(f"Document {i+1}: {len(text)} characters")

load_documents_from_directory

Load all supported documents from a directory.

def load_documents_from_directory(
    self,
    directory_path: str
) -> List[str]

Parameters

directory_path

str

required

Path to the directory containing documents

Returns

texts

List[str]

List of extracted text contents from all supported files in the directory

Example

from mini.loader import DocumentLoader

loader = DocumentLoader()

# Load all documents from a directory
texts = loader.load_documents_from_directory("./documents/")
print(f"Loaded {len(texts)} documents from directory")

Supported Formats

The DocumentLoader supports the following file formats through MarkItDown:

Document Formats

PDF (.pdf) - Text extraction with layout preservation
Word (.docx, .doc) - Microsoft Word documents
PowerPoint (.pptx, .ppt) - Presentation slides
Excel (.xlsx, .xls) - Spreadsheets
Text (.txt, .md) - Plain text and Markdown
HTML (.html, .htm) - Web pages

Image Formats (with OCR)

PNG (.png)
JPEG (.jpg, .jpeg)
GIF (.gif)
BMP (.bmp)
TIFF (.tiff, .tif)

Code & Data

CSV (.csv) - Comma-separated values
JSON (.json) - JSON data files
XML (.xml) - XML documents
YAML (.yaml, .yml) - YAML configuration

Error Handling

from mini.loader import DocumentLoader

loader = DocumentLoader()

try:
    text = loader.load("document.pdf")
except FileNotFoundError:
    print("Document not found")
except Exception as e:
    print(f"Error loading document: {e}")

Complete Example

from mini.loader import DocumentLoader
from pathlib import Path

# Initialize loader
loader = DocumentLoader()

# Load different file types
pdf_text = loader.load("research_paper.pdf")
docx_text = loader.load("report.docx")
img_text = loader.load("screenshot.png")  # OCR

print(f"PDF: {len(pdf_text)} chars")
print(f"DOCX: {len(docx_text)} chars")
print(f"Image: {len(img_text)} chars")

# Batch load from directory
docs_dir = Path("./documents")
all_texts = loader.load_documents_from_directory(str(docs_dir))

print(f"\nLoaded {len(all_texts)} documents from directory")

# Process loaded texts
for i, text in enumerate(all_texts):
    if len(text) > 100:
        print(f"Doc {i+1}: {text[:100]}...")

Integration Example

Using DocumentLoader with other Mini RAG components:

from mini.loader import DocumentLoader
from mini.chunker import Chunker
from mini.embedding import EmbeddingModel

# Initialize components
loader = DocumentLoader()
chunker = Chunker()
embedding_model = EmbeddingModel()

# Load document
text = loader.load("document.pdf")

# Chunk text
chunks = chunker.chunk(text)
print(f"Created {len(chunks)} chunks")

# Generate embeddings
embeddings = embedding_model.embed_chunks([c.text for c in chunks])
print(f"Generated {len(embeddings)} embeddings")

Best Practices

Error Handling

Always wrap load operations in try-except blocks:

documents = ["doc1.pdf", "doc2.pdf", "doc3.pdf"]
loaded = []

for doc in documents:
    try:
        text = loader.load(doc)
        loaded.append(text)
    except Exception as e:
        print(f"Failed to load {doc}: {e}")

Large Files

For large files, consider chunking during or after loading:

# Load large document
text = loader.load("large_document.pdf")

# Chunk immediately
chunks = chunker.chunk(text)

# Process in batches
batch_size = 100
for i in range(0, len(chunks), batch_size):
    batch = chunks[i:i + batch_size]
    # Process batch

Memory Management

Clear large text objects after processing:

import gc

text = loader.load("huge_document.pdf")
chunks = chunker.chunk(text)

# Clear large text from memory
del text
gc.collect()

Chunker

Chunk loaded text

Document Loading

Learn more about document loading

AgenticRAG

Complete RAG pipeline

API Documentation

Core Classes

Configuration

Rerankers

Overview

Constructor

Methods

load

Parameters

Returns

Example

load_documents

Parameters

Returns

Example

load_documents_from_directory

Parameters

Returns

Example

Supported Formats

Error Handling

Complete Example

Integration Example

Best Practices

See Also

Chunker

Document Loading

AgenticRAG

API Documentation

Core Classes

Configuration

Rerankers

​Overview

​Constructor

​Methods

​load

​Parameters

​Returns

​Example

​load_documents

​Parameters

​Returns

​Example

​load_documents_from_directory

​Parameters

​Returns

​Example

​Supported Formats

​Error Handling

​Complete Example

​Integration Example

​Best Practices

​See Also

Chunker

Document Loading

AgenticRAG

Overview

Constructor

Methods

load

Parameters

Returns

Example

load_documents

Parameters

Returns

Example

load_documents_from_directory

Parameters

Returns

Example

Supported Formats

Error Handling

Complete Example

Integration Example

Best Practices

See Also