Skip to main content

Overview

The DocumentLoader class provides a simple interface to load documents from various file formats. It uses MarkItDown under the hood, which supports:
  • PDF files
  • Word documents (DOCX, DOC)
  • PowerPoint presentations (PPTX, PPT)
  • Excel spreadsheets (XLSX, XLS)
  • Images with OCR (PNG, JPG, JPEG)
  • HTML and Markdown files
  • Plain text files

Basic Usage

from mini.loader import DocumentLoader

# Initialize loader
loader = DocumentLoader()

# Load a single document
text = loader.load("document.pdf")
print(f"Loaded {len(text)} characters")

Loading Multiple Documents

# Load multiple files
texts = loader.load_documents([
    "document1.pdf",
    "document2.docx",
    "presentation.pptx"
])

print(f"Loaded {len(texts)} documents")
for i, text in enumerate(texts):
    print(f"Document {i+1}: {len(text)} characters")

Loading from Directory

# Load all documents from a directory
texts = loader.load_documents_from_directory("./knowledge_base/")

print(f"Loaded {len(texts)} documents from directory")
The load_documents_from_directory method recursively scans the directory and loads all supported file formats.

Supported File Formats

Extracts text from PDF documents, including scanned PDFs with OCR.
text = loader.load("report.pdf")
Parses Microsoft Word documents and extracts formatted text.
text = loader.load("proposal.docx")
Extracts text and notes from PowerPoint presentations.
text = loader.load("presentation.pptx")
Reads data from Excel files, including multiple sheets.
text = loader.load("data.xlsx")
Uses OCR to extract text from images.
text = loader.load("screenshot.png")
Parses HTML and Markdown files.
text = loader.load("readme.md")
Loads plain text files with proper encoding detection.
text = loader.load("notes.txt")

Integration with AgenticRAG

When using AgenticRAG, document loading is handled automatically:
from mini import AgenticRAG

rag = AgenticRAG(vector_store=vector_store, embedding_model=embedding_model)

# The loader is used internally
rag.index_document("document.pdf")

# Load multiple documents
rag.index_documents([
    "doc1.pdf",
    "doc2.docx",
    "doc3.pptx"
])

Advanced Usage

Custom Processing

You can use the loader to get text and then apply custom processing:
from mini.loader import DocumentLoader
from mini.chunker import Chunker

loader = DocumentLoader()
chunker = Chunker()

# Load and process
text = loader.load("document.pdf")

# Apply custom preprocessing
text = text.replace("\n\n\n", "\n\n")  # Clean extra newlines
text = text.strip()

# Chunk the processed text
chunks = chunker.chunk(text)

Batch Loading with Metadata

documents = [
    {"path": "2023_report.pdf", "metadata": {"year": 2023, "type": "report"}},
    {"path": "2024_report.pdf", "metadata": {"year": 2024, "type": "report"}},
]

for doc in documents:
    text = loader.load(doc["path"])
    # Process text with metadata
    # ...

Error Handling

The document loader includes robust error handling:
try:
    text = loader.load("document.pdf")
except FileNotFoundError:
    print("Document not found")
except Exception as e:
    print(f"Error loading document: {e}")
Some file formats (especially scanned PDFs) may require additional system dependencies for OCR. Ensure you have the necessary libraries installed.

Best Practices

  • Use text-based formats (TXT, MD) when possible for fastest loading
  • PDFs work well but may be slower to process
  • Images require OCR and may have accuracy limitations
  • Clean documents before indexing (remove headers, footers, etc.)
  • Ensure proper formatting for better chunking
  • Remove unnecessary content that won’t be useful for retrieval
  • Load multiple documents in batches for efficiency
  • Consider using multiprocessing for large document sets
  • Monitor memory usage when loading many large documents
  • Always wrap document loading in try-except blocks
  • Log failed documents for review
  • Implement retry logic for transient failures

Performance Considerations

File TypeSpeedAccuracyNotes
TXT, MDFastPerfectBest for text-only content
DOCXFastExcellentHandles formatting well
PDFMediumGoodDepends on PDF type (text vs scanned)
ImagesSlowVariableRequires OCR, accuracy varies
PPTXFastGoodExtracts slide text and notes

Troubleshooting

Solution: Install required OCR dependencies:
# macOS
brew install tesseract

# Ubuntu
sudo apt-get install tesseract-ocr

# Windows
# Download from: https://github.com/UB-Mannheim/tesseract/wiki
Solution: Ensure you have the required PDF libraries:
pip install pymupdf  # or pdfplumber
Solution: MarkItDown handles encoding automatically, but you can specify encoding if needed:
# The library handles this internally
text = loader.load("document.txt")

Next Steps