Overview
TheDocumentLoader class provides a simple interface to load documents from various file formats. It uses MarkItDown under the hood, which supports:
- PDF files
- Word documents (DOCX, DOC)
- PowerPoint presentations (PPTX, PPT)
- Excel spreadsheets (XLSX, XLS)
- Images with OCR (PNG, JPG, JPEG)
- HTML and Markdown files
- Plain text files
Basic Usage
Loading Multiple Documents
Loading from Directory
Supported File Formats
PDF Files (.pdf)
PDF Files (.pdf)
Extracts text from PDF documents, including scanned PDFs with OCR.
Word Documents (.docx, .doc)
Word Documents (.docx, .doc)
Parses Microsoft Word documents and extracts formatted text.
PowerPoint (.pptx, .ppt)
PowerPoint (.pptx, .ppt)
Extracts text and notes from PowerPoint presentations.
Excel Spreadsheets (.xlsx, .xls)
Excel Spreadsheets (.xlsx, .xls)
Reads data from Excel files, including multiple sheets.
Images (.png, .jpg, .jpeg)
Images (.png, .jpg, .jpeg)
Uses OCR to extract text from images.
HTML and Markdown
HTML and Markdown
Parses HTML and Markdown files.
Plain Text (.txt)
Plain Text (.txt)
Loads plain text files with proper encoding detection.
Integration with AgenticRAG
When usingAgenticRAG, document loading is handled automatically:
Advanced Usage
Custom Processing
You can use the loader to get text and then apply custom processing:Batch Loading with Metadata
Error Handling
The document loader includes robust error handling:Best Practices
File Format Selection
File Format Selection
- Use text-based formats (TXT, MD) when possible for fastest loading
- PDFs work well but may be slower to process
- Images require OCR and may have accuracy limitations
Document Preprocessing
Document Preprocessing
- Clean documents before indexing (remove headers, footers, etc.)
- Ensure proper formatting for better chunking
- Remove unnecessary content that won’t be useful for retrieval
Batch Processing
Batch Processing
- Load multiple documents in batches for efficiency
- Consider using multiprocessing for large document sets
- Monitor memory usage when loading many large documents
Error Handling
Error Handling
- Always wrap document loading in try-except blocks
- Log failed documents for review
- Implement retry logic for transient failures
Performance Considerations
| File Type | Speed | Accuracy | Notes |
|---|---|---|---|
| TXT, MD | Fast | Perfect | Best for text-only content |
| DOCX | Fast | Excellent | Handles formatting well |
| Medium | Good | Depends on PDF type (text vs scanned) | |
| Images | Slow | Variable | Requires OCR, accuracy varies |
| PPTX | Fast | Good | Extracts slide text and notes |
Troubleshooting
OCR not working for images
OCR not working for images
Solution: Install required OCR dependencies:
PDF loading fails
PDF loading fails
Solution: Ensure you have the required PDF libraries:
Encoding errors
Encoding errors
Solution: MarkItDown handles encoding automatically, but you can specify encoding if needed:
