A geoscience document analysis system that uses Google's Gemini API to extract structured information from scientific papers, focusing on geomorphology and sedimentology. Based on vegaluisjose/mlx-rag, extended with Gemini API for PDF processing, DuckDB for embeddings and enhanced visualization.
- PDF Processing with Gemini API
- Structured information extraction
- Automatic metadata parsing
- Research findings identification
- Relationship mapping
- Beautiful Output Formats
- Rich console display
- HTML export with styling
- Structured JSON storage
- Vector Database Integration
- MLX embeddings
- DuckDB storage
- Similarity search
- Interactive Chat Interface
- Context-aware responses
- Research paper integration
- Geoscience expertise
- Clone the repository:
git clone https://github.com/jameshgrn/drsedman.git
cd drsedman
- Set up Python environment:
pyenv install 3.10.14
pyenv local 3.10.14
python -m venv .venv
source .venv/bin/activate
- Install dependencies:
poetry install
- Set up credentials:
cp config/gemini_credentials_example.json config/gemini_credentials.json
# Edit config/gemini_credentials.json with your API key
# Process all PDFs in a directory
./scripts/run_gemini_processing.zsh data/pdfs gemini_output
# Retry failed processing
./scripts/run_gemini_processing.zsh data/pdfs gemini_output --retry-failed
# View a random processed file
./scripts/view_gemini.sh
# View a specific file
./scripts/view_gemini.sh --file gemini_output/paper_gemini.jsonl
# Save as HTML
./scripts/view_gemini.sh --save-html
# Create vector embeddings from processed PDFs
./scripts/process_and_embed.zsh
./scripts/chat.zsh
# Run tests
make test
# Run linter
make lint
# Format code
make format
# Type checking
make mypy
# Run coverage
make coverage
See architecture.md for detailed system design.
See CONTRIBUTING.md for development guidelines.
The system has processed approximately 750 academic papers in geoscience, extracting:
- 6,465 research findings
- 2,849 methodology entries
- 2,672 identified relationships
Copyright 2024 James Hooker Gearon
Licensed under the Apache License, Version 2.0. See LICENSE for details.
This project is based on vegaluisjose/mlx-rag, with significant extensions for geoscience document analysis including Gemini API integration, structured information extraction, and enhanced visualization capabilities.