Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
stochastic-sisyphus authored Dec 17, 2024
1 parent 88f3da2 commit 633f766
Showing 1 changed file with 165 additions and 0 deletions.
165 changes: 165 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -1 +1,166 @@
# SynSearch

SynSearch is an advanced document processing and semantic search system that combines embedding generation, clustering, and summarization capabilities to effectively process and analyze large collections of text documents.

## 🌟 Features

- **Document Processing Pipeline**
- Domain-agnostic text preprocessing
- Supports multiple dataset formats
- Efficient batch processing capabilities

- **Advanced Embedding Generation**
- Transformer-based embeddings
- Configurable model selection
- GPU acceleration support
- Optimized batch processing

- **Dynamic Clustering**
- Adaptive clustering algorithms
- Theme-based document grouping
- Support for multiple clustering strategies

- **Intelligent Summarization**
- Hybrid summarization approach
- Support for scientific and legal domains
- Cluster-based summary generation
- Configurable summary length

- **ArXiv Integration**
- Direct ArXiv paper search
- Batch paper fetching
- Rate-limited API handling

## πŸ“‹ Requirements

- Python 3.8 or higher
- CUDA-compatible GPU (optional, for acceleration)
- Required Python packages:
- torch
- transformers
- pandas
- numpy
- spacy
- pyyaml

## πŸš€ Installation

1. Clone the repository:
```bash
git clone https://github.com/stochastic-sisyphus/synsearch.git
cd synsearch
```

2. Set up a virtual environment (recommended):
```bash
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
```

3. Install dependencies:
```bash
pip install -r requirements.txt
```

4. Download required datasets:
```bash
make download-data
```

## βš™οΈ Configuration

The system is configured through YAML files located in the `config` directory. Key configuration areas include:

- Data sources and paths
- Embedding model settings
- Preprocessing parameters
- Clustering configuration
- Summarization options

Example configuration:
```yaml
data:
datasets:
- name: scisummnet
enabled: true
scisummnet_path: "path/to/dataset"

embedding:
model_name: "bert-base-uncased"
dimension: 768
max_seq_length: 512
batch_size: 32

preprocessing:
# Preprocessing settings

clustering:
# Clustering settings

summarization:
# Summarization settings
```

## πŸ”¨ Usage

1. Basic usage:
```python
from src.main import main

# Run the complete pipeline
main()
```

2. Using specific components:
```python
from src.preprocessing.domain_agnostic_preprocessor import DomainAgnosticPreprocessor
from src.embedding_generator import EnhancedEmbeddingGenerator

# Initialize components
preprocessor = DomainAgnosticPreprocessor()
embedding_generator = EnhancedEmbeddingGenerator(model_name="bert-base-uncased")

# Process texts
processed_texts = preprocessor.preprocess_texts(your_texts)
embeddings = embedding_generator.generate_embeddings(processed_texts)
```

## πŸ§ͺ Testing

Run the test suite:
```bash
pytest tests/
```

Key test areas include:
- Preprocessing functionality
- ArXiv API integration
- Embedding generation
- Clustering algorithms
- Summarization quality

## πŸ“ Project Structure

```
synsearch/
β”œβ”€β”€ src/
β”‚ β”œβ”€β”€ api/ # API integrations
β”‚ β”œβ”€β”€ preprocessing/ # Text preprocessing
β”‚ β”œβ”€β”€ clustering/ # Clustering algorithms
β”‚ β”œβ”€β”€ summarization/ # Summary generation
β”‚ └── utils/ # Utility functions
β”œβ”€β”€ tests/ # Test suite
β”œβ”€β”€ config/ # Configuration files
β”œβ”€β”€ data/ # Dataset storage
└── outputs/ # Generated outputs
```

## 🀝 Contributing

1. Fork the repository
2. Create a feature branch
3. Commit your changes
4. Push to the branch
5. Open a Pull Request

## πŸ“ License

0 comments on commit 633f766

Please sign in to comment.