Spaces:

jfang
/

gprmax-support-gsoc25

Running on Zero

App Files Files Community

gprmax-support-gsoc25 / rag-db /README.md

jfang

Upload 7 files

3718631 verified 3 months ago

preview code

raw

history blame contribute delete

5.25 kB

	# gprMax RAG Database System

	## Overview
	This is a production-ready Retrieval-Augmented Generation (RAG) system for gprMax documentation. It provides efficient vector search capabilities for the gprMax documentation, enabling intelligent context retrieval for the chatbot.

	## Architecture

	### Components
	1. Document Processor: Extracts and chunks documentation from gprMax GitHub repository
	2. Embedding Model: Qwen2.5-0.5B (will upgrade to Qwen3-Embedding-0.6B when available)
	3. Vector Database: ChromaDB with persistent storage
	4. Retriever: Search and context retrieval utilities

	### Key Features
	- Automatic documentation extraction from gprMax GitHub repository
	- Intelligent chunking with configurable size and overlap
	- Persistent vector database using ChromaDB
	- Efficient similarity search with score thresholding
	- Metadata tracking for reproducibility

	## Installation

	The database is automatically generated on first startup of the application. No manual installation required!

	## Automatic Generation

	When the app starts:
	1. Checks if database exists at `rag-db/chroma_db/`
	2. If not found, automatically runs `generate_db.py`
	3. Clones gprMax repository and processes documentation
	4. Creates ChromaDB with default embeddings (all-MiniLM-L6-v2)
	5. Ready to use - this only happens once!

	## Manual Generation (Optional)

	If you need to manually regenerate the database:

	```bash
	cd rag-db
	python generate_db.py --recreate
	```

	Custom settings:
	```bash
	python generate_db.py \
	--db-path ./custom_db \
	--temp-dir ./temp \
	--device cuda \
	--recreate
	```

	### 2. Use Retriever in Application

	```python
	from rag_db.retriever import create_retriever

	# Initialize retriever
	retriever = create_retriever(db_path="./rag-db/chroma_db")

	# Search for relevant documents
	results = retriever.search("How to create a source?", k=5)

	# Get formatted context for LLM
	context = retriever.get_context("antenna patterns", k=3)

	# Get relevant source files
	files = retriever.get_relevant_files("boundary conditions")

	# Get database statistics
	stats = retriever.get_stats()
	```

	### 3. Test Retriever

	```bash
	# Test with default query
	python retriever.py

	# Test with custom query
	python retriever.py "How to model soil layers?"
	```

	## Database Schema

	### Document Structure
	```json
	{
	"id": "unique_hash",
	"text": "document_chunk_text",
	"metadata": {
	"source": "docs/relative/path.rst",
	"file_type": ".rst",
	"chunk_index": 0,
	"char_start": 0,
	"char_end": 1000
	}
	}
	```

	### Metadata File
	Generated `metadata.json` contains:
	```json
	{
	"created_at": "2024-01-01T00:00:00",
	"embedding_model": "Qwen/Qwen2.5-0.5B",
	"collection_name": "gprmax_docs_v1",
	"chunk_size": 1000,
	"chunk_overlap": 200,
	"total_documents": 1234
	}
	```

	## Configuration

	### Chunking Parameters
	- `CHUNK_SIZE`: 1000 characters (optimal for context windows)
	- `CHUNK_OVERLAP`: 200 characters (ensures continuity)

	### Embedding Model
	- Current: `Qwen/Qwen2.5-0.5B` (512-dim embeddings)
	- Future: `Qwen/Qwen3-Embedding-0.6B` (when available)

	### Database Settings
	- Storage: ChromaDB persistent client
	- Collection: `gprmax_docs_v1` (versioned for updates)
	- Distance Metric: Cosine similarity

	## Maintenance

	### Regular Updates
	Run monthly or when gprMax documentation updates:
	```bash
	# This will pull latest docs and update database
	python generate_db.py
	```

	### Database Backup
	```bash
	# Backup database
	cp -r chroma_db chroma_db_backup_$(date +%Y%m%d)
	```

	### Performance Tuning
	- Adjust `CHUNK_SIZE` and `CHUNK_OVERLAP` in `generate_db.py`
	- Modify batch sizes for large datasets
	- Use GPU acceleration with `--device cuda`

	## Integration with Main App

	The RAG system integrates with the main Gradio app:

	1. Import retriever in `app.py`
	2. Use retriever to augment prompts with context
	3. Display source references in UI

	Example integration:
	```python
	# In app.py
	from rag_db.retriever import create_retriever

	retriever = create_retriever()

	def augment_with_context(user_query):
	context = retriever.get_context(user_query, k=3)
	augmented_prompt = f"""
	Context from documentation:
	{context}

	User question: {user_query}
	"""
	return augmented_prompt
	```

	## Troubleshooting

	### Common Issues

	1. Database not found
	- Run `python generate_db.py` first
	- Check `--db-path` parameter

	2. Out of memory
	- Use smaller batch sizes
	- Use CPU instead of GPU
	- Reduce chunk size

	3. Slow generation
	- Use GPU with `--device cuda`
	- Reduce repository depth with shallow clone
	- Use pre-generated database

	### Logs
	Check generation logs for detailed information:
	```bash
	python generate_db.py 2>&1 \| tee generation.log
	```

	## Future Enhancements

	1. Model Upgrade: Migrate to Qwen3-Embedding-0.6B when available
	2. Incremental Updates: Add documents without full regeneration
	3. Multi-modal Support: Include images and diagrams from docs
	4. Query Expansion: Automatic query reformulation for better retrieval
	5. Caching Layer: Redis cache for frequent queries
	6. Fine-tuned Embeddings: Domain-specific embedding model for gprMax

	## License
	Same as parent project