Spaces:
Running
on
Zero
Running
on
Zero
| # gprMax RAG Database System | |
| ## Overview | |
| This is a production-ready Retrieval-Augmented Generation (RAG) system for gprMax documentation. It provides efficient vector search capabilities for the gprMax documentation, enabling intelligent context retrieval for the chatbot. | |
| ## Architecture | |
| ### Components | |
| 1. **Document Processor**: Extracts and chunks documentation from gprMax GitHub repository | |
| 2. **Embedding Model**: Qwen2.5-0.5B (will upgrade to Qwen3-Embedding-0.6B when available) | |
| 3. **Vector Database**: ChromaDB with persistent storage | |
| 4. **Retriever**: Search and context retrieval utilities | |
| ### Key Features | |
| - Automatic documentation extraction from gprMax GitHub repository | |
| - Intelligent chunking with configurable size and overlap | |
| - Persistent vector database using ChromaDB | |
| - Efficient similarity search with score thresholding | |
| - Metadata tracking for reproducibility | |
| ## Installation | |
| The database is **automatically generated** on first startup of the application. No manual installation required! | |
| ## Automatic Generation | |
| When the app starts: | |
| 1. Checks if database exists at `rag-db/chroma_db/` | |
| 2. If not found, automatically runs `generate_db.py` | |
| 3. Clones gprMax repository and processes documentation | |
| 4. Creates ChromaDB with default embeddings (all-MiniLM-L6-v2) | |
| 5. Ready to use - this only happens once! | |
| ## Manual Generation (Optional) | |
| If you need to manually regenerate the database: | |
| ```bash | |
| cd rag-db | |
| python generate_db.py --recreate | |
| ``` | |
| Custom settings: | |
| ```bash | |
| python generate_db.py \ | |
| --db-path ./custom_db \ | |
| --temp-dir ./temp \ | |
| --device cuda \ | |
| --recreate | |
| ``` | |
| ### 2. Use Retriever in Application | |
| ```python | |
| from rag_db.retriever import create_retriever | |
| # Initialize retriever | |
| retriever = create_retriever(db_path="./rag-db/chroma_db") | |
| # Search for relevant documents | |
| results = retriever.search("How to create a source?", k=5) | |
| # Get formatted context for LLM | |
| context = retriever.get_context("antenna patterns", k=3) | |
| # Get relevant source files | |
| files = retriever.get_relevant_files("boundary conditions") | |
| # Get database statistics | |
| stats = retriever.get_stats() | |
| ``` | |
| ### 3. Test Retriever | |
| ```bash | |
| # Test with default query | |
| python retriever.py | |
| # Test with custom query | |
| python retriever.py "How to model soil layers?" | |
| ``` | |
| ## Database Schema | |
| ### Document Structure | |
| ```json | |
| { | |
| "id": "unique_hash", | |
| "text": "document_chunk_text", | |
| "metadata": { | |
| "source": "docs/relative/path.rst", | |
| "file_type": ".rst", | |
| "chunk_index": 0, | |
| "char_start": 0, | |
| "char_end": 1000 | |
| } | |
| } | |
| ``` | |
| ### Metadata File | |
| Generated `metadata.json` contains: | |
| ```json | |
| { | |
| "created_at": "2024-01-01T00:00:00", | |
| "embedding_model": "Qwen/Qwen2.5-0.5B", | |
| "collection_name": "gprmax_docs_v1", | |
| "chunk_size": 1000, | |
| "chunk_overlap": 200, | |
| "total_documents": 1234 | |
| } | |
| ``` | |
| ## Configuration | |
| ### Chunking Parameters | |
| - `CHUNK_SIZE`: 1000 characters (optimal for context windows) | |
| - `CHUNK_OVERLAP`: 200 characters (ensures continuity) | |
| ### Embedding Model | |
| - Current: `Qwen/Qwen2.5-0.5B` (512-dim embeddings) | |
| - Future: `Qwen/Qwen3-Embedding-0.6B` (when available) | |
| ### Database Settings | |
| - Storage: ChromaDB persistent client | |
| - Collection: `gprmax_docs_v1` (versioned for updates) | |
| - Distance Metric: Cosine similarity | |
| ## Maintenance | |
| ### Regular Updates | |
| Run monthly or when gprMax documentation updates: | |
| ```bash | |
| # This will pull latest docs and update database | |
| python generate_db.py | |
| ``` | |
| ### Database Backup | |
| ```bash | |
| # Backup database | |
| cp -r chroma_db chroma_db_backup_$(date +%Y%m%d) | |
| ``` | |
| ### Performance Tuning | |
| - Adjust `CHUNK_SIZE` and `CHUNK_OVERLAP` in `generate_db.py` | |
| - Modify batch sizes for large datasets | |
| - Use GPU acceleration with `--device cuda` | |
| ## Integration with Main App | |
| The RAG system integrates with the main Gradio app: | |
| 1. Import retriever in `app.py` | |
| 2. Use retriever to augment prompts with context | |
| 3. Display source references in UI | |
| Example integration: | |
| ```python | |
| # In app.py | |
| from rag_db.retriever import create_retriever | |
| retriever = create_retriever() | |
| def augment_with_context(user_query): | |
| context = retriever.get_context(user_query, k=3) | |
| augmented_prompt = f""" | |
| Context from documentation: | |
| {context} | |
| User question: {user_query} | |
| """ | |
| return augmented_prompt | |
| ``` | |
| ## Troubleshooting | |
| ### Common Issues | |
| 1. **Database not found** | |
| - Run `python generate_db.py` first | |
| - Check `--db-path` parameter | |
| 2. **Out of memory** | |
| - Use smaller batch sizes | |
| - Use CPU instead of GPU | |
| - Reduce chunk size | |
| 3. **Slow generation** | |
| - Use GPU with `--device cuda` | |
| - Reduce repository depth with shallow clone | |
| - Use pre-generated database | |
| ### Logs | |
| Check generation logs for detailed information: | |
| ```bash | |
| python generate_db.py 2>&1 | tee generation.log | |
| ``` | |
| ## Future Enhancements | |
| 1. **Model Upgrade**: Migrate to Qwen3-Embedding-0.6B when available | |
| 2. **Incremental Updates**: Add documents without full regeneration | |
| 3. **Multi-modal Support**: Include images and diagrams from docs | |
| 4. **Query Expansion**: Automatic query reformulation for better retrieval | |
| 5. **Caching Layer**: Redis cache for frequent queries | |
| 6. **Fine-tuned Embeddings**: Domain-specific embedding model for gprMax | |
| ## License | |
| Same as parent project |