Refactor DESIGN.md for clarity and consistency in problem definition and audience analysis
Browse files
DESIGN.md
CHANGED
|
@@ -1,104 +1,103 @@
|
|
| 1 |
# AIM AIE 6 Certification Challenge: AI-Driven Blog Chat Component
|
| 2 |
|
| 3 |
-
## Task 1: Defining
|
| 4 |
|
| 5 |
### Problem Statement
|
| 6 |
-
Content-rich blogs are difficult
|
| 7 |
-
|
| 8 |
|
| 9 |
### Target Audience Analysis
|
| 10 |
|
| 11 |
-
|
| 12 |
-
|
| 13 |
-
1. **AI/ML Engineers & Researchers** - Professionals working on LLM applications and RAG systems who need guidance on evaluation frameworks and best practices.
|
| 14 |
-
|
| 15 |
-
2. **Data Scientists & Engineers** - Individuals looking to implement robust data strategies and evaluation methodologies for AI systems.
|
| 16 |
-
|
| 17 |
-
3. **Technical Leaders & Architects** - Decision-makers seeking insights on AI implementation, data strategy, and technical approaches for enterprise solutions.
|
| 18 |
|
| 19 |
-
|
| 20 |
-
|
| 21 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 22 |
|
| 23 |
|
| 24 |
### Potential User Questions
|
| 25 |
1. What is Ragas and how does it help evaluate LLM applications?
|
| 26 |
2. How do I set up a basic evaluation workflow with Ragas?
|
| 27 |
3. What are the key metrics to evaluate RAG systems?
|
| 28 |
-
4. How can I generate synthetic test data for
|
| 29 |
-
5. How do I create custom metrics in Ragas for
|
| 30 |
6. What's the difference between Faithfulness and Factual Correctness metrics?
|
| 31 |
-
7. How can I build a research agent with RSS feed support
|
| 32 |
-
8. What technologies
|
| 33 |
-
9. How can I implement feedback loops to
|
| 34 |
-
10. Why
|
| 35 |
|
| 36 |
-
These questions align with the main themes
|
| 37 |
|
| 38 |
|
| 39 |
## Task 2: Proposed Solution
|
| 40 |
|
| 41 |
-
We propose an AI-driven chat assistant for [TheDataGuy](https://thedataguy.pro)'s blog that enables readers to interactively explore technical content. This
|
| 42 |
|
| 43 |
-
- Provide contextually relevant responses
|
| 44 |
-
-
|
| 45 |
-
- Deliver code examples and
|
| 46 |
-
- Function as a knowledge companion
|
| 47 |
-
- Create
|
| 48 |
|
| 49 |
-
|
| 50 |
|
| 51 |
## Technology Stack
|
| 52 |
|
| 53 |
1. **LLM Architecture**:
|
| 54 |
-
|
| 55 |
-
|
| 56 |
|
| 57 |
2. **Embedding Model**:
|
| 58 |
-
|
| 59 |
-
|
| 60 |
|
| 61 |
-
3. **Orchestration**: LangChain -
|
| 62 |
|
| 63 |
-
4. **Vector Database**: Qdrant - Stores embeddings
|
| 64 |
|
| 65 |
-
5. **Monitoring**: LangSmith -
|
| 66 |
|
| 67 |
-
6. **Evaluation**: Ragas -
|
| 68 |
-
|
| 69 |
-
|
| 70 |
-
|
| 71 |
-
|
| 72 |
|
| 73 |
7. **User Interface**:
|
| 74 |
-
|
| 75 |
-
|
| 76 |
|
| 77 |
## Serving & Inference
|
| 78 |
-
- **Development
|
| 79 |
### Future
|
| 80 |
-
- **Production
|
| 81 |
-
- **API Layer**: FastAPI -
|
| 82 |
-
- **Deployment
|
| 83 |
|
| 84 |
## Agentic Reasoning
|
| 85 |
|
| 86 |
-
Agentic
|
| 87 |
|
| 88 |
|
| 89 |
# Task 3: Dealing with the Data
|
| 90 |
|
| 91 |
## Data Collection
|
| 92 |
|
| 93 |
-
The blog data from [TheDataGuy](https://thedataguy.pro) was collected and processed for our
|
| 94 |
|
| 95 |
-
| Title | Date |
|
| 96 |
-
|
| 97 |
| "Coming Back to AI Roots - My Professional Journey" | 2025-04-14 | 5,827 | [Link](https://thedataguy.pro/blog/coming-back-to-ai-roots/) |
|
| 98 |
| "Data is King: Why Your Data Strategy IS Your Business Strategy" | 2025-04-15 | 6,197 | [Link](https://thedataguy.pro/blog/data-is-king/) |
|
| 99 |
| "A C# Programmer's Perspective on LangChain Expression Language" | 2025-04-16 | 3,361 | [Link](https://thedataguy.pro/blog/langchain-experience-csharp-perspective/) |
|
| 100 |
| "Building a Research Agent with RSS Feed Support" | 2025-04-20 | 7,320 | [Link](https://thedataguy.pro/blog/building-research-agent/) |
|
| 101 |
-
| "Part 1: Introduction to Ragas: The Essential Evaluation Framework
|
| 102 |
| "Part 2: Basic Evaluation Workflow with Ragas" | 2025-04-26 | 11,223 | [Link](https://thedataguy.pro/blog/basic-evaluation-workflow-with-ragas/) |
|
| 103 |
| "Part 3: Evaluating RAG Systems with Ragas" | 2025-04-26 | 8,811 | [Link](https://thedataguy.pro/blog/evaluating-rag-systems-with-ragas/) |
|
| 104 |
| "Part 4: Generating Test Data with Ragas" | 2025-04-27 | 14,682 | [Link](https://thedataguy.pro/blog/generating-test-data-with-ragas/) |
|
|
@@ -112,18 +111,17 @@ The blog data from [TheDataGuy](https://thedataguy.pro) was collected and proces
|
|
| 112 |
|
| 113 |
## Chunking Strategy
|
| 114 |
|
| 115 |
-
For our blog chat component, we evaluated multiple chunking approaches
|
| 116 |
|
| 117 |
-
1. **Initial
|
| 118 |
-
|
| 119 |
-
|
| 120 |
|
| 121 |
-
2. **Final Implementation
|
| 122 |
-
|
| 123 |
-
|
| 124 |
-
- Each blog post is treated as a distinct retrievable unit in the vector database
|
| 125 |
|
| 126 |
-
The whole-document
|
| 127 |
|
| 128 |
## Data Statistics Summary
|
| 129 |
|
|
@@ -131,25 +129,23 @@ The whole-document chunking strategy was selected to retrieve relevant full blog
|
|
| 131 |
|-----------|-------|
|
| 132 |
| Total Blog Posts | 14 |
|
| 133 |
| Total Characters | 106,275 |
|
| 134 |
-
| Minimum Post Length | 1,900
|
| 135 |
-
| Maximum Post Length | 13,468
|
| 136 |
-
| Average Post Length | 7,591
|
| 137 |
-
|
| 138 |
-
With average post length under 8,000 characters, whole-document retrieval remains efficient while eliminating contextual fragmentation that occurs with smaller chunks. This approach optimizes for content coherence over granularity, supporting comprehensive responses to technical queries about RAG systems, evaluation frameworks, and data strategy.
|
| 139 |
|
|
|
|
| 140 |
|
| 141 |
## Tools and APIs
|
| 142 |
-
We will
|
| 143 |
-
|
| 144 |
|
| 145 |
# Task 4: Building a Quick End-to-End Prototype
|
| 146 |
|
| 147 |
-
Live demo at [Let's Talk](https://huggingface.co/spaces/mafzaal/lets_talk)
|
| 148 |
|
| 149 |
|
| 150 |
# Task 5: Creating a Golden Test Data Set
|
| 151 |
|
| 152 |
-
Synthetic data
|
| 153 |
|
| 154 |
## Evaluation Results Summary
|
| 155 |
|
|
@@ -162,16 +158,15 @@ Synthetic data is available at [Testset](/evals/testset_2.csv), [Eval set](/eval
|
|
| 162 |
| Context Recall | 0.1905 |
|
| 163 |
| Context Entity Recall | 0.1503 |
|
| 164 |
|
| 165 |
-
These results
|
| 166 |
|
| 167 |
# Task 6: Fine-Tuning Open-Source Embeddings
|
| 168 |
|
| 169 |
-
Fine
|
| 170 |
-
|
| 171 |
|
| 172 |
# Task 7: Assessing Performance
|
| 173 |
|
| 174 |
-
|
| 175 |
|
| 176 |
## Fine-Tuned Embedding Model Evaluation Results
|
| 177 |
|
|
@@ -184,5 +179,5 @@ Following is evalation based on finetuned embedding model.
|
|
| 184 |
| Context Recall | 0.2500 |
|
| 185 |
| Context Entity Recall | 0.2175 |
|
| 186 |
|
| 187 |
-
The fine-tuned
|
| 188 |
|
|
|
|
| 1 |
# AIM AIE 6 Certification Challenge: AI-Driven Blog Chat Component
|
| 2 |
|
| 3 |
+
## Task 1: Defining the Problem and Audience
|
| 4 |
|
| 5 |
### Problem Statement
|
| 6 |
+
Content-rich technical blogs are difficult to navigate, making information extraction challenging and resulting in reduced reader engagement and information retention.
|
|
|
|
| 7 |
|
| 8 |
### Target Audience Analysis
|
| 9 |
|
| 10 |
+
Our initial implementation focuses on [TheDataGuy](https://thedataguy.pro) blog, which serves a technical audience including:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 11 |
|
| 12 |
+
1. **AI/ML Engineers & Researchers** - Professionals seeking guidance on LLM applications, RAG systems, and evaluation frameworks
|
| 13 |
+
|
| 14 |
+
2. **Data Scientists & Engineers** - Practitioners implementing data strategies and evaluation methodologies for AI systems
|
| 15 |
+
|
| 16 |
+
3. **Technical Leaders & Architects** - Decision-makers exploring AI implementation, data strategy, and enterprise solutions
|
| 17 |
+
|
| 18 |
+
4. **Developers Building AI Applications** - Engineers needing practical advice on RAG systems and evaluation metrics
|
| 19 |
+
|
| 20 |
+
5. **Data Strategists** - Professionals leveraging data as a strategic business asset
|
| 21 |
|
| 22 |
|
| 23 |
### Potential User Questions
|
| 24 |
1. What is Ragas and how does it help evaluate LLM applications?
|
| 25 |
2. How do I set up a basic evaluation workflow with Ragas?
|
| 26 |
3. What are the key metrics to evaluate RAG systems?
|
| 27 |
+
4. How can I generate synthetic test data for my RAG system?
|
| 28 |
+
5. How do I create custom metrics in Ragas for specific use cases?
|
| 29 |
6. What's the difference between Faithfulness and Factual Correctness metrics?
|
| 30 |
+
7. How can I build a research agent with RSS feed support?
|
| 31 |
+
8. What technologies are optimal for building a research agent?
|
| 32 |
+
9. How can I implement feedback loops to improve my LLM application?
|
| 33 |
+
10. Why is data strategy critical for business success?
|
| 34 |
|
| 35 |
+
These questions align with the blog's main themes: LLM evaluation, RAG systems, AI tool development, and data strategy.
|
| 36 |
|
| 37 |
|
| 38 |
## Task 2: Proposed Solution
|
| 39 |
|
| 40 |
+
We propose an AI-driven chat assistant for [TheDataGuy](https://thedataguy.pro)'s blog that enables readers to interactively explore technical content. This solution will:
|
| 41 |
|
| 42 |
+
- Provide contextually relevant responses on RAG systems, evaluation metrics, and data strategy
|
| 43 |
+
- Enable users to ask clarifying questions without searching across multiple articles
|
| 44 |
+
- Deliver code examples and explanations tailored to users' technical backgrounds
|
| 45 |
+
- Function as a knowledge companion with the blog's expertise
|
| 46 |
+
- Create personalized learning experiences based on visitors' specific interests
|
| 47 |
|
| 48 |
+
This solution transforms passive reading into interactive dialogue, enhancing information discovery and retention.
|
| 49 |
|
| 50 |
## Technology Stack
|
| 51 |
|
| 52 |
1. **LLM Architecture**:
|
| 53 |
+
- **Primary Model**: OpenAI `gpt-4.1` - For complex tasks including synthetic data generation and evaluation workflows
|
| 54 |
+
- **Inference Model**: OpenAI `gpt-4o-mini` - Powers the chat application with optimal performance/cost balance
|
| 55 |
|
| 56 |
2. **Embedding Model**:
|
| 57 |
+
- **Base Model**: `Snowflake/snowflake-arctic-embed-l` - Foundation embedding capabilities for technical content
|
| 58 |
+
- **Fine-tuned Model**: `mafzaal/thedataguy_arctic_ft` - Custom-tuned using blog-specific query-context pairs
|
| 59 |
|
| 60 |
+
3. **Orchestration**: LangChain - Flexible components for LLM applications with robust RAG pipelines and context management
|
| 61 |
|
| 62 |
+
4. **Vector Database**: Qdrant - Stores embeddings via `pipeline.py` with GitHub workflow automation for new blog posts
|
| 63 |
|
| 64 |
+
5. **Monitoring**: LangSmith - Seamless LangChain integration with comprehensive tracing and performance monitoring
|
| 65 |
|
| 66 |
+
6. **Evaluation**: Ragas - Perfect alignment with [TheDataGuy](https://thedataguy.pro)'s expertise for metrics like faithfulness and relevance:
|
| 67 |
+
- [05_SDG_Eval](/py-src/notebooks/05_SDG_Eval.ipynb)
|
| 68 |
+
- [07_Fine_Tuning_Dataset](/py-src/notebooks/07_Fine_Tuning_Dataset.ipynb)
|
| 69 |
+
- [07_Fine_Tune_Embeddings](/py-src/notebooks/07_Fine_Tune_Embeddings.ipynb)
|
| 70 |
+
- [07_Fine_Tune_Eval](/py-src/notebooks/07_Fine_Tune_Eval.ipynb)
|
| 71 |
|
| 72 |
7. **User Interface**:
|
| 73 |
+
- **Current**: Chainlit - Rapid prototyping with built-in chat UI components
|
| 74 |
+
- **Production**: Custom Svelte component - Lightweight, responsive interface integrating with the blog's design
|
| 75 |
|
| 76 |
## Serving & Inference
|
| 77 |
+
- **Development**: Prototype on Hugging Face Spaces ([Let's Talk](https://huggingface.co/spaces/mafzaal/lets_talk))
|
| 78 |
### Future
|
| 79 |
+
- **Production**: Azure Container Apps - Event-driven autoscaling with enterprise-grade security
|
| 80 |
+
- **API Layer**: FastAPI - High-performance endpoints with automatic OpenAPI documentation
|
| 81 |
+
- **Deployment**: CI/CD via GitHub Actions - Consistent testing with automated content indexing for new blog posts
|
| 82 |
|
| 83 |
## Agentic Reasoning
|
| 84 |
|
| 85 |
+
Agentic reasoning will be added in a future version.
|
| 86 |
|
| 87 |
|
| 88 |
# Task 3: Dealing with the Data
|
| 89 |
|
| 90 |
## Data Collection
|
| 91 |
|
| 92 |
+
The blog data from [TheDataGuy](https://thedataguy.pro) was collected and processed for our chat component. Summary of collected posts:
|
| 93 |
|
| 94 |
+
| Title | Date | Length | URL |
|
| 95 |
+
|-------|------|--------|-----|
|
| 96 |
| "Coming Back to AI Roots - My Professional Journey" | 2025-04-14 | 5,827 | [Link](https://thedataguy.pro/blog/coming-back-to-ai-roots/) |
|
| 97 |
| "Data is King: Why Your Data Strategy IS Your Business Strategy" | 2025-04-15 | 6,197 | [Link](https://thedataguy.pro/blog/data-is-king/) |
|
| 98 |
| "A C# Programmer's Perspective on LangChain Expression Language" | 2025-04-16 | 3,361 | [Link](https://thedataguy.pro/blog/langchain-experience-csharp-perspective/) |
|
| 99 |
| "Building a Research Agent with RSS Feed Support" | 2025-04-20 | 7,320 | [Link](https://thedataguy.pro/blog/building-research-agent/) |
|
| 100 |
+
| "Part 1: Introduction to Ragas: The Essential Evaluation Framework" | 2025-04-26 | 6,999 | [Link](https://thedataguy.pro/blog/introduction-to-ragas/) |
|
| 101 |
| "Part 2: Basic Evaluation Workflow with Ragas" | 2025-04-26 | 11,223 | [Link](https://thedataguy.pro/blog/basic-evaluation-workflow-with-ragas/) |
|
| 102 |
| "Part 3: Evaluating RAG Systems with Ragas" | 2025-04-26 | 8,811 | [Link](https://thedataguy.pro/blog/evaluating-rag-systems-with-ragas/) |
|
| 103 |
| "Part 4: Generating Test Data with Ragas" | 2025-04-27 | 14,682 | [Link](https://thedataguy.pro/blog/generating-test-data-with-ragas/) |
|
|
|
|
| 111 |
|
| 112 |
## Chunking Strategy
|
| 113 |
|
| 114 |
+
For our blog chat component, we evaluated multiple chunking approaches:
|
| 115 |
|
| 116 |
+
1. **Initial Experiment**:
|
| 117 |
+
- RecursiveCharacterTextSplitter with 1000-character chunks and 200-character overlap
|
| 118 |
+
- Provided granular context chunks for baseline testing and embedding fine-tuning
|
| 119 |
|
| 120 |
+
2. **Final Implementation**:
|
| 121 |
+
- Whole blog posts as individual chunks to preserve complete article context
|
| 122 |
+
- Each post treated as a distinct retrievable unit in the vector database
|
|
|
|
| 123 |
|
| 124 |
+
The whole-document strategy preserves article narrative integrity while providing sufficient context for accurate responses.
|
| 125 |
|
| 126 |
## Data Statistics Summary
|
| 127 |
|
|
|
|
| 129 |
|-----------|-------|
|
| 130 |
| Total Blog Posts | 14 |
|
| 131 |
| Total Characters | 106,275 |
|
| 132 |
+
| Minimum Post Length | 1,900 chars |
|
| 133 |
+
| Maximum Post Length | 13,468 chars |
|
| 134 |
+
| Average Post Length | 7,591 chars |
|
|
|
|
|
|
|
| 135 |
|
| 136 |
+
With average post length under 8,000 characters, whole-document retrieval remains efficient while maintaining content coherence, supporting comprehensive responses about RAG systems, evaluation frameworks, and data strategy.
|
| 137 |
|
| 138 |
## Tools and APIs
|
| 139 |
+
We will use the [RSS Feed](https://thedataguy.pro/rss.xml) with tool calls to retrieve latest posts not yet vectorized.
|
|
|
|
| 140 |
|
| 141 |
# Task 4: Building a Quick End-to-End Prototype
|
| 142 |
|
| 143 |
+
Live demo available at [Let's Talk](https://huggingface.co/spaces/mafzaal/lets_talk)
|
| 144 |
|
| 145 |
|
| 146 |
# Task 5: Creating a Golden Test Data Set
|
| 147 |
|
| 148 |
+
Synthetic data available at [Testset](/evals/testset_2.csv), [Eval set](/evals/rag_eval_2.csv) and [Results](/evals/rag_eval_result_2.csv).
|
| 149 |
|
| 150 |
## Evaluation Results Summary
|
| 151 |
|
|
|
|
| 158 |
| Context Recall | 0.1905 |
|
| 159 |
| Context Entity Recall | 0.1503 |
|
| 160 |
|
| 161 |
+
These results show strong faithfulness but opportunities for improvement in contextual relevance and factual accuracy. The low context recall and entity recall scores suggest the retrieval component needs refinement to better surface relevant information from blog content.
|
| 162 |
|
| 163 |
# Task 6: Fine-Tuning Open-Source Embeddings
|
| 164 |
|
| 165 |
+
Fine-tuning dataset available at [Link](/evals/ft_questions.csv) and uploaded to [thedataguy_embed_ft](https://huggingface.co/datasets/mafzaal/thedataguy_embed_ft). Implementation in [07_Fine_Tune_Embeddings](/py-src/notebooks/07_Fine_Tune_Embeddings.ipynb).
|
|
|
|
| 166 |
|
| 167 |
# Task 7: Assessing Performance
|
| 168 |
|
| 169 |
+
Evaluation based on fine-tuned embedding model:
|
| 170 |
|
| 171 |
## Fine-Tuned Embedding Model Evaluation Results
|
| 172 |
|
|
|
|
| 179 |
| Context Recall | 0.2500 |
|
| 180 |
| Context Entity Recall | 0.2175 |
|
| 181 |
|
| 182 |
+
The fine-tuned model shows improved answer relevancy and context recall compared to the base model. While faithfulness decreased, the system better retrieves relevant information. These results suggest the fine-tuning process shifted strengths toward contextually appropriate responses, though further optimization is needed for faithfulness and factual accuracy.
|
| 183 |
|