mafzaal commited on
Commit
35d94ce
·
1 Parent(s): 9ffcda2

Refactor DESIGN.md for clarity and consistency in problem definition and audience analysis

Browse files
Files changed (1) hide show
  1. DESIGN.md +68 -73
DESIGN.md CHANGED
@@ -1,104 +1,103 @@
1
  # AIM AIE 6 Certification Challenge: AI-Driven Blog Chat Component
2
 
3
- ## Task 1: Defining your Problem and Audience
4
 
5
  ### Problem Statement
6
- Content-rich blogs are difficult for readers to navigate and extract specific information from, leading to poor information retention and reader engagement.
7
-
8
 
9
  ### Target Audience Analysis
10
 
11
- For our first implementation, we will focus on content from [TheDataGuy](https://thedataguy.pro) blog, which caters to a technical audience primarily consisting of:
12
-
13
- 1. **AI/ML Engineers & Researchers** - Professionals working on LLM applications and RAG systems who need guidance on evaluation frameworks and best practices.
14
-
15
- 2. **Data Scientists & Engineers** - Individuals looking to implement robust data strategies and evaluation methodologies for AI systems.
16
-
17
- 3. **Technical Leaders & Architects** - Decision-makers seeking insights on AI implementation, data strategy, and technical approaches for enterprise solutions.
18
 
19
- 4. **Developers Building AI Applications** - Software engineers implementing AI/ML capabilities who need practical advice on topics like RAG systems, evaluation metrics, and feedback loops.
20
-
21
- 5. **Data Strategists** - Professionals focused on leveraging data as a strategic asset for business success.
 
 
 
 
 
 
22
 
23
 
24
  ### Potential User Questions
25
  1. What is Ragas and how does it help evaluate LLM applications?
26
  2. How do I set up a basic evaluation workflow with Ragas?
27
  3. What are the key metrics to evaluate RAG systems?
28
- 4. How can I generate synthetic test data for evaluating my RAG system?
29
- 5. How do I create custom metrics in Ragas for my specific use case?
30
  6. What's the difference between Faithfulness and Factual Correctness metrics?
31
- 7. How can I build a research agent with RSS feed support like the one described in your blog?
32
- 8. What technologies did you use to build your research agent?
33
- 9. How can I implement feedback loops to continuously improve my LLM application?
34
- 10. Why do you consider data strategy so important for business success?
35
 
36
- These questions align with the main themes in the blog posts, which cover LLM evaluation, RAG systems, custom development of AI tools, and data strategy.
37
 
38
 
39
  ## Task 2: Proposed Solution
40
 
41
- We propose an AI-driven chat assistant for [TheDataGuy](https://thedataguy.pro)'s blog that enables readers to interactively explore technical content. This component will:
42
 
43
- - Provide contextually relevant responses about RAG systems, evaluation metrics, and data strategy
44
- - Allow users to ask clarifying questions about complex concepts without searching across multiple articles
45
- - Deliver code examples and detailed explanations tailored to the user's technical background
46
- - Function as a knowledge companion that has internalized the blog's expertise
47
- - Create a personalized learning experience that adapts to each visitor's specific interests
48
 
49
- The solution transforms passive reading into an interactive dialogue, significantly enhancing information discovery and retention.
50
 
51
  ## Technology Stack
52
 
53
  1. **LLM Architecture**:
54
- - **Primary Model**: OpenAI `gpt-4.1` - Handles complex tasks including synthetic data generation, sophisticated evaluation workflows, and fine-tuning questions development
55
- - **Inference Model**: OpenAI `gpt-4o-mini` - Powers the user-facing chat application, offering an optimal balance between performance and cost-efficiency for real-time responses
56
 
57
  2. **Embedding Model**:
58
- - **Base Model**: `Snowflake/snowflake-arctic-embed-l` - Provides foundation embedding capabilities optimized for technical content with robust semantic understanding
59
- - **Fine-tuned Model**: `mafzaal/thedataguy_arctic_ft` - Custom-tuned embedding model using query-context pairs extracted from blog content, enhancing retrieval accuracy for domain-specific AI terminology and concepts
60
 
61
- 3. **Orchestration**: LangChain - Provides flexible components for building LLM applications with robust RAG pipelines and context management tailored for technical blog content.
62
 
63
- 4. **Vector Database**: Qdrant - Stores embeddings generated through a `pipeline.py` script which also functions as a GitHub workflow to automatically incorporate new blog posts. Provides robust filtering capabilities for content categorization and delivers high performance with manageable operational complexity.
64
 
65
- 5. **Monitoring**: LangSmith - Integrates seamlessly with LangChain while providing comprehensive tracing, debugging, and performance monitoring specific to LLM applications.
66
 
67
- 6. **Evaluation**: Ragas - Aligns perfectly with [TheDataGuy](https://thedataguy.pro)'s expertise and blog content, enabling evaluation of the chat system on metrics like faithfulness and relevance. See following notebooks
68
- - [05_SDG_Eval](/py-src/notebooks/05_SDG_Eval.ipynb)
69
- - [07_Fine_Tuning_Dataset](/py-src/notebooks/07_Fine_Tuning_Dataset.ipynb)
70
- - [07_Fine_Tune_Embeddings](/py-src/notebooks/07_Fine_Tune_Embeddings.ipynb)
71
- - [07_Fine_Tune_Eval](/py-src/notebooks/07_Fine_Tune_Eval.ipynb)
72
 
73
  7. **User Interface**:
74
- - **Current Implementation**: Chainlit - Provides rapid prototyping capabilities with built-in chat UI components
75
- - **Production Version**: Custom Svelte component - Delivers a lightweight, responsive interface that seamlessly integrates with the blog's existing design language while minimizing impact on page load performance
76
 
77
  ## Serving & Inference
78
- - **Development Environment**: Prototype deployed on Hugging Face Spaces for rapid testing and validation, visit [Let's Talk](https://huggingface.co/spaces/mafzaal/lets_talk)
79
  ### Future
80
- - **Production Infrastructure**: Azure Container Apps - Provides event-driven autoscaling and enterprise-grade security while integrating with [TheDataGuy](https://thedataguy.pro)'s existing Azure-based technical ecosystem
81
- - **API Layer**: FastAPI - Delivers high-performance endpoints with automatic OpenAPI documentation, facilitating seamless integration with the blog's frontend
82
- - **Deployment Strategy**: CI/CD pipeline using GitHub Actions - Ensures consistent testing and deployment with automated content indexing whenever new blog posts are published
83
 
84
  ## Agentic Reasoning
85
 
86
- Agentic reasoing will be added in future version.
87
 
88
 
89
  # Task 3: Dealing with the Data
90
 
91
  ## Data Collection
92
 
93
- The blog data from [TheDataGuy](https://thedataguy.pro) was collected and processed for our AI-driven chat component. Below is a summary of the collected blog posts:
94
 
95
- | Title | Date | Text Length | URL |
96
- |-------|------|-------------|-----|
97
  | "Coming Back to AI Roots - My Professional Journey" | 2025-04-14 | 5,827 | [Link](https://thedataguy.pro/blog/coming-back-to-ai-roots/) |
98
  | "Data is King: Why Your Data Strategy IS Your Business Strategy" | 2025-04-15 | 6,197 | [Link](https://thedataguy.pro/blog/data-is-king/) |
99
  | "A C# Programmer's Perspective on LangChain Expression Language" | 2025-04-16 | 3,361 | [Link](https://thedataguy.pro/blog/langchain-experience-csharp-perspective/) |
100
  | "Building a Research Agent with RSS Feed Support" | 2025-04-20 | 7,320 | [Link](https://thedataguy.pro/blog/building-research-agent/) |
101
- | "Part 1: Introduction to Ragas: The Essential Evaluation Framework for LLM Applications" | 2025-04-26 | 6,999 | [Link](https://thedataguy.pro/blog/introduction-to-ragas/) |
102
  | "Part 2: Basic Evaluation Workflow with Ragas" | 2025-04-26 | 11,223 | [Link](https://thedataguy.pro/blog/basic-evaluation-workflow-with-ragas/) |
103
  | "Part 3: Evaluating RAG Systems with Ragas" | 2025-04-26 | 8,811 | [Link](https://thedataguy.pro/blog/evaluating-rag-systems-with-ragas/) |
104
  | "Part 4: Generating Test Data with Ragas" | 2025-04-27 | 14,682 | [Link](https://thedataguy.pro/blog/generating-test-data-with-ragas/) |
@@ -112,18 +111,17 @@ The blog data from [TheDataGuy](https://thedataguy.pro) was collected and proces
112
 
113
  ## Chunking Strategy
114
 
115
- For our blog chat component, we evaluated multiple chunking approaches to optimize retrieval performance:
116
 
117
- 1. **Initial Experimental Approach**:
118
- - Used RecursiveCharacterTextSplitter with chunk size of 1000 characters and 200 character overlap
119
- - This approach provided granular context chunks for both baseline testing and embedding fine-tuning
120
 
121
- 2. **Final Implementation Decision**:
122
- - Opted to use whole blog posts as individual chunks rather than smaller text segments
123
- - This approach ensures complete retention of article context and coherence
124
- - Each blog post is treated as a distinct retrievable unit in the vector database
125
 
126
- The whole-document chunking strategy was selected to retrieve relevant full blog posts for response generation. This approach preserves article narrative integrity while providing sufficient context for accurate responses.
127
 
128
  ## Data Statistics Summary
129
 
@@ -131,25 +129,23 @@ The whole-document chunking strategy was selected to retrieve relevant full blog
131
  |-----------|-------|
132
  | Total Blog Posts | 14 |
133
  | Total Characters | 106,275 |
134
- | Minimum Post Length | 1,900 characters |
135
- | Maximum Post Length | 13,468 characters |
136
- | Average Post Length | 7,591 characters |
137
-
138
- With average post length under 8,000 characters, whole-document retrieval remains efficient while eliminating contextual fragmentation that occurs with smaller chunks. This approach optimizes for content coherence over granularity, supporting comprehensive responses to technical queries about RAG systems, evaluation frameworks, and data strategy.
139
 
 
140
 
141
  ## Tools and APIs
142
- We will also using [RSS Feed](https://thedataguy.pro/rss.xml) with tool call to get latest posts which are not yet vectorized
143
-
144
 
145
  # Task 4: Building a Quick End-to-End Prototype
146
 
147
- Live demo at [Let's Talk](https://huggingface.co/spaces/mafzaal/lets_talk)
148
 
149
 
150
  # Task 5: Creating a Golden Test Data Set
151
 
152
- Synthetic data is available at [Testset](/evals/testset_2.csv), [Eval set](/evals/rag_eval_2.csv) and [Results](/evals/rag_eval_result_2.csv) and here is summary
153
 
154
  ## Evaluation Results Summary
155
 
@@ -162,16 +158,15 @@ Synthetic data is available at [Testset](/evals/testset_2.csv), [Eval set](/eval
162
  | Context Recall | 0.1905 |
163
  | Context Entity Recall | 0.1503 |
164
 
165
- These results indicate strong faithfulness but opportunities for improvement in contextual relevance and factual accuracy. The relatively low context recall and entity recall scores suggest that the retrieval component may need refinement to better surface relevant information from the blog content.
166
 
167
  # Task 6: Fine-Tuning Open-Source Embeddings
168
 
169
- Fine tuning dataset is available at [Link](/evals/ft_questions.csv) and also uploaded at [thedataguy_embed_ft](https://huggingface.co/datasets/mafzaal/thedataguy_embed_ft) and notebook is available at [07_Fine_Tune_Embeddings](/py-src/notebooks/07_Fine_Tune_Embeddings.ipynb)
170
-
171
 
172
  # Task 7: Assessing Performance
173
 
174
- Following is evalation based on finetuned embedding model.
175
 
176
  ## Fine-Tuned Embedding Model Evaluation Results
177
 
@@ -184,5 +179,5 @@ Following is evalation based on finetuned embedding model.
184
  | Context Recall | 0.2500 |
185
  | Context Entity Recall | 0.2175 |
186
 
187
- The fine-tuned embedding model shows improved answer relevancy and context recall compared to the base model. While faithfulness decreased, the system demonstrates better ability to retrieve relevant information. These results suggest that the fine-tuning process has shifted the model's strengths toward delivering more contextually appropriate responses, though further optimization is needed to improve faithfulness and factual accuracy.
188
 
 
1
  # AIM AIE 6 Certification Challenge: AI-Driven Blog Chat Component
2
 
3
+ ## Task 1: Defining the Problem and Audience
4
 
5
  ### Problem Statement
6
+ Content-rich technical blogs are difficult to navigate, making information extraction challenging and resulting in reduced reader engagement and information retention.
 
7
 
8
  ### Target Audience Analysis
9
 
10
+ Our initial implementation focuses on [TheDataGuy](https://thedataguy.pro) blog, which serves a technical audience including:
 
 
 
 
 
 
11
 
12
+ 1. **AI/ML Engineers & Researchers** - Professionals seeking guidance on LLM applications, RAG systems, and evaluation frameworks
13
+
14
+ 2. **Data Scientists & Engineers** - Practitioners implementing data strategies and evaluation methodologies for AI systems
15
+
16
+ 3. **Technical Leaders & Architects** - Decision-makers exploring AI implementation, data strategy, and enterprise solutions
17
+
18
+ 4. **Developers Building AI Applications** - Engineers needing practical advice on RAG systems and evaluation metrics
19
+
20
+ 5. **Data Strategists** - Professionals leveraging data as a strategic business asset
21
 
22
 
23
  ### Potential User Questions
24
  1. What is Ragas and how does it help evaluate LLM applications?
25
  2. How do I set up a basic evaluation workflow with Ragas?
26
  3. What are the key metrics to evaluate RAG systems?
27
+ 4. How can I generate synthetic test data for my RAG system?
28
+ 5. How do I create custom metrics in Ragas for specific use cases?
29
  6. What's the difference between Faithfulness and Factual Correctness metrics?
30
+ 7. How can I build a research agent with RSS feed support?
31
+ 8. What technologies are optimal for building a research agent?
32
+ 9. How can I implement feedback loops to improve my LLM application?
33
+ 10. Why is data strategy critical for business success?
34
 
35
+ These questions align with the blog's main themes: LLM evaluation, RAG systems, AI tool development, and data strategy.
36
 
37
 
38
  ## Task 2: Proposed Solution
39
 
40
+ We propose an AI-driven chat assistant for [TheDataGuy](https://thedataguy.pro)'s blog that enables readers to interactively explore technical content. This solution will:
41
 
42
+ - Provide contextually relevant responses on RAG systems, evaluation metrics, and data strategy
43
+ - Enable users to ask clarifying questions without searching across multiple articles
44
+ - Deliver code examples and explanations tailored to users' technical backgrounds
45
+ - Function as a knowledge companion with the blog's expertise
46
+ - Create personalized learning experiences based on visitors' specific interests
47
 
48
+ This solution transforms passive reading into interactive dialogue, enhancing information discovery and retention.
49
 
50
  ## Technology Stack
51
 
52
  1. **LLM Architecture**:
53
+ - **Primary Model**: OpenAI `gpt-4.1` - For complex tasks including synthetic data generation and evaluation workflows
54
+ - **Inference Model**: OpenAI `gpt-4o-mini` - Powers the chat application with optimal performance/cost balance
55
 
56
  2. **Embedding Model**:
57
+ - **Base Model**: `Snowflake/snowflake-arctic-embed-l` - Foundation embedding capabilities for technical content
58
+ - **Fine-tuned Model**: `mafzaal/thedataguy_arctic_ft` - Custom-tuned using blog-specific query-context pairs
59
 
60
+ 3. **Orchestration**: LangChain - Flexible components for LLM applications with robust RAG pipelines and context management
61
 
62
+ 4. **Vector Database**: Qdrant - Stores embeddings via `pipeline.py` with GitHub workflow automation for new blog posts
63
 
64
+ 5. **Monitoring**: LangSmith - Seamless LangChain integration with comprehensive tracing and performance monitoring
65
 
66
+ 6. **Evaluation**: Ragas - Perfect alignment with [TheDataGuy](https://thedataguy.pro)'s expertise for metrics like faithfulness and relevance:
67
+ - [05_SDG_Eval](/py-src/notebooks/05_SDG_Eval.ipynb)
68
+ - [07_Fine_Tuning_Dataset](/py-src/notebooks/07_Fine_Tuning_Dataset.ipynb)
69
+ - [07_Fine_Tune_Embeddings](/py-src/notebooks/07_Fine_Tune_Embeddings.ipynb)
70
+ - [07_Fine_Tune_Eval](/py-src/notebooks/07_Fine_Tune_Eval.ipynb)
71
 
72
  7. **User Interface**:
73
+ - **Current**: Chainlit - Rapid prototyping with built-in chat UI components
74
+ - **Production**: Custom Svelte component - Lightweight, responsive interface integrating with the blog's design
75
 
76
  ## Serving & Inference
77
+ - **Development**: Prototype on Hugging Face Spaces ([Let's Talk](https://huggingface.co/spaces/mafzaal/lets_talk))
78
  ### Future
79
+ - **Production**: Azure Container Apps - Event-driven autoscaling with enterprise-grade security
80
+ - **API Layer**: FastAPI - High-performance endpoints with automatic OpenAPI documentation
81
+ - **Deployment**: CI/CD via GitHub Actions - Consistent testing with automated content indexing for new blog posts
82
 
83
  ## Agentic Reasoning
84
 
85
+ Agentic reasoning will be added in a future version.
86
 
87
 
88
  # Task 3: Dealing with the Data
89
 
90
  ## Data Collection
91
 
92
+ The blog data from [TheDataGuy](https://thedataguy.pro) was collected and processed for our chat component. Summary of collected posts:
93
 
94
+ | Title | Date | Length | URL |
95
+ |-------|------|--------|-----|
96
  | "Coming Back to AI Roots - My Professional Journey" | 2025-04-14 | 5,827 | [Link](https://thedataguy.pro/blog/coming-back-to-ai-roots/) |
97
  | "Data is King: Why Your Data Strategy IS Your Business Strategy" | 2025-04-15 | 6,197 | [Link](https://thedataguy.pro/blog/data-is-king/) |
98
  | "A C# Programmer's Perspective on LangChain Expression Language" | 2025-04-16 | 3,361 | [Link](https://thedataguy.pro/blog/langchain-experience-csharp-perspective/) |
99
  | "Building a Research Agent with RSS Feed Support" | 2025-04-20 | 7,320 | [Link](https://thedataguy.pro/blog/building-research-agent/) |
100
+ | "Part 1: Introduction to Ragas: The Essential Evaluation Framework" | 2025-04-26 | 6,999 | [Link](https://thedataguy.pro/blog/introduction-to-ragas/) |
101
  | "Part 2: Basic Evaluation Workflow with Ragas" | 2025-04-26 | 11,223 | [Link](https://thedataguy.pro/blog/basic-evaluation-workflow-with-ragas/) |
102
  | "Part 3: Evaluating RAG Systems with Ragas" | 2025-04-26 | 8,811 | [Link](https://thedataguy.pro/blog/evaluating-rag-systems-with-ragas/) |
103
  | "Part 4: Generating Test Data with Ragas" | 2025-04-27 | 14,682 | [Link](https://thedataguy.pro/blog/generating-test-data-with-ragas/) |
 
111
 
112
  ## Chunking Strategy
113
 
114
+ For our blog chat component, we evaluated multiple chunking approaches:
115
 
116
+ 1. **Initial Experiment**:
117
+ - RecursiveCharacterTextSplitter with 1000-character chunks and 200-character overlap
118
+ - Provided granular context chunks for baseline testing and embedding fine-tuning
119
 
120
+ 2. **Final Implementation**:
121
+ - Whole blog posts as individual chunks to preserve complete article context
122
+ - Each post treated as a distinct retrievable unit in the vector database
 
123
 
124
+ The whole-document strategy preserves article narrative integrity while providing sufficient context for accurate responses.
125
 
126
  ## Data Statistics Summary
127
 
 
129
  |-----------|-------|
130
  | Total Blog Posts | 14 |
131
  | Total Characters | 106,275 |
132
+ | Minimum Post Length | 1,900 chars |
133
+ | Maximum Post Length | 13,468 chars |
134
+ | Average Post Length | 7,591 chars |
 
 
135
 
136
+ With average post length under 8,000 characters, whole-document retrieval remains efficient while maintaining content coherence, supporting comprehensive responses about RAG systems, evaluation frameworks, and data strategy.
137
 
138
  ## Tools and APIs
139
+ We will use the [RSS Feed](https://thedataguy.pro/rss.xml) with tool calls to retrieve latest posts not yet vectorized.
 
140
 
141
  # Task 4: Building a Quick End-to-End Prototype
142
 
143
+ Live demo available at [Let's Talk](https://huggingface.co/spaces/mafzaal/lets_talk)
144
 
145
 
146
  # Task 5: Creating a Golden Test Data Set
147
 
148
+ Synthetic data available at [Testset](/evals/testset_2.csv), [Eval set](/evals/rag_eval_2.csv) and [Results](/evals/rag_eval_result_2.csv).
149
 
150
  ## Evaluation Results Summary
151
 
 
158
  | Context Recall | 0.1905 |
159
  | Context Entity Recall | 0.1503 |
160
 
161
+ These results show strong faithfulness but opportunities for improvement in contextual relevance and factual accuracy. The low context recall and entity recall scores suggest the retrieval component needs refinement to better surface relevant information from blog content.
162
 
163
  # Task 6: Fine-Tuning Open-Source Embeddings
164
 
165
+ Fine-tuning dataset available at [Link](/evals/ft_questions.csv) and uploaded to [thedataguy_embed_ft](https://huggingface.co/datasets/mafzaal/thedataguy_embed_ft). Implementation in [07_Fine_Tune_Embeddings](/py-src/notebooks/07_Fine_Tune_Embeddings.ipynb).
 
166
 
167
  # Task 7: Assessing Performance
168
 
169
+ Evaluation based on fine-tuned embedding model:
170
 
171
  ## Fine-Tuned Embedding Model Evaluation Results
172
 
 
179
  | Context Recall | 0.2500 |
180
  | Context Entity Recall | 0.2175 |
181
 
182
+ The fine-tuned model shows improved answer relevancy and context recall compared to the base model. While faithfulness decreased, the system better retrieves relevant information. These results suggest the fine-tuning process shifted strengths toward contextually appropriate responses, though further optimization is needed for faithfulness and factual accuracy.
183