| # π AI Dataset Studio - Complete Deployment Guide | |
| **Deploy your AI-powered dataset creation platform with Perplexity integration** | |
| --- | |
| ## π Pre-Deployment Checklist | |
| ### β **Required Files** | |
| Ensure you have all these files ready: | |
| ``` | |
| ai-dataset-studio/ | |
| βββ app.py # Main application with Perplexity integration | |
| βββ perplexity_client.py # Perplexity AI client | |
| βββ config.py # Configuration management | |
| βββ requirements.txt # Dependencies | |
| βββ README.md # Documentation | |
| βββ DEPLOYMENT.md # This guide | |
| βββ utils.py # Utility functions (optional) | |
| ``` | |
| ### β **API Keys & Environment** | |
| - [ ] **Perplexity API Key** - Get from [Perplexity AI](https://www.perplexity.ai/) | |
| - [ ] **HuggingFace Account** - For Space hosting | |
| - [ ] **Optional**: HuggingFace Token for private datasets | |
| --- | |
| ## π― Deployment Options | |
| ### **Option 1: Full AI-Powered Deployment (Recommended)** | |
| *Best for: Professional use, maximum features* | |
| #### Hardware: **T4 Small** ($0.60/hour) | |
| - β GPU acceleration for AI models | |
| - β Fast processing (5-15s per article) | |
| - β All Perplexity features enabled | |
| - β Production-ready performance | |
| #### **Step-by-Step:** | |
| 1. **Create HuggingFace Space** | |
| ```bash | |
| # Go to: https://huggingface.co/new-space | |
| Space Name: ai-dataset-studio | |
| SDK: Gradio | |
| Hardware: T4 Small | |
| Visibility: Public (or Private) | |
| ``` | |
| 2. **Upload Files** | |
| - Copy all files from artifacts above | |
| - Ensure `app.py` is the main file | |
| - Keep file structure intact | |
| 3. **Set Environment Variables** | |
| ```bash | |
| # In Space Settings β Repository secrets: | |
| PERPLEXITY_API_KEY = your_perplexity_api_key_here | |
| # Optional: | |
| HF_TOKEN = your_huggingface_token | |
| LOG_LEVEL = INFO | |
| DEBUG = false | |
| ``` | |
| 4. **Deploy & Test** | |
| - Space will build automatically (2-3 minutes) | |
| - Test Perplexity integration first | |
| - Verify all templates work | |
| --- | |
| ### **Option 2: Budget-Friendly Deployment** | |
| *Best for: Testing, learning, cost-conscious users* | |
| #### Hardware: **CPU Basic** (Free) | |
| - β‘ Basic functionality available | |
| - β οΈ Slower AI processing (30-60s per article) | |
| - β Perplexity discovery still works | |
| - β Perfect for getting started | |
| #### **Step-by-Step:** | |
| 1. **Create Space with CPU Basic** | |
| ```bash | |
| Space Name: ai-dataset-studio | |
| SDK: Gradio | |
| Hardware: CPU Basic (Free) | |
| ``` | |
| 2. **Upload Core Files** | |
| ```bash | |
| # Essential files only: | |
| app.py | |
| perplexity_client.py | |
| requirements.txt | |
| README.md | |
| config.py | |
| ``` | |
| 3. **Set API Key** | |
| ```bash | |
| PERPLEXITY_API_KEY = your_api_key | |
| ``` | |
| 4. **Gradual Upgrade Path** | |
| - Start with CPU Basic | |
| - Test functionality | |
| - Upgrade to T4 Small when ready | |
| --- | |
| ### **Option 3: Enterprise Deployment** | |
| *Best for: High-volume usage, team collaboration* | |
| #### Hardware: **A10G Small** ($1.05/hour) | |
| - π Maximum performance (3-8s per article) | |
| - πͺ Handle large batch processing | |
| - π Support multiple concurrent users | |
| - π Production-scale capabilities | |
| #### **Additional Setup:** | |
| 1. **Persistent Storage** | |
| ```bash | |
| # In Space settings: | |
| Storage: Small Persistent ($5/month) | |
| # Enables data persistence between restarts | |
| ``` | |
| 2. **Advanced Configuration** | |
| ```bash | |
| # Environment variables: | |
| MAX_SOURCES_PER_SEARCH = 50 | |
| BATCH_SIZE = 16 | |
| ENABLE_CACHING = true | |
| CONCURRENT_REQUESTS = 5 | |
| ``` | |
| 3. **Monitoring Setup** | |
| ```bash | |
| # Enable detailed logging: | |
| LOG_LEVEL = DEBUG | |
| ENABLE_METRICS = true | |
| ``` | |
| --- | |
| ## π§ Configuration Details | |
| ### **Perplexity API Setup** | |
| 1. **Get API Key** | |
| ```bash | |
| # Visit: https://www.perplexity.ai/ | |
| # Sign up for account | |
| # Navigate to API section | |
| # Generate new API key | |
| # Copy key for environment setup | |
| ``` | |
| 2. **Test API Key** | |
| ```python | |
| # Quick test script: | |
| import requests | |
| headers = { | |
| 'Authorization': 'Bearer YOUR_API_KEY', | |
| 'Content-Type': 'application/json' | |
| } | |
| response = requests.post( | |
| 'https://api.perplexity.ai/chat/completions', | |
| headers=headers, | |
| json={ | |
| "model": "llama-3.1-sonar-large-128k-online", | |
| "messages": [{"role": "user", "content": "Test message"}] | |
| } | |
| ) | |
| print("API Status:", response.status_code) | |
| ``` | |
| ### **Hardware Requirements by Use Case** | |
| | Use Case | Hardware | Monthly Cost | Performance | Best For | | |
| |----------|----------|--------------|-------------|----------| | |
| | **Learning** | CPU Basic | Free | Basic | Students, hobbyists | | |
| | **Development** | CPU Upgrade | $22 | Good | Developers, testing | | |
| | **Production** | T4 Small | $432 | Excellent | Businesses, researchers | | |
| | **Enterprise** | A10G Small | $756 | Maximum | High-volume, teams | | |
| ### **Memory & Storage Planning** | |
| ```bash | |
| # Model Memory Usage: | |
| BART Summarization: ~1.5GB | |
| RoBERTa Sentiment: ~500MB | |
| BERT NER: ~400MB | |
| Base Application: ~200MB | |
| Total GPU Memory: ~2.5GB (T4 Small = 16GB, plenty of headroom) | |
| # Storage Usage: | |
| Application Files: ~50MB | |
| Model Cache: ~2GB | |
| Temporary Data: ~100MB per project | |
| Persistent Storage: Optional, recommended for large projects | |
| ``` | |
| --- | |
| ## π§ͺ Testing Your Deployment | |
| ### **Basic Functionality Test** | |
| 1. **Launch Application** | |
| ```bash | |
| # Your Space URL: https://huggingface.co/spaces/YOUR_USERNAME/ai-dataset-studio | |
| # Wait for "Running" status | |
| # Interface should load within 30-60 seconds | |
| ``` | |
| 2. **Test Project Creation** | |
| ```bash | |
| Project Name: "Test Sentiment Analysis" | |
| Template: Sentiment Analysis | |
| Description: "Testing the deployment" | |
| Click: "Create Project" | |
| Expected: "β Project created successfully" | |
| ``` | |
| 3. **Test Perplexity Integration** | |
| ```bash | |
| AI Search Description: "Product reviews for sentiment analysis" | |
| Search Type: General | |
| Max Sources: 10 | |
| Click: "Discover Sources with AI" | |
| Expected: List of relevant URLs with quality scores | |
| ``` | |
| ### **Advanced Testing** | |
| 4. **Test Complete Workflow** | |
| ```bash | |
| # Use discovered sources from step 3 | |
| Click: "Use These Sources" | |
| Click: "Start Scraping" | |
| Wait: Processing to complete | |
| Click: "Process Data" | |
| Select: Same template as project | |
| Click: "Export Dataset" | |
| Format: JSON | |
| Expected: Downloadable dataset file | |
| ``` | |
| 5. **Performance Benchmarks** | |
| ```bash | |
| # Timing expectations: | |
| AI Source Discovery: 5-15 seconds | |
| Scraping 10 URLs: 30-120 seconds | |
| Processing Data: 30-180 seconds (depends on hardware) | |
| Export: 5-10 seconds | |
| ``` | |
| --- | |
| ## π¨ Troubleshooting | |
| ### **Common Issues & Solutions** | |
| #### β **"Perplexity API key not found"** | |
| ```bash | |
| # Problem: Environment variable not set | |
| # Solution: | |
| 1. Go to Space Settings β Repository secrets | |
| 2. Add: PERPLEXITY_API_KEY = your_key_here | |
| 3. Restart Space | |
| 4. Check logs for "β Perplexity AI client initialized" | |
| ``` | |
| #### β **"No sources found" from AI discovery** | |
| ```bash | |
| # Problem: Search query too specific or API limits | |
| # Solutions: | |
| 1. Make description more general | |
| 2. Try different search types | |
| 3. Check API key has sufficient credits | |
| 4. Use manual URL entry as fallback | |
| ``` | |
| #### β **"Model loading failed"** | |
| ```bash | |
| # Problem: Insufficient memory or network issues | |
| # Solutions: | |
| 1. Upgrade to T4 Small for GPU memory | |
| 2. Wait 2-3 minutes for model downloads | |
| 3. Check Space logs for specific errors | |
| 4. Restart Space if persistent | |
| ``` | |
| #### β **"Scraping failed" for multiple URLs** | |
| ```bash | |
| # Problem: Rate limiting or blocked access | |
| # Solutions: | |
| 1. Reduce concurrent requests | |
| 2. Check robots.txt compliance | |
| 3. Use more diverse sources | |
| 4. Verify URLs are publicly accessible | |
| ``` | |
| ### **Debug Mode** | |
| Enable detailed logging for troubleshooting: | |
| ```bash | |
| # Environment variables: | |
| DEBUG = true | |
| LOG_LEVEL = DEBUG | |
| # Then check Space logs for detailed information | |
| ``` | |
| ### **Health Check Script** | |
| ```python | |
| # Add this to test basic functionality: | |
| def health_check(): | |
| """Test all components""" | |
| # Test imports | |
| try: | |
| import gradio | |
| print("β Gradio imported") | |
| except ImportError: | |
| print("β Gradio import failed") | |
| # Test Perplexity | |
| try: | |
| from perplexity_client import PerplexityClient | |
| client = PerplexityClient() | |
| if client._validate_api_key(): | |
| print("β Perplexity API key valid") | |
| else: | |
| print("β Perplexity API key invalid") | |
| except Exception as e: | |
| print(f"β Perplexity error: {e}") | |
| # Test models | |
| try: | |
| from transformers import pipeline | |
| print("β Transformers available") | |
| except ImportError: | |
| print("β οΈ Transformers not available (CPU fallback)") | |
| # Run health check in your Space | |
| ``` | |
| --- | |
| ## π Maintenance & Updates | |
| ### **Regular Maintenance Tasks** | |
| 1. **Monitor API Usage** | |
| ```bash | |
| # Check Perplexity dashboard for: | |
| - API calls remaining | |
| - Rate limit status | |
| - Billing usage | |
| ``` | |
| 2. **Update Dependencies** | |
| ```bash | |
| # Periodically update requirements.txt: | |
| gradio>=4.44.0 # Check for latest version | |
| transformers>=4.30.0 | |
| # Test thoroughly after updates | |
| ``` | |
| 3. **Performance Monitoring** | |
| ```bash | |
| # Monitor Space metrics: | |
| - CPU/GPU usage | |
| - Memory consumption | |
| - Request response times | |
| - Error rates | |
| ``` | |
| ### **Backup Strategy** | |
| ```bash | |
| # Important data to backup: | |
| 1. Configuration files (app.py, config.py) | |
| 2. Custom templates or modifications | |
| 3. API keys and environment variables | |
| 4. Any persistent data or datasets | |
| # HuggingFace Spaces automatically versions your files | |
| # Use git commands to manage versions | |
| ``` | |
| --- | |
| ## π Scaling & Optimization | |
| ### **Performance Optimization** | |
| 1. **Model Optimization** | |
| ```python | |
| # In config.py, adjust for your needs: | |
| batch_size = 16 # Increase for better GPU utilization | |
| max_sequence_length = 256 # Reduce for faster processing | |
| confidence_threshold = 0.8 # Higher for better quality | |
| ``` | |
| 2. **Caching Strategy** | |
| ```python | |
| # Enable model caching: | |
| cache_models = True | |
| model_cache_dir = "./model_cache" | |
| # Cache API responses: | |
| cache_api_responses = True | |
| cache_ttl_hours = 24 | |
| ``` | |
| 3. **Resource Management** | |
| ```python | |
| # Optimize memory usage: | |
| clear_cache_after_processing = True | |
| max_concurrent_requests = 3 | |
| timeout_per_url = 10 # seconds | |
| ``` | |
| ### **Cost Optimization** | |
| 1. **Auto-Sleep Configuration** | |
| ```bash | |
| # HuggingFace Spaces auto-sleep after 1 hour idle | |
| # No additional configuration needed | |
| # Automatically resumes on next request | |
| ``` | |
| 2. **Hardware Scheduling** | |
| ```bash | |
| # Strategy: Start with CPU Basic | |
| # Upgrade to T4 Small during processing | |
| # Downgrade back to CPU Basic when idle | |
| ``` | |
| 3. **API Cost Management** | |
| ```bash | |
| # Perplexity API optimization: | |
| - Cache search results for similar queries | |
| - Use more specific search terms | |
| - Implement request batching | |
| - Set reasonable max_sources limits | |
| ``` | |
| --- | |
| ## π Best Practices | |
| ### **Security Best Practices** | |
| 1. **API Key Management** | |
| ```bash | |
| β Store in HuggingFace Spaces secrets | |
| β Never commit to git repositories | |
| β Rotate keys periodically | |
| β Monitor usage for anomalies | |
| ``` | |
| 2. **Safe Scraping** | |
| ```bash | |
| β Respect robots.txt | |
| β Implement rate limiting | |
| β Use appropriate user agents | |
| β Avoid private/internal networks | |
| ``` | |
| 3. **Data Privacy** | |
| ```bash | |
| β No persistent data storage by default | |
| β Clear temporary files after processing | |
| β Respect copyright and fair use | |
| β Provide clear data source attribution | |
| ``` | |
| ### **Development Best Practices** | |
| 1. **Testing Strategy** | |
| ```bash | |
| # Test with small datasets first | |
| # Verify each step of the pipeline | |
| # Use diverse source types | |
| # Test error conditions | |
| ``` | |
| 2. **Version Control** | |
| ```bash | |
| # Use git for file management | |
| # Tag stable releases | |
| # Document changes and updates | |
| # Keep rollback capability | |
| ``` | |
| 3. **Documentation** | |
| ```bash | |
| # Keep README.md updated | |
| # Document custom configurations | |
| # Provide usage examples | |
| # Include troubleshooting guides | |
| ``` | |
| --- | |
| ## π Getting Help | |
| ### **Support Channels** | |
| 1. **HuggingFace Community** | |
| - Discussions: Share issues and solutions | |
| - Discord: Real-time help from community | |
| 2. **GitHub Issues** | |
| - Bug reports and feature requests | |
| - Include logs and configuration details | |
| 3. **Documentation** | |
| - README.md: Complete usage guide | |
| - DEPLOYMENT.md: This guide | |
| - Code comments: Inline documentation | |
| ### **Information to Include When Asking for Help** | |
| ```bash | |
| 1. Deployment type (CPU Basic, T4 Small, etc.) | |
| 2. Error messages (exact text) | |
| 3. Space logs (relevant sections) | |
| 4. Configuration details (without API keys) | |
| 5. Steps to reproduce the issue | |
| 6. Expected vs actual behavior | |
| ``` | |
| --- | |
| ## π Success Indicators | |
| Your deployment is successful when you see: | |
| ```bash | |
| β Space builds without errors | |
| β Interface loads within 60 seconds | |
| β Perplexity AI discovery works | |
| β Can create projects and scrape URLs | |
| β AI processing generates quality data | |
| β Export produces valid dataset files | |
| β No persistent errors in logs | |
| ``` | |
| --- | |
| ## π What's Next? | |
| After successful deployment: | |
| 1. **Create Your First Dataset** | |
| - Start with a simple sentiment analysis project | |
| - Use AI discovery to find sources | |
| - Process and export a small dataset | |
| 2. **Explore Advanced Features** | |
| - Try different templates | |
| - Experiment with search types | |
| - Test batch processing | |
| 3. **Optimize for Your Use Case** | |
| - Adjust configurations | |
| - Create custom templates | |
| - Integrate with your ML pipeline | |
| 4. **Share and Collaborate** | |
| - Make Space public to help others | |
| - Contribute improvements | |
| - Share success stories | |
| **Your AI Dataset Studio is now ready to revolutionize how you create training datasets!** π― | |
| *From idea to ML-ready dataset in minutes, not weeks.* |