Upload 14 files

Browse files

Files changed (14) hide show

.env.example +9 -0
MODEL_CARD.md +159 -0
README.md +47 -3
cloud_agents/__init__.py +10 -0
cloud_agents/agent.py +161 -0
cloud_agents/cli.py +64 -0
cloud_agents/config.py +23 -0
cloud_agents/coordinator.py +208 -0
cloud_agents/couchdb_client.py +141 -0
cloud_agents/db_views.py +74 -0
cloud_agents/scaling.py +153 -0
cloud_agents/tensor_ops.py +75 -0
requirements.txt +10 -0
setup.py +30 -0

.env.example ADDED Viewed

	@@ -0,0 +1,9 @@

+COUCHDB_URL=http://localhost:5984
+COUCHDB_USER=admin
+COUCHDB_PASSWORD=password
+COORDINATOR_HOST=localhost
+COORDINATOR_PORT=8000
+MODEL_ID=OpenPeerAI/OpenPeerLLM
+RAY_HEAD_PORT=6379
+BATCH_SIZE=32
+GRADIENT_ACCUMULATION_STEPS=4

MODEL_CARD.md ADDED Viewed

	@@ -0,0 +1,159 @@

+# Model Card: Cloud Agents for OpenPeerLLM
+## Model Details
+- **Model Type:** Distributed Training System for Language Models
+- **Primary Purpose:** Training Large Language Models in a distributed environment
+- **Framework:** PyTorch with Ray
+- **Target Model:** [OpenPeerLLM](https://huggingface.co/OpenPeerAI/OpenPeerLLM)
+- **License:** MIT
+## Intended Use
+### Primary Use
+- Distributed training of large language models
+- Grid computing/distributed computing-based learning for tensors
+- Horizontal scaling of model training infrastructure
+### Out-of-Scope Uses
+- Production deployment of models
+- Single-machine training
+- Real-time inference
+## System Architecture
+### Components
+1. **Distributed Agents**
+   - Lightweight worker nodes for distributed computing
+   - Automatic scaling based on workload
+   - Built-in fault tolerance and recovery
+2. **CouchDB Coordination Layer**
+   - Job distribution and management
+   - State synchronization
+   - Agent discovery and registration
+3. **Tensor Operations**
+   - Distributed gradient computation
+   - Efficient parameter updates
+   - Gradient averaging and clipping
+4. **Training Orchestration**
+   - Automated model checkpoint management
+   - Dynamic load balancing
+   - Progress monitoring and reporting
+## Performance
+### Scaling Characteristics
+- **Minimum Agents:** 2
+- **Maximum Agents:** 10 (configurable)
+- **Scale-up Threshold:** 80% utilization
+- **Scale-down Threshold:** 30% utilization
+- **Auto-scaling:** Yes, based on workload
+### Resource Requirements
+- **Per Agent:**
+  - CPU: 1 core minimum
+  - GPU: Optional, supports fractional GPU allocation
+  - Memory: Varies based on model size
+  - Network: Reliable connection to CouchDB and other agents
+## Limitations
+1. **Network Dependency**
+   - Requires stable network connectivity between agents
+   - CouchDB must be accessible to all agents
+2. **Scaling Limits**
+   - Upper bound on number of concurrent agents
+   - Network latency can impact synchronization
+3. **Resource Management**
+   - Requires careful monitoring of resource utilization
+   - GPU memory management crucial for large models
+## Training Details
+### Training Data
+- Uses the same training data as OpenPeerLLM
+- Supports distributed batch processing
+- Configurable gradient accumulation steps
+### Training Procedure
+1. **Initialization**
+   - Model weights loaded from HuggingFace hub
+   - Agents register with coordinator
+   - Initial state distributed to all agents
+2. **Training Loop**
+   - Distributed gradient computation
+   - Synchronized parameter updates
+   - Regular checkpointing
+   - Automatic agent scaling
+### Hyperparameters
+Configurable through environment variables:
+- Batch size
+- Gradient accumulation steps
+- Number of epochs
+- Learning rate
+- Scaling thresholds
+## Getting Started
+1. **Installation**
+   ```bash
+   pip install -r requirements.txt
+   ```
+2. **Configuration**
+   - Copy `.env.example` to `.env`
+   - Configure CouchDB connection
+   - Set desired training parameters
+3. **Launch Training**
+   ```bash
+   python -m cloud_agents.cli train --num-epochs 3 --steps-per-epoch 100
+   ```
+4. **Monitor Progress**
+   ```bash
+   python -m cloud_agents.cli status
+   ```
+## Ethical Considerations
+- Resource efficiency through intelligent scaling
+- Environmental impact minimization via workload-based scaling
+- Distributed approach reduces single-point-of-failure risks
+## Maintenance
+This system is maintained as an open-source project. Users are encouraged to:
+- Report issues and bugs
+- Suggest improvements
+- Contribute to the codebase
+- Share performance metrics and optimization strategies
+## Citation
+If you use this system in your research, please cite:
+```bibtex
+@software{cloud_agents_2025,
+  title = {Cloud Agents: Distributed Training System for OpenPeerLLM},
+  year = {2025},
+  author = {Andrew Magdy Kamal},
+  url = {hhttps://huggingface.co/OpenPeerAI/Cloud-Agents},
+  note = {Distributed computing framework for training large language models}
+}
+```

README.md CHANGED Viewed

@@ -1,3 +1,47 @@
----
-license: mit
----

+# Cloud Agents for Distributed Model Training
+A lightweight and horizontally scalable distributed computing system for training large language models, specifically designed for OpenPeerLLM.
+## Features
+- Distributed tensor operations for model training
+- CouchDB-based coordination layer
+- Automatic agent discovery and load balancing
+- Horizontal scaling capabilities
+- Fault tolerance and recovery
+- Integration with OpenPeerAI's OpenPeerLLM
+## Installation
+```bash
+pip install -r requirements.txt
+```
+## Configuration
+1. Set up CouchDB instance
+2. Copy `.env.example` to `.env` and configure your settings
+3. Start the coordinator node
+4. Launch agent nodes
+## Quick Start
+```bash
+# Start coordinator
+python -m cloud_agents.coordinator
+# Start agent (on each machine)
+python -m cloud_agents.agent
+```
+## Architecture
+- `coordinator`: Manages job distribution and agent coordination
+- `agent`: Handles tensor operations and model training
+- `couchdb_client`: Interface for CouchDB communication
+- `tensor_ops`: Distributed tensor operations
+- `utils`: Helper functions and utilities
+## License
+MIT

cloud_agents/__init__.py ADDED Viewed

	@@ -0,0 +1,10 @@

+"""
+Cloud Agents package initialization.
+"""
+from .agent import Agent
+from .coordinator import Coordinator
+from .couchdb_client import CouchDBClient
+from .config import settings
+__version__ = "0.1.0"
+__all__ = ["Agent", "Coordinator", "CouchDBClient", "settings"]

cloud_agents/agent.py ADDED Viewed

	@@ -0,0 +1,161 @@

+"""
+Base agent class for distributed computing.
+"""
+import torch
+import ray
+import uuid
+import asyncio
+from typing import Dict, Any, Optional
+from datetime import datetime
+import logging
+from .couchdb_client import CouchDBClient
+from .config import settings
+logger = logging.getLogger(__name__)
+@ray.remote
+class Agent:
+    """Distributed computing agent for tensor operations and model training."""
+    def __init__(self):
+        self.agent_id = str(uuid.uuid4())
+        self.db_client = CouchDBClient()
+        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+        self.current_job: Optional[Dict] = None
+        self._register_agent()
+        self._start_heartbeat()
+    def _register_agent(self):
+        """Register agent with the cluster."""
+        capabilities = {
+            "device": str(self.device),
+            "cuda_available": torch.cuda.is_available(),
+            "cuda_devices": torch.cuda.device_count() if torch.cuda.is_available() else 0,
+            "memory_available": torch.cuda.get_device_properties(0).total_memory if torch.cuda.is_available() else 0
+        }
+        success = self.db_client.register_agent(self.agent_id, capabilities)
+        if not success:
+            raise RuntimeError("Failed to register agent")
+    def _start_heartbeat(self):
+        """Start agent heartbeat."""
+        async def heartbeat_loop():
+            while True:
+                try:
+                    self.db_client.update_heartbeat(self.agent_id)
+                    await asyncio.sleep(30)
+                except Exception as e:
+                    logger.error(f"Heartbeat error: {e}")
+                    await asyncio.sleep(5)
+        asyncio.create_task(heartbeat_loop())
+    def process_tensors(self, tensors: Dict[str, torch.Tensor]) -> Dict[str, torch.Tensor]:
+        """Process tensor operations."""
+        results = {}
+        for name, tensor in tensors.items():
+            tensor = tensor.to(self.device)
+            # Perform tensor operations
+            results[name] = self._compute_tensor(tensor)
+        return results
+    def _compute_tensor(self, tensor: torch.Tensor) -> torch.Tensor:
+        """Compute operations on a single tensor."""
+        # Add custom tensor operations here
+        return tensor
+    async def run(self):
+        """Main agent loop."""
+        while True:
+            try:
+                # Try to claim a job
+                job = self.db_client.claim_job(self.agent_id)
+                if job:
+                    self.current_job = job
+                    await self._process_job(job)
+                else:
+                    await asyncio.sleep(1)
+            except Exception as e:
+                logger.error(f"Error in agent loop: {e}")
+                await asyncio.sleep(5)
+    async def _process_job(self, job: Dict[str, Any]):
+        """Process a claimed job."""
+        try:
+            job_type = job['type']
+            params = job['params']
+            result = None
+            if job_type == 'gradient_computation':
+                result = await self._compute_gradients(params)
+            elif job_type == 'model_update':
+                result = await self._update_model(params)
+            # Store job results
+            self.db_client.update_job_status(
+                job['_id'],
+                'completed',
+                result
+            )
+        except Exception as e:
+            logger.error(f"Job processing error: {e}")
+            self.db_client.update_job_status(
+                job['_id'],
+                'failed',
+                {'error': str(e)}
+            )
+        finally:
+            self.current_job = None
+    async def _compute_gradients(self, params: Dict[str, Any]) -> Dict[str, Any]:
+        """Compute gradients for model training."""
+        try:
+            # Load model checkpoint
+            checkpoint = params.get('checkpoint')
+            if checkpoint:
+                state_dict = torch.load(checkpoint, map_location=self.device)
+                # Compute gradients
+                gradients = self._compute_model_gradients(state_dict, params.get('batch'))
+                # Store gradients in CouchDB
+                gradient_id = self.db_client.store_gradients(
+                    self.current_job['_id'],
+                    gradients
+                )
+                return {'gradient_id': gradient_id}
+        except Exception as e:
+            logger.error(f"Gradient computation error: {e}")
+            raise
+    def _compute_model_gradients(self, state_dict: Dict[str, torch.Tensor], batch: Dict[str, Any]) -> Dict[str, Any]:
+        """Compute gradients for a given model state and batch."""
+        # Convert gradients to serializable format
+        gradients = {}
+        for name, param in state_dict.items():
+            if param.requires_grad:
+                grad = param.grad
+                if grad is not None:
+                    gradients[name] = grad.cpu().numpy().tolist()
+        return gradients
+    async def _update_model(self, params: Dict[str, Any]) -> Dict[str, Any]:
+        """Update model with new parameters."""
+        try:
+            new_state = params.get('state')
+            if new_state:
+                # Apply model updates
+                state_id = self.db_client.store_model_state(new_state)
+                return {'state_id': state_id}
+        except Exception as e:
+            logger.error(f"Model update error: {e}")
+            raise
+    def shutdown(self):
+        """Shutdown the agent."""
+        # Update agent status to inactive
+        self.db_client.update_job_status(
+            self.agent_id,
+            'inactive'
+        )
+        # Clean up resources
+        if torch.cuda.is_available():
+            torch.cuda.empty_cache()

cloud_agents/cli.py ADDED Viewed

	@@ -0,0 +1,64 @@

+"""
+Command-line interface for the Cloud Agents system.
+"""
+import click
+import asyncio
+import logging
+from .coordinator import Coordinator
+from .scaling import ScalingManager
+from .config import settings
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+@click.group()
+def cli():
+    """Cloud Agents CLI for distributed model training."""
+    pass
+@cli.command()
+@click.option('--num-epochs', default=1, help='Number of training epochs')
+@click.option('--steps-per-epoch', default=100, help='Steps per epoch')
+def train(num_epochs, steps_per_epoch):
+    """Start distributed training."""
+    try:
+        coordinator = Coordinator()
+        scaling_manager = ScalingManager()
+        async def run_training():
+            # Start scaling manager
+            asyncio.create_task(scaling_manager.monitor_and_scale())
+            # Start training
+            await coordinator.coordinate_training({
+                'num_epochs': num_epochs,
+                'steps_per_epoch': steps_per_epoch
+            })
+        asyncio.run(run_training())
+    except Exception as e:
+        logger.error(f"Training failed: {e}")
+        raise
+@cli.command()
+def status():
+    """Get cluster status."""
+    try:
+        scaling_manager = ScalingManager()
+        status = scaling_manager.get_cluster_status()
+        click.echo("Cluster Status:")
+        click.echo(f"Total Agents: {status['total_agents']}")
+        click.echo(f"Busy Agents: {status['busy_agents']}")
+        click.echo(f"Idle Agents: {status['idle_agents']}")
+        click.echo(f"Utilization: {status['utilization']:.2%}")
+        click.echo(f"Can Scale Up: {status['can_scale_up']}")
+        click.echo(f"Can Scale Down: {status['can_scale_down']}")
+    except Exception as e:
+        logger.error(f"Failed to get status: {e}")
+        raise
+if __name__ == '__main__':
+    cli()

cloud_agents/config.py ADDED Viewed

	@@ -0,0 +1,23 @@

+"""
+Configuration settings for Cloud Agents.
+"""
+from pydantic_settings import BaseSettings
+from typing import Optional
+class Settings(BaseSettings):
+    """Settings for Cloud Agents configuration."""
+    COUCHDB_URL: str = "http://localhost:5984"
+    COUCHDB_USER: str = "admin"
+    COUCHDB_PASSWORD: str = "password"
+    COORDINATOR_HOST: str = "localhost"
+    COORDINATOR_PORT: int = 8000
+    MODEL_ID: str = "OpenPeerAI/OpenPeerLLM"
+    RAY_HEAD_PORT: int = 6379
+    BATCH_SIZE: int = 32
+    GRADIENT_ACCUMULATION_STEPS: int = 4
+    class Config:
+        env_file = ".env"
+        env_file_encoding = "utf-8"
+settings = Settings()

cloud_agents/coordinator.py ADDED Viewed

	@@ -0,0 +1,208 @@

+"""
+Coordinator for distributed model training.
+"""
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from typing import Dict, List, Any, Optional
+import asyncio
+import logging
+from huggingface_hub import snapshot_download
+import os
+import ray
+from .couchdb_client import CouchDBClient
+from .config import settings
+from .tensor_ops import TensorOps
+logger = logging.getLogger(__name__)
+class Coordinator:
+    """Coordinator for distributed training of OpenPeerLLM."""
+    def __init__(self):
+        self.db_client = CouchDBClient()
+        self.model_id = settings.MODEL_ID
+        self.batch_size = settings.BATCH_SIZE
+        self.gradient_accumulation_steps = settings.GRADIENT_ACCUMULATION_STEPS
+        self._initialize_model()
+    def _initialize_model(self):
+        """Initialize the model and tokenizer."""
+        try:
+            # Download model and tokenizer from Hugging Face
+            cache_dir = snapshot_download(self.model_id)
+            self.model = AutoModelForCausalLM.from_pretrained(cache_dir)
+            self.tokenizer = AutoTokenizer.from_pretrained(cache_dir)
+            # Store initial model state
+            initial_state = {
+                'model_state': self.model.state_dict(),
+                'step': 0,
+                'epoch': 0
+            }
+            self.db_client.store_model_state(initial_state)
+        except Exception as e:
+            logger.error(f"Failed to initialize model: {e}")
+            raise
+    async def coordinate_training(self, training_config: Dict[str, Any]):
+        """Coordinate distributed training across agents."""
+        try:
+            num_epochs = training_config.get('num_epochs', 1)
+            steps_per_epoch = training_config.get('steps_per_epoch', 100)
+            for epoch in range(num_epochs):
+                logger.info(f"Starting epoch {epoch}")
+                await self._train_epoch(epoch, steps_per_epoch)
+                # Save checkpoint after each epoch
+                self._save_checkpoint(epoch)
+        except Exception as e:
+            logger.error(f"Training coordination error: {e}")
+            raise
+    async def _train_epoch(self, epoch: int, steps_per_epoch: int):
+        """Train for one epoch."""
+        for step in range(steps_per_epoch):
+            # Get active agents
+            active_agents = self.db_client.get_active_agents()
+            if not active_agents:
+                logger.warning("No active agents available")
+                await asyncio.sleep(5)
+                continue
+            # Distribute gradient computation jobs
+            gradient_jobs = await self._distribute_gradient_computation(
+                active_agents,
+                self.batch_size
+            )
+            # Collect and process gradients
+            gradients = await self._collect_gradients(gradient_jobs)
+            if gradients:
+                # Update model with collected gradients
+                self._update_model_parameters(gradients)
+                # Distribute updated model state to agents
+                await self._distribute_model_update()
+    async def _distribute_gradient_computation(
+        self,
+        agents: List[Dict[str, Any]],
+        batch_size: int
+    ) -> List[str]:
+        """Distribute gradient computation jobs to available agents."""
+        job_ids = []
+        # Get current model state
+        current_state = self.db_client.get_latest_model_state()
+        if not current_state:
+            raise RuntimeError("No model state available")
+        # Create gradient computation jobs
+        for agent in agents:
+            job_id = self.db_client.create_job(
+                'gradient_computation',
+                {
+                    'batch_size': batch_size,
+                    'state': current_state['state']
+                }
+            )
+            job_ids.append(job_id)
+        return job_ids
+    async def _collect_gradients(self, job_ids: List[str]) -> Optional[List[Dict[str, Any]]]:
+        """Collect gradients from completed jobs."""
+        all_gradients = []
+        timeout = 300  # 5 minutes timeout
+        async def wait_for_job(job_id: str) -> Optional[Dict[str, Any]]:
+            start_time = asyncio.get_event_time()
+            while True:
+                if asyncio.get_event_time() - start_time > timeout:
+                    logger.warning(f"Job {job_id} timed out")
+                    return None
+                job = self.db_client.get_job(job_id)
+                if job['status'] == 'completed':
+                    gradient_id = job['result']['gradient_id']
+                    return self.db_client.get_gradients(gradient_id)
+                elif job['status'] == 'failed':
+                    logger.error(f"Job {job_id} failed: {job.get('result', {}).get('error')}")
+                    return None
+                await asyncio.sleep(1)
+        # Wait for all gradient computations to complete
+        gradient_tasks = [wait_for_job(job_id) for job_id in job_ids]
+        gradients = await asyncio.gather(*gradient_tasks)
+        # Filter out None results (failed jobs)
+        return [g for g in gradients if g is not None]
+    def _update_model_parameters(self, gradients: List[Dict[str, Any]]):
+        """Update model parameters with collected gradients."""
+        try:
+            # Average gradients from all workers
+            avg_gradients = TensorOps.average_gradients([
+                {k: torch.tensor(v) for k, v in g.items()}
+                for g in gradients
+            ])
+            # Apply gradient clipping
+            clipped_gradients = TensorOps.gradient_clipping(avg_gradients, max_norm=1.0)
+            # Update model parameters
+            with torch.no_grad():
+                for name, param in self.model.named_parameters():
+                    if name in clipped_gradients:
+                        param.sub_(clipped_gradients[name] * self.model.config.learning_rate)
+        except Exception as e:
+            logger.error(f"Error updating model parameters: {e}")
+            raise
+    async def _distribute_model_update(self):
+        """Distribute updated model state to all agents."""
+        try:
+            # Store updated model state
+            state = {
+                'model_state': self.model.state_dict(),
+                'timestamp': datetime.utcnow().isoformat()
+            }
+            state_id = self.db_client.store_model_state(state)
+            # Create model update jobs for all active agents
+            active_agents = self.db_client.get_active_agents()
+            for agent in active_agents:
+                self.db_client.create_job(
+                    'model_update',
+                    {
+                        'state_id': state_id,
+                        'state': state
+                    }
+                )
+        except Exception as e:
+            logger.error(f"Error distributing model update: {e}")
+            raise
+    def _save_checkpoint(self, epoch: int):
+        """Save a checkpoint of the current model state."""
+        try:
+            checkpoint_dir = os.path.join(os.getcwd(), 'checkpoints')
+            os.makedirs(checkpoint_dir, exist_ok=True)
+            checkpoint_path = os.path.join(checkpoint_dir, f"checkpoint_epoch_{epoch}.pt")
+            torch.save({
+                'epoch': epoch,
+                'model_state_dict': self.model.state_dict(),
+                'optimizer_state_dict': self.optimizer.state_dict() if hasattr(self, 'optimizer') else None
+            }, checkpoint_path)
+            logger.info(f"Saved checkpoint for epoch {epoch}")
+        except Exception as e:
+            logger.error(f"Error saving checkpoint: {e}")
+            raise

cloud_agents/couchdb_client.py ADDED Viewed

	@@ -0,0 +1,141 @@

+"""
+CouchDB client for distributed coordination.
+"""
+import couchdb
+import uuid
+from datetime import datetime
+from typing import Dict, List, Optional, Any
+from .config import settings
+class CouchDBClient:
+    """Client for interacting with CouchDB for distributed coordination."""
+    def __init__(self):
+        self.server = couchdb.Server(settings.COUCHDB_URL)
+        self.server.resource.credentials = (
+            settings.COUCHDB_USER,
+            settings.COUCHDB_PASSWORD
+        )
+        self._ensure_databases()
+    def _ensure_databases(self):
+        """Ensure required databases exist."""
+        required_dbs = ['agents', 'jobs', 'gradients', 'model_state']
+        for db_name in required_dbs:
+            if db_name not in self.server:
+                self.server.create(db_name)
+    def register_agent(self, agent_id: str, capabilities: Dict[str, Any]) -> bool:
+        """Register an agent in the cluster."""
+        db = self.server['agents']
+        doc = {
+            '_id': agent_id,
+            'status': 'active',
+            'capabilities': capabilities,
+            'last_heartbeat': datetime.utcnow().isoformat(),
+            'current_job': None
+        }
+        try:
+            db.save(doc)
+            return True
+        except couchdb.http.ResourceConflict:
+            return False
+    def update_heartbeat(self, agent_id: str) -> bool:
+        """Update agent heartbeat."""
+        db = self.server['agents']
+        try:
+            doc = db[agent_id]
+            doc['last_heartbeat'] = datetime.utcnow().isoformat()
+            db.save(doc)
+            return True
+        except couchdb.http.ResourceNotFound:
+            return False
+    def create_job(self, job_type: str, params: Dict[str, Any]) -> str:
+        """Create a new job in the job queue."""
+        db = self.server['jobs']
+        job_id = str(uuid.uuid4())
+        doc = {
+            '_id': job_id,
+            'type': job_type,
+            'params': params,
+            'status': 'pending',
+            'created_at': datetime.utcnow().isoformat(),
+            'assigned_to': None
+        }
+        db.save(doc)
+        return job_id
+    def claim_job(self, agent_id: str) -> Optional[Dict[str, Any]]:
+        """Attempt to claim a pending job."""
+        db = self.server['jobs']
+        for row in db.view('_all_docs', include_docs=True):
+            doc = row.doc
+            if doc.get('status') == 'pending':
+                try:
+                    doc['status'] = 'in_progress'
+                    doc['assigned_to'] = agent_id
+                    doc['claimed_at'] = datetime.utcnow().isoformat()
+                    db.save(doc)
+                    return doc
+                except couchdb.http.ResourceConflict:
+                    continue
+        return None
+    def update_job_status(self, job_id: str, status: str, result: Optional[Dict[str, Any]] = None) -> bool:
+        """Update job status and optionally store results."""
+        db = self.server['jobs']
+        try:
+            doc = db[job_id]
+            doc['status'] = status
+            if result:
+                doc['result'] = result
+            doc['updated_at'] = datetime.utcnow().isoformat()
+            db.save(doc)
+            return True
+        except couchdb.http.ResourceNotFound:
+            return False
+    def store_gradients(self, job_id: str, gradients: Dict[str, Any]) -> str:
+        """Store computed gradients."""
+        db = self.server['gradients']
+        gradient_id = str(uuid.uuid4())
+        doc = {
+            '_id': gradient_id,
+            'job_id': job_id,
+            'gradients': gradients,
+            'timestamp': datetime.utcnow().isoformat()
+        }
+        db.save(doc)
+        return gradient_id
+    def get_active_agents(self) -> List[Dict[str, Any]]:
+        """Get list of currently active agents."""
+        db = self.server['agents']
+        active_agents = []
+        for row in db.view('_all_docs', include_docs=True):
+            doc = row.doc
+            if doc.get('status') == 'active':
+                active_agents.append(doc)
+        return active_agents
+    def store_model_state(self, state: Dict[str, Any]) -> str:
+        """Store current model state."""
+        db = self.server['model_state']
+        state_id = str(uuid.uuid4())
+        doc = {
+            '_id': state_id,
+            'state': state,
+            'timestamp': datetime.utcnow().isoformat()
+        }
+        db.save(doc)
+        return state_id
+    def get_latest_model_state(self) -> Optional[Dict[str, Any]]:
+        """Retrieve the latest model state."""
+        db = self.server['model_state']
+        # Get the most recent state by timestamp
+        for row in db.view('_all_docs', include_docs=True, descending=True, limit=1):
+            return row.doc
+        return None

cloud_agents/db_views.py ADDED Viewed

	@@ -0,0 +1,74 @@

+"""
+Database views for CouchDB.
+"""
+from typing import Dict
+# Views to be created in CouchDB for efficient querying
+VIEWS: Dict[str, Dict] = {
+    'agents': {
+        '_design/agents': {
+            'views': {
+                'active': {
+                    'map': '''function(doc) {
+                        if (doc.status === 'active') {
+                            emit(doc._id, doc);
+                        }
+                    }'''
+                },
+                'by_status': {
+                    'map': '''function(doc) {
+                        emit(doc.status, doc);
+                    }'''
+                }
+            }
+        }
+    },
+    'jobs': {
+        '_design/jobs': {
+            'views': {
+                'pending': {
+                    'map': '''function(doc) {
+                        if (doc.status === 'pending') {
+                            emit(doc._id, doc);
+                        }
+                    }'''
+                },
+                'by_agent': {
+                    'map': '''function(doc) {
+                        if (doc.assigned_to) {
+                            emit(doc.assigned_to, doc);
+                        }
+                    }'''
+                }
+            }
+        }
+    },
+    'gradients': {
+        '_design/gradients': {
+            'views': {
+                'by_job': {
+                    'map': '''function(doc) {
+                        emit(doc.job_id, doc);
+                    }'''
+                },
+                'by_timestamp': {
+                    'map': '''function(doc) {
+                        emit(doc.timestamp, doc);
+                    }'''
+                }
+            }
+        }
+    },
+    'model_state': {
+        '_design/model_state': {
+            'views': {
+                'by_timestamp': {
+                    'map': '''function(doc) {
+                        emit(doc.timestamp, doc);
+                    }'''
+                }
+            }
+        }
+    }
+}

cloud_agents/scaling.py ADDED Viewed

	@@ -0,0 +1,153 @@

+"""
+Scaling manager for horizontal scaling of cloud agents.
+"""
+import ray
+import asyncio
+import logging
+from typing import Dict, List, Optional, Any
+from datetime import datetime, timedelta
+from .couchdb_client import CouchDBClient
+from .agent import Agent
+from .config import settings
+logger = logging.getLogger(__name__)
+class ScalingManager:
+    """Manager for horizontal scaling of cloud agents."""
+    def __init__(self):
+        self.db_client = CouchDBClient()
+        self._initialize_ray()
+        self.min_agents = 2
+        self.max_agents = 10
+        self.scale_up_threshold = 0.8  # Scale up when 80% of agents are busy
+        self.scale_down_threshold = 0.3  # Scale down when less than 30% of agents are busy
+        self.agent_refs: Dict[str, ray.actor.ActorHandle] = {}
+    def _initialize_ray(self):
+        """Initialize Ray for distributed computing."""
+        if not ray.is_initialized():
+            ray.init(address=f"ray://{settings.COORDINATOR_HOST}:{settings.RAY_HEAD_PORT}")
+    async def monitor_and_scale(self):
+        """Monitor cluster health and scale as needed."""
+        while True:
+            try:
+                await self._check_agent_health()
+                await self._scale_cluster()
+                await asyncio.sleep(60)  # Check every minute
+            except Exception as e:
+                logger.error(f"Error in monitor and scale loop: {e}")
+                await asyncio.sleep(5)
+    async def _check_agent_health(self):
+        """Check health of all agents and remove dead ones."""
+        try:
+            active_agents = self.db_client.get_active_agents()
+            current_time = datetime.utcnow()
+            for agent in active_agents:
+                last_heartbeat = datetime.fromisoformat(agent['last_heartbeat'])
+                if current_time - last_heartbeat > timedelta(minutes=5):
+                    # Agent is considered dead
+                    logger.warning(f"Agent {agent['_id']} appears to be dead. Removing...")
+                    await self._remove_agent(agent['_id'])
+        except Exception as e:
+            logger.error(f"Error checking agent health: {e}")
+            raise
+    async def _scale_cluster(self):
+        """Scale the cluster based on workload."""
+        try:
+            active_agents = self.db_client.get_active_agents()
+            total_agents = len(active_agents)
+            busy_agents = len([a for a in active_agents if a['current_job'] is not None])
+            if total_agents < 1:
+                # Always ensure at least one agent is running
+                await self._add_agent()
+                return
+            utilization = busy_agents / total_agents if total_agents > 0 else 0
+            # Scale up if needed
+            if utilization >= self.scale_up_threshold and total_agents < self.max_agents:
+                num_to_add = min(2, self.max_agents - total_agents)  # Add up to 2 agents at a time
+                logger.info(f"Scaling up: Adding {num_to_add} agents")
+                for _ in range(num_to_add):
+                    await self._add_agent()
+            # Scale down if needed
+            elif utilization <= self.scale_down_threshold and total_agents > self.min_agents:
+                num_to_remove = min(1, total_agents - self.min_agents)  # Remove 1 agent at a time
+                logger.info(f"Scaling down: Removing {num_to_remove} agents")
+                idle_agents = [a for a in active_agents if a['current_job'] is None]
+                for _ in range(num_to_remove):
+                    if idle_agents:
+                        await self._remove_agent(idle_agents.pop()['_id'])
+        except Exception as e:
+            logger.error(f"Error scaling cluster: {e}")
+            raise
+    async def _add_agent(self):
+        """Add a new agent to the cluster."""
+        try:
+            # Create new agent actor using Ray
+            agent_ref = ray.remote(Agent).options(
+                num_cpus=1,
+                num_gpus=0.5 if ray.get_gpu_ids() else 0
+            ).remote()
+            # Store reference
+            agent_id = await agent_ref.get_id.remote()
+            self.agent_refs[agent_id] = agent_ref
+            # Start agent
+            ray.get(agent_ref.run.remote())
+            logger.info(f"Added new agent {agent_id}")
+            return agent_id
+        except Exception as e:
+            logger.error(f"Error adding agent: {e}")
+            raise
+    async def _remove_agent(self, agent_id: str):
+        """Remove an agent from the cluster."""
+        try:
+            # Get agent reference
+            agent_ref = self.agent_refs.get(agent_id)
+            if agent_ref:
+                # Shutdown agent gracefully
+                await agent_ref.shutdown.remote()
+                # Remove from Ray
+                ray.kill(agent_ref)
+                # Remove from local tracking
+                del self.agent_refs[agent_id]
+            logger.info(f"Removed agent {agent_id}")
+        except Exception as e:
+            logger.error(f"Error removing agent: {e}")
+            raise
+    def get_cluster_status(self) -> Dict[str, Any]:
+        """Get current status of the cluster."""
+        try:
+            active_agents = self.db_client.get_active_agents()
+            total_agents = len(active_agents)
+            busy_agents = len([a for a in active_agents if a['current_job'] is not None])
+            return {
+                'total_agents': total_agents,
+                'busy_agents': busy_agents,
+                'idle_agents': total_agents - busy_agents,
+                'utilization': busy_agents / total_agents if total_agents > 0 else 0,
+                'can_scale_up': total_agents < self.max_agents,
+                'can_scale_down': total_agents > self.min_agents
+            }
+        except Exception as e:
+            logger.error(f"Error getting cluster status: {e}")
+            raise

cloud_agents/tensor_ops.py ADDED Viewed

	@@ -0,0 +1,75 @@

+"""
+Tensor operations for distributed computing.
+"""
+import torch
+import numpy as np
+from typing import Dict, List, Optional, Union, Tuple
+class TensorOps:
+    """Utility class for distributed tensor operations."""
+    @staticmethod
+    def split_tensor(tensor: torch.Tensor, num_parts: int) -> List[torch.Tensor]:
+        """Split a tensor into multiple parts for distributed processing."""
+        return torch.chunk(tensor, num_parts)
+    @staticmethod
+    def merge_tensors(tensors: List[torch.Tensor], dim: int = 0) -> torch.Tensor:
+        """Merge multiple tensors back into a single tensor."""
+        return torch.cat(tensors, dim=dim)
+    @staticmethod
+    def average_gradients(gradients: List[Dict[str, torch.Tensor]]) -> Dict[str, torch.Tensor]:
+        """Average gradients from multiple workers."""
+        avg_gradients = {}
+        for key in gradients[0].keys():
+            avg_gradients[key] = torch.mean(torch.stack([g[key] for g in gradients]), dim=0)
+        return avg_gradients
+    @staticmethod
+    def serialize_tensor(tensor: torch.Tensor) -> Dict[str, Union[List, str]]:
+        """Serialize a tensor for storage/transmission."""
+        return {
+            'data': tensor.cpu().numpy().tolist(),
+            'shape': list(tensor.shape),
+            'dtype': str(tensor.dtype)
+        }
+    @staticmethod
+    def deserialize_tensor(tensor_dict: Dict[str, Union[List, str]]) -> torch.Tensor:
+        """Deserialize a tensor from storage/transmission format."""
+        data = np.array(tensor_dict['data'])
+        shape = tensor_dict['shape']
+        dtype = getattr(torch, tensor_dict['dtype'].split('.')[-1])
+        return torch.tensor(data, dtype=dtype).reshape(shape)
+    @staticmethod
+    def gradient_clipping(gradients: Dict[str, torch.Tensor], max_norm: float) -> Dict[str, torch.Tensor]:
+        """Apply gradient clipping to prevent exploding gradients."""
+        for k, v in gradients.items():
+            if v is not None:
+                torch.nn.utils.clip_grad_norm_(v, max_norm)
+        return gradients
+    @staticmethod
+    def reduce_precision(tensor: torch.Tensor, bits: int = 16) -> torch.Tensor:
+        """Reduce tensor precision for efficient transmission."""
+        if bits == 16:
+            return tensor.half()
+        elif bits == 32:
+            return tensor.float()
+        else:
+            raise ValueError("Unsupported precision bits")
+    @staticmethod
+    def shard_tensor(tensor: torch.Tensor, shard_size: int) -> List[torch.Tensor]:
+        """Shard a tensor into smaller pieces for distributed processing."""
+        return [tensor[i:i + shard_size] for i in range(0, tensor.size(0), shard_size)]
+    @staticmethod
+    def compute_parameter_norm(parameters: Dict[str, torch.Tensor]) -> float:
+        """Compute the total norm of all parameters."""
+        total_norm = 0.0
+        for param in parameters.values():
+            total_norm += param.norm().item() ** 2
+        return total_norm ** 0.5

requirements.txt ADDED Viewed

	@@ -0,0 +1,10 @@

+torch>=2.0.0
+transformers>=4.30.0
+couchdb>=1.2
+numpy>=1.24.0
+python-dotenv>=1.0.0
+aiohttp>=3.8.0
+pydantic>=2.0.0
+ray>=2.6.0
+tqdm>=4.65.0
+huggingface_hub>=0.16.0

setup.py ADDED Viewed

	@@ -0,0 +1,30 @@

+from setuptools import setup, find_packages
+setup(
+    name="cloud_agents",
+    version="0.1.0",
+    packages=find_packages(),
+    install_requires=[
+        "torch>=2.0.0",
+        "transformers>=4.30.0",
+        "couchdb>=1.2",
+        "numpy>=1.24.0",
+        "python-dotenv>=1.0.0",
+        "aiohttp>=3.8.0",
+        "pydantic>=2.0.0",
+        "ray>=2.6.0",
+        "tqdm>=4.65.0",
+        "huggingface_hub>=0.16.0",
+    ],
+    python_requires=">=3.8",
+    author="Andrew Magdy Kamal",
+    description="Distributed cloud agents for training OpenPeerLLM",
+    long_description=open("README.md").read(),
+    long_description_content_type="text/markdown",
+    classifiers=[
+        "Development Status :: 3 - Alpha",
+        "Intended Audience :: Science/Research",
+        "License :: OSI Approved :: MIT License",
+        "Programming Language :: Python :: 3.8",
+    ],
+)