פרק 19: Deploy to Production — Hosting, Monitoring, and Scaling

מה יהיה לך בסוף הפרק הזה

Dockerfile ו-docker-compose.yml עובדים -- סוכן ארוז ב-container מוכן לפריסה לכל ענן
Production API Server עם FastAPI -- endpoints ל-REST ו-SSE streaming, authentication עם API keys, rate limiting
Durable Execution מוגדר -- הסוכן שורד קריסות, restarts, ו-deployment updates באמצעות checkpointing
Monitoring Stack מלא -- OpenTelemetry traces, structured logs, מטריקות token usage, ודשבורד Grafana
Cost Control System -- מעקב עלויות per-request, budget alerts, model routing לחיסכון של 60-90%
Security Layer -- JWT authentication, input validation נגד prompt injection, secrets management, audit logging
CI/CD Pipeline עם GitHub Actions -- unit tests, eval suite, canary deployments, one-click rollback
Production Deployment מלא -- סוכן שרץ 24/7 עם monitoring, cost controls, security, ו-CI/CD

מה תוכלו לעשות אחרי הפרק הזה

תוכלו לארוז סוכן AI ב-Docker container ולפרוס אותו לענן -- serverless (Lambda, Workers) או always-on (VPS, Kubernetes)
תוכלו לבנות production API עם FastAPI שתומך ב-streaming, authentication, rate limiting, וגם TypeScript alternative עם Express
תוכלו להגדיר durable execution שמבטיח שהסוכן שורד קריסות, restarts, ו-long-running workflows
תוכלו לנטר סוכנים עם OpenTelemetry -- traces, logs, metrics, alerts -- ולבנות דשבורד Grafana מותאם
תוכלו לשלוט בעלויות production עם model routing, caching, batch processing, ו-budget alerts

לפני שמתחילים

פרקים קודמים: פרק 11 (Tool Use Mastery), פרק 14 (Human-in-the-Loop & Safety), פרק 18 (Code Review Agent), או כל פרק Build (15-18)
מה תצטרכו: Python 3.11+ ו/או Node.js 18+, Docker Desktop מותקן, חשבון GitHub, מפתח API (Anthropic / OpenAI), חשבון ענן (AWS / GCP / Cloudflare -- free tier מספיק)
ידע נדרש: בניית סוכן AI עובד (מפרקי Build), היכרות בסיסית עם HTTP APIs, command line, ו-Git
זמן משוער: 4-6 שעות | עלות משוערת: $5-15 (LLM calls + hosting free tier)

הפרויקט שלך

בפרק 18 בניתם Code Review & DevOps Agent שעושה review ל-Pull Requests, עוזר עם deployments, ומנטר production -- למדתם integration עם GitHub API, static analysis, ו-deployment automation. בפרק הזה תיקחו כל סוכן שבניתם (Support, Research, Marketing, או Code Review) ותפרסו אותו ל-production אמיתי. תלמדו את הדפוסים שהופכים "סקריפט שרץ על הלפטופ" ל"שירות שרץ 24/7 בענן" -- containerization, API design, monitoring, cost control, security, ו-CI/CD. בפרק 20 תסכמו את כל הקורס עם Strategy -- איך לבחור SDKs, לבנות צוותים, ולתכנן עתיד.

מילון מונחים -- מושגים חדשים בפרק

מונח (English)	תרגום	הסבר
Containerization	קונטיינריזציה	אריזת האפליקציה (קוד, dependencies, הגדרות) בתוך Docker container -- מבטיח שמה שעובד אצלך עובד גם בענן. "It works on my machine" נגמר
Serverless	ללא שרת (Serverless)	מודל הרצה שבו ספק הענן מנהל את השרתים. משלמים רק על execution time. יתרון: zero ops. חיסרון: cold starts ומגבלות runtime
SSE (Server-Sent Events)	אירועי שרת	פרוטוקול HTTP שבו השרת שולח stream של אירועים ללקוח. מושלם לסוכני AI שצריכים להזרים תשובות token-by-token
Durable Execution	ביצוע עמיד	תשתית שמאפשרת לסוכן לשרוד קריסות, restarts, ו-deployments. המצב נשמר ב-checkpoints והסוכן ממשיך מהנקודה שבה עצר
Circuit Breaker	מפסק מעגל	דפוס שמונע מהמערכת לנסות שוב ושוב פעולה שנכשלת. אחרי X כשלונות, המפסק "נפתח" ומפסיק לנסות -- מגן על המערכת מקריסה מלאה
OpenTelemetry (OTel)	טלמטריה פתוחה	סטנדרט פתוח לאיסוף traces, metrics, ו-logs ממערכות מבוזרות. מתחבר ל-Grafana, Datadog, ועוד -- ללא vendor lock-in
Model Routing	ניתוב מודלים	שליחת בקשות פשוטות למודל זול (Haiku, Flash) ובקשות מורכבות למודל חזק (Sonnet, GPT-4o). חוסך 60-80% מעלויות LLM
Canary Deployment	פריסת קנרי	גלגול גרסה חדשה ל-5% מהתנועה, ניטור ביצועים, ורק אם הכל תקין -- הרחבה ל-100%. שם: כורי פחם שלחו קנריה למנהרה לבדוק אוויר
Rate Limiting	הגבלת קצב	הגבלת מספר הבקשות שמשתמש או מערכת יכולים לשלוח בפרק זמן -- מגן מפני שימוש יתר, DDoS, ועלויות בלתי צפויות
Idempotency	אידמפוטנטיות	תכונה של פעולה שאפשר להריץ אותה פעמיים ולקבל את אותה תוצאה. קריטי לסוכנים -- אם בקשה נשלחת פעמיים, הסוכן לא צריך לעשות את הפעולה פעמיים
Dead Letter Queue (DLQ)	תור הודעות מתות	תור שאוסף בקשות שנכשלו ולא ניתנות ל-retry. מאפשר ניתוח לאחר מעשה -- למה נכשל? מה לתקן?
Prompt Caching	מטמון פרומפטים	טכנולוגיה של Anthropic שחוסכת 90% מעלות system prompts חוזרים. ה-LLM "זוכר" את ה-system prompt ולא מעבד אותו מחדש בכל בקשה

בינוני 30 דקות מושג חינם

ארכיטקטורת Production

יש לכם סוכן AI שעובד. הוא רץ ב-terminal, עונה על שאלות, משתמש ב-tools, ומחזיר תשובות מצוינות. מעולה -- אבל זה כמו להכין אוכל מדהים רק במטבח שלכם. כדי לפתוח מסעדה, צריך מטבח תעשייתי, מלצרים, קופה, ניהול מלאי, וביטוח. הפרק הזה הוא על המעבר מ-demo ל-production.

ה-Production Agent Stack

כל סוכן production בנוי מאותם שכבות. הנה הארכיטקטורה המלאה:

┌─────────────────────────────────────────────────────────┐
│                    CLIENTS                               │
│  Web App  │  Mobile App  │  CLI  │  Slack Bot  │  API   │
└────────────────────┬────────────────────────────────────┘
                     │ HTTPS / WSS
            ┌────────▼─────────┐
            │  Load Balancer   │   (Nginx / CloudFlare / ALB)
            │  + Rate Limiter  │
            └────────┬─────────┘
                     │
            ┌────────▼─────────┐
            │   API Server     │   (FastAPI / Express)
            │  - Auth (JWT)    │   - REST endpoints
            │  - Validation    │   - SSE streaming
            │  - Rate limits   │   - WebSocket (optional)
            └────────┬─────────┘
                     │
     ┌───────────────┼───────────────┐
     │               │               │
┌────▼────┐   ┌──────▼──────┐  ┌─────▼──────┐
│  Queue  │   │ Agent       │  │ Background │
│ (Redis/ │   │ Runtime     │  │ Workers    │
│  SQS)   │   │ (LangGraph/ │  │ (Celery/   │
│         │   │  Pydantic/  │  │  Bull)     │
└────┬────┘   │  custom)    │  └────────────┘
     │        └──────┬──────┘
     │               │
     │    ┌──────────┼──────────┐
     │    │          │          │
     │ ┌──▼───┐ ┌───▼────┐ ┌──▼──────┐
     │ │Tools │ │  LLM   │ │ Memory  │
     │ │(MCP) │ │Provider│ │ (Redis/ │
     │ │      │ │(Claude/│ │  Postgres│
     │ └──────┘ │ GPT-4) │ │  /Pinec.)│
     │          └────────┘ └─────────┘
     │
┌────▼────────────────────┐
│  Observability Layer    │
│  - OpenTelemetry traces │
│  - Structured logs      │
│  - Metrics (Prometheus) │
│  - Alerts (PagerDuty)   │
└─────────────────────────┘

Stateless vs. Stateful Agents

ההחלטה הראשונה והקריטית ביותר: האם הסוכן שלכם stateless או stateful?

מאפיין	Stateless Agent	Stateful Agent
הגדרה	כל בקשה עצמאית, אין זיכרון בין בקשות	הסוכן שומר state בין בקשות (שיחה, context)
Scaling	קל -- כל instance יכול לטפל בכל בקשה	מורכב -- צריך sticky sessions או shared state
דוגמאות	סוכן תרגום, סוכן סיכום, סוכן code review	צ'אטבוט, סוכן support, סוכן research ארוך
Serverless	מושלם	בעייתי -- cold starts, timeout limits
המלצה	התחילו כאן תמיד	רק כשצריך שיחה מתמשכת

נתון מעניין

80% מה-use cases של סוכני AI ב-production הם stateless -- בקשה אחת, תשובה אחת. גם סוכנים שנראים כאילו צריכים state (כמו support chatbot) יכולים להיות stateless אם שומרים את ה-conversation history בDB חיצוני ושולחים אותה עם כל בקשה.

Containerization עם Docker

Docker הוא השפה המשותפת של deployment. לא משנה אם אתם פורסים ל-AWS, GCP, DigitalOcean, או הרספברי-פאי בבית -- אם יש לכם Dockerfile, אתם פורסים.

# Dockerfile -- Production Agent
FROM python:3.12-slim

# Security: non-root user
RUN groupadd -r agent && useradd -r -g agent agent

WORKDIR /app

# Dependencies first (cache layer)
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Application code
COPY . .

# Don't run as root
USER agent

# Health check
HEALTHCHECK --interval=30s --timeout=10s --retries=3 \
    CMD curl -f http://localhost:8000/health || exit 1

EXPOSE 8000

CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

// Dockerfile -- TypeScript Agent
FROM node:20-slim

RUN groupadd -r agent && useradd -r -g agent agent

WORKDIR /app

COPY package*.json ./
RUN npm ci --only=production

COPY . .
RUN npm run build

USER agent

HEALTHCHECK --interval=30s --timeout=10s --retries=3 \
    CMD curl -f http://localhost:3000/health || exit 1

EXPOSE 3000

CMD ["node", "dist/server.js"]

עשה עכשיו 5 דקות

בחרו סוכן שבניתם בפרקים 15-18 (Support, Research, Marketing, או Code Review). צרו תיקייה חדשה:

mkdir production-agent && cd production-agent
mkdir -p src config monitoring
cp ../your-agent/main.py src/  # או main.ts
touch Dockerfile docker-compose.yml .env.example
echo "ANTHROPIC_API_KEY=sk-ant-..." > .env.example
echo "PORT=8000" >> .env.example

ודאו ש-Docker Desktop מותקן ורץ: docker --version

Framework: עץ ההחלטות לארכיטקטורת Production

שאלה	תשובה	המלצה
כמה בקשות ביום?	< 1,000	Serverless (Lambda / Workers)
כמה בקשות ביום?	1,000 - 100,000	Docker Compose על VPS
כמה בקשות ביום?	> 100,000	Kubernetes cluster
הסוכן stateful?	כן -- שיחות ארוכות	Always-on עם Redis/Postgres state
הסוכן stateful?	לא -- בקשה-תשובה	Serverless מושלם
Latency חשוב?	כן -- < 200ms first token	Always-on (no cold start)
Latency חשוב?	לא -- אפשר לחכות	Serverless / Queue-based
תקציב ops?	אפס -- no DevOps team	Managed platform (LangGraph Cloud, Bedrock)
תקציב ops?	יש DevOps או willingness ללמוד	Self-hosted (Docker/K8s)

בינוני 25 דקות מושג חינם

אפשרויות אירוח

שלוש קטגוריות עיקריות: self-hosted (אתם מנהלים הכל), serverless (אתם כותבים קוד, הספק מנהל), ו-managed agent platforms (הספק מנהל גם את ה-agent runtime).

Self-hosted (VPS/VM)

שליטה מלאה, הכי זול ב-scale, הכי הרבה עבודה. מתאים לצוותים עם ידע DevOps.

ספק	מחיר חודשי (4 vCPU / 8GB RAM)	יתרון	חיסרון
AWS EC2	~$70	אקוסיסטם עצום, auto-scaling	מורכבות, pricing מבלבל
GCP Compute	~$65	AI/ML tools מובנים	פחות popular מ-AWS
DigitalOcean	~$48	פשטות, UX מעולה	פחות services מ-AWS
Hetzner	~$22	הכי זול, שרתים באירופה	אין managed Kubernetes
Kamatera (ישראל)	~$30	שרתים בישראל, latency נמוך	אקוסיסטם קטן

Docker Compose לפריסות קטנות -- אם יש לכם עד כמה אלפי בקשות ביום, docker-compose על VPS בודד עם Nginx reverse proxy זה מספיק:

# docker-compose.yml
version: "3.9"
services:
  agent:
    build: .
    ports:
      - "8000:8000"
    env_file: .env
    restart: unless-stopped
    deploy:
      resources:
        limits:
          memory: 2G
    depends_on:
      - redis

  redis:
    image: redis:7-alpine
    volumes:
      - redis_data:/data

  nginx:
    image: nginx:alpine
    ports:
      - "80:80"
      - "443:443"
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf
      - ./certs:/etc/nginx/certs
    depends_on:
      - agent

volumes:
  redis_data:

Serverless

Zero ops, pay-per-use, cold starts. מושלם לסוכנים stateless עם תנועה לא אחידה.

פלטפורמה	Max Runtime	Cold Start	יתרון
AWS Lambda	15 דקות	1-3 שניות	אקוסיסטם AWS, auto-scale
Cloudflare Workers	30 שניות (CPU)	< 5ms (edge)	הכי מהיר, edge computing
Google Cloud Functions	60 דקות (v2)	1-5 שניות	integration עם GCP
Vercel Serverless	5 דקות (pro)	~250ms	DX מעולה, Vercel AI SDK

Managed Agent Platforms

הכי פחות עבודה, הכי יקר. הספק מנהל את כל ה-infrastructure כולל ה-agent runtime.

פלטפורמה	SDK	יתרון	מחיר
LangGraph Platform	LangGraph (Python)	Managed hosting, monitoring, checkpointing	Usage-based + platform fee
Vertex AI Agent Engine	Google ADK	Google ecosystem, Gemini models	Usage-based
AWS Bedrock Agents	AWS SDK	Managed on AWS, multi-model	Usage-based + Bedrock fees

הקשר ישראלי -- אירוח מישראל

Latency מישראל לשרתי LLM: Anthropic (us-east-1, US) = 120-180ms, OpenAI (US) = 100-160ms. שרת בישראל (Kamatera) חוסך ~20ms ל-latency של ה-API שלכם, אבל ה-LLM call עצמו עדיין הולך ל-US. לכן: hosting location משפיע בעיקר על latency של ה-client, לא על latency של ה-LLM. לעסקים ישראלים עם users בישראל, Hetzner (אירופה) או Kamatera (ישראל) מצוינים. אם יש דרישות data residency (מידע רפואי, GDPR), hosting אירופאי חובה.

השוואת עלויות -- טבלה מלאה

פלטפורמה	10K runs/חודש	100K runs/חודש	1M runs/חודש
AWS Lambda	$2-5	$20-50	$200-500
Cloudflare Workers	$5 (flat)	$5-15	$50-150
VPS (Hetzner)	$22 (flat)	$22 (flat)	$44-88 (2-4 VPS)
VPS (DigitalOcean)	$48 (flat)	$48 (flat)	$96-192
Managed (LangGraph)	$50-100	$200-500	$1,000-3,000

שימו לב: עלויות ה-hosting הן רק חלק קטן מהעלות הכוללת. ב-90% מהמקרים, עלויות ה-LLM API גדולות פי 10-100 מעלויות ה-hosting. סוכן שמשתמש ב-Claude Sonnet ושולח 1,000 tokens בממוצע לבקשה + מקבל 2,000 tokens -- עולה ~$0.012 לבקשה. ב-100K בקשות זה $1,200/חודש ב-LLM בלבד.

עשה עכשיו 3 דקות

פתחו את הטבלה למעלה, חשבו: כמה בקשות ביום הסוכן שלכם צפוי לטפל? בחרו hosting option. רשמו את הבחירה ואת ההנמקה.

מתקדם 40 דקות תרגול $1-3

עיצוב API לשירותי סוכנים

הסוכן שלכם צריך דלת כניסה -- API שלקוחות (web app, mobile, CLI, Slack bot) יכולים לדבר איתו. הנה ה-API הסטנדרטי לשירות סוכן:

Endpoints

Endpoint	Method	תיאור
`POST /agent/run`	POST	הרצת סוכן -- שולח prompt, מקבל תשובה מלאה
`POST /agent/stream`	POST (SSE)	הרצת סוכן עם streaming -- מקבל tokens בזמן אמת
`GET /agent/status/:id`	GET	סטטוס של ריצה async (polling)
`GET /health`	GET	Health check -- load balancer בודק אם השרת חי
`GET /metrics`	GET	Prometheus metrics -- token usage, latency, errors

Python: FastAPI Production Server

# src/server.py -- Production Agent API
import os
import uuid
import time
import logging
from fastapi import FastAPI, HTTPException, Depends, Request
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
from anthropic import Anthropic

# --- Structured Logging ---
logging.basicConfig(
    format='{"time":"%(asctime)s","level":"%(levelname)s","msg":"%(message)s"}',
    level=logging.INFO
)
logger = logging.getLogger(__name__)

app = FastAPI(title="Production Agent API", version="1.0.0")
client = Anthropic()

# --- Request/Response Models ---
class AgentRequest(BaseModel):
    prompt: str
    model: str = "claude-sonnet-4-20250514"
    max_tokens: int = 4096
    user_id: str | None = None

class AgentResponse(BaseModel):
    id: str
    response: str
    model: str
    tokens_used: int
    cost_usd: float
    latency_ms: float

# --- Authentication ---
API_KEYS = set(os.getenv("API_KEYS", "").split(","))

async def verify_api_key(request: Request):
    api_key = request.headers.get("X-API-Key")
    if not api_key or api_key not in API_KEYS:
        raise HTTPException(status_code=401, detail="Invalid API key")
    return api_key

# --- Rate Limiting (simple in-memory, use Redis in prod) ---
from collections import defaultdict
rate_limits: dict[str, list[float]] = defaultdict(list)
RATE_LIMIT = 60  # requests per minute

def check_rate_limit(api_key: str):
    now = time.time()
    window = [t for t in rate_limits[api_key] if now - t < 60]
    rate_limits[api_key] = window
    if len(window) >= RATE_LIMIT:
        raise HTTPException(status_code=429, detail="Rate limit exceeded")
    rate_limits[api_key].append(now)

# --- Endpoints ---
@app.get("/health")
async def health():
    return {"status": "healthy", "timestamp": time.time()}

@app.post("/agent/run", response_model=AgentResponse)
async def run_agent(req: AgentRequest, api_key: str = Depends(verify_api_key)):
    check_rate_limit(api_key)
    request_id = str(uuid.uuid4())
    start = time.time()

    logger.info(f"request_id={request_id} user={req.user_id} model={req.model}")

    try:
        result = client.messages.create(
            model=req.model,
            max_tokens=req.max_tokens,
            messages=[{"role": "user", "content": req.prompt}]
        )

        latency = (time.time() - start) * 1000
        input_tokens = result.usage.input_tokens
        output_tokens = result.usage.output_tokens
        cost = (input_tokens * 0.003 + output_tokens * 0.015) / 1000  # Sonnet pricing

        logger.info(
            f"request_id={request_id} tokens_in={input_tokens} "
            f"tokens_out={output_tokens} cost=${cost:.4f} latency={latency:.0f}ms"
        )

        return AgentResponse(
            id=request_id,
            response=result.content[0].text,
            model=req.model,
            tokens_used=input_tokens + output_tokens,
            cost_usd=cost,
            latency_ms=latency
        )
    except Exception as e:
        logger.error(f"request_id={request_id} error={str(e)}")
        raise HTTPException(status_code=500, detail="Agent execution failed")

# --- SSE Streaming Endpoint ---
@app.post("/agent/stream")
async def stream_agent(req: AgentRequest, api_key: str = Depends(verify_api_key)):
    check_rate_limit(api_key)

    async def generate():
        with client.messages.stream(
            model=req.model,
            max_tokens=req.max_tokens,
            messages=[{"role": "user", "content": req.prompt}]
        ) as stream:
            for text in stream.text_stream:
                yield f"data: {text}\n\n"
        yield "data: [DONE]\n\n"

    return StreamingResponse(generate(), media_type="text/event-stream")

TypeScript: Express Production Server

// src/server.ts -- Production Agent API (TypeScript)
import express from 'express';
import Anthropic from '@anthropic-ai/sdk';
import { randomUUID } from 'crypto';

const app = express();
app.use(express.json());

const client = new Anthropic();
const API_KEYS = new Set((process.env.API_KEYS || '').split(','));

// --- Auth Middleware ---
function authMiddleware(req: express.Request, res: express.Response, next: express.NextFunction) {
  const apiKey = req.headers['x-api-key'] as string;
  if (!apiKey || !API_KEYS.has(apiKey)) {
    return res.status(401).json({ error: 'Invalid API key' });
  }
  next();
}

// --- Health Check ---
app.get('/health', (req, res) => {
  res.json({ status: 'healthy', timestamp: Date.now() });
});

// --- Run Agent ---
app.post('/agent/run', authMiddleware, async (req, res) => {
  const requestId = randomUUID();
  const start = Date.now();
  const { prompt, model = 'claude-sonnet-4-20250514', max_tokens = 4096 } = req.body;

  try {
    const result = await client.messages.create({
      model,
      max_tokens,
      messages: [{ role: 'user', content: prompt }],
    });

    const latency = Date.now() - start;
    const text = result.content[0].type === 'text' ? result.content[0].text : '';

    console.log(JSON.stringify({
      request_id: requestId,
      tokens_in: result.usage.input_tokens,
      tokens_out: result.usage.output_tokens,
      latency_ms: latency,
    }));

    res.json({
      id: requestId,
      response: text,
      model,
      tokens_used: result.usage.input_tokens + result.usage.output_tokens,
      latency_ms: latency,
    });
  } catch (error) {
    console.error(JSON.stringify({ request_id: requestId, error: String(error) }));
    res.status(500).json({ error: 'Agent execution failed' });
  }
});

// --- SSE Streaming ---
app.post('/agent/stream', authMiddleware, async (req, res) => {
  const { prompt, model = 'claude-sonnet-4-20250514', max_tokens = 4096 } = req.body;

  res.setHeader('Content-Type', 'text/event-stream');
  res.setHeader('Cache-Control', 'no-cache');
  res.setHeader('Connection', 'keep-alive');

  const stream = await client.messages.stream({
    model,
    max_tokens,
    messages: [{ role: 'user', content: prompt }],
  });

  for await (const event of stream) {
    if (event.type === 'content_block_delta' && event.delta.type === 'text_delta') {
      res.write(`data: ${event.delta.text}\n\n`);
    }
  }
  res.write('data: [DONE]\n\n');
  res.end();
});

const PORT = process.env.PORT || 3000;
app.listen(PORT, () => console.log(`Agent API running on port ${PORT}`));

עשה עכשיו 10 דקות

העתיקו את קוד ה-FastAPI (Python) או Express (TypeScript) לפרויקט שלכם. התקינו dependencies:

# Python
pip install fastapi uvicorn anthropic pydantic

# TypeScript
npm install express @anthropic-ai/sdk
npm install -D @types/express typescript

הריצו את השרת: uvicorn src.server:app --reload (Python) או npx tsx src/server.ts (TS). בדקו health check: curl http://localhost:8000/health

טעות נפוצה: לשכוח streaming

בלי streaming, המשתמש ממתין 5-30 שניות בלי feedback ואז מקבל תשובה ענקית. זה UX נוראי. תמיד תמכו ב-SSE streaming -- המשתמש רואה את התשובה נבנית token-by-token, זה מרגיש מהיר גם כשלוקח זמן. כל ה-SDKs המודרניים (Anthropic, OpenAI, Vercel AI) תומכים ב-streaming out of the box.

מתקדם 25 דקות מושג חינם

אסטרטגיות Scaling

הסוכן שלכם עובד מצוין עם 10 בקשות ביום. מה קורה כשיש 10,000? 100,000? מיליון? הדרך מ-10 ל-10M עוברת דרך ארבע אסטרטגיות מפתח.

1. Horizontal Scaling

במקום שרת אחד חזק, הרבה שרתים קטנים מאחורי load balancer. כל instance מטפל ב-X בקשות במקביל.

כמה concurrent sessions לכל instance? סוכן AI הוא I/O-bound (מחכה ל-LLM, ל-tools) ולא CPU-bound. לכן instance עם 4 vCPU יכול לטפל ב-50-200 sessions במקביל (תלוי ב-agent complexity). מעבר לזה, צריך instance נוסף.

2. Queue-Based Architecture

במקום שהAPI server מריץ את הסוכן ישירות, הוא שם את הבקשה בתור (queue) ו-worker נפרד מעבד אותה. זה מפריד בין קבלת בקשות לעיבוד:

# Queue-based architecture with Redis
# producer.py -- API Server pushes to queue
import redis
import json

r = redis.Redis()

async def enqueue_agent_request(request_id: str, prompt: str):
    job = json.dumps({"id": request_id, "prompt": prompt})
    r.lpush("agent:queue", job)
    return request_id

# consumer.py -- Worker pulls from queue
def process_queue():
    while True:
        _, job_data = r.brpop("agent:queue")
        job = json.loads(job_data)

        try:
            result = run_agent(job["prompt"])
            r.set(f"agent:result:{job['id']}", json.dumps(result), ex=3600)
        except Exception as e:
            # Send to Dead Letter Queue
            r.lpush("agent:dlq", json.dumps({**job, "error": str(e)}))

3. Caching

אל תקראו ל-LLM לשאלה שכבר ענו עליה. שלוש שכבות cache:

שכבת Cache	מה שומרים	Hit Rate צפוי	חיסכון
Prompt Cache (Anthropic)	System prompt -- ה-LLM "זוכר" אותו	~95%	90% מעלות system prompt
Response Cache	שאלות + תשובות זהות (hash-based)	5-30%	100% על cache hit
Tool Result Cache	תוצאות של tool calls (API, search, DB)	20-60%	חוסך latency + API costs

4. Model Routing

לא כל שאלה צריכה את המודל הכי חזק. שאלה פשוטה כמו "מה השעה?" לא צריכה Claude Sonnet ($3/$15 per M tokens). Haiku ($0.25/$1.25) מספיק. ה-pattern:

# Model routing -- save 60-80% on LLM costs
def route_to_model(prompt: str, complexity: str = "auto") -> str:
    if complexity == "auto":
        # Use a cheap model to classify complexity
        classification = client.messages.create(
            model="claude-haiku-3-5-20241022",
            max_tokens=10,
            messages=[{"role": "user", "content": f"Rate complexity 1-3: {prompt[:200]}"}]
        )
        score = int(classification.content[0].text.strip())
    else:
        score = {"low": 1, "medium": 2, "high": 3}[complexity]

    models = {
        1: "claude-haiku-3-5-20241022",    # Simple: $0.25/$1.25 per M tokens
        2: "claude-sonnet-4-20250514",     # Medium: $3/$15 per M tokens
        3: "claude-sonnet-4-20250514",     # Complex: best available
    }
    return models[score]

Framework: The Scaling Playbook -- מ-10 ל-10M

שלב	Runs/חודש	ארכיטקטורה	עדיפויות
1. MVP	< 1K	Process יחיד, SQLite, no queue	לגרום לזה לעבוד. אל תעשו over-engineering
2. Early Users	1K - 10K	Docker Compose, Redis cache, 1 VPS	הוסיפו monitoring, structured logging, caching
3. Growth	10K - 100K	Queue + workers, Postgres, 2-4 instances	Model routing, prompt caching, rate limiting
4. Scale	100K - 1M	Kubernetes, auto-scaling, multi-region	שקלו self-hosting LLM, batch processing
5. Massive	> 1M	K8s + dedicated LLM infra, CDN, edge	Custom solutions, dedicated support from LLM providers

הכלל: אל תבנו לשלב 4 כשאתם בשלב 1. כל שלב עולה פי 3-5 יותר complexity. עברו לשלב הבא רק כשאתם מרגישים כאב אמיתי.

תרגיל 1: הוסיפו Queue-Based Architecture (25 דקות)

התקינו Redis: docker run -d -p 6379:6379 redis:7-alpine
הוסיפו endpoint POST /agent/async שמכניס בקשה ל-queue ומחזיר request_id
כתבו worker script שמושך בקשות מה-queue, מריץ את הסוכן, ושומר תוצאה ב-Redis
הוסיפו endpoint GET /agent/status/:id ש-polling לתוצאה
הריצו 10 בקשות במקביל ובדקו שכולן מעובדות

תוצאה צפויה: API שמקבל בקשות מיידית, מעבד ברקע, ומאפשר polling לתוצאה.

עשה עכשיו 2 דקות

חשבו: הסוכן שלכם -- באיזה שלב מה-Scaling Playbook הוא נמצא? ומה הדבר הבא שתצטרכו להוסיף כדי לעבור לשלב הבא?

מתקדם 35 דקות תרגול $1-3

Durable Execution -- סוכנים ששורדים קריסות

סוכן AI לא כמו API רגיל שמחזיר תשובה בתוך 100ms. סוכן יכול לרוץ דקות, שעות, ואפילו ימים -- חושב, קורא ל-tools, מחכה לאישור אנושי, חוזר וחושב. מה קורה כש:

השרת קורס באמצע ריצה של סוכן?
עדכון deployment מכבה את ה-container?
ה-LLM API חוזר עם timeout ב-step 7 מתוך 10?
הסוכן מחכה לאישור אנושי שיגיע רק מחר?

בלי durable execution, כל העבודה אבדה. הסוכן מתחיל מאפס. עם durable execution, הסוכן ממשיך מהנקודה שבה עצר.

Durable execution חצה את ה-mainstream ב-2026: AWS שחרר Durable Functions, Cloudflare השיק Workflows GA, ו-Vercel השיק Workflow DevKit. זה כבר לא נישה -- זה standard.

פלטפורמות Durable Execution

פלטפורמה	גישה	הכי טוב ל-	דוגמה
LangGraph	Automatic checkpointing + time-travel debugging	Python agents, complex graphs	שמירה אוטומטית אחרי כל node בגרף
Temporal	Workflow-as-code -- workflows של שעות/ימים/חודשים	Long-running processes, enterprise	סוכן שמחכה ימים לאישור אנושי
Cloudflare Workflows	Step-based checkpoints -- LLM responses נשמרות	Edge/serverless agents	Workers שרצים על ה-edge ושורדים cold restarts
Inngest	Event-driven durable execution עם automatic retries	Event-driven workflows	Step functions עם built-in retry logic
Pydantic AI	Built-in durable execution	Python agents, simple durability	שורד API failures ו-restarts

דוגמה: LangGraph Checkpointing

# Durable execution with LangGraph checkpointing
from langgraph.graph import StateGraph, END
from langgraph.checkpoint.postgres import PostgresSaver
from typing import TypedDict

class AgentState(TypedDict):
    messages: list
    current_step: str
    tool_results: dict
    approved: bool

# Create graph with persistent checkpointing
checkpointer = PostgresSaver.from_conn_string(
    "postgresql://user:pass@localhost:5432/agent_state"
)

graph = StateGraph(AgentState)

# Define nodes (each one is a checkpoint)
graph.add_node("research", research_step)
graph.add_node("analyze", analyze_step)
graph.add_node("wait_approval", human_approval_step)
graph.add_node("execute", execute_step)

# Define edges
graph.add_edge("research", "analyze")
graph.add_edge("analyze", "wait_approval")
graph.add_conditional_edges(
    "wait_approval",
    lambda state: "execute" if state["approved"] else END
)
graph.add_edge("execute", END)

graph.set_entry_point("research")
agent = graph.compile(checkpointer=checkpointer)

# Run -- if server crashes after "research", it resumes from "analyze"
config = {"configurable": {"thread_id": "task-123"}}
result = agent.invoke({"messages": [...], "current_step": "start"}, config)

# Later: resume after human approval
agent.update_state(config, {"approved": True})
result = agent.invoke(None, config)  # Continues from wait_approval

דוגמה: Temporal Workflow (TypeScript)

// Durable agent workflow with Temporal
import { proxyActivities, sleep } from '@temporalio/workflow';
import type { AgentActivities } from './activities';

const { research, analyze, sendApprovalRequest, executeAction } =
  proxyActivities({ startToCloseTimeout: '5 minutes' });

// This workflow survives server crashes, restarts, deployments
export async function agentWorkflow(taskId: string, prompt: string) {
  // Step 1: Research (checkpointed automatically)
  const researchResults = await research(prompt);

  // Step 2: Analyze (if server crashes here, resumes from this point)
  const analysis = await analyze(researchResults);

  // Step 3: Wait for human approval (can wait days!)
  await sendApprovalRequest(taskId, analysis);
  // Temporal persists state -- even if server restarts, it remembers
  // that we're waiting for approval

  // Step 4: Execute after approval signal
  const result = await executeAction(analysis);
  return result;
}

Framework: The Durability Decision -- מתי להוסיף durable execution

תנאי	צריך Durability?	פלטפורמה מומלצת
Agent שרץ < 30 שניות, stateless	לא -- retry רגיל מספיק	לא צריך
Agent שרץ 1-10 דקות	כן -- קריסה = אובדן עבודה	LangGraph checkpointing / Inngest
Agent עם human approval steps	חובה -- המתנה יכולה להיות ימים	Temporal / LangGraph
Serverless / Edge agent	כן -- cold restarts תכופים	Cloudflare Workflows / Inngest
Agent שרץ שעות-ימים (research, pipeline)	חובה מוחלטת	Temporal (enterprise-grade)

כלל אצבע: אם ה-agent שלכם רץ יותר מדקה, או יש שלב שמחכה לבן-אדם -- אתם צריכים durable execution.

עשה עכשיו 3 דקות

חשבו על הסוכן שבניתם: מה הריצה הארוכה ביותר שלו? האם יש שלב human approval? אם כן, רשמו איזה פלטפורמת durability מתאימה לו.

בינוני 25 דקות מושג חינם

טיפול בשגיאות וחוסן

סוכני AI נכשלים בדרכים ייחודיות שלא קיימות בתוכנה רגילה. מעבר ל-bugs רגילים, יש:

סוג כשל	תיאור	פתרון
LLM Timeout	ה-LLM API לא מגיב (overload, maintenance)	Retry with exponential backoff, fallback model
Hallucination	הסוכן ממציא עובדות או tool names	Output validation, fact-checking tools
Infinite Loop	הסוכן קורא ל-tool שנכשל, מנסה שוב, נכשל...	Max iterations limit, circuit breaker
Tool Failure	API חיצוני קורס (Google, GitHub)	Graceful degradation, cached responses
Token Overflow	Context window מתמלא, הסוכן "שוכח"	Context truncation, summarization
Cost Runaway	Bug שגורם לסוכן לרוץ ללא הפסקה	Budget limits, max steps, alerts

Circuit Breaker Pattern

# Circuit breaker for agent tool calls
import time

class CircuitBreaker:
    def __init__(self, failure_threshold: int = 5, reset_timeout: int = 60):
        self.failure_threshold = failure_threshold
        self.reset_timeout = reset_timeout
        self.failures = 0
        self.last_failure_time = 0
        self.state = "CLOSED"  # CLOSED = normal, OPEN = blocking, HALF_OPEN = testing

    def call(self, func, *args, **kwargs):
        if self.state == "OPEN":
            if time.time() - self.last_failure_time > self.reset_timeout:
                self.state = "HALF_OPEN"
            else:
                raise Exception("Circuit breaker is OPEN -- service unavailable")

        try:
            result = func(*args, **kwargs)
            if self.state == "HALF_OPEN":
                self.state = "CLOSED"
                self.failures = 0
            return result
        except Exception as e:
            self.failures += 1
            self.last_failure_time = time.time()
            if self.failures >= self.failure_threshold:
                self.state = "OPEN"
            raise e

# Usage
llm_breaker = CircuitBreaker(failure_threshold=3, reset_timeout=30)

def call_llm_with_protection(prompt: str):
    try:
        return llm_breaker.call(client.messages.create, model="claude-sonnet-4-20250514", ...)
    except Exception:
        # Fallback: return cached response or graceful error
        return {"text": "I'm temporarily unable to process requests. Please try again shortly."}

Graceful Degradation

כשמשהו נכשל, הסוכן לא צריך לקרוס. הוא צריך להתדרדר בחינניות:

# Graceful degradation pattern
async def agent_with_graceful_degradation(prompt: str):
    results = {}

    # Try primary data source
    try:
        results["live_data"] = await fetch_live_analytics()
    except Exception:
        logger.warning("Live analytics unavailable, using cached data")
        results["live_data"] = await get_cached_analytics()  # 1-hour old

    # Try primary LLM
    try:
        response = await call_claude_sonnet(prompt, context=results)
    except Exception:
        logger.warning("Claude Sonnet unavailable, falling back to Haiku")
        try:
            response = await call_claude_haiku(prompt, context=results)
        except Exception:
            logger.error("All LLMs unavailable")
            response = "I'm experiencing technical difficulties. " \
                       "Here's what I can tell you from cached data: " \
                       f"{summarize_cached(results)}"

    return response

טעות נפוצה: retry ללא exponential backoff

הקוד while True: try: call_api() except: time.sleep(1) הוא מתכון לאסון. אם ה-API overloaded, 1,000 clients ש-retry כל שנייה רק מחמירים את הבעיה. תמיד השתמשו ב-exponential backoff: 1s, 2s, 4s, 8s, 16s -- עם jitter (רעש אקראי) כדי שלא כולם ינסו באותו רגע. ואחרי 5 ניסיונות, circuit breaker נפתח ומפסיק לנסות.

עשה עכשיו 5 דקות

הוסיפו לשרת שלכם circuit breaker פשוט: אם ה-LLM API נכשל 3 פעמים ברצף, עצרו ניסיונות ל-30 שניות. העתיקו את ה-CircuitBreaker class למעלה והשתמשו בו ב-endpoint /agent/run.

מתקדם 35 דקות תרגול $0-5

Monitoring ו-Observability

אתם לא יכולים לתקן מה שאתם לא רואים. Observability היא היכולת להבין מה קורה בתוך המערכת מבחוץ. לסוכני AI, זה קריטי במיוחד כי ההתנהגות שלהם לא דטרמיניסטית -- אותה שאלה יכולה לייצר תשובות שונות, tool calls שונים, ומספר steps שונה.

ארבעת עמודי ה-Observability

עמוד	מה מודדים	כלי	דוגמה
Metrics	מספרים מצטברים: request rate, latency, error rate	Prometheus + Grafana	p95 latency = 3.2s, error rate = 0.5%
Logs	אירועים: מה קרה, מתי, לאיזה request	Structured JSON logs	{"request_id": "abc", "tool": "search", "duration_ms": 450}
Traces	זרימת בקשה דרך כל שלב	OpenTelemetry + Jaeger	Request -> LLM (2.1s) -> Tool (0.5s) -> LLM (1.8s)
Alerts	הודעות כשמשהו חורג מהנורמה	PagerDuty / Slack	"Error rate > 5% for 5 minutes"

OpenTelemetry Instrumentation

# OpenTelemetry setup for agent monitoring
from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter

# Setup tracing
trace.set_tracer_provider(TracerProvider())
trace.get_tracer_provider().add_span_processor(
    BatchSpanProcessor(OTLPSpanExporter(endpoint="http://otel-collector:4317"))
)
tracer = trace.get_tracer("agent-service")

# Setup metrics
metrics.set_meter_provider(MeterProvider())
meter = metrics.get_meter("agent-service")

# Custom metrics for AI agents
request_counter = meter.create_counter("agent.requests.total")
token_counter = meter.create_counter("agent.tokens.total")
cost_counter = meter.create_counter("agent.cost.usd")
latency_histogram = meter.create_histogram("agent.latency.ms")

# Instrumented agent call
@app.post("/agent/run")
async def run_agent(req: AgentRequest):
    with tracer.start_as_current_span("agent.run") as span:
        span.set_attribute("agent.model", req.model)
        span.set_attribute("agent.user_id", req.user_id or "anonymous")

        # Track LLM call
        with tracer.start_as_current_span("llm.call"):
            result = client.messages.create(...)

        # Record metrics
        request_counter.add(1, {"model": req.model, "status": "success"})
        token_counter.add(result.usage.input_tokens, {"type": "input"})
        token_counter.add(result.usage.output_tokens, {"type": "output"})
        cost = calculate_cost(result.usage)
        cost_counter.add(cost, {"model": req.model})
        latency_histogram.record(latency_ms, {"model": req.model})

LLM-Specific Monitoring

מעבר ל-metrics סטנדרטיים, סוכני AI צריכים monitoring ייחודי:

מטריקה	מה מודדים	Alert Threshold
Token Usage per Request	כמה tokens כל בקשה צורכת	p99 > 10K tokens (possible runaway)
Tool Call Count	כמה tool calls לכל ריצה	> 15 calls (possible loop)
Cost per Request	עלות USD לכל ריצה	> $0.50 per request (cost anomaly)
Error Rate by Type	LLM errors vs tool errors vs validation	> 5% for any type
Agent Step Count	כמה steps (think-act cycles) לכל ריצה	> 10 steps (complexity alert)

תרגיל 2: הוסיפו Monitoring בסיסי (20 דקות)

הוסיפו structured logging (JSON format) לכל endpoint
תעדו לכל בקשה: request_id, user_id, model, tokens_in, tokens_out, cost, latency_ms, status
הוסיפו endpoint GET /metrics שמחזיר: total requests, total tokens, total cost, average latency
הוסיפו alert פשוט: אם error rate > 10% ב-5 דקות אחרונות, רשמו WARNING ל-log
בדקו שהמטריקות נכונות על ידי שליחת 5 בקשות ובדיקת /metrics

תוצאה צפויה: endpoint שמציג real-time metrics של השירות שלכם.

בינוני 30 דקות מושג חינם

שליטה בעלויות Production

עלות LLM ב-production היא ההוצאה הכי גדולה שלכם -- גדולה פי 10-100 מעלות ה-hosting. סוכן שמשתמש ב-Claude Sonnet ב-100K בקשות/חודש יכול לעלות $1,000-$3,000 בחודש ב-LLM בלבד. אבל עם אופטימיזציה נכונה, אפשר לחסוך 60-90%.

מנופי אופטימיזציית עלות

טכניקה	חיסכון	מורכבות	הסבר
Model Routing	60-80%	בינונית	Haiku ($0.25/$1.25) לבקשות פשוטות, Sonnet ($3/$15) למורכבות
Prompt Caching	90% on system prompt	נמוכה	Anthropic שומר את ה-system prompt ב-cache -- לא מעבד אותו מחדש
Batch Processing	50%	נמוכה	כל הספקים מציעים 50% הנחה ל-async batch (SLA של 24 שעות)
Response Caching	100% on hit	נמוכה	שמרו תשובות ל-queries זהות -- אל תקראו ל-API לאותה שאלה
Step Limits	30-50%	נמוכה	הגבילו tool loops -- מנעו ריצות של 20 steps כשמספיקים 3
Context Truncation	20-40%	בינונית	שמרו context windows קטנים -- כל token עולה כסף
Self-hosting	60-80%	גבוהה	ב-100K+ בקשות/יום, self-host Llama/Mistral חוסך הרבה

נתון: חיסכון משולב

Prompt caching (90%) + batch processing (50%) + model routing (70%) = חיסכון כולל של 90%+. סוכן שעולה $3,000/חודש בלי אופטימיזציה יכול לרדת ל-$200-400/חודש עם שלוש הטכניקות האלה. זה ההבדל בין "זה יקר מדי ל-production" לבין "זה שולי."

תרחיש סיוט: Runaway Agent Loops

הכי מפחיד ב-production: באג שגורם לסוכן לרוץ בלולאה אינסופית. כל iteration קוראת ל-LLM, כל קריאה עולה כסף. סוכן עם באג שרץ 1,000 iterations ב-Sonnet יכול לעלות $50-200 בדקות.

# Cost protection: budget limits and runaway detection
class CostGuard:
    def __init__(self, max_cost_per_request: float = 1.0,
                 max_daily_cost: float = 100.0,
                 max_steps_per_request: int = 15):
        self.max_cost_per_request = max_cost_per_request
        self.max_daily_cost = max_daily_cost
        self.max_steps = max_steps_per_request
        self.daily_cost = 0.0
        self.request_cost = 0.0
        self.step_count = 0

    def check_budget(self, cost_increment: float):
        self.request_cost += cost_increment
        self.daily_cost += cost_increment
        self.step_count += 1

        if self.step_count > self.max_steps:
            raise BudgetExceeded(f"Max steps exceeded: {self.step_count}/{self.max_steps}")
        if self.request_cost > self.max_cost_per_request:
            raise BudgetExceeded(f"Request budget exceeded: ${self.request_cost:.2f}")
        if self.daily_cost > self.max_daily_cost:
            raise BudgetExceeded(f"Daily budget exceeded: ${self.daily_cost:.2f}")
            # Also: send PagerDuty alert!

class BudgetExceeded(Exception):
    pass

טעות נפוצה: לא להגביל עלויות לפני launch

הרבה צוותים שמים budget alerts אחרי שמקבלים חשבון מפתיע של $500. הגדירו limits לפני launch: max cost per request ($1), max daily cost ($100), max steps per request (15), max tokens per request (10K). ערכים ספציפיים תלויים ב-use case שלכם, אבל חייבים להיות ערכים. ערך גבוה מדי עדיף על אין ערך בכלל.

עשה עכשיו 5 דקות

הוסיפו את CostGuard class לשרת שלכם. הגדירו: max_cost_per_request=$1, max_daily_cost=$50, max_steps=15. הוסיפו את check_budget() אחרי כל LLM call.

תרגיל 3: בנו Cost Dashboard (25 דקות)

הוסיפו endpoint GET /costs שמחזיר: total cost today, cost per hour (last 24h), top 5 most expensive requests, average cost per request
הוסיפו cost breakdown by model (Haiku vs Sonnet)
הוסיפו alert: אם עלות שעתית > $10, רשמו WARNING ל-log
בדקו שה-dashboard מציג נתונים נכונים אחרי 10 requests

תוצאה צפויה: endpoint שמציג real-time cost tracking עם alerts.

מתקדם 25 דקות מושג חינם

אבטחה ב-Production

סוכן AI ב-production הוא משטח תקיפה ענק. הוא מקבל input חופשי (natural language), ומריץ tools שיכולים לגשת ל-APIs, databases, ואפילו file systems. כל שכבת אבטחה שחסרה היא דלת פתוחה לתוקף.

Authentication ו-Authorization

# JWT Authentication for agent API
import jwt
from datetime import datetime, timedelta

SECRET_KEY = os.getenv("JWT_SECRET")

def create_token(user_id: str, role: str = "user") -> str:
    payload = {
        "user_id": user_id,
        "role": role,
        "exp": datetime.utcnow() + timedelta(hours=24),
        "iat": datetime.utcnow()
    }
    return jwt.encode(payload, SECRET_KEY, algorithm="HS256")

def verify_token(token: str) -> dict:
    try:
        return jwt.decode(token, SECRET_KEY, algorithms=["HS256"])
    except jwt.ExpiredSignatureError:
        raise HTTPException(status_code=401, detail="Token expired")
    except jwt.InvalidTokenError:
        raise HTTPException(status_code=401, detail="Invalid token")

# Role-based access to agent tools
TOOL_PERMISSIONS = {
    "user":  ["search", "summarize", "translate"],
    "admin": ["search", "summarize", "translate", "database_query", "deploy"],
    "readonly": ["search", "summarize"]
}

def check_tool_access(role: str, tool_name: str):
    allowed = TOOL_PERMISSIONS.get(role, [])
    if tool_name not in allowed:
        raise HTTPException(
            status_code=403,
            detail=f"Role '{role}' cannot use tool '{tool_name}'"
        )

Input Validation -- נגד Prompt Injection

# Input validation layer
import re

def validate_agent_input(prompt: str) -> str:
    # 1. Length limit
    if len(prompt) > 10000:
        raise HTTPException(status_code=400, detail="Prompt too long (max 10K chars)")

    # 2. Detect common prompt injection patterns
    injection_patterns = [
        r"ignore\s+(all\s+)?previous\s+instructions",
        r"system\s*:\s*you\s+are\s+now",
        r"forget\s+(everything|all|your)\s+(instructions|rules)",
        r"override\s+safety",
        r"act\s+as\s+if\s+you\s+have\s+no\s+restrictions",
    ]
    for pattern in injection_patterns:
        if re.search(pattern, prompt, re.IGNORECASE):
            logger.warning(f"Prompt injection attempt detected: {pattern}")
            raise HTTPException(status_code=400, detail="Invalid input detected")

    # 3. Strip potentially dangerous content
    prompt = prompt.replace("\x00", "")  # null bytes

    return prompt

Output Sanitization

# Output sanitization -- strip PII before returning
def sanitize_output(response: str) -> str:
    # Remove Israeli ID numbers (9 digits)
    response = re.sub(r'\b\d{9}\b', '[REDACTED-ID]', response)

    # Remove credit card numbers
    response = re.sub(r'\b\d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4}\b', '[REDACTED-CC]', response)

    # Remove email addresses
    response = re.sub(r'[\w.-]+@[\w.-]+\.\w+', '[REDACTED-EMAIL]', response)

    # Remove Israeli phone numbers
    response = re.sub(r'\b0[2-9]\d{1}-?\d{7}\b', '[REDACTED-PHONE]', response)

    return response

Secrets Management

שיטה	מתאים ל-	דוגמה
Environment variables	MVP, small projects	`.env` file, Docker `--env-file`
Cloud Secrets Manager	Production workloads	AWS Secrets Manager, GCP Secret Manager
HashiCorp Vault	Enterprise, multi-cloud	Dynamic secrets, rotation, audit logging
Cloudflare KV / Secrets	Edge / serverless	Secrets bound to Workers at deploy time

הקשר ישראלי -- אבטחת סוכני AI

מערך הסייבר הלאומי (INCD) פרסם ב-2025 הנחיות לאבטחת מערכות AI. הנקודות העיקריות: הפרדה בין ה-AI engine לנתונים רגישים, audit logging מלא של כל הקלטות וההחלטות, ובדיקת prompt injection כחלק מתהליך ה-penetration testing. סטארטאפים ישראלים שרוצים SOC2 צריכים להראות controls ספציפיים ל-AI: logging של inputs/outputs, data retention policy, ו-access control ל-agent tools. ב-Israel, compliance with Privacy Protection Regulations (5777-2017) requires data residency awareness.

טעות נפוצה: API key ב-Git

לעולם אל תעשו commit ל-API keys או secrets ל-Git. גם אם ה-repo פרטי. גם אם מחקתם את ה-commit אחר כך (זה עדיין ב-history). השתמשו ב-.gitignore עם .env, ו-git-secrets או gitleaks כ-pre-commit hook שסורק secrets לפני כל commit.

עשה עכשיו 5 דקות

הוסיפו את validate_agent_input() לשרת שלכם -- לפני כל קריאה ל-LLM. בדקו: שלחו בקשה עם "ignore all previous instructions" -- השרת צריך להחזיר 400.

מתקדם 30 דקות תרגול חינם

CI/CD לסוכנים

סוכן AI שונה מתוכנה רגילה בגלל שני ערוצי שינוי: קוד (כמו כל תוכנה) ו-prompts (system prompt, tool descriptions, guardrails). שניהם צריכים versioning, testing, ו-deployment pipeline.

Testing Pipeline לסוכנים

# Pipeline: unit tests --> integration tests --> eval suite --> staging --> production

# 1. Unit Tests (no LLM calls, fast)
# test_tools.py
def test_search_tool_formatting():
    result = format_search_results([{"title": "Test", "url": "https://test.com"}])
    assert "Test" in result
    assert "https://test.com" in result

def test_cost_guard_limits():
    guard = CostGuard(max_cost_per_request=1.0)
    guard.check_budget(0.5)  # OK
    with pytest.raises(BudgetExceeded):
        guard.check_budget(0.6)  # Exceeds $1.0

# 2. Integration Tests (real LLM calls, slower)
def test_agent_responds_to_simple_query():
    response = client.post("/agent/run", json={"prompt": "What is 2+2?"})
    assert response.status_code == 200
    assert "4" in response.json()["response"]

# 3. Eval Suite (quality checks)
def test_agent_eval_suite():
    test_cases = [
        {"prompt": "Summarize this article: ...", "expected_contains": ["key point"]},
        {"prompt": "Translate to Hebrew: Hello", "expected_contains": ["שלום"]},
    ]
    passed = 0
    for case in test_cases:
        response = run_agent(case["prompt"])
        if all(kw in response for kw in case["expected_contains"]):
            passed += 1
    pass_rate = passed / len(test_cases)
    assert pass_rate >= 0.8, f"Eval pass rate {pass_rate:.0%} below 80% threshold"

GitHub Actions CI/CD Pipeline

# .github/workflows/agent-deploy.yml
name: Agent CI/CD Pipeline

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Setup Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.12"

      - name: Install dependencies
        run: pip install -r requirements.txt -r requirements-dev.txt

      - name: Unit tests
        run: pytest tests/unit/ -v

      - name: Integration tests
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
        run: pytest tests/integration/ -v --timeout=60

      - name: Eval suite
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
        run: python tests/eval_suite.py --min-pass-rate 0.8

  deploy-staging:
    needs: test
    if: github.ref == 'refs/heads/main'
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Build Docker image
        run: docker build -t agent:${{ github.sha }} .

      - name: Push to registry
        run: |
          docker tag agent:${{ github.sha }} registry.example.com/agent:${{ github.sha }}
          docker push registry.example.com/agent:${{ github.sha }}

      - name: Deploy to staging
        run: |
          # Deploy to staging environment
          kubectl set image deployment/agent-staging agent=registry.example.com/agent:${{ github.sha }}

  canary-deploy:
    needs: deploy-staging
    runs-on: ubuntu-latest
    steps:
      - name: Canary deploy (5% traffic)
        run: |
          # Route 5% of production traffic to new version
          kubectl apply -f k8s/canary-5-percent.yml

      - name: Monitor canary (10 minutes)
        run: |
          # Check error rate and latency for 10 minutes
          python scripts/monitor_canary.py --duration 600 --max-error-rate 0.05

      - name: Promote to 100% or rollback
        run: |
          if [ $? -eq 0 ]; then
            kubectl apply -f k8s/production-100-percent.yml
          else
            kubectl apply -f k8s/rollback.yml
            exit 1
          fi

Prompt Versioning

Prompts הם קוד -- הם צריכים version control. שינוי של מילה אחת ב-system prompt יכול לשנות את ההתנהגות של כל הסוכן.

# prompts/v2.3.0/system_prompt.txt
You are a customer support agent for ClickFlow.
Your role is to help users with technical issues.

Rules:
- Always be helpful and professional
- If you don't know the answer, say so
- Never share internal system details
- Escalate billing issues to humans

# Version history:
# v2.3.0 - Added escalation rule for billing
# v2.2.0 - Improved professional tone guidelines
# v2.1.0 - Added "don't share internal details" rule
# v2.0.0 - Major rewrite for new product features
# v1.0.0 - Initial version

תרגיל 4: הגדירו CI/CD Pipeline בסיסי (30 דקות)

צרו תיקיית tests/ עם test_tools.py, test_api.py, ו-eval_suite.py
כתבו 3 unit tests (ללא LLM calls): validation, cost guard, response formatting
כתבו 2 integration tests (עם LLM calls): agent responds, streaming works
כתבו eval suite עם 5 test cases שבודקים איכות תשובות
צרו .github/workflows/agent-deploy.yml שמריץ unit tests, integration tests, ו-eval suite
עשו push ובדקו שה-pipeline רץ ב-GitHub Actions

תוצאה צפויה: CI/CD pipeline שרץ אוטומטית על כל push ובודק קוד + prompts + איכות.

בינוני 15 דקות מושג חינם

Compliance ו-Audit

כשסוכן AI מטפל בלקוחות, מנתח נתונים, או מקבל החלטות -- צריך לתעד הכל. לא רק ל-debugging, אלא ל-compliance:

מה לתעד

מה	למה	כמה זמן לשמור
Inputs (prompts)	Audit trail -- מה שאלו את הסוכן	90 ימים (GDPR: right to deletion)
Outputs (responses)	Audit trail -- מה הסוכן ענה	90 ימים
Tool calls + results	הבנת decision chain	90 ימים
Model + version	Reproducibility	כל עוד ה-model זמין
User identity	Access audit	לפי מדיניות פרטיות
Cost + tokens	Financial audit, billing	7 שנים (מס הכנסה ישראלי)

GDPR ו-Right to Deletion

אם משתמש מבקש למחוק את הנתונים שלו (GDPR Article 17), אתם צריכים למחוק:

כל ה-conversations שלו (inputs + outputs)
כל ה-embeddings שנוצרו מהנתונים שלו (vector DB)
כל ה-cached responses שקשורות אליו
metadata שמזהה אותו (email, user ID ב-logs)

# GDPR deletion endpoint
@app.delete("/user/{user_id}/data")
async def delete_user_data(user_id: str, api_key: str = Depends(verify_admin)):
    # 1. Delete conversations from database
    db.execute("DELETE FROM conversations WHERE user_id = ?", [user_id])

    # 2. Delete from vector database
    vector_db.delete(filter={"user_id": user_id})

    # 3. Delete from cache
    for key in redis.scan_iter(f"cache:*:{user_id}:*"):
        redis.delete(key)

    # 4. Anonymize logs (don't delete -- anonymize for compliance)
    db.execute(
        "UPDATE audit_logs SET user_id = 'DELETED' WHERE user_id = ?",
        [user_id]
    )

    logger.info(f"GDPR deletion completed for user {user_id}")
    return {"status": "deleted", "user_id": user_id}

עשה עכשיו 3 דקות

רשמו: מהם 3 סוגי הנתונים הרגישים ביותר שהסוכן שלכם מטפל בהם? לכל אחד, רשמו: איפה הוא מאוחסן ומה מדיניות השמירה.

מתחיל 10 דקות אסטרטגיה חינם

שגרת עבודה -- Production Agent

סוכן ב-production דורש פיקוח מתמשך. הנה השגרה שמבטיחה שהכל עובד:

שגרת עבודה -- Production Agent Ops

תדירות	משימה	זמן
יומי	בדקו דשבורד monitoring: error rate, latency, cost -- הכל בנורמה?	3 דק
יומי	סקרו alerts שהגיעו מאתמול: false positive? צריך טיפול?	5 דק
יומי	בדקו daily cost -- האם בטווח הצפוי?	1 דק
שבועי	סקרו error logs -- מה הכשלים הנפוצים? צריך לתקן?	15 דק
שבועי	בדקו eval suite results -- quality ירדה? prompts צריכים עדכון?	10 דק
שבועי	בדקו cost trends -- האם יש עלייה לא צפויה?	5 דק
חודשי	סקירת אבטחה: בדקו audit logs, עדכנו API keys, סקרו access permissions	30 דק
חודשי	הריצו load test: האם המערכת עמידה ב-2x current traffic?	20 דק
חודשי	עדכנו dependencies: LLM SDK versions, security patches, Docker base image	30 דק

עשה עכשיו 2 דקות

קבעו תזכורת בלוח השנה: "בדיקת דשבורד סוכן production" -- כל בוקר ב-9:00. אפילו 3 דקות מספיקות לזהות בעיות לפני שהן הופכות לאסון.

אם אתם עושים רק דבר אחד מהפרק הזה

ארזו את הסוכן ב-Docker ופרסו אותו. לא צריך Kubernetes, לא צריך CI/CD מורכב, לא צריך monitoring מלא. רק Dockerfile + docker-compose.yml + VPS ב-$20/חודש. ברגע שהסוכן רץ 24/7 על שרת (במקום על הלפטופ שלכם), הכל משתנה -- אתם מתחילים לחשוב על monitoring, security, ועלויות באופן טבעי. הצעד הראשון הוא הכי חשוב.

בדוק את עצמך -- 5 שאלות

מה ההבדל בין stateless ל-stateful agent ואיך זה משפיע על בחירת hosting? מתי סוכן stateful יכול להתנהג כ-stateless? (רמז: external state store)
תסבירו את Circuit Breaker pattern בשלוש משפטים. למה exponential backoff לבד לא מספיק ב-production? (רמז: cascade failures)
מה שלושת מנופי אופטימיזציית עלויות שיחד חוסכים 90%+? לכל מנוף, הסבירו איך הוא עובד ומה ה-tradeoff. (רמז: model routing, prompt caching, batch processing)
למה durable execution קריטי לסוכנים ולא ל-APIs רגילים? תנו דוגמה של workflow שנשבר בלי durability. (רמז: long-running, human approval)
מה ההבדל בין canary deployment ל-rollback? למה שניהם חשובים ב-CI/CD לסוכנים? (רמז: non-deterministic behavior, eval degradation)

עברתם 4 מתוך 5? מצוין -- אתם מוכנים לפרק 20.

סיכום הפרק

בפרק הזה עשיתם את המעבר הקריטי מ-"סוכן שרץ על הלפטופ" ל-"שירות production שרץ 24/7". התחלתם עם ארכיטקטורת production -- הבנתם את ההבדל בין stateless ל-stateful agents ובניתם Docker container מוכן לפריסה. בחנתם אפשרויות hosting -- self-hosted (VPS), serverless (Lambda, Workers), ו-managed platforms (LangGraph, Bedrock) -- עם השוואת עלויות מלאה ושיקולים ישראליים כמו latency ו-data residency. בניתם production API server עם FastAPI/Express שתומך ב-REST, SSE streaming, authentication, ו-rate limiting. למדתם scaling strategies -- horizontal scaling, queue-based architecture, caching, ו-model routing -- עם Scaling Playbook שמנחה אתכם מ-10 ל-10M בקשות. הכרתם durable execution -- הטכנולוגיה שמבטיחה שסוכנים שורדים קריסות, restarts, ו-deployments, עם דוגמאות ב-LangGraph ו-Temporal. בניתם error handling עם circuit breakers ו-graceful degradation, monitoring עם OpenTelemetry ו-Grafana, cost control עם budget guards ש-model routing + prompt caching + batch processing חוסכים 90%+ מעלויות LLM. הוספתם שכבת אבטחה -- JWT, prompt injection detection, output sanitization, ו-secrets management. סגרתם עם CI/CD pipeline שכולל unit tests, eval suite, canary deployments, ו-rollback -- ו-compliance עם audit logging ו-GDPR deletion.

הנקודה המרכזית: production הוא לא רק "להריץ את הקוד על שרת". זה שכבות שלמות של monitoring, cost control, security, resilience, ו-testing שמבטיחות שהסוכן עובד בצורה אמינה, חסכונית, ובטוחה -- 24 שעות ביממה, 7 ימים בשבוע.

בפרק הבא (פרק 20) תסכמו את כל הקורס עם Strategy -- איך לבחור SDKs, לבנות צוותים שעובדים עם סוכני AI, ולתכנן את העתיד.

צ'קליסט -- סיום פרק 19

בניתי Dockerfile לסוכן עם non-root user, health check, ו-multi-stage build
כתבתי docker-compose.yml עם agent service, Redis, ו-Nginx reverse proxy
בניתי production API server (FastAPI או Express) עם /agent/run ו-/agent/stream
הוספתי SSE streaming -- tokens מגיעים ב-real-time, לא תשובה מלאה אחרי 10 שניות
הוספתי authentication (API keys או JWT) ו-rate limiting
הגדרתי cost guards: max cost per request, max daily cost, max steps per request
הוספתי circuit breaker לקריאות LLM ו-tool calls
הוספתי structured logging (JSON) עם request_id, tokens, cost, latency
הוספתי monitoring endpoint (/metrics) עם token usage, cost, ו-error rate
הוספתי input validation נגד prompt injection ו-output sanitization לPII
כתבתי unit tests (3+) ו-eval suite (5+ test cases) לסוכן
יצרתי GitHub Actions workflow שמריץ tests על כל push
מכיר/ה את durable execution ויודע/ת מתי צריך LangGraph checkpointing או Temporal
מכיר/ה את The Scaling Playbook ויודע/ת באיזה שלב הסוכן שלי נמצא
מבין/ה את GDPR implications ויש לי data retention policy לנתוני הסוכן