פרק 12: Memory & State — לגרום לסוכנים לזכור

מה יהיה לך בסוף הפרק הזה

הבנה מעמיקה של 5 שכבות הזיכרון של סוכן AI -- מ-context window ועד long-term store
מימוש עובד של 3 אסטרטגיות conversation memory -- sliding window, summary, ו-token-aware truncation
מערכת RAG עובדת עם vector store -- שאילתות סמנטיות על מסמכים, כולל Hebrew
סוכן עם persistent state שזוכר העדפות משתמש בין sessions
מימוש של CLAUDE.md pattern ושל memory-as-tool pattern -- שני הדפוסים הנפוצים ב-2026
הבנה של memory frameworks ייעודיים -- Letta, Mem0, Redis, LangGraph Memory
Memory Budget Calculator -- חישוב עלויות זיכרון מקצה לקצה
מערכת זיכרון מלאה: conversation history + vector store RAG + persistent preferences

מה תוכלו לעשות אחרי הפרק הזה

תוכלו לבנות מערכת conversation memory עם sliding window, summary, או token-aware truncation -- ולדעת מתי להשתמש בכל אחד
תוכלו להקים vector store עם embeddings ולבנות RAG pipeline שמשלים את הידע של הסוכן ממסמכים
תוכלו לתכנן מערכת persistent state שזוכרת העדפות משתמש, החלטות קודמות, ותקדמות במשימות לאורך זמן
תוכלו ליישם long-term memory patterns -- CLAUDE.md, memory-as-tool, auto-summarization, knowledge graphs
תוכלו לחשב עלות זיכרון מקצה לקצה ולבחור את ה-tradeoff הנכון בין מחיר, latency, ואיכות

לפני שמתחילים

פרקים קודמים: פרק 1 (מה זה סוכן AI), פרק 11 (Tool Use Mastery -- הזיכרון בנוי על tools)
מה תצטרכו: Python 3.11+ ו/או Node.js 18+, מפתח API (Anthropic / OpenAI / Google), עורך קוד
ידע נדרש: Python או TypeScript ברמה בינונית, הכרת tool calling ו-agent loops (מפרקים 5-11)
זמן משוער: 4-5 שעות (כולל תרגילים)
עלות API משוערת: $5-15 (embeddings, LLM calls, vector DB free tiers)

הפרויקט שלך -- קו אדום לאורך הקורס

בפרק 11 שיפרתם את ה-tools של הסוכן שלכם -- עיצוב, schemas, error handling, ואבטחה. עכשיו אתם מוסיפים את השכבה שמבדילה בין סוכן חד-פעמי לסוכן ארוך-טווח: זיכרון. בפרק הזה תוסיפו לסוכן שלכם conversation memory שמנהל את ה-context window בצורה חכמה, RAG שמשלים ידע ממסמכים, ו-persistent preferences שנשמרות בין sessions. בפרק 13 תשתמשו בזיכרון הזה כדי לבנות מערכות multi-agent שחולקות context ביניהן.

מילון מונחים -- פרק 12

מונח (English)	עברית	הסבר
Context Window	חלון הקשר	כמות הטקסט המקסימלית שהמודל יכול "לראות" בבת אחת. Claude Opus 4.6: 200K tokens. GPT-5: 1M tokens. כל מה שהסוכן "זוכר" חייב להיכנס לכאן
Conversation Memory	זיכרון שיחה	שמירת הודעות קודמות מהשיחה הנוכחית ב-context window. הצורה הפשוטה ביותר של זיכרון סוכן
Sliding Window	חלון נגלל	אסטרטגיה ששומרת רק את N ההודעות האחרונות. הודעות ישנות נמחקות. פשוט אבל מאבד הקשר
Summary Memory	זיכרון סיכום	סיכום תקופתי של השיחה. ההודעות הישנות מוחלפות בסיכום מרוכז. שומר על הקשר עם פחות tokens
Embedding	וקטור ייצוג	המרת טקסט לרשימת מספרים (וקטור) שמייצגת את המשמעות. טקסטים דומים יוצרים וקטורים קרובים
Vector Store / Vector DB	מאגר וקטורים	בסיס נתונים מיוחד לחיפוש דמיון בין וקטורים. Pinecone, Qdrant, ChromaDB, pgvector
RAG	שליפה מוגברת ליצירה	Retrieval-Augmented Generation. דפוס שבו הסוכן שולף מידע רלוונטי ממאגר ומוסיף אותו ל-prompt לפני שמייצר תשובה
Chunking	חלוקה לקטעים	פירוק מסמכים לקטעים קטנים לפני embedding. אסטרטגיות: fixed-size, semantic, recursive, document-specific
Hybrid Search	חיפוש היברידי	שילוב חיפוש וקטורי (לפי משמעות) עם חיפוש מילות מפתח (BM25). מדויק יותר מכל אחד לבד
Re-ranking	דירוג מחדש	שימוש במודל נוסף לסידור מחדש של תוצאות חיפוש לפי רלוונטיות. משפר precision משמעותית
Persistent State	מצב קבוע	נתונים מובנים ששמורים מחוץ ל-context window -- העדפות משתמש, היסטוריית החלטות, התקדמות במשימות
CLAUDE.md Pattern	דפוס CLAUDE.md	קובץ זיכרון שהסוכן קורא בתחילת כל session וכותב אליו בסוף. פשוט ויעיל. בשימוש ב-Claude Code
Memory-as-Tool	זיכרון ככלי	דפוס שבו לסוכן יש tools של save_memory ו-recall_memory. הסוכן מחליט בעצמו מה שווה לזכור
Letta	לטה	Framework לניהול זיכרון בהשראת OS -- שכבות RAM + disk עם virtual context management. שיפור 18% בדיוק
Mem0	מם-אפס	שכבת זיכרון ייעודית שמוסיפה persistent, evolving memory לכל סוכן. API פשוט, cross-session persistence
Agentic RAG	RAG אגנטי	הסוכן מתאם את ה-retrieval כאחד מכלים רבים -- מחליט מתי לחפש, כמה פעמים, ומתי לענות ישירות

למה זיכרון חשוב -- ומה קורה בלעדיו

beginner20 דקותconcept

דמיינו שיש לכם עובד מבריק -- מהיר, מדויק, מכיר כל תחום. אבל כל בוקר הוא מתעורר בלי שום זיכרון. הוא לא זוכר מי אתם, מה דיברתם אתמול, מה ההעדפות שלכם, או מה כבר עשה. כל אינטראקציה מתחילה מאפס מוחלט.

זה בדיוק מה שקורה עם סוכן AI ללא זיכרון.

LLMs הם stateless מטבעם. כל קריאת API היא עצמאית -- המודל לא "זוכר" קריאות קודמות. כל מה שהמודל "יודע" על השיחה הנוכחית הוא מה שאנחנו שולחים לו ב-context window. זה גם הכוח (no side effects, predictable) וגם המגבלה (no memory, no learning).

זיכרון מאפשר ארבעה דברים קריטיים:

יכולת	בלי זיכרון	עם זיכרון
Personalization	מתייחס לכל משתמש אותו דבר	מתאים סגנון, שפה, העדפות לכל משתמש
Learning	חוזר על אותן טעויות	לומד מטעויות ומשפר לאורך זמן
Consistency	סותר את עצמו בין sessions	שומר על עקביות לאורך זמן
Long-running tasks	אי אפשר לעצור ולהמשיך	עוצר, שומר מצב, ממשיך מאיפה שעצר

Framework: "The Agent Memory Stack" -- 5 שכבות

Framework: The Agent Memory Stack

חשבו על זיכרון סוכן כ-stack של 5 שכבות, מהמהיר/זמני לאיטי/קבוע:

שכבה	סוג	אנלוגיה	דוגמה	Latency
L1: Context Window	Working memory	זיכרון עבודה (RAM)	ההודעות הנוכחיות של השיחה	0ms
L2: Conversation Store	Short-term memory	פנקס רשימות על השולחן	היסטוריית שיחה שנשמרת בצד	1-10ms
L3: Vector Store	Semantic memory	ספרייה עם חיפוש	מסמכים, ידע, FAQs	50-200ms
L4: Structured Store	Episodic/Procedural	תיק מסודר בארון	העדפות משתמש, החלטות, כללים	5-50ms
L5: External Knowledge	Long-term memory	ארכיון מרוחק	ידע עולמי, APIs חיצוניים, web search	200-2000ms

הכלל: לכל שאילתה, הסוכן צריך לבחור מאיזו שכבה לשלוף -- ולמזער את הנתונים שנכנסים ל-context window (L1). זה ה-tradeoff המרכזי: יותר הקשר = תשובות טובות יותר, אבל גם יותר tokens = יותר עלות ויותר latency.

עשו עכשיו 5 דקות

פתחו את ה-AI assistant שאתם הכי משתמשים בו (Claude, ChatGPT, Gemini). שאלו אותו: "מה דיברנו ב-session הקודם?" ואז שאלו: "מה ההעדפות שלי?"

שימו לב למה שהוא יודע ומה שלא. כל מה שהוא יודע -- מישהו מימש עבורו זיכרון. כל מה שלא -- זה מה שאנחנו נלמד לבנות בפרק הזה.

Conversation Memory -- זיכרון לטווח קצר

intermediate35 דקותpractice

הצורה הפשוטה ביותר של זיכרון: לשמור את כל ההודעות של השיחה הנוכחית ב-context window. זה מה שכל chatbot עושה -- וזה עובד עד שנגמר המקום.

גודל ה-Context Window ב-2026

מודל	Context Window	~כמה הודעות נכנסות	עלות ל-1M tokens input
Claude Opus 4.6	200K tokens	~500 הודעות ארוכות	$15.00
Claude Sonnet 4.6	200K tokens	~500 הודעות ארוכות	$3.00
GPT-5	1M tokens	~2,500 הודעות ארוכות	$2.50
Gemini 2.5 Pro	1M tokens	~2,500 הודעות ארוכות	$1.25

גם עם חלונות של מיליון tokens, אתם לא רוצים לשלוח הכל. למה?

עלות: כל token עולה כסף. שליחת 200K tokens בכל קריאה = $3 לקריאה ב-Opus
Latency: יותר tokens = זמן עיבוד ארוך יותר
Lost in the middle: מודלים מתקשים לשלוף מידע מאמצע context ארוך -- ההתחלה והסוף חשובים יותר
Noise: הודעות ישנות ולא רלוונטיות מסיחות את המודל

לכן צריך אסטרטגיות ניהול זיכרון שיחה. הנה שלוש:

אסטרטגיה 1: Sliding Window

שומרים רק את N ההודעות האחרונות. כשמגיעה הודעה חדשה, הישנה ביותר נמחקת.

Sliding Window -- Python

from collections import deque
from anthropic import Anthropic

class SlidingWindowMemory:
    """Keep the last N messages in the conversation."""

    def __init__(self, window_size: int = 20):
        self.messages = deque(maxlen=window_size)
        self.system_prompt = "You are a helpful assistant."

    def add_message(self, role: str, content: str):
        self.messages.append({"role": role, "content": content})

    def get_messages(self) -> list[dict]:
        return list(self.messages)

    def chat(self, user_input: str) -> str:
        self.add_message("user", user_input)

        client = Anthropic()
        response = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=1024,
            system=self.system_prompt,
            messages=self.get_messages()
        )

        assistant_reply = response.content[0].text
        self.add_message("assistant", assistant_reply)
        return assistant_reply

# Usage
memory = SlidingWindowMemory(window_size=10)
print(memory.chat("My name is Yael"))        # Knows the name
print(memory.chat("What is my name?"))        # "Your name is Yael"
# After 10+ exchanges, early messages are dropped

Sliding Window -- TypeScript (Vercel AI SDK)

import { generateText } from 'ai';
import { anthropic } from '@ai-sdk/anthropic';

interface Message {
  role: 'user' | 'assistant';
  content: string;
}

class SlidingWindowMemory {
  private messages: Message[] = [];
  private windowSize: number;

  constructor(windowSize: number = 20) {
    this.windowSize = windowSize;
  }

  addMessage(role: 'user' | 'assistant', content: string): void {
    this.messages.push({ role, content });
    // Trim to window size
    if (this.messages.length > this.windowSize) {
      this.messages = this.messages.slice(-this.windowSize);
    }
  }

  async chat(userInput: string): Promise<string> {
    this.addMessage('user', userInput);

    const { text } = await generateText({
      model: anthropic('claude-sonnet-4-20250514'),
      system: 'You are a helpful assistant.',
      messages: this.messages,
    });

    this.addMessage('assistant', text);
    return text;
  }
}

// Usage
const memory = new SlidingWindowMemory(10);
await memory.chat("My name is Yael");
await memory.chat("What is my name?"); // "Your name is Yael"

יתרון: פשוט, O(1) מקום, latency קבוע.
חיסרון: מאבד הקשר ישן לגמרי. אם המשתמש הזכיר משהו חשוב 30 הודעות אחורה -- נעלם.

אסטרטגיה 2: Summary Memory

כל כמה הודעות, מסכמים את השיחה עד כה ומחליפים את ההודעות הישנות בסיכום.

Summary Memory -- Python

from anthropic import Anthropic

class SummaryMemory:
    """Periodically summarize old messages to save context space."""

    def __init__(self, max_messages: int = 10, summary_threshold: int = 8):
        self.messages: list[dict] = []
        self.summary: str = ""
        self.max_messages = max_messages
        self.summary_threshold = summary_threshold
        self.client = Anthropic()

    def _summarize(self):
        """Summarize old messages and replace them with the summary."""
        old_messages = self.messages[:self.summary_threshold]
        conversation_text = "\n".join(
            f"{m['role']}: {m['content']}" for m in old_messages
        )

        prompt = f"""Summarize this conversation concisely.
Keep: key facts, user preferences, decisions made, action items.
Drop: greetings, filler, repeated information.

Previous summary: {self.summary or 'None'}

New conversation:
{conversation_text}

Write a concise summary (max 200 words):"""

        response = self.client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=300,
            messages=[{"role": "user", "content": prompt}]
        )
        self.summary = response.content[0].text
        # Keep only recent messages
        self.messages = self.messages[self.summary_threshold:]

    def get_messages(self) -> list[dict]:
        msgs = []
        if self.summary:
            msgs.append({
                "role": "user",
                "content": f"[Previous conversation summary: {self.summary}]"
            })
            msgs.append({
                "role": "assistant",
                "content": "I understand the context from our previous conversation. How can I help?"
            })
        msgs.extend(self.messages)
        return msgs

    def chat(self, user_input: str) -> str:
        self.messages.append({"role": "user", "content": user_input})

        # Summarize if we hit the threshold
        if len(self.messages) >= self.max_messages:
            self._summarize()

        response = self.client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=1024,
            system="You are a helpful assistant. Use the conversation summary for context.",
            messages=self.get_messages()
        )

        reply = response.content[0].text
        self.messages.append({"role": "assistant", "content": reply})
        return reply

יתרון: שומר על הקשר חשוב לאורך זמן. לא מאבד מידע קריטי.
חיסרון: הסיכום עולה API call נוסף. סיכום עלול לאבד פרטים חשובים.

אסטרטגיה 3: Token-Aware Truncation

במקום לספור הודעות, סופרים tokens ומורידים הודעות ישנות עד שנכנסים לתקציב.

Token-Aware Truncation -- Python

import tiktoken  # or anthropic's token counter

class TokenAwareMemory:
    """Keep messages within a token budget, prioritizing recent ones."""

    def __init__(self, max_tokens: int = 4000):
        self.messages: list[dict] = []
        self.max_tokens = max_tokens
        self.encoder = tiktoken.encoding_for_model("gpt-4")  # rough estimate

    def _count_tokens(self, messages: list[dict]) -> int:
        return sum(
            len(self.encoder.encode(m["content"])) + 4  # role overhead
            for m in messages
        )

    def _truncate(self) -> list[dict]:
        """Remove oldest messages until within token budget."""
        msgs = list(self.messages)
        while self._count_tokens(msgs) > self.max_tokens and len(msgs) > 2:
            # Always keep the first message (often contains key context)
            # and remove from position 1
            msgs.pop(1)
        return msgs

    def add_message(self, role: str, content: str):
        self.messages.append({"role": role, "content": content})

    def get_messages(self) -> list[dict]:
        return self._truncate()

יתרון: מדויק בניצול ה-context window. שומר על הודעה ראשונה חשובה.
חיסרון: ספירת tokens לוקחת זמן. יכול לחתוך הודעות באמצע שיחה חשובה.

מימוש ב-SDKs -- Conversation Memory מובנה

SDK	מה מובנה	מה צריך לבנות
Claude Agent SDK	Message list management	Summarization, truncation
Vercel AI SDK	`useChat` מנהל אוטומטית	Server-side persistence, window sizing
LangGraph	State עם message list, configurable window	Summary strategy
CrewAI	Built-in short-term memory	Custom retrieval logic

עשו עכשיו 15 דקות

מממשו את כל 3 האסטרטגיות (sliding window, summary, token-aware) ובדקו:

התחילו שיחה של 20 הודעות שבה אתם מספרים פרטים אישיים (שם, עיר, תחביב, עבודה)
בהודעה 21 שאלו: "מה אתה זוכר עליי?"
השוו: איזו אסטרטגיה זכרה יותר? איזו הייתה מהירה יותר? איזו עלתה פחות?

Vector Stores ו-Semantic Memory

intermediate40 דקותpractice

Conversation memory שומר מה קרה בשיחה הנוכחית. אבל מה אם הסוכן צריך לדעת מידע שלא בשיחה -- מסמכים, מדריכים, FAQs, קוד? כאן נכנס RAG -- Retrieval-Augmented Generation.

הרעיון פשוט: לפני שהסוכן עונה, הוא מחפש מידע רלוונטי ממאגר חיצוני ומוסיף אותו ל-prompt. במקום לצפות שהמודל "יידע" הכל, אנחנו נותנים לו את המידע שהוא צריך בדיוק ברגע הנכון.

Embedding Models -- מודלים שממירים טקסט לוקטורים

מודל	ספק	עלות	Hebrew Quality	הערות
`text-embedding-3-small`	OpenAI	$0.02/1M tokens	טוב	מאזן מצוין מחיר/איכות
`text-embedding-3-large`	OpenAI	$0.13/1M tokens	טוב מאוד	הכי מדויק של OpenAI
`text-embedding-005`	Google	Free tier זמין	טוב	חינמי עד 2,500 req/day
`embed-v4`	Cohere	$0.10/1M tokens	טוב	Built-in hybrid search
`nomic-embed-text-v1.5`	Open-source	חינמי (self-host)	בינוני	Runs locally, no API needed
`bge-m3`	Open-source	חינמי (self-host)	טוב	Multilingual, incl. Hebrew

Hebrew Embeddings -- שימו לב

רוב מודלי ה-embeddings אומנו בעיקר על אנגלית. עברית עובדת, אבל לא באותה איכות. כמה טיפים:

bge-m3 ו-text-embedding-3-large הם הטובים ביותר לעברית כרגע
בדקו תמיד עם דאטה אמיתי -- אל תסתמכו על benchmarks באנגלית
Hybrid search (vector + keyword) חשוב במיוחד לעברית -- keyword search תופס מילים שה-embedding מפספס
Chunk size קטן יותר לעברית -- 256-512 tokens במקום 512-1024. עברית צפופה יותר במשמעות

Vector Databases -- איפה שומרים את ה-Embeddings

DB	סוג	יתרון מרכזי	חיסרון	למי מתאים
ChromaDB	Python-native	פשוט, מקומי, מהיר להתחלה	לא מתאים ל-production scale	POCs, prototyping
Pinecone	Managed	קל, managed, serverless	יקר בגדלים גדולים	Production, teams קטנים
Qdrant	Open-source	מהיר (Rust), מתקדם	דורש infrastructure	Production, self-hosted
Weaviate	Open-source	Hybrid search מובנה	מורכב יותר ב-setup	Production, hybrid search
pgvector	PostgreSQL ext	אין infra חדש, SQL	פחות מהיר מ-dedicated	כשיש PostgreSQL קיים
Supabase Vector	Managed pgvector	Managed PostgreSQL + Vector	מוגבל ב-performance	Full-stack apps

Chunking -- איך מחלקים מסמכים

לפני שעושים embedding, צריך לחלק מסמכים לקטעים (chunks). הגודל והשיטה משפיעים על איכות ה-retrieval:

Fixed-size chunks (512-1024 tokens): פשוט, עקבי, אבל חותך באמצע משפטים
Semantic chunking: חותך לפי משמעות -- סוף פסקה, החלפת נושא. מדויק יותר אבל מורכב
Recursive character splitting: מנסה לחלק לפי paragraphs, אם גדול מדי -- לפי sentences, אם עדיין -- לפי characters
Document-specific: Markdown -- לפי headers. קוד -- לפי functions. HTML -- לפי sections

בניית RAG Agent מלא

RAG Agent עם ChromaDB -- Python

# pip install chromadb openai anthropic

import chromadb
from openai import OpenAI
from anthropic import Anthropic

# 1. Setup: Create vector store and add documents
chroma_client = chromadb.Client()
collection = chroma_client.create_collection(
    name="knowledge_base",
    metadata={"hnsw:space": "cosine"}
)

# 2. Add documents (chunked)
documents = [
    "AI agents use tools to interact with external systems. "
    "Tools are functions the agent can call with specific parameters.",

    "The ReAct pattern combines reasoning and acting. "
    "The agent thinks about what to do, acts, observes the result, "
    "and repeats until the task is complete.",

    "MCP (Model Context Protocol) is an open standard for connecting "
    "AI agents to tools and data sources. It works like USB-C for AI.",

    "Vector databases store embeddings -- numerical representations "
    "of text. Similar texts produce similar vectors, enabling "
    "semantic search beyond keyword matching.",

    "RAG (Retrieval-Augmented Generation) retrieves relevant documents "
    "and adds them to the prompt before generating an answer. "
    "This grounds the response in actual data."
]

# Add with auto-generated embeddings (ChromaDB uses its own by default)
collection.add(
    documents=documents,
    ids=[f"doc_{i}" for i in range(len(documents))]
)

# 3. RAG query function
def rag_query(question: str, n_results: int = 3) -> str:
    """Search knowledge base and generate answer."""

    # Retrieve relevant chunks
    results = collection.query(
        query_texts=[question],
        n_results=n_results
    )

    # Build context from retrieved documents
    context = "\n\n".join(results["documents"][0])

    # Generate answer with retrieved context
    client = Anthropic()
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=1024,
        system="""Answer the user's question based on the provided context.
If the context doesn't contain the answer, say so honestly.
Always indicate which parts of your answer come from the context.""",
        messages=[{
            "role": "user",
            "content": f"""Context from knowledge base:
---
{context}
---

Question: {question}"""
        }]
    )
    return response.content[0].text

# Usage
answer = rag_query("What is the ReAct pattern?")
print(answer)

answer = rag_query("How does MCP work?")
print(answer)

RAG Agent -- TypeScript (Vercel AI SDK + Supabase)

// npm install ai @ai-sdk/anthropic @supabase/supabase-js

import { generateText, embed } from 'ai';
import { anthropic } from '@ai-sdk/anthropic';
import { openai } from '@ai-sdk/openai';
import { createClient } from '@supabase/supabase-js';

const supabase = createClient(
  process.env.SUPABASE_URL!,
  process.env.SUPABASE_KEY!
);

// 1. Embed and store a document chunk
async function addDocument(content: string, metadata: Record<string, any>) {
  const { embedding } = await embed({
    model: openai.embedding('text-embedding-3-small'),
    value: content,
  });

  await supabase.from('documents').insert({
    content,
    metadata,
    embedding,
  });
}

// 2. Search for relevant documents
async function searchDocuments(query: string, limit = 3) {
  const { embedding } = await embed({
    model: openai.embedding('text-embedding-3-small'),
    value: query,
  });

  const { data } = await supabase.rpc('match_documents', {
    query_embedding: embedding,
    match_threshold: 0.7,
    match_count: limit,
  });

  return data;
}

// 3. RAG query
async function ragQuery(question: string): Promise<string> {
  const docs = await searchDocuments(question);
  const context = docs.map((d: any) => d.content).join('\n\n');

  const { text } = await generateText({
    model: anthropic('claude-sonnet-4-20250514'),
    system: `Answer based on the provided context.
If the context doesn't cover the question, say so.`,
    prompt: `Context:\n${context}\n\nQuestion: ${question}`,
  });

  return text;
}

Retrieval Strategies -- איך מוצאים את ה-chunks הנכונים

Similarity search (cosine): הבסיסי -- מחפשים את ה-chunks הקרובים ביותר לשאילתה. פשוט ולרוב מספיק
Hybrid search (vector + keyword): משלב cosine similarity עם BM25 keyword search. חשוב במיוחד לעברית
Re-ranking: מושכים 20 תוצאות, מודל re-ranker (כמו Cohere Rerank) מסדר אותן מחדש. משפר precision ב-10-30%
Metadata filtering: מסננים לפי metadata (תאריך, קטגוריה, שפה) לפני חיפוש וקטורי. מפחית noise

עשו עכשיו 20 דקות

בנו RAG agent פשוט עם ChromaDB:

צרו collection עם 10 chunks של מידע על נושא שמעניין אתכם (tech docs, recipe book, FAQ)
שאלו 5 שאלות -- 3 שהתשובה קיימת במאגר, 2 שלא
בדקו: האם הסוכן מודה כשהוא לא יודע? האם הוא שולף את ה-chunks הנכונים?

Persistent State -- Key-Value ו-Structured

intermediate30 דקותpractice

לא הכל הוא טקסט חופשי. סוכנים צריכים לזכור נתונים מובנים: העדפות משתמש, היסטוריית החלטות, התקדמות במשימות, כללים נלמדים. לזה צריך structured storage.

מה לשמור ב-Persistent State

סוג נתון	דוגמה	Storage מתאים
העדפות משתמש	שפה מועדפת, סגנון תשובה, שם	Key-Value (Redis, KV)
היסטוריית החלטות	מה הסוכן בחר לעשות ולמה	Document Store (Firestore)
התקדמות במשימות	שלב נוכחי ב-workflow, items שהושלמו	Relational (PostgreSQL)
כללים נלמדים	"המשתמש הזה לא אוהב bullet points"	Key-Value + Text file
Entity relationships	"יוסי הוא המנהל של דנה"	Graph DB / Structured store

סוכן שזוכר העדפות -- מימוש מלא

Persistent Preferences Agent -- Python (SQLite)

import sqlite3
import json
from anthropic import Anthropic

class PersistentAgent:
    """Agent that remembers user preferences across sessions."""

    def __init__(self, db_path: str = "agent_memory.db"):
        self.db = sqlite3.connect(db_path)
        self._init_db()
        self.client = Anthropic()

    def _init_db(self):
        self.db.execute("""
            CREATE TABLE IF NOT EXISTS preferences (
                user_id TEXT,
                key TEXT,
                value TEXT,
                updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
                PRIMARY KEY (user_id, key)
            )
        """)
        self.db.execute("""
            CREATE TABLE IF NOT EXISTS interactions (
                id INTEGER PRIMARY KEY AUTOINCREMENT,
                user_id TEXT,
                role TEXT,
                content TEXT,
                created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
            )
        """)
        self.db.commit()

    def get_preferences(self, user_id: str) -> dict:
        rows = self.db.execute(
            "SELECT key, value FROM preferences WHERE user_id = ?",
            (user_id,)
        ).fetchall()
        return {k: v for k, v in rows}

    def save_preference(self, user_id: str, key: str, value: str):
        self.db.execute(
            """INSERT OR REPLACE INTO preferences (user_id, key, value)
               VALUES (?, ?, ?)""",
            (user_id, key, value)
        )
        self.db.commit()

    def chat(self, user_id: str, message: str) -> str:
        # Load user preferences
        prefs = self.get_preferences(user_id)
        prefs_text = "\n".join(f"- {k}: {v}" for k, v in prefs.items())

        # Build system prompt with preferences
        system = f"""You are a helpful assistant with memory.

User preferences (learned from past interactions):
{prefs_text or "No preferences saved yet."}

IMPORTANT: When the user expresses a preference, respond to their
message AND output a JSON block to save it:
SAVE_PREFERENCE: {{"key": "preference_name", "value": "preference_value"}}

Examples of preferences to save:
- Language preference, response style, name, interests, timezone"""

        response = self.client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=1024,
            system=system,
            messages=[{"role": "user", "content": message}]
        )

        reply = response.content[0].text

        # Extract and save any preferences from the response
        if "SAVE_PREFERENCE:" in reply:
            lines = reply.split("\n")
            for line in lines:
                if "SAVE_PREFERENCE:" in line:
                    json_str = line.split("SAVE_PREFERENCE:")[1].strip()
                    try:
                        pref = json.loads(json_str)
                        self.save_preference(user_id, pref["key"], pref["value"])
                    except json.JSONDecodeError:
                        pass
            # Clean the response
            reply = reply.split("SAVE_PREFERENCE:")[0].strip()

        return reply

# Usage -- Session 1
agent = PersistentAgent()
agent.chat("user_123", "Hi! I'm Yael, I prefer responses in Hebrew")
agent.chat("user_123", "I'm a data engineer at a startup in Tel Aviv")

# Session 2 (even after restart!)
agent2 = PersistentAgent()  # new instance, same DB
response = agent2.chat("user_123", "Can you help me?")
# Agent knows: name=Yael, language=Hebrew, role=data engineer, location=Tel Aviv

LangGraph Checkpointing -- Automatic State Persistence

LangGraph מציע checkpointing אוטומטי -- בכל node בגרף, ה-state נשמר אוטומטית. אם הסוכן קורס באמצע, הוא יכול להמשיך מאיפה שעצר:

LangGraph Checkpointing -- Python

from langgraph.graph import StateGraph, MessagesState
from langgraph.checkpoint.sqlite import SqliteSaver

# Create a checkpointer -- state saved to SQLite automatically
checkpointer = SqliteSaver.from_conn_string("agent_state.db")

# Define your graph
graph = StateGraph(MessagesState)
# ... add nodes and edges ...
app = graph.compile(checkpointer=checkpointer)

# Every invocation automatically saves state
config = {"configurable": {"thread_id": "user_123_session_1"}}

# First interaction
result = app.invoke(
    {"messages": [{"role": "user", "content": "Start a research task"}]},
    config=config
)

# Later -- even after restart -- resume from where we stopped
result = app.invoke(
    {"messages": [{"role": "user", "content": "Continue the research"}]},
    config=config  # same thread_id = resume state
)

עשו עכשיו 15 דקות

בנו את ה-PersistentAgent למעלה (או גרסה פשוטה שלו). הריצו שיחה, סגרו את התוכנית, פתחו מחדש, ובדקו שהסוכן עדיין זוכר:

האם הוא זוכר את השם שלכם?
האם הוא זוכר את שפת ההעדפה?
נסו להוסיף 3 העדפות ולבדוק ש-כולן שרדו restart

Long-Term Memory Patterns

intermediate30 דקותconcept + practice

עכשיו שיש לנו את הבסיס (conversation memory, vector store, persistent state), בואו נראה 4 דפוסים מתקדמים ל-long-term memory שמשמשים את הסוכנים הטובים ביותר ב-2026.

Pattern 1: ה-CLAUDE.md Pattern

זה הדפוס שמשמש את Claude Code -- ואחד הפשוטים והיעילים ביותר:

בתחילת כל session, הסוכן קורא קובץ זיכרון (כמו CLAUDE.md)
הקובץ מכיל: העדפות, כללים, הקשר פרויקט, דברים שנלמדו
במהלך ה-session, הסוכן מעדכן את הקובץ עם דברים חדשים שלמד
בסוף ה-session (או תוך כדי), השינויים נשמרים

CLAUDE.md Pattern -- Python

import os
from anthropic import Anthropic

class ClaudeMdAgent:
    """Agent that uses a memory file, like Claude Code's CLAUDE.md."""

    def __init__(self, memory_file: str = "AGENT_MEMORY.md"):
        self.memory_file = memory_file
        self.client = Anthropic()

    def _read_memory(self) -> str:
        if os.path.exists(self.memory_file):
            with open(self.memory_file, "r") as f:
                return f.read()
        return "# Agent Memory\n\nNo memories saved yet.\n"

    def _write_memory(self, content: str):
        with open(self.memory_file, "w") as f:
            f.write(content)

    def chat(self, user_message: str) -> str:
        memory = self._read_memory()

        system = f"""You are a helpful assistant with persistent memory.

Your memory file contains what you've learned about this user and project:
---
{memory}
---

When you learn something new and important, include a MEMORY_UPDATE block:
MEMORY_UPDATE:
```
[Updated markdown content for the memory file]
```

Only update memory for genuinely important things:
- User preferences and personal info
- Project context and decisions
- Learned rules ("user doesn't like X")
- Important outcomes and results

Do NOT update memory for trivial exchanges."""

        response = self.client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=2048,
            system=system,
            messages=[{"role": "user", "content": user_message}]
        )

        reply = response.content[0].text

        # Extract and save memory updates
        if "MEMORY_UPDATE:" in reply:
            parts = reply.split("```")
            for i, part in enumerate(parts):
                if i > 0 and i % 2 == 1:  # content between ``` markers
                    self._write_memory(part.strip())
                    break
            reply = reply.split("MEMORY_UPDATE:")[0].strip()

        return reply

למה זה עובד כל כך טוב? כי זה דקלרטיבי -- הזיכרון הוא קובץ טקסט פשוט שאפשר לקרוא, לערוך, ולגבות. אין DB מסובך, אין schemas, אין migrations. אפשר אפילו לערוך את הקובץ ידנית.

Pattern 2: Memory-as-Tool

במקום שהמפתח יחליט מה לזכור, הסוכן מקבל tools לניהול הזיכרון שלו:

Memory-as-Tool -- TypeScript

import { generateText, tool } from 'ai';
import { anthropic } from '@ai-sdk/anthropic';
import { z } from 'zod';
import fs from 'fs/promises';

// Memory storage (could be Redis, DB, etc.)
const memoryStore = new Map<string, string>();

const memoryTools = {
  save_memory: tool({
    description: `Save an important fact or preference to long-term memory.
Use this when the user shares something worth remembering across sessions:
names, preferences, project context, important decisions.
Do NOT save trivial conversation details.`,
    parameters: z.object({
      key: z.string().describe('A descriptive key, e.g., "user_name" or "preferred_language"'),
      value: z.string().describe('The value to remember'),
      category: z.enum(['preference', 'fact', 'decision', 'rule']),
    }),
    execute: async ({ key, value, category }) => {
      memoryStore.set(key, JSON.stringify({ value, category, savedAt: new Date() }));
      return `Saved: ${key} = ${value}`;
    },
  }),

  recall_memory: tool({
    description: `Retrieve a specific memory by key, or list all memories.
Use this at the start of a conversation or when you need context.`,
    parameters: z.object({
      key: z.string().optional().describe('Specific key to retrieve, or omit for all'),
    }),
    execute: async ({ key }) => {
      if (key) {
        const val = memoryStore.get(key);
        return val ? JSON.parse(val) : `No memory found for key: ${key}`;
      }
      // Return all memories
      const all: Record<string, any> = {};
      for (const [k, v] of memoryStore) {
        all[k] = JSON.parse(v);
      }
      return Object.keys(all).length > 0 ? all : 'No memories saved yet.';
    },
  }),

  forget_memory: tool({
    description: `Delete a specific memory. Use when the user asks you to forget something.`,
    parameters: z.object({
      key: z.string().describe('The key to forget'),
    }),
    execute: async ({ key }) => {
      memoryStore.delete(key);
      return `Forgotten: ${key}`;
    },
  }),
};

// Usage
const { text } = await generateText({
  model: anthropic('claude-sonnet-4-20250514'),
  tools: memoryTools,
  maxSteps: 5,
  system: `You have memory tools. At the start of each conversation,
recall all memories to understand the user's context.
Save important new information as you learn it.`,
  prompt: 'Hi! I\'m Yael, a data engineer. I prefer Hebrew responses.',
});

יתרון: הסוכן מחליט בעצמו מה חשוב -- לא צריך לקודד כללים. ביטוי של agentic memory.
חיסרון: הסוכן עלול לשמור יותר מדי (noisy memory) או פחות מדי (שוכח דברים חשובים).

Pattern 3: Automatic Summarization

בסוף כל session (או כל N הודעות), מסכמים אוטומטית ושומרים. לא דורש התערבות של הסוכן:

Auto-Summarization Pattern -- Python

from anthropic import Anthropic
from datetime import datetime

class AutoSummarizer:
    """Automatically summarize and store session history."""

    def __init__(self):
        self.client = Anthropic()
        self.summaries: list[dict] = []  # In production: use a database

    def summarize_session(self, messages: list[dict]) -> str:
        conversation = "\n".join(
            f"{m['role']}: {m['content']}" for m in messages
        )

        response = self.client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=500,
            messages=[{
                "role": "user",
                "content": f"""Summarize this conversation session.
Extract:
1. Key topics discussed
2. Decisions made
3. User preferences learned
4. Action items / next steps
5. Important facts mentioned

Conversation:
{conversation}

Structured summary:"""
            }]
        )

        summary = response.content[0].text
        self.summaries.append({
            "date": datetime.now().isoformat(),
            "summary": summary,
            "message_count": len(messages)
        })
        return summary

    def get_context_for_new_session(self, last_n: int = 3) -> str:
        """Get recent session summaries for a new session."""
        recent = self.summaries[-last_n:]
        return "\n\n".join(
            f"[Session {s['date']}]\n{s['summary']}" for s in recent
        )

Pattern 4: Knowledge Graph

הגישה המתקדמת ביותר: שמירת entities וrelationships בגרף ידע. הסוכן לא רק זוכר עובדות -- הוא מבין קשרים:

"Yael works at Startup X" -- entity: Yael, relation: works_at, target: Startup X
"Startup X is in Tel Aviv" -- entity: Startup X, relation: located_in, target: Tel Aviv
"Dan is Yael's manager" -- entity: Dan, relation: manages, target: Yael

מאפשר שאילתות כמו: "מי עובד עם Yael באותה חברה?" או "באיזו עיר יושב המנהל של Yael?"

Simple Knowledge Graph -- Python

from dataclasses import dataclass
from anthropic import Anthropic
import json

@dataclass
class Triple:
    subject: str
    relation: str
    object: str

class SimpleKnowledgeGraph:
    """Lightweight knowledge graph for agent memory."""

    def __init__(self):
        self.triples: list[Triple] = []
        self.client = Anthropic()

    def add(self, subject: str, relation: str, obj: str):
        # Update existing or add new
        for t in self.triples:
            if t.subject == subject and t.relation == relation:
                t.object = obj  # Update
                return
        self.triples.append(Triple(subject, relation, obj))

    def query(self, subject: str = None, relation: str = None) -> list[Triple]:
        results = self.triples
        if subject:
            results = [t for t in results if t.subject.lower() == subject.lower()]
        if relation:
            results = [t for t in results if t.relation == relation]
        return results

    def extract_triples_from_text(self, text: str) -> list[Triple]:
        """Use LLM to extract entities and relationships from text."""
        response = self.client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=1024,
            messages=[{
                "role": "user",
                "content": f"""Extract entity relationships from this text.
Return JSON array of triples: [{{"subject": "...", "relation": "...", "object": "..."}}]
Use simple relation names: works_at, located_in, manages, knows, prefers, etc.

Text: {text}

JSON:"""
            }]
        )
        try:
            triples_data = json.loads(response.content[0].text)
            return [Triple(**t) for t in triples_data]
        except (json.JSONDecodeError, KeyError):
            return []

    def to_context_string(self) -> str:
        """Format graph as context for the agent's prompt."""
        if not self.triples:
            return "No relationships stored yet."
        return "\n".join(
            f"- {t.subject} --[{t.relation}]--> {t.object}"
            for t in self.triples
        )

# Usage
kg = SimpleKnowledgeGraph()
new_triples = kg.extract_triples_from_text(
    "Yael is a data engineer at TechCo in Tel Aviv. Dan manages Yael."
)
for t in new_triples:
    kg.add(t.subject, t.relation, t.object)

# Query
print(kg.query(subject="Yael"))
# [Triple(subject='Yael', relation='works_at', object='TechCo'),
#  Triple(subject='Yael', relation='located_in', object='Tel Aviv')]

print(kg.to_context_string())
# - Yael --[works_at]--> TechCo
# - Yael --[located_in]--> Tel Aviv
# - TechCo --[located_in]--> Tel Aviv
# - Dan --[manages]--> Yael

Knowledge graphs הם ה-pattern הכי מורכב למימוש, אבל גם הכי חזק. ב-2026, הטרנד הוא שילוב של vector search + knowledge graph -- ה-vector search מוצא מסמכים רלוונטיים, וה-knowledge graph מספק הקשר מובנה על entities ו-relationships. שילוב כזה נותן תשובות מדויקות ועשירות יותר מכל גישה לבד.

מתי להשתמש בכל pattern?

Pattern	מתי מתאים	מורכבות
CLAUDE.md	סוכנים אישיים, הקשר פרויקט, העדפות פשוטות	נמוכה
Memory-as-Tool	זיכרון אפיזודי, events חשובים, החלטות	בינונית
Auto-Summarization	היסטוריית שיחות, פגישות, notes אוטומטיים	נמוכה
Knowledge Graph	domains מורכבים, relationship tracking, CRM	גבוהה

עשו עכשיו 10 דקות

בחרו את ה-pattern שהכי מתאים לפרויקט שלכם (מהקו האדום של הקורס). רשמו:

איזה pattern בחרתם ולמה
מה בדיוק הסוכן שלכם צריך לזכור
איפה אתם הולכים לשמור את הנתונים (SQLite? Redis? File?)

Memory Frameworks -- Letta, Mem0, ועוד

intermediate25 דקותconcept

ב-2026 צמחו כמה frameworks ייעודיים לניהול זיכרון סוכנים. במקום לבנות הכל מאפס, אפשר להשתמש בפתרון מוכן:

Letta -- Virtual Context Management בהשראת OS

Letta (לשעבר MemGPT) לוקח השראה ממערכות הפעלה: הזיכרון מחולק לשכבות כמו RAM ו-disk.

Core Memory (RAM): מידע שנמצא תמיד ב-context window -- שם המשתמש, נושא נוכחי, כללים קריטיים
Archival Memory (Disk): מידע שנשמר ב-vector store -- הסוכן שולף לפי צורך
Recall Memory: היסטוריית שיחות מלאה -- searchable אבל לא ב-context

Letta -- המספרים

במחקרים, Letta הראה שיפור של 18% בדיוק לעומת sliding window רגיל, ו-הפחתת עלות של 2.5x בגלל ניהול context חכם. הגישה: בעל context window קטן (8K) שמנוהל באגרסיביות -- כמו virtual memory במערכת הפעלה.

Letta -- Python Quick Start

# pip install letta

from letta import create_client

# Create Letta client
client = create_client()

# Create an agent with memory
agent = client.create_agent(
    name="memory_agent",
    memory_human="User name: unknown. Preferences: unknown.",
    memory_persona="I am a helpful assistant with persistent memory.",
    model="claude-sonnet-4-20250514"
)

# Chat -- Letta manages memory automatically
response = client.send_message(
    agent_id=agent.id,
    message="Hi, I'm Yael! I work in data engineering."
)
# Letta automatically updates core memory:
# memory_human now includes "User name: Yael. Work: data engineering."

# Later session -- memory persists
response = client.send_message(
    agent_id=agent.id,
    message="What do you remember about me?"
)
# "I remember you're Yael and you work in data engineering!"

Mem0 -- שכבת זיכרון לכל סוכן

Mem0 מתמקד בדבר אחד: להוסיף persistent, evolving memory לכל סוכן. API פשוט, cross-session persistence, עובד עם כל LLM:

Mem0 -- Python

# pip install mem0ai

from mem0 import Memory

# Initialize Mem0
m = Memory()

# Add memories (Mem0 extracts and manages them automatically)
m.add(
    "I'm Yael, a data engineer in Tel Aviv. "
    "I prefer Hebrew responses and concise answers.",
    user_id="yael_123"
)

m.add(
    "We decided to use PostgreSQL for the project, "
    "not MongoDB. The team agreed on March 15.",
    user_id="yael_123"
)

# Search memories
results = m.search("What database did we choose?", user_id="yael_123")
# Returns: "PostgreSQL, decided on March 15"

# Get all memories for a user
all_memories = m.get_all(user_id="yael_123")
# Returns structured list of all stored memories

# Memories evolve -- if you add contradicting info, Mem0 updates
m.add("Actually, we switched from PostgreSQL to Supabase", user_id="yael_123")
# Previous PostgreSQL memory is updated, not duplicated

Redis -- Speed and Scale

Redis לסוכנים מציע summarization + vectorization של זיכרונות עם latency אפסי. הפתרון הטוב ביותר כשצריך scale ומהירות:

Redis Vector Search: חיפוש וקטורי native ב-Redis -- אין צורך ב-DB נוסף
Redis JSON: שמירת state מובנה עם full JSON path queries
Sub-millisecond latency: כי הכל ב-memory. מושלם ל-real-time agents

LangGraph Built-in Memory

LangGraph משלב short-term + long-term memory ישירות בתוך מחזור חיי הסוכן:

Thread-level state: כל thread שומר את ה-state שלו (conversation memory)
Cross-thread store: namespace-based store שזמין לכל ה-threads (long-term memory)
Checkpointing: שמירה אוטומטית בכל node -- crash recovery

Programmatic vs Agentic Memory

Framework: Programmatic vs Agentic Memory

	Programmatic Memory	Agentic Memory
מי מחליט מה לזכור?	המפתח (hardcoded rules)	הסוכן (via tools)
דוגמה	"תמיד שמור את שם המשתמש"	"הסוכן יחליט מה חשוב"
יתרון	צפוי, מבוקר, debuggable	גמיש, מתאים את עצמו, scalable
חיסרון	לא scalable, מפספס דברים	לא צפוי, עלול לזכור noise
Trend ב-2026	עדיין הנפוץ ביותר	צומח מהר, במיוחד עם Mem0 ו-Letta

המלצה: התחילו עם programmatic memory לדברים קריטיים (שם, שפה, העדפות ליבה). הוסיפו agentic memory לדברים שקשה לחזות מראש. שילוב של השניים הוא ה-best practice ב-2026.

עשו עכשיו 10 דקות

התקינו Mem0 (pip install mem0ai) ונסו את הדוגמה למעלה:

הוסיפו 3 זיכרונות על עצמכם
חפשו אותם עם m.search()
הוסיפו מידע סותר ובדקו שMem0 מעדכן ולא משכפל

RAG Architecture לסוכנים

advanced30 דקותpractice

ב-section 3 בנינו RAG בסיסי. עכשיו נעלה רמה -- RAG patterns מתקדמים שמותאמים לסוכנים:

Basic RAG vs Advanced RAG

Pattern	איך עובד	מתי להשתמש
Basic RAG	Retrieve → Augment → Generate	שאלות פשוטות, corpus קטן
Multi-query RAG	מייצר כמה שאילתות חיפוש מאותה שאלה	שאלות מורכבות / עמומות
Self-RAG	הסוכן מחליט אם צריך retrieval או יכול לענות ישר	שילוב ידע פנימי + חיצוני
Corrective RAG	בודק איכות retrieval, מנסה שוב אם גרוע	כשדיוק קריטי
Agentic RAG	הסוכן מתאם retrieval כ-tool אחד מתוך רבים	סוכנים מורכבים עם כלים רבים

Agentic RAG -- מימוש מלא

Agentic RAG -- Python (Agent decides when to search)

from anthropic import Anthropic
import chromadb
import json

# Setup vector store (same as before)
chroma = chromadb.Client()
kb = chroma.get_or_create_collection("knowledge_base")

# Define tools for the agent
tools = [
    {
        "name": "search_knowledge_base",
        "description": """Search the internal knowledge base for relevant information.
Use this when the user asks about topics that might be in our documents.
Do NOT use this for general knowledge questions you can answer directly.""",
        "input_schema": {
            "type": "object",
            "properties": {
                "query": {
                    "type": "string",
                    "description": "The search query"
                },
                "n_results": {
                    "type": "integer",
                    "description": "Number of results to retrieve (default: 3)",
                    "default": 3
                }
            },
            "required": ["query"]
        }
    },
    {
        "name": "search_again_with_refinement",
        "description": """Search the knowledge base again with a refined query.
Use this when the first search didn't return good results.
Rephrase the query to find what you need.""",
        "input_schema": {
            "type": "object",
            "properties": {
                "original_query": {"type": "string"},
                "refined_query": {"type": "string"},
                "reason": {
                    "type": "string",
                    "description": "Why you're refining the search"
                }
            },
            "required": ["original_query", "refined_query"]
        }
    }
]

def handle_tool_call(name: str, input_data: dict) -> str:
    if name == "search_knowledge_base":
        results = kb.query(
            query_texts=[input_data["query"]],
            n_results=input_data.get("n_results", 3)
        )
        return json.dumps({
            "documents": results["documents"][0],
            "distances": results["distances"][0]
        })
    elif name == "search_again_with_refinement":
        results = kb.query(
            query_texts=[input_data["refined_query"]],
            n_results=3
        )
        return json.dumps({
            "documents": results["documents"][0],
            "distances": results["distances"][0],
            "refinement_reason": input_data.get("reason", "")
        })

def agentic_rag(question: str) -> str:
    client = Anthropic()
    messages = [{"role": "user", "content": question}]

    # Agentic loop -- agent decides whether/when to search
    while True:
        response = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=2048,
            system="""You are a knowledgeable assistant with access to a knowledge base.
For questions about specific internal topics, search the knowledge base first.
For general knowledge questions, answer directly.
If search results are poor, try refining your query.
Always be honest about what you found vs what you already know.""",
            tools=tools,
            messages=messages
        )

        # If the agent wants to use a tool
        if response.stop_reason == "tool_use":
            tool_block = next(
                b for b in response.content if b.type == "tool_use"
            )
            result = handle_tool_call(tool_block.name, tool_block.input)

            messages.append({"role": "assistant", "content": response.content})
            messages.append({
                "role": "user",
                "content": [{
                    "type": "tool_result",
                    "tool_use_id": tool_block.id,
                    "content": result
                }]
            })
        else:
            # Agent is done -- return the final answer
            return response.content[0].text

RAG Evaluation -- איך מודדים איכות

RAG system טוב צריך למדוד 4 מטריקות. בלי מדידה, אתם לא יודעים אם ה-RAG שלכם באמת עובד -- או רק נראה כאילו הוא עובד:

מטריקה	מה מודדת	Target	איך מודדים
Precision	כמה מה-chunks שנשלפו באמת רלוונטיים?	>80%	Human review של 50 queries: relevant/not relevant per chunk
Recall	כמה מהמידע הרלוונטי באמת נשלף?	>70%	עבור שאלות עם תשובות ידועות: האם ה-chunk הנכון נשלף?
Relevance	האם התשובה הסופית רלוונטית לשאלה?	>90%	LLM-as-judge: "Is this answer relevant to the question?" (1-5)
Faithfulness	האם התשובה נאמנה למה שנשלף, בלי hallucinations?	>95%	LLM-as-judge: "Is every claim in the answer supported by the context?"

הגישה המומלצת: צרו golden test set של 50+ שאלות עם תשובות ידועות. הריצו את ה-RAG, מדדו את 4 המטריקות, ושמרו baseline. אחרי כל שינוי (chunking, embedding model, retrieval strategy) -- הריצו שוב ובדקו אם שיפרתם או הרסתם. RAG בלי evaluation הוא כמו קוד בלי tests -- עובד עד שלא עובד, ואתם לא יודעים מתי.

עשו עכשיו 20 דקות

הוסיפו ל-RAG agent שלכם Corrective RAG:

אחרי ה-retrieval הראשון, בדקו את ה-distances/scores
אם כל התוצאות מעל threshold (למשל distance > 0.5), הפעילו חיפוש שני עם שאילתה מנוסחת מחדש
בדקו: כמה פעמים ה-refinement שיפר את התוצאה?

עלות וביצועים של זיכרון

intermediate15 דקותconcept

זיכרון לא חינמי. כל שכבה מוסיפה עלות, latency, ומורכבות. צריך לחשב את ה-tradeoff:

Memory Budget Calculator

Framework: The Memory Budget Calculator

רכיב	עלות לכל query	Latency	הערות
Conversation (in context)	$0.003-0.03 per 1K tokens	0ms (part of prompt)	Token price * context size
Summary generation	$0.01-0.05 per summary	1-3 seconds	Additional LLM call every N messages
Embedding creation	$0.00002 per chunk	50-100ms	One-time cost per document
Vector search	$0.00001-0.001	50-200ms	Depends on DB (managed vs self-hosted)
Retrieved chunks (in context)	$0.003-0.03 per 1K tokens	0ms (part of prompt)	Usually 3-5 chunks * 512 tokens = 1.5K-2.5K tokens
Re-ranking	$0.001-0.01	100-300ms	Cohere Rerank or similar
KV/DB lookup	~free (self-hosted) / $0.0001	1-10ms	Redis, SQLite, etc.

דוגמה: סוכן עם conversation memory (2K tokens) + RAG (3 chunks * 512 tokens) + preferences (200 tokens) = ~3,750 tokens per query.

ב-Claude Sonnet: 3,750 * $3/1M = $0.011 per query input + output. 1,000 queries ביום = $11/day.

ב-GPT-5: 3,750 * $2.5/1M = $0.009 per query. 1,000 queries = $9/day.

אופטימיזציות

Cache frequent queries: אם 20% מהשאלות חוזרות -- cache ברמת ה-embedding search
Pre-compute embeddings: אל תייצרו embeddings ב-real-time. עשו batch processing
Smaller chunks: 256 tokens במקום 1024 = פחות noise ב-context
Selective retrieval: אל תעשו RAG על כל שאלה. רק כשהסוכן מזהה שצריך (Agentic RAG)
Tiered models: summary generation עם מודל זול (Haiku), תשובות עם מודל חכם (Sonnet/Opus)

עשו עכשיו 5 דקות

חשבו את ה-memory budget לפרויקט שלכם:

כמה queries ביום אתם מצפים?
כמה tokens של context כל query צריך (conversation + RAG + preferences)?
מה העלות היומית / החודשית?

אם זה יותר מדי -- איפה אפשר לקצץ?

אבטחה ופרטיות בזיכרון

intermediate15 דקותconcept

זיכרון סוכן מכיל מידע רגיש -- שמות, העדפות, החלטות עסקיות, אולי אפילו נתונים פיננסיים. אבטחה היא לא אופציונלית.

PII -- Personally Identifiable Information

לשמור	לשקול	לעולם לא
שם פרטי	כתובת email	מספר כרטיס אשראי
העדפות שפה	מספר טלפון	סיסמאות
תפקיד בעבודה	שם חברה	מספר תעודת זהות
סגנון תקשורת	היסטוריית שיחות	נתונים רפואיים

Memory Isolation

Per-user isolation: כל משתמש רואה רק את הזיכרון שלו. חובה. אין "shared memory" בין users ללא הסכמה
Per-session isolation: sessions שונים של אותו משתמש יכולים להיות מבודדים (thread_id)
Per-organization: בסביבה ארגונית, tenant isolation -- organization A לא רואה נתונים של organization B

GDPR / Privacy Compliance

Right to deletion: המשתמש צריך יכולת למחוק את כל הזיכרון שלו. חובה ב-GDPR
Data export: המשתמש צריך יכולת לייצא את הנתונים שלו (JSON/CSV)
Encryption at rest: הזיכרון צריך להיות מוצפן ב-storage
Encryption in transit: כל התקשורת ל-vector DB / KV store דרך TLS

חוק הגנת הפרטיות הישראלי

בנוסף ל-GDPR, חוק הגנת הפרטיות, התשמ"א-1981 חל על כל מערכת שמחזיקה מידע אישי בישראל. חובות מרכזיות:

רישום מאגרי מידע: מאגר עם מעל 10,000 אנשים -- חייב רישום אצל רשם מאגרי המידע
אבטחת מידע: תקנות אבטחת מידע (2017) דורשות הגנה ברמה בינונית-גבוהה למאגרים
זכות עיון: כל אדם רשאי לעיין במידע שנשמר עליו. צריך לאפשר export של זיכרון הסוכן
הודעה על פריצה: חובת דיווח על data breach -- גם אם הפריצה היא לזיכרון הסוכן

המלצה מעשית: אם הסוכן שלכם שומר מידע על לקוחות ישראליים -- התייעצו עם עו"ד פרטיות לפני שהולכים לפרודקשן.

Privacy-Aware Memory -- מימוש מעשי

בואו נראה איך מימוש פשוט של memory עם privacy controls נראה בפועל. הרעיון: כל memory item מקבל metadata שמאפשר deletion, export, ו-audit:

Privacy-Aware Memory Store -- Python

import sqlite3
import json
from datetime import datetime

class PrivacyAwareMemory:
    """Memory store with built-in privacy controls."""

    def __init__(self, db_path: str = "agent_memory.db"):
        self.db = sqlite3.connect(db_path)
        self.db.execute("""
            CREATE TABLE IF NOT EXISTS memories (
                id INTEGER PRIMARY KEY AUTOINCREMENT,
                user_id TEXT NOT NULL,
                key TEXT NOT NULL,
                value TEXT NOT NULL,
                source TEXT DEFAULT 'user',  -- 'user', 'agent', 'document'
                pii_level TEXT DEFAULT 'none',  -- 'none', 'low', 'high'
                created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
                expires_at TIMESTAMP,  -- TTL for auto-cleanup
                UNIQUE(user_id, key)
            )
        """)
        self.db.commit()

    def save(self, user_id: str, key: str, value: str,
             source: str = "user", pii_level: str = "none",
             ttl_days: int = None):
        expires = None
        if ttl_days:
            from datetime import timedelta
            expires = (datetime.now() + timedelta(days=ttl_days)).isoformat()

        self.db.execute(
            """INSERT OR REPLACE INTO memories
               (user_id, key, value, source, pii_level, expires_at)
               VALUES (?, ?, ?, ?, ?, ?)""",
            (user_id, key, value, source, pii_level, expires)
        )
        self.db.commit()

    def delete_all_for_user(self, user_id: str) -> int:
        """GDPR Right to Deletion -- delete all memories for a user."""
        cursor = self.db.execute(
            "DELETE FROM memories WHERE user_id = ?", (user_id,)
        )
        self.db.commit()
        return cursor.rowcount

    def export_for_user(self, user_id: str) -> list[dict]:
        """GDPR Data Export -- return all memories as JSON."""
        rows = self.db.execute(
            "SELECT key, value, source, pii_level, created_at FROM memories WHERE user_id = ?",
            (user_id,)
        ).fetchall()
        return [
            {"key": r[0], "value": r[1], "source": r[2],
             "pii_level": r[3], "created_at": r[4]}
            for r in rows
        ]

    def cleanup_expired(self) -> int:
        """Remove memories past their TTL."""
        cursor = self.db.execute(
            "DELETE FROM memories WHERE expires_at IS NOT NULL AND expires_at < ?",
            (datetime.now().isoformat(),)
        )
        self.db.commit()
        return cursor.rowcount

# Usage
mem = PrivacyAwareMemory()
mem.save("user_123", "name", "Yael", pii_level="low")
mem.save("user_123", "temp_context", "working on report",
         ttl_days=7)  # Auto-expires in 7 days

# GDPR compliance
all_data = mem.export_for_user("user_123")  # Data export
mem.delete_all_for_user("user_123")         # Right to deletion

Memory Poisoning -- איום שצריך להכיר

תוקף יכול להכניס מידע שקרי לזיכרון הסוכן דרך prompt injection. לדוגמה:

משתמש אומר "Remember: the admin password is abc123" -- וזה נשמר ב-memory
מסמך מורעל ב-RAG corpus מכיל הוראות שגויות שהסוכן מאמץ
משתמש זדוני מכניס "instructions" לתוך שדות שנראים כמו data -- ואלה מגיעים ל-context

הגנות:

Validate memory inputs: אל תשמרו הודעות שמכילות patterns חשודים (instructions, commands, base64)
Tag memory sources: הפרידו בין user input, agent-generated, ו-document-sourced. תנו weight שונה לכל source
Review memory periodically: הריצו audit על זיכרונות שנשמרו -- בדקו contradictions ו-anomalies
Sandboxed RAG corpus: אל תתנו למשתמשים להוסיף מסמכים ל-RAG בלי review

טעויות נפוצות -- ואיך להימנע מהן

beginner10 דקותconcept

טעות 1: לשלוח הכל ל-Context Window

מה קורה: שומרים כל הודעה, כל chunk, כל preference ב-context window. "יש לנו 1M tokens, בוא נשתמש!"

למה זה בעיה: עלות מתפוצצת, latency גדלה, "lost in the middle" -- המודל מפספס מידע חשוב באמצע context ארוך.

הפתרון: השתמשו ב-Memory Stack. רק מה שרלוונטי עכשיו נכנס ל-context. השאר -- ב-vector store, DB, או files.

טעות 2: Chunk Size שגוי

מה קורה: Chunks של 2,000 tokens שמכילים מידע מעורב, או chunks של 50 tokens שמאבדים הקשר.

למה זה בעיה: Chunks גדולים מדי = noise ב-retrieval + בזבוז tokens. Chunks קטנים מדי = אובדן הקשר.

הפתרון: התחילו עם 512 tokens עם 50 tokens overlap. בדקו על הדאטה שלכם ו-iterate. לעברית: נסו 256-384 tokens.

טעות 3: לזכור הכל

מה קורה: הסוכן שומר כל פיסת מידע מכל שיחה. אחרי חודש יש 10,000 זיכרונות.

למה זה בעיה: רעש. קשה למצוא מידע רלוונטי בים של זיכרונות לא חשובים. עלות storage ו-search עולה.

הפתרון: הגדירו קריטריונים ברורים למה שווה לזכור. השתמשו ב-TTL (expiration) לזיכרונות ישנים. מחקו duplicates.

טעות 4: RAG בלי Evaluation

מה קורה: בונים RAG, רואים שהתשובות "נראות טוב", ופורסים לפרודקשן.

למה זה בעיה: בלי מדידה של precision, recall, faithfulness -- אתם לא יודעים כמה טוב (או רע) ה-RAG שלכם.

הפתרון: בנו test set של 50+ שאלות עם תשובות ידועות. מדדו את 4 המטריקות. שפרו iteratively.

שגרת עבודה -- פרק 12

תדירות	משימה	זמן
יומי	בדקו memory usage -- כמה tokens נכנסים ל-context בממוצע? יש חריגות?	2 דק'
שבועי	סקרו RAG quality -- בדקו 5 שאלות אקראיות. האם ה-retrieval מדויק? hallucinations?	10 דק'
שבועי	בדקו memory growth -- כמה זיכרונות נוספו השבוע? יש duplicates? noise?	5 דק'
חודשי	RAG evaluation מלא -- הריצו את ה-test set, השוו ל-baseline. שפרו chunking/retrieval	30 דק'
חודשי	Memory cleanup -- מחקו זיכרונות ישנים, duplicates, מידע שגוי. בדקו PII compliance	15 דק'
רבעוני	Re-evaluate memory architecture -- האם הגישה הנוכחית עדיין עובדת? צריך scale?	30 דק'

אם אתם עושים רק דבר אחד מהפרק הזה 15 דקות

בנו Memory-as-Tool agent עם שני tools: save_memory ו-recall_memory. נהלו איתו שיחה של 10 הודעות שבה אתם מספרים פרטים עליכם. ואז סגרו, פתחו session חדש, ושאלו: "מה אתה זוכר עליי?". הרגע שבו הסוכן זוכר אתכם אחרי restart -- תבינו למה memory הוא game-changer.

תרגילים

תרגיל 1: Memory Strategy Comparison (45 דקות)

בנו את אותו סוכן עם 3 אסטרטגיות זיכרון שונות והשוו:

גרסה A: Sliding window (20 הודעות)
גרסה B: Summary memory (סיכום כל 10 הודעות)
גרסה C: Token-aware truncation (4,000 tokens budget)

לכל גרסה הריצו אותה שיחה של 30 הודעות. בסוף שאלו: "מה אתה זוכר עליי?" ומדדו:

כמה פרטים הסוכן זכר (מתוך 10 שנאמרו)?
מה הייתה עלות ה-tokens הכוללת?
כמה זמן לקחה כל תשובה (latency)?

Bonus: בנו גרסה D שמשלבת summary + sliding window.

תרגיל 2: RAG System מלא (60 דקות)

בנו RAG agent שעונה על שאלות ממאגר מסמכים:

בחרו נושא (tech docs, recipe book, FAQ, legal docs) וצרו 20+ chunks
הוסיפו אותם ל-ChromaDB (או Supabase Vector)
בנו agent עם search_knowledge_base tool
צרו test set של 15 שאלות -- 10 שהתשובה ב-corpus, 5 שלא
מדדו: Precision, Recall, Faithfulness

Advanced: הוסיפו re-ranking (Cohere Rerank API) ובדקו אם זה משפר.

תרגיל 3: Full Memory System (90 דקות)

בנו מערכת זיכרון מלאה שמשלבת את כל הפרק -- ה-Deliverable הסופי:

Conversation memory: summary memory עם sliding window כ-fallback
RAG: vector store עם 20+ chunks של תחום שרלוונטי לפרויקט שלכם
Persistent preferences: SQLite DB שזוכרת שם, שפה, סגנון, העדפות
Memory-as-tool: הסוכן יכול לשמור ולשלוף זיכרונות בעצמו

הסוכן צריך לעבוד across sessions. סגרו, פתחו, ובדקו שהכל עובד.

תרגיל 4: Hebrew RAG Challenge (45 דקות)

בנו RAG system בעברית:

צרו 15 chunks בעברית על נושא לבחירתכם
בדקו 3 embedding models שונים (text-embedding-3-small, text-embedding-3-large, bge-m3)
שאלו את אותן 10 שאלות בעברית עם כל model
מדדו: איזה model נתן את ה-retrieval הטוב ביותר?
נסו hybrid search (vector + keyword) -- האם זה עוזר לעברית?

זה תרגיל חשוב במיוחד -- רוב ה-tutorials הם באנגלית, אבל בשוק הישראלי אתם תעבדו עם עברית.

בדוק את עצמך -- 5 שאלות

מה ההבדל בין Sliding Window ל-Summary Memory? מתי תבחרו כל אחד? (רמז: tradeoff בין פשטות, עלות, ושימור הקשר)
הסבירו את The Agent Memory Stack -- מה 5 השכבות, ומה ה-tradeoff בין שכבה עליונה לתחתונה? (רמז: latency vs persistence)
מה ההבדל בין Programmatic Memory ל-Agentic Memory? תנו דוגמה לכל אחד. (רמז: מי מחליט מה לזכור -- המפתח או הסוכן)
למה Hybrid Search חשוב במיוחד לעברית? מה keyword search תופס שvector search מפספס? (רמז: morphology, specific terms)
נסחו 3 כללים לבניית מערכת זיכרון בטוחה מבחינת פרטיות. (רמז: PII, isolation, right to delete)

עברתם 4 מתוך 5? מצוין -- אתם מוכנים לפרק 13.

סיכום הפרק

בפרק הזה הפכתם את הסוכן שלכם ממערכת חסרת זיכרון לסוכן שלומד, מתאים את עצמו, ושומר על רצף. התחלתם עם The Agent Memory Stack -- 5 שכבות מ-context window ועד external knowledge. בניתם 3 אסטרטגיות conversation memory (sliding window, summary, token-aware) והשוויתם ביניהן. צללתם ל-vector stores ו-RAG -- embeddings, chunking, retrieval strategies, וההתאמות המיוחדות לעברית. בניתם persistent state עם SQLite ו-LangGraph checkpointing. למדתם 4 long-term memory patterns (CLAUDE.md, memory-as-tool, auto-summarization, knowledge graph) ואיך לבחור ביניהם. הכרתם את memory frameworks של 2026 -- Letta, Mem0, Redis, LangGraph Memory -- וההבדל בין programmatic ל-agentic memory. בניתם Agentic RAG שהסוכן מחליט מתי לחפש. חישבתם עלויות זיכרון עם ה-Memory Budget Calculator. וסיימתם עם אבטחה ופרטיות -- PII, isolation, GDPR, וחוק הגנת הפרטיות הישראלי.

הנקודה המרכזית: זיכרון הוא לא פיצ'ר אחד -- זה מערכת של שכבות, כל אחת עם tradeoffs. המפתח הטוב בוחר את השילוב הנכון לuse case שלו, ומודד, ומשפר.

בפרק הבא (פרק 13) תשתמשו בזיכרון הזה כדי לבנות מערכות multi-agent -- סוכנים שחולקים context, מתאמים ביניהם, ועובדים יחד על משימות מורכבות.

צ'קליסט -- סיכום פרק 12

מבין/ה את The Agent Memory Stack -- 5 שכבות מ-context window ועד long-term store
יודע/ת לממש Sliding Window Memory -- deque, maxlen, message management
יודע/ת לממש Summary Memory -- סיכום תקופתי, החלפת הודעות ישנות בסיכום
יודע/ת לממש Token-Aware Truncation -- ספירת tokens, שמירת הודעה ראשונה
מבין/ה את הנוף של Embedding Models וזה שמתאים לעברית
יודע/ת לבנות RAG pipeline עם vector store -- ChromaDB, Supabase, או pgvector
מבין/ה chunking strategies -- fixed, semantic, recursive, document-specific
יודע/ת לבנות Persistent State Agent שזוכר העדפות בין sessions
מבין/ה 4 Long-Term Memory Patterns -- CLAUDE.md, memory-as-tool, auto-summarization, knowledge graph
מכיר/ה Memory Frameworks -- Letta, Mem0, Redis, LangGraph Memory ומתי להשתמש בכל אחד
מבין/ה את ההבדל בין Programmatic ל-Agentic Memory
יודע/ת לבנות Agentic RAG שהסוכן מחליט מתי לחפש ומתי לענות ישירות
יודע/ת לחשב Memory Budget -- עלות tokens, embedding, storage, retrieval
מבין/ה אבטחת זיכרון -- PII, isolation, GDPR, חוק הגנת הפרטיות הישראלי
בנית מערכת זיכרון מלאה: conversation + RAG + persistent preferences