# Round 20 — Floor X · The Self-Healing Composer

**Date**: 2026-04-30
**Status**: ✅ COMPLETE
**Round Type**: Tower Special Floor (the secret floor)

---

## 🎯 הקומה המסתורית

Floor X **לא קיימת ב-tower diagram הציבורי**. היא קומה פנימית, סודית, שמטפלת ב**self-healing** — מתי הpipeline נכשל באמצע, ה-Floor X תופס ומתקן.

זאת ה-**immune system** של המגדל.

---

## 🔧 תפקיד Floor X

3 משימות עיקריות:

### 1. Validation Failure Recovery
כשLLM מחזיר A2UI invalid → Floor X מתערב:
- Re-prompt עם specific error
- Fallback to simpler component
- אם 3 ניסיונות נכשלו → degraded mode (text-only)

### 2. Stuck Pipeline Detection
כשstation לא מגיב timeout (e.g. J11 FactCheck ז 30s):
- Skip + flag כ-`degraded`
- Continue pipeline
- Mark article ב-`needs_human_review`

### 3. Cost Spike Mitigation
כשarticle משתמש > $0.10 ב-tokens (5x normal):
- Throttle to brief writer (J16) only
- Truncate context to top-5 cards
- Notify admin

---

## 🏗️ Implementation

### A2UI Self-Correction Loop
מקור: A2UI v0.9 spec — `VALIDATION_FAILED` event.

```python
async def floor_x_validate_and_fix(a2ui_message: dict, retries: int = 3):
    """Validate A2UI message against catalog. If fail, re-prompt LLM."""
    catalog = await load_catalog(a2ui_message["catalogId"])

    for attempt in range(retries):
        valid, error = validate_against_catalog(a2ui_message, catalog)

        if valid:
            return a2ui_message

        # Send VALIDATION_FAILED back to LLM
        correction_prompt = f"""
        Your previous A2UI message had a validation error:
        - Error: {error.message}
        - Path: {error.path}
        - Expected: {error.expected}
        - Got: {error.got}

        Original message:
        {json.dumps(a2ui_message, indent=2)}

        Please fix and re-emit only the corrected message.
        """

        a2ui_message = await gemini.regenerate(correction_prompt)

    # All retries failed → fallback
    return generate_fallback_message(a2ui_message)
```

### Stuck Pipeline Detection
```python
async def watchdog_pipeline(flow_id: int, timeout_per_station: int = 10):
    """Monitor pipeline progress. If stuck, skip + flag."""
    while True:
        await asyncio.sleep(2)

        progress = await get_pipeline_progress(flow_id)
        if progress.completed:
            return

        for station_id, last_event in progress.station_states.items():
            elapsed = (now() - last_event.timestamp).seconds
            if elapsed > timeout_per_station:
                log.warn(f"Station {station_id} stuck for {elapsed}s — skipping")
                await skip_station(flow_id, station_id, reason="timeout")
                await flag_article(flow_id, "degraded_pipeline")
```

### Cost Spike Mitigation
```python
async def cost_guardian(flow_id: int):
    """Track per-flow LLM cost. Alert if spike."""
    cost_so_far = 0
    threshold = 0.10  # $0.10

    async for event in subscribe_flow_events(flow_id):
        if event.type == "llm_call_complete":
            cost_so_far += event.data["cost"]

            if cost_so_far > threshold:
                # Switch to economy mode
                await emit_event("force_economy_mode", flow_id=flow_id)
                await alert_admin(f"Flow {flow_id} exceeded ${threshold} budget")
                break
```

---

## 🛡️ Failure Modes Handled

| Failure | Detection | Response |
|---|---|---|
| LLM returns invalid JSON | JSON parse error | Re-prompt with `VALIDATION_FAILED` |
| LLM uses unknown component | Catalog validation | Re-prompt with allowed list |
| LLM timeout (>30s) | asyncio timeout | Fallback to brief writer |
| pgvector query timeout | DB timeout | Skip context, write warning |
| FactCheck stuck | Watchdog 10s | Skip + flag |
| Cost spike | Cost guardian | Switch to economy |
| Network failure (ElevenLabs) | aiohttp error | Skip multimedia, continue |
| Token limit exceeded | API 429 | Backoff + retry / fallback model |

---

## 🔍 Floor X Observability

ה-Floor X מייצר events ייעודיים:
```json
{"event":"floor_x_correction","data":{"original_error":"...","attempts":2,"resolved":true}}
{"event":"floor_x_skip","data":{"station":"J11","reason":"timeout","elapsed_s":12}}
{"event":"floor_x_economy","data":{"reason":"cost_spike","cost_so_far":0.105}}
```

ה-`Lobby Dashboard` (Round 46) מציג Floor X events ב-realtime — admin רואה את ה-**immune system** עובד.

---

## 💰 Cost of Floor X

הוצאה נוספת על self-healing:
- Re-prompts (~$0.0001 per correction)
- Watchdog (no cost — async)
- Economy fallback (cheaper, saves money)

**Net effect**: -10% costs (כי mitigation מונע spikes), +5% latency (avg).

---

## 🎯 ההכרעות

### 1. למה Floor X ולא מובנה בכל station?

**Separation of concerns**. כל station עושה את העבודה שלו. Floor X מטפל ב-**meta-failures** (timeouts, validation, cost). זה layer transverse.

### 2. למה זה "סודי"?

זה לא באמת סוד. זה **layer infrastructure**, לא product. ה-tenant לא רואה Floor X. הוא רק רואה שpipeline עובד גם כשLLM מתפרק.

### 3. אם 3 retries נכשלו?

Degraded mode — text-only article (no animations, no audio, simplified layout). **משהו תמיד יוצא**. אזולאי לא רוצה לראות "Server Error 500" ב-newsroom.

---

## ✅ Closure
✅ **Round 20 closed. Floor X (the immune system) documented.**
