# Data Collection Scheduling Implementation Plan

> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.

**Goal:** Add daily scheduled data collection (18:00 Mon-Fri) and an API-triggered backfill job to collect all historical price data.

**Architecture:** A new `collection_job.py` module orchestrates collectors in dependency order (master data first, then prices). The scheduler registers the daily job alongside the existing snapshot job. Backfill splits date ranges into yearly chunks, reusing existing `PriceCollector` and `ETFPriceCollector`.

**Tech Stack:** APScheduler (already in use), existing pykrx-based collectors, PostgreSQL, FastAPI

---

### Task 1: Create `collection_job.py` with `run_daily_collection`

**Files:**
- Create: `backend/jobs/collection_job.py`
- Test: `backend/tests/e2e/test_collection_job.py`

**Step 1: Write the failing test**

Create `backend/tests/e2e/test_collection_job.py`:

```python
"""
Tests for collection job orchestration.
"""
import pytest
from unittest.mock import patch, MagicMock
from datetime import datetime

from jobs.collection_job import run_daily_collection


def test_run_daily_collection_calls_collectors_in_order():
    """Daily collection should run all collectors in dependency order."""
    call_order = []

    def make_mock_collector(name):
        mock_cls = MagicMock()
        instance = MagicMock()
        instance.run.side_effect = lambda: call_order.append(name)
        mock_cls.return_value = instance
        return mock_cls

    with patch("jobs.collection_job.SessionLocal") as mock_session_local, \
         patch("jobs.collection_job.StockCollector", make_mock_collector("stock")), \
         patch("jobs.collection_job.SectorCollector", make_mock_collector("sector")), \
         patch("jobs.collection_job.PriceCollector", make_mock_collector("price")), \
         patch("jobs.collection_job.ValuationCollector", make_mock_collector("valuation")), \
         patch("jobs.collection_job.ETFCollector", make_mock_collector("etf")), \
         patch("jobs.collection_job.ETFPriceCollector", make_mock_collector("etf_price")):
        mock_session_local.return_value = MagicMock()
        run_daily_collection()

    assert call_order == ["stock", "sector", "price", "valuation", "etf", "etf_price"]


def test_run_daily_collection_continues_on_failure():
    """If one collector fails, the rest should still run."""
    call_order = []

    def make_mock_collector(name, should_fail=False):
        mock_cls = MagicMock()
        instance = MagicMock()
        def side_effect():
            if should_fail:
                raise RuntimeError(f"{name} failed")
            call_order.append(name)
        instance.run.side_effect = side_effect
        mock_cls.return_value = instance
        return mock_cls

    with patch("jobs.collection_job.SessionLocal") as mock_session_local, \
         patch("jobs.collection_job.StockCollector", make_mock_collector("stock", should_fail=True)), \
         patch("jobs.collection_job.SectorCollector", make_mock_collector("sector")), \
         patch("jobs.collection_job.PriceCollector", make_mock_collector("price")), \
         patch("jobs.collection_job.ValuationCollector", make_mock_collector("valuation")), \
         patch("jobs.collection_job.ETFCollector", make_mock_collector("etf")), \
         patch("jobs.collection_job.ETFPriceCollector", make_mock_collector("etf_price")):
        mock_session_local.return_value = MagicMock()
        run_daily_collection()

    # stock failed, but rest should continue
    assert call_order == ["sector", "price", "valuation", "etf", "etf_price"]
```

**Step 2: Run test to verify it fails**

Run: `cd /home/zephyrdark/workspace/quant/galaxy-po/backend && python -m pytest tests/e2e/test_collection_job.py -v`
Expected: FAIL with `ModuleNotFoundError: No module named 'jobs.collection_job'`

**Step 3: Write minimal implementation**

Create `backend/jobs/collection_job.py`:

```python
"""
Data collection orchestration jobs.
"""
import logging
from datetime import datetime

from app.core.database import SessionLocal
from app.services.collectors import (
    StockCollector,
    SectorCollector,
    PriceCollector,
    ValuationCollector,
    ETFCollector,
    ETFPriceCollector,
)

logger = logging.getLogger(__name__)

# Collectors in dependency order: master data first, then derived data
DAILY_COLLECTORS = [
    ("StockCollector", StockCollector, {}),
    ("SectorCollector", SectorCollector, {}),
    ("PriceCollector", PriceCollector, {}),
    ("ValuationCollector", ValuationCollector, {}),
    ("ETFCollector", ETFCollector, {}),
    ("ETFPriceCollector", ETFPriceCollector, {}),
]


def run_daily_collection():
    """
    Run all data collectors in dependency order.

    Each collector gets its own DB session. If one fails, the rest continue.
    Designed to be called by APScheduler at 18:00 Mon-Fri.
    """
    logger.info("Starting daily data collection")
    results = {}

    for name, collector_cls, kwargs in DAILY_COLLECTORS:
        db = SessionLocal()
        try:
            collector = collector_cls(db, **kwargs)
            collector.run()
            results[name] = "success"
            logger.info(f"{name} completed successfully")
        except Exception as e:
            results[name] = f"failed: {e}"
            logger.error(f"{name} failed: {e}")
        finally:
            db.close()

    logger.info(f"Daily collection finished: {results}")
```

**Step 4: Run test to verify it passes**

Run: `cd /home/zephyrdark/workspace/quant/galaxy-po/backend && python -m pytest tests/e2e/test_collection_job.py -v`
Expected: 2 tests PASS

**Step 5: Commit**

```bash
git add backend/jobs/collection_job.py backend/tests/e2e/test_collection_job.py
git commit -m "feat: add daily collection job orchestration"
```

---

### Task 2: Add `run_backfill` to `collection_job.py`

**Files:**
- Modify: `backend/jobs/collection_job.py`
- Test: `backend/tests/e2e/test_collection_job.py`

**Step 1: Write the failing test**

Append to `backend/tests/e2e/test_collection_job.py`:

```python
from jobs.collection_job import run_backfill


def test_run_backfill_generates_yearly_chunks():
    """Backfill should split date range into yearly chunks."""
    collected_ranges = []

    def make_price_collector(name):
        mock_cls = MagicMock()
        def capture_init(db, start_date=None, end_date=None):
            instance = MagicMock()
            collected_ranges.append((name, start_date, end_date))
            return instance
        mock_cls.side_effect = capture_init
        return mock_cls

    with patch("jobs.collection_job.SessionLocal") as mock_session_local, \
         patch("jobs.collection_job.PriceCollector", make_price_collector("price")), \
         patch("jobs.collection_job.ETFPriceCollector", make_price_collector("etf_price")):
        mock_db = MagicMock()
        mock_session_local.return_value = mock_db
        # Simulate no existing data (min date returns None)
        mock_db.query.return_value.scalar.return_value = None

        run_backfill(start_year=2023)

    # Should generate chunks: 2023, 2024, 2025, 2026 (partial) for both price and etf_price
    price_ranges = [(s, e) for name, s, e in collected_ranges if name == "price"]
    assert len(price_ranges) >= 3  # At least 2023, 2024, 2025
    assert price_ranges[0][0] == "20230101"  # First chunk starts at start_year


def test_run_backfill_skips_already_collected_range():
    """Backfill should start from earliest existing data backwards."""
    collected_ranges = []

    def make_price_collector(name):
        mock_cls = MagicMock()
        def capture_init(db, start_date=None, end_date=None):
            instance = MagicMock()
            collected_ranges.append((name, start_date, end_date))
            return instance
        mock_cls.side_effect = capture_init
        return mock_cls

    from datetime import date

    with patch("jobs.collection_job.SessionLocal") as mock_session_local, \
         patch("jobs.collection_job.PriceCollector", make_price_collector("price")), \
         patch("jobs.collection_job.ETFPriceCollector", make_price_collector("etf_price")):
        mock_db = MagicMock()
        mock_session_local.return_value = mock_db

        # Simulate: Price has data from 2024-06-01, ETFPrice has no data
        def query_side_effect(model_attr):
            mock_q = MagicMock()
            if "Price" in str(model_attr):
                mock_q.scalar.return_value = date(2024, 6, 1)
            else:
                mock_q.scalar.return_value = None
            return mock_q
        mock_db.query.return_value = MagicMock()
        # We'll verify the function runs without error
        run_backfill(start_year=2023)

    assert len(collected_ranges) > 0
```

**Step 2: Run test to verify it fails**

Run: `cd /home/zephyrdark/workspace/quant/galaxy-po/backend && python -m pytest tests/e2e/test_collection_job.py::test_run_backfill_generates_yearly_chunks -v`
Expected: FAIL with `ImportError: cannot import name 'run_backfill'`

**Step 3: Write minimal implementation**

Add to `backend/jobs/collection_job.py`:

```python
from datetime import date, timedelta
from sqlalchemy import func

from app.models.stock import Price, ETFPrice


def _generate_yearly_chunks(start_year: int, end_date: date) -> list[tuple[str, str]]:
    """Generate (start_date, end_date) pairs in YYYYMMDD format, one per year."""
    chunks = []
    current_start = date(start_year, 1, 1)

    while current_start < end_date:
        current_end = date(current_start.year, 12, 31)
        if current_end > end_date:
            current_end = end_date
        chunks.append((
            current_start.strftime("%Y%m%d"),
            current_end.strftime("%Y%m%d"),
        ))
        current_start = date(current_start.year + 1, 1, 1)

    return chunks


def run_backfill(start_year: int = 2000):
    """
    Collect historical price data from start_year to today.

    Checks the earliest existing data in DB and only collects
    missing periods. Splits into yearly chunks to avoid overloading pykrx.
    """
    logger.info(f"Starting backfill from {start_year}")
    today = date.today()

    db = SessionLocal()
    try:
        # Determine what needs backfilling
        backfill_targets = [
            ("Price", PriceCollector, Price.date),
            ("ETFPrice", ETFPriceCollector, ETFPrice.date),
        ]

        for name, collector_cls, date_col in backfill_targets:
            # Find earliest existing data
            earliest = db.query(func.min(date_col)).scalar()

            if earliest is None:
                # No data at all - collect everything
                backfill_end = today
            else:
                # Data exists - collect from start_year to day before earliest
                backfill_end = earliest - timedelta(days=1)

            if date(start_year, 1, 1) >= backfill_end:
                logger.info(f"{name}: no backfill needed (data exists from {earliest})")
                continue

            chunks = _generate_yearly_chunks(start_year, backfill_end)
            logger.info(f"{name}: backfilling {len(chunks)} yearly chunks from {start_year} to {backfill_end}")

            for start_dt, end_dt in chunks:
                chunk_db = SessionLocal()
                try:
                    collector = collector_cls(chunk_db, start_date=start_dt, end_date=end_dt)
                    collector.run()
                    logger.info(f"{name}: chunk {start_dt}-{end_dt} completed")
                except Exception as e:
                    logger.error(f"{name}: chunk {start_dt}-{end_dt} failed: {e}")
                finally:
                    chunk_db.close()

            # Also fill gap between earliest data and today (forward fill)
            if earliest is not None:
                latest = db.query(func.max(date_col)).scalar()
                if latest and latest < today:
                    gap_start = (latest + timedelta(days=1)).strftime("%Y%m%d")
                    gap_end = today.strftime("%Y%m%d")
                    gap_db = SessionLocal()
                    try:
                        collector = collector_cls(gap_db, start_date=gap_start, end_date=gap_end)
                        collector.run()
                        logger.info(f"{name}: forward fill {gap_start}-{gap_end} completed")
                    except Exception as e:
                        logger.error(f"{name}: forward fill failed: {e}")
                    finally:
                        gap_db.close()

    finally:
        db.close()

    logger.info("Backfill completed")
```

**Step 4: Run test to verify it passes**

Run: `cd /home/zephyrdark/workspace/quant/galaxy-po/backend && python -m pytest tests/e2e/test_collection_job.py -v`
Expected: All 4 tests PASS

**Step 5: Commit**

```bash
git add backend/jobs/collection_job.py backend/tests/e2e/test_collection_job.py
git commit -m "feat: add backfill job for historical price data"
```

---

### Task 3: Register daily collection in scheduler

**Files:**
- Modify: `backend/jobs/scheduler.py`
- Modify: `backend/jobs/__init__.py`
- Test: `backend/tests/e2e/test_collection_job.py`

**Step 1: Write the failing test**

Append to `backend/tests/e2e/test_collection_job.py`:

```python
def test_scheduler_has_daily_collection_job():
    """Scheduler should register a daily_collection job at 18:00."""
    from jobs.scheduler import scheduler, configure_jobs

    # Reset scheduler for test
    if scheduler.running:
        scheduler.shutdown(wait=False)

    from apscheduler.schedulers.background import BackgroundScheduler
    test_scheduler = BackgroundScheduler()

    with patch("jobs.scheduler.scheduler", test_scheduler):
        configure_jobs()

    jobs = {job.id: job for job in test_scheduler.get_jobs()}
    assert "daily_collection" in jobs

    trigger = jobs["daily_collection"].trigger
    # Verify it's a cron trigger (we can check string representation)
    trigger_str = str(trigger)
    assert "18" in trigger_str  # hour=18
```

**Step 2: Run test to verify it fails**

Run: `cd /home/zephyrdark/workspace/quant/galaxy-po/backend && python -m pytest tests/e2e/test_collection_job.py::test_scheduler_has_daily_collection_job -v`
Expected: FAIL with `AssertionError: 'daily_collection' not in jobs`

**Step 3: Modify scheduler to add the daily collection job**

Modify `backend/jobs/scheduler.py`:

```python
"""
APScheduler configuration for background jobs.
"""
import logging
from apscheduler.schedulers.background import BackgroundScheduler
from apscheduler.triggers.cron import CronTrigger

from jobs.snapshot_job import create_daily_snapshots
from jobs.collection_job import run_daily_collection

logger = logging.getLogger(__name__)

# Create scheduler instance
scheduler = BackgroundScheduler()


def configure_jobs():
    """Configure scheduled jobs."""
    # Daily data collection at 18:00 (after market close, before snapshot)
    scheduler.add_job(
        run_daily_collection,
        trigger=CronTrigger(
            hour=18,
            minute=0,
            day_of_week='mon-fri',
        ),
        id='daily_collection',
        name='Collect daily market data',
        replace_existing=True,
    )
    logger.info("Configured daily_collection job at 18:00")

    # Daily snapshot at 18:30 (after data collection completes)
    scheduler.add_job(
        create_daily_snapshots,
        trigger=CronTrigger(
            hour=18,
            minute=30,
            day_of_week='mon-fri',
        ),
        id='daily_snapshots',
        name='Create daily portfolio snapshots',
        replace_existing=True,
    )
    logger.info("Configured daily_snapshots job at 18:30")


def start_scheduler():
    """Start the scheduler."""
    if not scheduler.running:
        configure_jobs()
        scheduler.start()
        logger.info("Scheduler started")


def stop_scheduler():
    """Stop the scheduler."""
    if scheduler.running:
        scheduler.shutdown()
        logger.info("Scheduler stopped")
```

Update `backend/jobs/__init__.py`:

```python
"""
Background jobs module.
"""
from jobs.scheduler import scheduler, start_scheduler, stop_scheduler
from jobs.collection_job import run_daily_collection, run_backfill

__all__ = [
    "scheduler", "start_scheduler", "stop_scheduler",
    "run_daily_collection", "run_backfill",
]
```

**Step 4: Run test to verify it passes**

Run: `cd /home/zephyrdark/workspace/quant/galaxy-po/backend && python -m pytest tests/e2e/test_collection_job.py -v`
Expected: All 5 tests PASS

**Step 5: Commit**

```bash
git add backend/jobs/scheduler.py backend/jobs/__init__.py backend/tests/e2e/test_collection_job.py
git commit -m "feat: register daily collection job at 18:00 in scheduler"
```

---

### Task 4: Add backfill API endpoint

**Files:**
- Modify: `backend/app/api/admin.py`
- Test: `backend/tests/e2e/test_collection_job.py`

**Step 1: Write the failing test**

Append to `backend/tests/e2e/test_collection_job.py`:

```python
def test_backfill_api_endpoint(client, admin_auth_headers):
    """POST /api/admin/collect/backfill should trigger backfill."""
    with patch("app.api.admin.run_backfill_background") as mock_backfill:
        response = client.post(
            "/api/admin/collect/backfill?start_year=2020",
            headers=admin_auth_headers,
        )
    assert response.status_code == 200
    assert "backfill" in response.json()["message"].lower()
    mock_backfill.assert_called_once()


def test_backfill_api_requires_auth(client):
    """POST /api/admin/collect/backfill should require authentication."""
    response = client.post("/api/admin/collect/backfill")
    assert response.status_code == 401
```

**Step 2: Run test to verify it fails**

Run: `cd /home/zephyrdark/workspace/quant/galaxy-po/backend && python -m pytest tests/e2e/test_collection_job.py::test_backfill_api_endpoint -v`
Expected: FAIL with 404 or import error

**Step 3: Add backfill endpoint to admin API**

Add to `backend/app/api/admin.py`, after existing imports:

```python
from jobs.collection_job import run_backfill
```

Add the background runner function (after `_start_background_collection`):

```python
def _run_backfill_background(start_year: int):
    """Run backfill in a background thread."""
    try:
        run_backfill(start_year=start_year)
    except Exception as e:
        logger.error("Background backfill failed: %s", e)


def run_backfill_background(start_year: int):
    """Start backfill in a daemon thread."""
    thread = threading.Thread(
        target=_run_backfill_background,
        args=(start_year,),
        daemon=True,
    )
    thread.start()
```

Add the endpoint (after `collect_etf_prices`):

```python
@router.post("/collect/backfill", response_model=CollectResponse)
async def collect_backfill(
    current_user: CurrentUser,
    start_year: int = Query(2000, ge=1990, le=2026, description="Start year for backfill"),
):
    """Backfill historical price data from start_year to today (runs in background)."""
    run_backfill_background(start_year)
    return CollectResponse(message=f"Backfill started from {start_year}")
```

**Step 4: Run test to verify it passes**

Run: `cd /home/zephyrdark/workspace/quant/galaxy-po/backend && python -m pytest tests/e2e/test_collection_job.py -v`
Expected: All 7 tests PASS

**Step 5: Commit**

```bash
git add backend/app/api/admin.py backend/tests/e2e/test_collection_job.py
git commit -m "feat: add backfill API endpoint for historical data collection"
```

---

### Task 5: Run full test suite and verify

**Step 1: Run all existing tests**

Run: `cd /home/zephyrdark/workspace/quant/galaxy-po/backend && python -m pytest tests/ -v`
Expected: All tests PASS (existing + new)

**Step 2: Verify no import issues**

Run: `cd /home/zephyrdark/workspace/quant/galaxy-po/backend && python -c "from jobs import run_daily_collection, run_backfill; print('OK')"`
Expected: `OK`

**Step 3: Final commit if any fixes needed**

```bash
git add -A
git commit -m "fix: resolve any test issues from collection scheduling"
```