
Book
ML Handbook: ML Engineering with AI
A handbook for ML engineers whose primary implementation engine is an AI agent. Practice for the era when the bottleneck is deciding and verifying, not typing.
Table of Contents
35 chaptersFoundations
5 chapters- 01
Why AI-augmented ML is different
The interesting changes are second-order. Most posts focus on the first-order wins; the leverage is in what happens when the cost of trying an idea collapses.
3 min read
- 02
The new bottlenecks
Honest accounting of an AI-augmented week shows "writing code" rarely tops the list anymore. Three things take its place.
3 min read
- 03
Claude Code mental model
To use the agent well, you need an honest model of what it is. Most failures come from treating it as something else.
3 min read
- 04
Agent context files
The highest-leverage category of files in any AI-assisted ML repo is the set agents read to figure out how to behave. CLAUDE.md is the most visible. It is one of several. They form a layered system; teams that treat…
8 min read
- 05
Operating principles
The short list. Print it. Pin it next to your monitor. Apply on every task.
2 min read
Iteration Loop
6 chapters- 06
Classic vs AI-augmented loop
The loop has the same shape. The economics are different.
3 min read
- 07
Hypothesis → experiment → measure
The loop survives 100× speedup only if each step is structured. Here is the structure we use.
4 min read
- 08
Reproducibility is non-negotiable
You generate results 5–10× faster. If those results are not reproducible, you generate garbage 5–10× faster. Internalize this chapter before the others.
4 min read
- 09
Designing research loops
The mechanics of building an experiment harness that an agent can drive itself — a bounded action space, machine-readable results, and a hard budget.
5 min read
- 10
Reward & stop conditions
The two parts of an auto-research loop where teams quietly fool themselves. Both warrant their own discussion.
5 min read
- 11
Case studies
Three concrete shapes of auto-research that work today, and one that does not. Use them as templates, not gospel — adapt to your own constraints.
5 min read
Tooling Stack
5 chapters- 12
Claude Code as the engine
We default to Claude Code as the primary implementation agent. The reasoning is in Foundations; this chapter is about configuration that actually pays off.
5 min read
- 13
Experiment tracking
Pick one. Use it on every run. The worst tracker used consistently beats the best tracker used sometimes.
4 min read
- 14
Data & model versioning
Code without versioning is unsupportable; data and weights are no different. This chapter is about the practical minimum.
4 min read
- 15
Repo conventions
A repo layout that is friendly to humans and agents. Both audiences benefit from the same property: predictability.
5 min read
- 16
SOTA tool roundup (2026)
A pragmatic survey, not a hype list. For each category we name what exists, what we'd actually pick, and what to skip.
6 min read
Three Modalities
3 chapters- 17
Traditional ML with AI
Tabular, sklearn, XGBoost, LightGBM, classical NLP. This is where AI-augmented workflow shines brightest, because:
4 min read
- 18
Deep learning with AI
PyTorch, JAX, fine-tuning, distributed training. The economics are different from traditional ML: each experiment costs real money, runs take hours not seconds, and a wrong configuration can burn a day of GPU time.
6 min read
- 19
Agentic AI with AI
Building agent systems with the help of agents. The framing is recursive but the practice is concrete: you are designing software whose primary primitive is "LLM call + tool use," and your IDE is also an agent.
6 min read
Failure Modes
6 chapters- 20
Data leakage & dataset mixing
The classical ML failure mode, amplified. Agents refactor data pipelines five times in an afternoon and lose the split each time.
4 min read
- 21
Hardcoded paths & magic constants
The most common AI-written bug by volume. Easy to write. Easy to miss in review. Painful to find later.
4 min read
- 22
Fabricated benchmarks
The most damaging failure mode in AI-augmented ML. The agent produces a number that looks like a measurement but came from nowhere. The number lands in a model card, a slide, a blog post, a tweet. Then someone tries to…
5 min read
- 23
Silent test failures
A test that always passes is worse than no test. It looks like coverage and provides none. Agents produce these by accident with surprising regularity. The patterns below are the common ones; learn to spot them on sight.
4 min read
- 24
Hallucinated APIs & versions
The agent's training data is a snapshot. The libraries you use have moved since. Result: confident code calling functions that no longer exist, with arguments that were renamed, against APIs that were deprecated two…
4 min read
- 25
Guardrails checklist
The consolidated list. Use it before merging any AI-generated PR that touches data, training, eval, or anything customer-facing. Two minutes. Saves hours.
4 min read
Documenting with AI
5 chapters- 26
Doc-driven development
The shortest path to maintainable AI-augmented code: write the doc first, generate the code from the doc, keep them in sync.
4 min read
- 27
Numbers must be measured
The core rule, restated: every number in user-facing material comes from a measurement script in the repo, or a cited public source with a working URL. Nothing else.
5 min read
- 28
The same-commit rule
When you ship code, you ship the doc that describes it. Same commit. Always.
4 min read
- 29
Model & dataset cards
Model cards and dataset cards are the primary external interface for an ML artifact. The agent can draft them well; only you can verify them.
5 min read
- 30
Client deliverables (DOCX / PPTX / XLSX)
Markdown is the lingua franca for engineers. DOCX, PPTX, and XLSX are the lingua franca for everyone else — clients, leadership, finance, sales, regulators. Treating them as second-class cuts you off from collaboration.
6 min read
Recipes
5 chapters- 31
Recipe: spinning up a new project
A 30-minute path from empty directory to "agent can ship features here." Skip nothing the first time you do it; everything below has earned its place.
5 min read
- 32
Recipe: adding a baseline
The first model in any new project is the baseline. Not the headline model. The baseline. Without it, no future result has a number to beat.
5 min read
- 33
Recipe: running a benchmark
A benchmark is a runnable measurement script that produces a number you can publish. This recipe walks through writing and running one for inference performance — the most common case.
5 min read
- 34
Recipe: writing a model card
The 30-minute version. Produces a card that survives external review without embarrassing you.
5 min read
- 35
Recipe: debugging with Claude
Patterns that consistently produce a fix in under an hour.
6 min read
