Here are the best YouTube videos covering inference, attention, QKV (Query, Key, Value) and how they connect:
1. Best Overall - Attention in Transformers, Step-by-Step (3Blue1Brown)
The clearest visual explanation of self-attention, multi-head attention, QKV, and how transformers reason during inference. Highly recommended as a starting point.
2. "Attention is All You Need" - Full Model Walkthrough (Umar Jamil)
A complete deep-dive covering every layer of the Transformer: single-head self-attention, multi-head attention, Query/Key/Value matrices, training, and inference. Includes the math.
Chapters you care about most:
- 20:08 - Single Head Self-Attention
- 28:30 - Multi-Head Attention
- 35:39 - Query, Key, Value
- 52:09 - Inference
3. The Math Behind Attention: Keys, Queries, and Values (Luis Serrano Academy)
A visually-friendly, math-grounded walkthrough of Q, K, and V matrices - how similarity scores are computed, how attention weights are derived, and what the value matrix actually does.
Chapters:
- 11:09 - Attention
- 20:46 - Keys and Queries Matrices
- 25:02 - Values Matrix
- 28:41 - Self and Multi-head Attention
4. How to Explain Q, K, and V in Self-Attention (Discover AI)
A shorter, focused video on the intuition and history behind the Query/Key/Value notation in BERT-style transformers.
How QKV connects to Inference
Briefly: during inference, a transformer generates tokens one at a time. For each new token, attention is computed by:
- Creating a Query (Q) for the current token - "what am I looking for?"
- Comparing it against Keys (K) of all previous tokens - "who matches?"
- Weighted-summing the Values (V) of matched tokens - "what do I take from them?"
This is why KV Cache is a major inference optimization - the K and V tensors from prior tokens are cached and reused so they don't need to be recomputed on every step.
The 3Blue1Brown video (link #1 above) covers this most intuitively; Umar Jamil's (link #2) covers the mechanics most thoroughly.