Search for YouTube videos which talks about inference, attention, qkv

Search for YouTube videos which talks about inference, attention, qkv and how these are associated with

Searching the Web

YouTube videos inference attention QKV query key value transformer explained

Here are the best YouTube videos covering inference, attention, QKV (Query, Key, Value) and how they connect:

1. Best Overall - Attention in Transformers, Step-by-Step (3Blue1Brown)

The clearest visual explanation of self-attention, multi-head attention, QKV, and how transformers reason during inference. Highly recommended as a starting point.

Watch "https://www.youtube.com/watch?v=eMlx5fFNoYc" on YouTube

https://www.youtube.com/watch?v=eMlx5fFNoYc

2. "Attention is All You Need" - Full Model Walkthrough (Umar Jamil)

A complete deep-dive covering every layer of the Transformer: single-head self-attention, multi-head attention, Query/Key/Value matrices, training, and inference. Includes the math.

Chapters you care about most:

20:08 - Single Head Self-Attention
28:30 - Multi-Head Attention
35:39 - Query, Key, Value
52:09 - Inference

Watch "https://www.youtube.com/watch?v=bCz4OMemCcA" on YouTube

https://www.youtube.com/watch?v=bCz4OMemCcA

3. The Math Behind Attention: Keys, Queries, and Values (Luis Serrano Academy)

A visually-friendly, math-grounded walkthrough of Q, K, and V matrices - how similarity scores are computed, how attention weights are derived, and what the value matrix actually does.

Chapters:

11:09 - Attention
20:46 - Keys and Queries Matrices
25:02 - Values Matrix
28:41 - Self and Multi-head Attention

Watch "https://www.youtube.com/watch?v=UPtG_38Oq8o" on YouTube

https://www.youtube.com/watch?v=UPtG_38Oq8o

4. How to Explain Q, K, and V in Self-Attention (Discover AI)

A shorter, focused video on the intuition and history behind the Query/Key/Value notation in BERT-style transformers.

Watch "https://www.youtube.com/watch?v=PFczJ6NR5rY" on YouTube

https://www.youtube.com/watch?v=PFczJ6NR5rY

How QKV connects to Inference

Briefly: during inference, a transformer generates tokens one at a time. For each new token, attention is computed by:

Creating a Query (Q) for the current token - "what am I looking for?"
Comparing it against Keys (K) of all previous tokens - "who matches?"
Weighted-summing the Values (V) of matched tokens - "what do I take from them?"

This is why KV Cache is a major inference optimization - the K and V tensors from prior tokens are cached and reused so they don't need to be recomputed on every step.

The 3Blue1Brown video (link #1 above) covers this most intuitively; Umar Jamil's (link #2) covers the mechanics most thoroughly.

This is a shared conversation. Sign in to Orris to start your own chat.