Great question. The distinction is fundamental and worth unpacking carefully.
The Core Difference: What the Model Is Actually Doing
LLMs — Sequential, Autoregressive Generation
LLMs (and vision-language models like GPT-4o) are autoregressive: they generate one token at a time, where each token depends on all previous tokens.
- You send input → the model runs a full forward pass through billions of parameters → produces token 1 → runs another full forward pass → produces token 2 → and so on.
- This is inherently sequential. You cannot parallelize token generation because token N+1 depends on token N.
- A 70B parameter model doing a forward pass involves moving hundreds of gigabytes of weights through compute units repeatedly.
- The latency is structural, not accidental. It's baked into the architecture.
Even the input processing (encoding your image/text) takes time because the model has to run attention across every single token against every other token — O(n²) complexity for the attention mechanism.
Self-Driving Cars — Specialized, Feedforward, Parallel Pipelines
Self-driving systems like Waymo use a fundamentally different stack:
1. The models are smaller and purpose-built
They don't use giant 70B LLMs. They use convolutional neural networks (CNNs), vision transformers (ViTs), or custom architectures trained for one specific job: detect objects, estimate depth, predict trajectories. A single-pass CNN classifying objects in a frame can be just a few million parameters — orders of magnitude smaller than an LLM.
2. Feedforward, not autoregressive
Object detection (YOLO, EfficientDet, etc.) is a single forward pass through the network for an entire image frame. Input goes in → activations flow forward → output comes out. No token-by-token generation loop. This is inherently parallelizable and fast.
3. Dedicated hardware (edge compute)
Waymo vehicles carry onboard compute — custom ASICs (like Google's TPU chips or NVIDIA Drive platforms). These are physically present in the car, so there's no network round-trip. The frame is processed locally in milliseconds.
For context:
- A camera frame at 30 FPS gives you ~33ms per frame.
- A modern object detection model on a GPU/ASIC can run inference in 5–20ms per frame.
- LLMs generating a response involve seconds of sequential computation.
4. The output is a narrow, structured signal
The car's perception model doesn't need to "think" — it outputs a bounding box, a class label (car/pedestrian/sign), and a confidence score. That's it. The planning and control layers then act on that structured output. There's no generation of free-form text or multi-step reasoning happening in real time.
5. Sensor fusion at the hardware level
LiDAR, radar, and camera data are processed in parallel by separate specialized models, then fused. Each stream runs independently and concurrently. An LLM processes everything serially through a single monolithic model.
Why Can't You Just Make an LLM Real-Time?
You can use vision models (like CLIP or lightweight ViTs) in real time — and self-driving cars actually do use vision transformers. The key is:
- No autoregressive loop — just a single forward pass.
- Small model size — fits in fast memory, no weight paging.
- Quantization and pruning — models are compressed to INT8 or INT4 for speed.
- Dedicated silicon — not a general-purpose server CPU.
The "latency" you notice in GPT-4o or similar is specifically the generation step. If you just ask it to classify an image (no text output), it's actually much faster. The slowness comes from generating long, sequential outputs.
Summary Table
| Factor | LLM (e.g. GPT-4o) | Self-Driving (e.g. Waymo) |
|---|
| Architecture | Autoregressive transformer | Feedforward CNN/ViT |
| Output type | Tokens, one at a time | Structured labels/boxes |
| Model size | Tens to hundreds of billions of params | Millions to low billions of params |
| Hardware location | Remote datacenter server | Onboard edge compute |
| Inference latency | Seconds (due to generation loop) | 5–30ms per frame |
| Parallelism | Limited (sequential token gen) | High (whole frame in one pass) |
| Purpose | General reasoning | One specific perception task |
The short version: LLMs are slow because they generate outputs step-by-step through a huge model on remote servers. Self-driving cars use small, task-specific models that run a single fast forward pass on dedicated hardware sitting right inside the vehicle. The "real-time" capability comes from eliminating the autoregressive bottleneck, shrinking the model, and co-locating compute with the sensor.