I want to understand one thing, that in LLMs, the input and output tak

I want to understand one thing, that in LLMs, the input and output takes time, like there is a latency. So things does not happen in real time, but in things like Waymo or self-driving cars, the video and, you know, images get processed in real time. How is it possible? What is the difference? Like, how is this able, what is the distinguishing factor? And, like, I just want to understand these two different things and the principles. Like for LLMs, we have vision models, but they can process these images, but it does not happen in real time. So what brings that real-time thing into self-driving cars?

Checking Date & Time

Great question. The distinction is fundamental and worth unpacking carefully.

The Core Difference: What the Model Is Actually Doing

LLMs — Sequential, Autoregressive Generation

LLMs (and vision-language models like GPT-4o) are autoregressive: they generate one token at a time, where each token depends on all previous tokens.

You send input → the model runs a full forward pass through billions of parameters → produces token 1 → runs another full forward pass → produces token 2 → and so on.
This is inherently sequential. You cannot parallelize token generation because token N+1 depends on token N.
A 70B parameter model doing a forward pass involves moving hundreds of gigabytes of weights through compute units repeatedly.
The latency is structural, not accidental. It's baked into the architecture.

Even the input processing (encoding your image/text) takes time because the model has to run attention across every single token against every other token — O(n²) complexity for the attention mechanism.

Self-Driving Cars — Specialized, Feedforward, Parallel Pipelines

Self-driving systems like Waymo use a fundamentally different stack:

1. The models are smaller and purpose-built

They don't use giant 70B LLMs. They use convolutional neural networks (CNNs), vision transformers (ViTs), or custom architectures trained for one specific job: detect objects, estimate depth, predict trajectories. A single-pass CNN classifying objects in a frame can be just a few million parameters — orders of magnitude smaller than an LLM.

2. Feedforward, not autoregressive

Object detection (YOLO, EfficientDet, etc.) is a single forward pass through the network for an entire image frame. Input goes in → activations flow forward → output comes out. No token-by-token generation loop. This is inherently parallelizable and fast.

3. Dedicated hardware (edge compute)

Waymo vehicles carry onboard compute — custom ASICs (like Google's TPU chips or NVIDIA Drive platforms). These are physically present in the car, so there's no network round-trip. The frame is processed locally in milliseconds.

For context:

A camera frame at 30 FPS gives you ~33ms per frame.
A modern object detection model on a GPU/ASIC can run inference in 5–20ms per frame.
LLMs generating a response involve seconds of sequential computation.

4. The output is a narrow, structured signal

The car's perception model doesn't need to "think" — it outputs a bounding box, a class label (car/pedestrian/sign), and a confidence score. That's it. The planning and control layers then act on that structured output. There's no generation of free-form text or multi-step reasoning happening in real time.

5. Sensor fusion at the hardware level

LiDAR, radar, and camera data are processed in parallel by separate specialized models, then fused. Each stream runs independently and concurrently. An LLM processes everything serially through a single monolithic model.

Why Can't You Just Make an LLM Real-Time?

You can use vision models (like CLIP or lightweight ViTs) in real time — and self-driving cars actually do use vision transformers. The key is:

No autoregressive loop — just a single forward pass.
Small model size — fits in fast memory, no weight paging.
Quantization and pruning — models are compressed to INT8 or INT4 for speed.
Dedicated silicon — not a general-purpose server CPU.

The "latency" you notice in GPT-4o or similar is specifically the generation step. If you just ask it to classify an image (no text output), it's actually much faster. The slowness comes from generating long, sequential outputs.

Summary Table

Factor	LLM (e.g. GPT-4o)	Self-Driving (e.g. Waymo)
Architecture	Autoregressive transformer	Feedforward CNN/ViT
Output type	Tokens, one at a time	Structured labels/boxes
Model size	Tens to hundreds of billions of params	Millions to low billions of params
Hardware location	Remote datacenter server	Onboard edge compute
Inference latency	Seconds (due to generation loop)	5–30ms per frame
Parallelism	Limited (sequential token gen)	High (whole frame in one pass)
Purpose	General reasoning	One specific perception task

The short version: LLMs are slow because they generate outputs step-by-step through a huge model on remote servers. Self-driving cars use small, task-specific models that run a single fast forward pass on dedicated hardware sitting right inside the vehicle. The "real-time" capability comes from eliminating the autoregressive bottleneck, shrinking the model, and co-locating compute with the sensor.

This is a shared conversation. Sign in to Orris to start your own chat.