An essay on AI
How AI Chips Work: The Pause Is Where the Work Lives
A chip is not a swarm. A hundred billion transistors pause together a billion times a second. The pause is the work. The slowest part sets the pace.
An essay on AI

How AI Chips Work: The Pause Is Where the Work Lives

A chip is not a swarm. A hundred billion transistors pause together a billion times a second. The pause is the work. The slowest part sets the pace.

A tall sand dune at sunset, sand streaming from its sharp crest, with two small acacia trees on the pale desert floor.

A modern chip has about a hundred billion transistors, all running at the same time. The thing that keeps them coherent is not their independence. It is the moment, one billion times a second, when every one of them pauses and steps forward together.

Short answer

How do AI chips work, and why is the pause the part that matters?

How AI chips work depends on the pause. The pause is where the actual computation lives. The clock pulse is the announcement, not the work. AI chips win by making the pause longer and more predictable, not by making the clock faster than the silicon allows.

A chip is not a swarm#

If you have ever wondered what a gigahertz is, this is the place to start.

Picture a hundred billion transistors on a chip. Now picture them all freezing at the same instant, one billion times a second, then taking one tiny step forward together. That moment of freezing is what a clock cycle is.

It is also, in a way that almost nobody explains, the whole reason the chip works.

Without the freeze, the parts of the chip would finish their work at slightly different times. The answers would not line up. The math would come apart.

Parallel computing is not many things happening independently. It is many things happening at the same beat. The beat is the work.

Sequence diagram of the chip clock pulse showing every transistor on the chip pausing in lockstep once a nanosecond and stepping forward together, with the pause moment highlighted as the unit of cooperation
Source: Every nanosecond, every transistor on the chip pauses, latches its state, and steps forward together. The pause is the work.

When many things are doing work at the same time, the results have to meet somewhere. The clock is the moment they meet.

A chip moving forward in lockstep is like a marching band stepping forward together on the downbeat. The band sounds like one body because every foot lands at the same moment. The chip works for the same reason. The downbeat is the cycle.

At the clock instant, whatever value happens to be on each wire gets stored in that wire’s register. The chip steps forward one beat. The wires reset. The next computation begins. The cycle repeats a billion times a second.

A chip is not a swarm of independent parts. Every nanosecond, every transistor pauses and steps forward together. The pause is the work.

The reader who got this far has the model the rest of the post builds on. The gigahertz number on a chip’s specification sheet is not a speed rating. It is the count of pauses per second. A faster chip is a chip that pauses more often.

The slowest part sets the pace#

Now you can see what a clock cycle is. The next question is what sets its speed.

The answer is unexpected.

The clock speed of a chip is not set by its fastest part. It is set by its slowest. Whatever takes the longest to finish in a single cycle determines how fast the whole chip can run.

This is true for the same reason a convoy on a mountain road moves at the speed of the slowest truck. The fastest truck in the line cannot pull ahead of the slowest one without losing the convoy. The clock is the convoy. The slowest piece of logic is the slowest truck.

The standard fix for decades was to split a long logic path in half with a register in the middle. The clock could then run twice as fast, at the cost of one extra register. This is called pipelining. It worked for thirty years.

The fix stops working at a specific place. Some logic cannot be split that way. A calculation that feeds its result back into itself, a running sum or an accumulator, breaks if you put a register in the middle of the loop. The loop has to finish in one cycle. Whatever the slowest loop takes is what the whole chip can do.

That is why two chips built on the same manufacturing process can end up at different clock speeds. One has a tighter feedback loop than the other. The chip with the tighter loop runs faster. The chip with the looser loop runs slower. Same factory. Same materials. Different speed limit.

That is also why the era of doubling the clock every few years ended. The slowest feedback loop stopped cooperating. Designers turned to other strategies, more cores, smarter caches, specialized circuits, because the clock could not be pushed past the loop.

The slowest piece sets the pace in any parallel system. The chip is the cleanest example. The same rule shows up in any workshop where many hands meet a deadline, any kitchen with five cooks waiting on one oven, any household trying to leave on time when one parent is still tying a shoe.

The slowest part runs the meeting.

Hidden in hardware or exposed to software#

There is one more move on the timing side of a chip worth understanding. It is the move that makes an Artificial Intelligence (AI) chip an AI chip.

Every timing decision in a chip lives somewhere on the same line. On one end, the hardware decides on its own, and the programmer never sees the decision. On the other end, the programmer is told to decide explicitly, and the hardware just executes.

A Central Processing Unit (CPU), the chip in a desktop computer or laptop, picks the first end. It uses caches, which are small fast pieces of memory that the hardware fills automatically with whatever data it predicts the program will need next. Caches make programs about a hundred times faster than they would be without them. They also make timing impossible to predict.

A Tensor Processing Unit (TPU), the chip in an AI accelerator, picks the second end. It uses scratchpads, which are small fast pieces of memory the programmer fills directly. Timing is predictable because nothing is being decided behind the scenes.

Side-by-side comparison of a Central Processing Unit using a cache (hardware decides what to keep nearby, timing is unpredictable, average speed is high) and a Tensor Processing Unit using a scratchpad (programmer decides explicitly, timing is predictable, average speed is lower)
Source: Cache on the left. Scratchpad on the right. Hidden in hardware, or exposed to software. The choice decides whether the timing is predictable.

The CPU is more flexible and less predictable. The TPU is more rigid and more reliable. Neither chip is wrong. The chips are tuned for different work.

A CPU does not know what program it will run next. The cache is the way the hardware copes with that uncertainty. The CPU is like a passenger elevator that picks its own floors based on who walks in.

A TPU runs the same kind of program over and over. The scratchpad is the right tool because the programmer can plan exactly what data sits where. The TPU is like a freight elevator that waits for the operator to push the button.

This choice matters most when many chips have to work together. When a thousand chips have to coordinate on the same matrix multiply, jitter in any one of them slows the whole job. Predictable timing is what lets the thousand stay in step. A cache-heavy chip is fast on average and bad at staying in step. A scratchpad chip is slower on average and excellent at staying in step.

The AI workload chose the scratchpad. That choice is most of what makes a TPU look different from a CPU on the inside.

There is no free move#

Every choice on the timing side of a chip costs something.

A faster clock costs more power. A deeper pipeline costs more registers and more silicon area. A bigger cache costs predictability. A scratchpad costs programmer time. There is no free move.

The trick is matching the cost to the work. A laptop runs a thousand different programs in a day. The CPU pays the cache cost because the average-case speed is what the user feels. A training cluster runs one workload for weeks. The TPU pays the scratchpad cost because the coordination cost across a thousand chips is what the operator feels.

The whole post is about one rule. Coordination has a cost. The cost lives somewhere.

The clock pulse is the cost made visible. The pause every nanosecond is the toll the chip pays to keep a hundred billion transistors moving in the same direction. The faster the chip wants to go, the more often it has to pay the toll, and the harder it gets to keep paying.

The household that has tried to coordinate a Thanksgiving dinner with eight family members in one kitchen knows the rule from the inside. The turkey is the slowest dish. The turkey sets the pace.

The cook who tries to push past the turkey ends up with cold sides and a half-raw bird. The chip designer who tries to push past the slowest feedback loop ends up with a chip that does not work.

The reader who can see the clock pulse is reading every AI chip announcement through a different lens. Clock speed is the toll rate. The slowest part of the chip sets the rate. Cache versus scratchpad is the choice about who pays in time, the hardware or the programmer. None of it is free.

A chip is not a swarm. It is a hundred billion transistors that pause together. The price of getting that many things to work at once is that, every nanosecond, they all have to stop and look at each other.

A chip is a hundred billion transistors pausing together a billion times a second. A reader holding the next chip announcement now has a frame for what the gigahertz number means and why the AI chip looks different from the laptop chip on the same desk.

The pause is the work. The slowest part sets the pace. The next time a child asks how a chip clock works, the answer is one sentence long.

Source

The argument draws on Reiner Pope’s podcast interview with Dwarkesh Patel, 2025.

Questions readers ask

Six questions on this essay.

01 What is a clock cycle in a chip?

A clock cycle is the moment every transistor on the chip pauses, latches its state, and steps forward together. It happens about a billion times a second on a modern chip running at one gigahertz. During the cycle, each wire on the chip is doing work and producing a value. At the clock instant, whatever value happens to be on each wire gets stored in that wire's register, and the chip steps forward one beat. The cycle repeats. The gigahertz number on a chip's specification is the count of clock cycles per second. It is not a raw speed rating. It is the count of pauses. Without the pause, the parts of the chip would finish their work at slightly different times, the answers would not line up, and the math would come apart. The pause is what makes the chip a chip and not a swarm.

02 How does a chip with billions of transistors stay coordinated?

By pausing together a billion times a second. The clock pulse reaches every part of the chip at the same instant. At the pulse, every transistor latches its state into its register and steps forward one beat. Between pulses, the wires are doing work. At the pulse, the work gets stored. The chip is not a swarm of independent parts running on their own. The chip is a hundred billion transistors moving in lockstep. The lockstep is what makes the math come out right. Software people often imagine parallelism as many things happening at once. Hardware people know that parallelism in silicon is many things pausing at once. The pause is the synchronization. Without the pause, the chip would still have a hundred billion transistors. None of them would produce a coherent answer to anything.

03 Why did clock speeds stop increasing?

Because the slowest piece of logic on the chip would not cooperate. For thirty years, designers sped up the clock by splitting long logic paths in half with a register in the middle. Each split let the clock run twice as fast on that path. The fix stopped working at the loops. A calculation that feeds its result back into itself, a running sum or an accumulator, has to finish in one cycle. You cannot put a register in the middle of a feedback loop without breaking the loop. Whatever the slowest loop takes is what the whole chip can do. Designers turned to other strategies, more cores, smarter caches, specialized circuits like the ones inside an AI accelerator, because the clock could not be pushed past the slowest feedback loop in the design. The slowest part sets the pace.

04 What is the difference between a cache and a scratchpad?

A cache is a small fast piece of memory that the hardware fills automatically. The hardware predicts what the program needs next and keeps it nearby. The programmer never sees the decision. A scratchpad is a small fast piece of memory the programmer fills directly. The programmer says exactly what data goes there and when. The hardware does not predict. The trade-off is between flexibility and predictability. A cache makes programs about a hundred times faster on average and makes the timing impossible to predict. A scratchpad makes the timing predictable and forces the programmer to plan the data layout. A Central Processing Unit uses caches because it runs many different programs and the average case is what matters. A Tensor Processing Unit uses scratchpads because it runs the same kind of program over and over and the coordination across many chips matters more than the average case.

05 Why do AI chips use scratchpads instead of caches?

Because AI training runs the same workload across hundreds or thousands of chips at the same time. When a thousand chips have to coordinate on the same matrix multiplication, jitter in any one chip slows the whole job. Predictable timing is what lets the thousand stay in step. A cache-heavy chip is fast on average and bad at staying in step, because the cache makes the timing depend on what data happens to be sitting in fast memory at that moment. A scratchpad chip is slower on average and excellent at staying in step, because the programmer planned the data layout in advance and the timing does not depend on a hidden hardware decision. The AI workload chose the scratchpad. That single choice is most of what makes a Tensor Processing Unit look different from a Central Processing Unit on the inside.

06 What is the rule that applies to any parallel system, not just chips?

The slowest piece sets the pace for the whole system. The chip is the cleanest example because the pace shows up as the clock speed and the slowest piece is a specific feedback loop nobody can speed up. The same rule applies anywhere parallel work is coordinated. The convoy on a mountain road moves at the speed of the slowest truck. The kitchen on Thanksgiving moves at the speed of the turkey. The household leaving for school moves at the speed of the child who has not found a shoe. The team shipping a project moves at the speed of the slowest dependency. The lesson is the same. Coordination has a cost. The cost lives in the slowest piece. The faster the system wants to go, the harder it has to work on the slowest piece, because everything else is already waiting.

About the author
Hanh D. Brown, writer.

Essayist writing on craft, voice, aging, and what gets harder to say with the years. Twenty years building AI systems for life-stage decisions. Now writing the publication that has the time to ask why.

Read more