Subtitles section Play video
During the mid 1960s a revolution in miniaturization was kick started.
The idea of packing dozens of semiconductor based transistors
on to a single silicon chip spawned the integrated circuit.
It laid the groundwork for a complete paradigm shift in how
modern society would evolve.
In less than a decade, this marvel of electronic engineering
and materials sciences would lead in an era of advancement
incomparable to anything else in human history.
In the March of 1971, the commercial launch of a new
semiconductor product set
the stage for this new era.
Composed of a then-incredible 2,300 transistors, the Intel 4004 central
processing unit or CPU was released.
Initially created as a custom solution to the Japanese company Busicom Corp.
for use in the Busicom 141-PF calculator, it was released later
that year to the general public.
With prophetic irony, the marketing material for
the chip touted the slogan
“Announcing a new era in integrated electronics”.
But what made the Intel 4004 so groundbreaking?
Take calculator and solve any simply arithmetic operation,
let's say 22 divided by 7.
What we just did was issue a computer an instruction.
Instructions are elementally operations, such as math
commands that a CPU executes.
Every computer program ever made, from web browsers, to apps, to video
games is composed of millions of these instructions.
The 4004 was capable of executing between 46,250 to
92,500 instructions per second.
For comparison, ENIAC, the first electronic computer built just 25
years earlier could only execute 5,000 instructions a second.
But what made the 4004 so powerful wasn't just its 1800%
increase in processing power - it only consumed 1 watt of
electricity, was about ¾” long and cost $5 to produce in today's money.
This was miles ahead of ENIAC's, cost of $5.5 million
in today's money, 180kW power consumption and 27 ton weight.
Fast forward to September 2017, the launch date of
the Intel Core i9-7980 XE.
This CPU is capable of performing over 80 billion instructions
a second, a 900,000 time increase in processing power.
What did it take to get here?
In this 2 part series we explore the engineering and behind the scenes
technology that paved the way for that simple 16 pin chip to evolve
in the powerhouse of CPUs today.
This is the evolution of processing power.
HOW A CPU STORES DATA
In order to understand how a CPU derives its processing power, let
examine what a CPU actually does and how it interfaces with data.
In digital electronics everything is represented by the binary “bit”.
It's an elemental representation of two possible states.
A bit can represent a zero or one, true or false, up or down, on or
off, or any other bi-state value.
In a CPU, a “bit” is physically transmitted as voltage levels.
If we combine multiple “bits” together in a group, we can now represent
more combinations of discrete states.
For example, if we combine eight bits together we form
what's known as a byte.
A byte can represent 256 different states and can be
used to represent numbers.
In the case of a byte, any number between 0 and 255 can be expressed.
But in a CPU, how we choose to represent data
is completely malleable.
That same byte, can also represent a number between -128 to 127.
Other expressions of that byte may be colors or levels of sound.
When we combine multiple bytes together, we create
what's known as a word.
Words are expressed in their bit capacity.
A 32-bit word contains 32-bits.
A 64-bit word contains 64 bits and so on.
When processors are created, the native word size it operates on
forms the core of its architecture.
The original Intel 4004 processor operated on a 4-bit word.
This means data moving through the CPU transits in
chunks of four bits at time.
Modern CPUs are typical 64-bit, however 32-bit processors
are still quite common.
By making use of larger word sizes we can represent more discrete states
and consequently larger numbers.
A 32-bit word for example, can represent up to 4.2
billion different states.
Of all the forms data can take inside of a CPU the most important
one is that of an instruction.
Instructions are unique bits of data, that are decoded and
executed by the CPU as operations.
An example of a common instruction would be to add two words
values together or move a word of data from one location in
memory to another location.
The entire list of instructions a CPU supports is called
its instruction set.
Each instruction's binary representation, its machine
code is typically assigned a human readable presentation
known as a assembly language.
If we look at the instruction set of most CPU's, they all tend to
focus around performing math or logical operations on data, testing
conditions or moving it from one location to another in memory.
For all intents and purposes, we can think of a CPU as an
instruction processing machine.
They operate by looping through three basic steps,
fetch, decode, and execute.
As CPU designs evolve these three step become dramatically more complicated
and technologies are implemented that extend this core model of operation.
But in order to fully appreciate these advances, let's first explore
the mechanics of basic CPU operation.
Known today as the “classic Reduced Instruction Set Computer
or [RISC] pipeline”, this paradigm formed the basis for the first CPU
designs, such as the Intel 4004.
In the fetch phase, the CPU loads the instruction it will
be executing into itself.
A CPU can be thought of as existing in an information bubble.
It pulls instructions and data from outside of itself, performs operations
within its own internal environment, and then returns data back.
This data is typically stored in memory external of the CPU called
Random Access Memory or [RAM].
Software instructions and data are loaded into RAM from
more permanent sources such as hard drives and flash memory.
But at one point in history magnetic tape, punch cards, and
even flip switches were used.
When a CPU loads a word of data it does it by requesting the
contents of a location in RAM.
This is called the data's address.
The amount of data a CPU can address at one time is determined
by its address capacity.
A 4 bit address for example, can only directly address 16 locations of data.
Mechanisms exist for addressing more data than the CPUs address capacity,
but let's ignore these for now.
The mechanism by which data moves back and forth to RAM is called a bus.
A bus can be thought of as a multi-lane highway between
the CPU and RAM is which each bit of data has its own lane.
But we also need to transmit the location of the data we're requesting,
so a second highway must be added to accommodate both the size of
the data word and the address word.
These are called the data bus and address bus respectively.
In practice these data and address lines are physical electrical
connections between the CPU and RAM and often look exactly like a
superhighway on a circuit board.
When a CPU makes a request for RAM access, a memory control
region of the CPU loads the address bus with the memory word
address it wishes to access.
It then triggers a control line that signals a memory read request.
Upon receiving this request the RAM fills the data bus with the contents
of the requested memory location.
The CPU now sees this data on the bus.
Writing data to RAM works in a similar manner, with CPU
posting to the data bus instead.
When the RAM received a “write” signal, the contents of the data
bus is written to the RAM location pointed to by the address bus.
The address of the memory location to fetch is stored in the CPU,
in a mechanism called a register.
A register is a high speed internal memory word that is used as a
“notepad” by CPU operations.
It's typically used as a temporary data store for instructions
but can also be assigned to vital CPU functions, such as
keeping track of the current address being accessed in RAM.
Because they are designed innately into the CPU's hardware, most
only have a handful of registers.
Their word size is generally coupled to the CPU's native architecture.
Once a word of memory is read into the CPU, the register that stores
the address of that words, known as a Program Counter is incremented.
On the next fetch, it retrieves the next instruction, in sequence.
Accessing data from RAM is typically the bottleneck of a CPUs operation.
This is due to the need to interface with components
physically distant from the CPU.
On older CPUs this doesn't present much of a problem, but as they
get faster the latency of memory access becomes a critical issue.
The mechanism of how this is handled is key to the advancement
of processor performance and will be examined in part 2 of this
series as we introduce cacheing.
Once an instruction is fetched the decode phase begins.
In classic RISC architecture, one word of memory forms
a complete instruction.
This changes to a more elaborate methods as CPUs evolve to complex
instruction set archicture, which will be introduced
in part 2 of this series.
When a instruction is decoded, the word is broken down into
two parts known as bitfields.
These are called an opcode and an operand.
A opcode is a unique series of bits that represent a specific
function within the CPU.
Opcodes generally instruct the CPU to move data to a register, move
data between a register and memory, perform math or logic functions
on a registers and branching.
Branching occurs when an instruction causes a change in
the program counter's address.
This causes the next fetch to occur at new location in memory as oppose
to the next sequential address.
When this “jump” to a new program location is guaranteed, it's
called an unconditional branch.
In other cases a test can be done to determine if a “jump” should occur.
This is known as a conditional branch.
The tests that trigger these conditions are usually mathematical,
such is if a register or memory location is less than
or greater than a number, or if it is zero or non zero.
Branching allows program to make decisions and are
crucial to the power of a CPU.
Opcodes sometimes requires data to perform an its operation on.
This part of an instruction is called a operand.
Operands are bits piggy backed onto an instruction to be used as data.
Let say we wanted to add 5 to a register.
The binary representation of the number 5 would be embedded in the
instruction and extracted by the decoder for the addition operation.
When an instruction has an embedded constant of data within it,
its know as an immediate value.
In some instructions the operand does not specify the value it
self, but contains an address to a location in memory to be accessed.
This is common in opcodes that request a memory word
to be loaded into a register.
This is known as addressing, and can get far more
complicated in modern CPUs.
Addressing can result in a performance penalty because of the
need to “leave” the CPU but this is mitigated as CPU design advance.
Once we have our opcode and operand the opcode is matched by means of a
table and a combination of circuiry where a control unit then configures
various operational sections of the CPU to perform the operation.
In some modern CPU's the decode phase isn't hardwired, and can be programed.
This allows for the changing in how instructions are decoded and the
CPU is configured for execution.
In the execution phase the now configured CPUs is triggered.
This may occur in a single step or a series of steps
depending on the opcode.
One of the most commonly used sections of a CPU in execution is
the Arithmetic Logic Unit or ALU.
This block of circuitry is designed to take in two operands and
perform either basic arithmetic or bitwise logical operations on them.
The result are then outputted along with respective mathematical
flags, such as a carry over, an overflow or a zero result.
The output of the ALU is then sent to either a register or a location
in memory based on the opcode.
Let's say an instruction calls for adding a 10 to a register and
placing the result in that register.
The control unit of the CPU will load the immediate value
of the instruction into the ALU,
load the value of the register into the ALU and connect the
ALU output to the register.
On the execute trigger the addition is done and the output
loaded into the register.
In effect, software distills down to a loop of configuring
groups of circuits to interact with each other within a CPU.
In a CPU these 3 phases of operation loop continuously, workings its
way through the instruction of the computer program loaded in memory.
Gluing this looping machine together is a clock.
A clock is a repeating pulse use to synchronize a CPU's internal
mechanics and its interface with external components.
CPU clock rate is measured by the number of pulses per second, or Hertz.
The Intel 4004 ran at 740 KHz or 740,000 pulses a second.
Modern CPUs can touch clock rates approaching 5GHz, or
5 billion pulses a second.
On simpler CPUs a single clock triggers the advance of the
fetch, decode, and execute stage.
As CPUs get more sophisticated these stages can take several
clock cycles to complete.
Optimizing these stages and their use of clock cycles are key to
increasing processing power and will be discussed in part 2 of this series.
The throughput of a CPU, the amount of instructions that can be executed
a second determines how “fast” it is.
By increasing the clock rate, we can make a processor go
through its stages faster.
However as we get faster we encounter a new problem.
The period between clock cycles has to allow for enough time
for every possible instruction combination to execute.
If a new clock pulse happens before an instruction cycle
completes, results become unpredictable and the program fails.
Furthermore, increasing clock rates has the side effect of increasing
power dissipation and a buildup of heat in the CPU causing a
degradation of cirucity performance.
The battle to run CPU's faster and more efficiently has
dominated its entire existence.
In the next part of this series we'll explore the expansion of CPU designs
from that simple 2,300 transistor device of the 1970s, through the
microcomputing boom of the 1980s, and onward to the multi-million transistor
designs of the 90s and early 2000s.
We'll introduce the rise of pipelining technology, caching,
the move to larger bit CISC architecture, and charge forward
to multi GHz clock rates.