Download Computer Architectures for Nanoscale Devices

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
(TAKEN FROM CARRUTHERS, HAMMERSTROM, AND COLWEL DRAFT PAPER “COMPUTER
ARCHITECTURES FOR NANOSCALE DEVICES”, REV. 16, JULY 22, 2003.)
This Section deals with the Metrics for Comparing Computational Architectures)
ARCHITECTURES – LIMITATIONS OF PERFORMANCE
The metrics for evaluating the maturity and relevance of architectural implementations of nanoscale
technologies are difficult to make quantitative. Each of the architecture implementations in Table 46 is
radically different from others and must be evaluated with different metrics. Normally benchmark
programs are used to evaluate implementations. These benchmarks, such as SPECint for microprocessors,
must be able to measure the results of “typical” computations for any given application. However
benchmark programs also have problems because they are usually so specific that scaling of architectures
for extended applications and extension of architectures into new applications are not measured and in fact
may even be discouraged. Benchmark circuits can be identified that include memory and register cells,
some arithmetic, and long-range buses. For any technology option to be considered seriously, some
benchmark measures must be devised and agreed to by end users. In the case of emerging research devices,
benchmark logic circuits such as fan-out = 4 circuits and simple programming models that generate
instructions or operations per second (MIPS or MOPS) are the most recognizable metrics. Increasingly
such metrics must also recognize power limitations as well. The most recent targets for processors range
from 0.1 MOPS/mW for MPU’s to 1000 MOPS/mW for custom-designed DPU’s (data processing units).1
Any nanotechnology and its associated computing model must meet or exceed these targets to be viable.
Another measure commonly used is MOPS/mm2 or, in the case of 3-D systems, MOPS/mm3. The current
targets for architectures in CMOS silicon are 1-1000 MOPS/mm2 depending on the specific implementation
being used. The upper bound is obviously limited by power dissipation.
1
Private communication, Professor Jan Rabaey, University of California, Berkeley
1
Computer Architectures for Nanoscale Devices
Contribution to the Emerging Devices Section of the 2004 ITRS
John Carruthers, Dan Hammerstrom, Bob Colwell, George Bourianoff, Victor
Zhirnov
Rev 16, July 22, 2003
This paper is a description of the current state of computer architecture approaches that
are relevant to nanoscale devices. This will be published as part of the Emerging
Research Devices section of the 2004 International Technology Roadmap for
Semiconductors after review by appropriate ITRS Working Groups and integration with
other written contributions to the ITRS. A preceding section of this paper on the state of
computer architectures has been resected and published as Part 1 in the SRC Cavin’s
Corner section. The present paper will also be published as Part 2 in Cavin’s Corner as
well.
Definition of Computer Architecture
The architecture of computers is influenced by both the applications requirements needs
and by the technology capabilities on which it is implemented. In turn computer
architecture and its implementation into hardware is the major determinant of which
technology is used. The tight coupling between computer architecture and technology
will continue as CMOS scaling below the 22-25nm generation in 2015/6 is extended by
quantum-dominated nanoscale technologies represented by devices and interconnects that
operate by ballistic electron transport, tunneling electron transport, or even coulomb
charge transfer.
It is necessary to clarify what is mean’t by computer architecture since the term is used in
multiple contexts. Computer architecture is defined as "the minimal set of properties that
determine what programs will run and what results they will produce”. The definition has
been expanded over the years to recognize a three-way distinction between architecture
(or behavior), implementation (or structure), and realization. For this discussion the term,
architecture, is used in the strict sense and the structures that execute that architecture are
called implementations or organizations.
Architectures become established through the applications that make them useful and
cost-effective. The functional requirements of these applications determines the
appropriate instruction set or algorithm set. The associated infrastructure needed to
execute the instructions or algorithms includes the operating systems and compilers that
serve as the interface between the instruction sets and the program being computed. The
usefulness of any architecture is determined by its ability to meet functional requirements
of the application as well as price, performance, and power goals for the particular
market.
2
New nanoscale devices and implementations must add value to applications within the
context described above. Where there is an intersection between application needs and
device capabilities in satisfying those needs through improved or new
computer/communication architectures, there will be possible insertion of the new device
structures into the mainstream of computer and communications hardware providing the
appropriate software infrastructure is developed.
General Architecture Principles for Nanoscale Devices
The following discussion applies primarily to nanoscale devices that are dominated by
quantum transport phenomena. Coherent quantum devices and architectures based on
energy transition probabilities and phases are still in the research phase and, although
mentioned briefly below, a detailed discussion must be deferred until later revisions of
the ITRS.
The characteristics of nanoscale devices and fabrication methods that must be considered
in developing appropriate circuits and computing architectures include regularity of
layout, unreliable device performance, device transfer functions, interconnect limitations,
and thermal power generation. The regular layout is a result of the self-assembly methods
that must be used at dimensions below those for which the standard “tops-down”
processing techniques are used. The device performance is a consequence of both the
physical principles and the inherent variability associated with the nanoscale where it is
estimated that percent quantities of devices will not function adequately for useful
circuits. Device transfer functions include the need for gain so that complex circuits can
be designed as well as input/output relationships that are useful for circuit design.
Interconnect limitations come from two origins: the geometrical challenge of accessing
extremely small devices with connections that will transfer information at the needed
speed and bandwidth; and the transformation of interconnect dimensions from the
nanoscale to the physical world of realizable system connections. Thermal power
generation comes from the device switching energy and also the energy needed to drive
signals through circuits. For electron transport devices such as MOS transistors, the
switching energy has a fundamental limit of 10-18 J/switching transition at the 20nm node
that will limit the useable combination of device density and speed as discussed below.
The limitations of nanoscale devices impose restrictions on the organizations that are
available for future architectures. Local computing tiles composed of simple device
structures have been proposed that are interconnected with nearest neighbors through
crossbar interconnect arrangements that bound the devices. Such organizations satisfy
constraints on device gain and interconnect parasitics. Other organizations are based on
molecular devices inspired by biological systems with much larger circuit fanout than
used in today’s technology. Such circuits work by using chemical regulatory approaches.
For all nanoscale organizations, the management of defective devices will be a critical
3
element of any future architecture since the defects rates are expected to be much higher
than current practice.
Specific Architectures for Nanoscale Devices
This section describes the coupling of future nanoscale devices to new applications and
the architectures to support them. Please refer to Table 46b below for a summary.
a) Fine-Grained Parallel Implementations in Nanoscale Cellular Arrays
For nanoscale devices, the integration level will be terascale (1012 devices/cm2). For this
large number of devices, many new information processing and computing capabilities
are possible in principle that would not be considered at the gigascale level of integration.
Electron transport devices must comprehend the realities of different characteristics such
as high output impedance and contact resistance as well as the inability to provide global
interconnects due to parasitic RC limitations that require the cross-sections to not scale
and the line lengths to be short. Furthermore, patterning techniques will not allow the
random layouts of present logic circuits. Thus choice of devices and their organization
will need to be quite different than current practice. These devices will need to be
interconnected mostly locally and patterned in grids or arrays of cells using techniques
such as directed “self”-assembly. Devices such as quantum dots interconnected in regular
arrays by local coulomb charge interactions are being considered for terascale densities
Two architecture implementations proposed for these cellular arrays are quantum cellular
automata (QCA) and cellular nonlinear networks (CNN). Actually, QCA
implementations can be regarded as a subset of CNN’s2; but since they evolved
separately, we shall treat them separately below. These implementations are particularly
useful for hybrid analog/digital systems with data structures that map well to parallel
processing.
i) Quantum Cellular Automata Architecture Implementations (QCA)
The QCA paradigm is one in which a regular array of cells, each interacting with its
neighbors, is employed in a locally interconnected manner. Such cells are typically
envisioned to be electrostatically coupled quantum dots, or magnetic-field-coupled
nanomagnets. Ongoing research is exploring QCA in various molecular structures as
well3. Therefore, there are no wires in the signal paths. If QCA cells are arranged in a
closely packed grid, then long-established cellular automation theory can be used to
implement specific information processing algorithms. Also, QCA can be extended to
cellular nonlinear (neural) networks discussed below. Thus a large body of theoretical
algorithm implementations can be applied to QCA arrays. By departing from closepacked, regular grid structures, it is possible to use QCA’s to carry out general logic
2
3
W. Porod, C.S. Lent, G. Toth, A. Csurgay, Y.-F. Huang, R.-W. Liu, IEEE Abstracts, p. 745, 1997.
C.S. Lent, Science, 288, 1597 (2000)
4
functions and universal computing with modest efficiency. In addition to non-uniform
layouts, QCA’s need a spatially non-uniform “adiabatic clocking field” that controls cell
switching from one state to another and allows them to evolve rapidly to a stable end
state. The clock also produces some gain, non-linearity, and isolation between
neighboring parts of a circuit. It is possible to construct a complete set of Boolean logic
gates with QCA cells and to design arbitrary computing structures, but current device and
circuit analyses indicate that the clock speed of QCA’s may be extendable to the THz
regime. The energy per switching transition, adjusted for the required cooling energy, is
expected to be of order 3x-19J to 3x10-15J4 at the 100GHz mark (as compared to CMOS
values of 10-18J projected for CMOS at the 20nm node). Although there will be no
interconnect capacitance associated with these structures, there will be a significant
capacitance associated with the inter-dot size and spacing geometries.
Power gain in QCA’s has been demonstrated by using energy from the clock.5 Physically
this occurs because the clock must do some work on a slightly unpolarized cell during the
latching phase of the switching operation.
From a manufacturing viewpoint, the tolerances for metal dot quantum device arrays are
very small and beyond projected manufacturing capabilities. From an operational
viewpoint, quantum effects using metal dots will require temperatures from 70mK to 20K
and therefore may not be suitable for widespread applications. Molecular QCA structures
can operate at room temperature or above due to the sub-nanometer size of the zerodimensional metal cluster6; however defect tolerance issues remain to be explored.
Applications requiring very low power will benefit from QCA implementations.
However, if room temperature operation is desired, then molecular structures will be
needed and the switching speed and defect tolerance must be adequate for applications
and architectures to be useful. Such investigations are currently underway at several
universities.
ii) Cellular Nonlinear Networks (CNN)
A CNN is an array of mainly identical dynamical systems called cells that satisfy two
properties: 1) most interactions are local – within a distance of one cell dimension, and 2)
the state variables are continuous valued signals, (not digital). A template specifies the
interaction between each cell and all its neighbor cells in terms of their input, state, and
output variables. The interaction between the variables of one cell may be either a linear
or nonlinear function of the variables associated with its neighbor cells. A cloning
function determines how the template varies spatially across the grid and determines the
dynamical response of the array to boundary values and initial conditions. Since the range
of interaction and the connection complexity of each cell are independent of the number
of cells, the architecture is extremely scalable, reliable, and robust. Programming the
4
J. Timler and C. S. Lent, J., Power Gain and Dissipation in Quantum-Dot Cellular Automata, J. Appl. Phys., 91,
823,(2002)
5 ibid
6 C.S. Lent, B. Isaksen, M. Lieberman, Molecular Quantum Dot Cellular Automata, J. Amer. Chem Soc., 125, 1056
(2003
5
array consists of specifying the dynamics of a single cell, the connection template, and
the cloning function of the templates. This approach is simpler than traditional VLSI
design methodology since the functional components are simple and reusable.
CNN’s can be used to implement Boolean logic as well as more complex functions such
as majority gates, MUX gates, and switches. CNN’s can simulate many mathematical
problems such as diffusion and convection and nervous system functions. The CNN
organization also lends itself to implementing defect management techniques as
discussed below. Devices that can be used include quantum dots (QCA’s)78, SET’s, and
RTD’s. Tunneling phase logic has been combined with CNN to enable neural-like spike
switching waveforms and ultra-low power dissipation.
One caution concerning CNN’s is that despite the potential applications discussed above,
the only published application to date has been for analog image processing. However
algorithms for pattern recognition and analysis can be implemented very efficiently in
CNN’s.
b) Defect Tolerant Architecture Implementations
The goal of defect tolerant implementations is to enable reliable circuits and computing
from unreliable devices. Such defects are different from fault tolerance, which implies the
ability of a machine to recover from errors made during a calculation. Defects can occur
as permanent defects from hardware manufacturing and as transient defects such as
random charges that affect single electron transistors. Defective devices may be
functional but still not meet the tolerance and reliability requirements for effective largescale circuit operation. These effects are expected to be particularly acute for quantumdominated devices at the nano- and molecular scale and will require significant resources
to control9.
It is expected that the invention of nanometre-scale devices could eventually permit
extremely large scales of integration, of the order of 1012 devices per chip. However, it is
almost certain that it will be very difficult to make nanoscale circuits with any degree of
certainty. Furthermore, it is probable that the proposed nanoelectronic devices will be
more fragile than conventional devices, and will be sensitive to external influences.
Hence, fault-tolerant architectures will certainly be necessary in order to produce reliable
systems that are immune to manufacturing defects and to transient errors.
Several techniques exist for overcoming the effects of inoperative devices. All of them
use the concept of redundancy (in resources or in time). The most representative
7
G. Toth, C.S. Lent, P.D. Tougaw, Y. Brazhnik, W.W. Weng, W. Porod, R.W. Liu, and Y.F. Huang, Quantum Cellular
Neural Networks, Superlattices and Microstructures, 20(4), 473-478, (1996)
8 A.I. Csurgay, Signal Processing with Near Neighbor Coupled Time Varying Quantum Dot Arrays, IEEE Trans.
Circuits and Systems,-1: Fundamental Theory and Applications, 47, 1212 (2000)
9 J.R. Heath et al., A Defect Tolerant Computer Architecture: Opportunities for Nanotechnlogy, Science, 280, 1716,
1998
6
techniques are: R-fold modular redundancy (RMR)10, NAND multiplexing (NAND-M)11,
and reconfiguration (RCF)12. The effectiveness of RCF was successfully demonstrated on
a massively parallel computer ‘Teramac’13. An analysis of fault tolerance of
nanocomputers was presented in 14. The two characteristic parameters of a fault-tolerant
architecture are the amount of redundancy R and the allowable failure rate per device pf.
In this context, redundancy usually means static redundancy: redundant rows and
columns for example. Dynamic redundancy is used to catch and correct problems “on the
fly” and is a more expensive use of resources. It is not clear how much dynamic
redundancy will be needed at the nano- and molecular levels until new computing models
are developed.
The choice of fault-tolerant scheme may be both manufacturing and application specific.
For example, although the RMR technique is the least effective, with the level of
redundancy of R = 5, we can achieve the same level of chip reliability, but with devices
which are four orders of magnitude less reliable. The price for this improvement is that
the effective number of devices is reduced to N/5 (and the pf for each device must be
smaller than 10-9 for N = 1012 devices).
On the other hand, the reconfigurable computer can in principle handle extremely large
manufacturing defect rates—in the limit, even approaching unity—but only at the
expense of colossal amounts of redundancy, for example, manufacturing defect rates of
0.1 (i.e. 10%). If one wishes to fabricate a chip containing the equivalent of many
present-day workstations, then the device failure rate during manufacturing must be
smaller than 10-5. This may be difficult to achieve for nanoscale devices.
RMR and NAND-M in general are not as effective as reconfiguration. However, if the
dead devices cannot be located during manufacture, then a fault tolerant strategy must be
adopted, which allows a chip to work, even with many faulty (either temporarily or
permanently) devices. Furthermore, reconfiguration might be very time consuming for
protecting against transient errors that may occur in service, and therefore demand
temporary shutdown of the system until reconfiguration is performed. It may also be
necessary to use NAND multiplexing if reconfiguration methods are impractical or if the
probability of transient errors is very high.
RMR provides some benefits, but these are unlikely to be useful for chips with 10 12
devices, once the manufacturing defect rate is greater than about 10-8. The NAND-M
technique in principle would allow chips with 1012 devices to work, even if the fault rate
is as high as 10-3 per device. However, this needs even more redundancy than the
reconfiguration technique.
Depledge P G 1981”Fault-tolerant computer systems” IEE Proc.128 257–72
Spagocci S and Fountain T 1999 “Fault rates in nanochip devices” Electrochem. Soc. Proc. 99 354–68
12 Von Neumann J 1955 Probabilistic logics and the synthesis of reliable organisms from unreliable components
Automata Studies ed C E Shannon and J McCarthy (Princeton, NJ:Princeton University Press) pp 43–98
13 Heath J R, Kuekes P J, Snider G S and Williams R S 1998 “A defect-tolerant computer architecture: opportunities
for nanotechnology Science 280 1716–21”
14 K Nikolic, A Sadek and M Forshaw, “Fault-tolerant techniques for nanocomputers”, Nanotechnology 13 (2002)
357-362
10
11
7
The implications of these results are that the future usefulness of various nanoelectronic
devices may be seriously limited if they cannot be made in large quantities with a high
degree of reliability. The results shows that it is theoretically possible to make very large
functional circuits, even with one dead device in ten, but only if the dead devices can be
located and the circuit reconfigured to avoid them. Even so, this technique would require
a redundancy factor of ~10 000, i.e., a chip with 1012 non-perfect devices would perform
as if it had only 108 perfect devices. If it is not possible to locate the dead devices, then
one of the other two techniques would have to be used. These would require the
manufacturing and lifetime failure rate for R = 1000 to be between 10-7 and 10-6.
c) Biologically-Inspired Architecture Implementations
Biologically-inspired computing implies emulation of human and biological reasoning
functions. Such architectures possess basic information processing capabilities that are
organized and reorganized in goal-directed systems. The living cell is the biological
example of a goal-directed organism and has the features of flexibility, adaptability,
robustness, autonomy, situation-awareness, and interactivity. The self-organization of
biological cells is responsible for its own survival, destruction, replication, and
differentiation into multicellular forms, all under the direction of goals encoded in its
genes. The programming model does not involve millions of lines of code but rather
modules of encoded instructions that are activated or deactivated by regulatory modules
to act in concert with an overall goal-directed system. Algorithms inspired by
computational neurobiology have been the first approach to computing systems that
exhibit such behavior, implemented either as unique processors or on general-purpose
architectures. However there is an enormous gap in our understanding of how biological
pathways or circuits function. So there is much learning ahead of us before this
knowledge can be captured in useable computing systems.
At the nanoscale, devices are more stochastic in operation and quantum effects become
the rule rather than the exception. It is unlikely that existing computational models will
be an optimal mapping to these new devices and technologies, and this is the motivation
for biologically inspired algorithms. Neural circuits use loosely coupled, relatively slow,
globally asynchronous, distributed computing with unreliable (and occasionally failing)
components. Furthermore, even simple biological systems perform highly sophisticated
pattern recognition and control. Biological systems are self-organizing, tolerant of
manufacturing defects, and they adapt, rather than being programmed, to their
environments. The problems they solve involve the interaction of an organism/system
with the real world15.
Biological systems are also inherently low power at these relatively slow speeds. The
human brain is known to consume 10-30W in performing its functions at millisecond
timeframes that are compatible with the physiological processes being controlled.
15
G. Palm et al., Neural Associative Memories, in Associative Processing and Processors, A. Krikelis and C.C. Weems,
Editors, IEEE Computer Society, Los Alamitos, CA, 1997, p.284-306
8
The interconnect capabilities of biologically-inspired architectures are the key to its
massive parallelism. The connectivity of neurons in humans provides the best known
example of this: One cubic millimeter of cortex contains about 105 neurons and 109
synapses (104 synapses/neuron) and the human nervous system has about 1012 neurons
and 1015 synapses (103 synapses/neuron). Thus the fan-out per neuron ranges from 10,000
to 1000 in humans16. This amounts to about 1-10 synapses/m3. Most neurons are not
connected to nearest neighbors but rather to different cell classes required to execute the
goal-directed function. This enormous interconnectivity requires a much different
approach to managing information and algorithmic complexity than we implement in
current computing systems. And the large fan-out will require either large-gain devices or
circuit approaches based on additional signal processing inputs such as the regulatory
enzymes of biological reaction pathways.
Some of the most important problems in computing involve teaching computers to act in
a more intelligent manner. Key to this is the efficient representation of knowledge or
contextual information. A variety of highly parallel algorithms are being studied,
including neuromorphic structures, with the intent of enhancing the representation and
manipulation of knowledge in silicon. As discussed previously, realistic systems based on
biological precepts are a long way from implementation until we learn more about the
function of the constituent elements and their organization in goal-directed applications.
The feasibility of using nanoscale electronic devices and interconnects to implement such
massively parallel, adaptive, self-organizing computational models is an active research
area. In general, such architectures should be of interest to complex digital and intelligent
signal processing applications such as advanced human computer interfaces. These
interfaces will include elements such as computer recognition of speech, textual, and
image content as well as problems such as computer vision and robotic control. These
classes of problems require computers to find complex structures and relationships in
massive quantities of low-precision, ambiguous, and noisy data.
Implementations of biologically-inspired systems can be either entirely analog or digital,
or a hybrid of the two. Each has its advantages and disadvantages. Analog has more
density than digital, and many of the algorithmic operations, such as leaky integration,
that often appear in this class of algorithms can be implemented very efficiently in
analog. Also analog can be much more efficient in terms of power/operation. Digital
representation of computations allows more flexibility and allows multiplexing of
expensive computer hardware by a number of network nodes. This is particularly
attractive when the network is sparsely activated. On the other hand, analog is much
harder to design and debug due to the lack of mature design tools. Also analog quantities
are much more difficult to store reliably and bit precision may not be acceptable with the
small numbers of electrons and low values of voltage and current. Digital
implementations use many more transistors and power per operation and must eventually
interface with analog signals from the real world.
16
Patricia S. Churchland and Terrence J. Sejnowski, The Computational Brain, The MIT Press, 1992, ISBN 0-26203188-4.
9
The communications functions, even in analog systems, are best performed digitally.
Most neurons communicate via inter-spike-intervals using the time between pulses to
represent a signal vs. current or voltage. This type of signaling is very noise tolerant and
scales cleanly to single electron systems. Representing addresses in digital forms, such
as packets, means that dedicated metal interconnect wires are not required and that the
network can grow without adding new wires. Also multiplexing schemes for increasing
bandwidth are enabled by digital systems. However single-electron systems do not have
the gain required to drive large fanout circuits typical of biological implementations.
Very little work has been performed on nanoscale devices and circuits that would provide
such functions.
d) Coherent Quantum Computing
Coherent quantum devices rely on the phase information of quantum wavefunctions to
store and manipulate information. The phase information of any quantum state is called a
qubit and is extremely sensitive to its external environment. It is easily connected or
entangled with the quantum states of particles in the local environment. However, no
physical system can ever be completely isolated from its environment; the same
sensitivity can be used to entangle adjacent qubits in ways that can be controlled by
physical gates. The core idea of quantum information processing or quantum computing
is that each individual component of an infinite superposition of wavefunctions is
manipulated in parallel, thereby achieving massive speed-up relative to conventional
computers. The challenge is manipulate wavefunctions so that they can perform a useful
function and then find a way to read out the result of the calculation.
Essentially there have been three approaches for the implementation of quantum
computers:
1) Bulk resonance quantum implementations including nuclear magnetic
resonance, linear optics, and cavity quantum electrodynamics (CQED)
2) Atomic quantum implementations including trapped ions and optical lattices
3) Solid-state quantum implementations including semiconductors and
superconductors
Decoherence is a major issue – where qubits lose their quantum properties exponentially
quickly in the presence of a constant amount of noise per qubit. The decoherence per
operation ranges from 10-3 for electron charge states in semiconductors, to 10-9 for
photons, 10-13 for trapped ions, and 10-14 for nuclear spins.
The emphasis of this description is on solid-state implementations with a focus on
semiconductors since this is the most attractive for developing the required
manufacturing process control and commercial products.
As stated above, a fundamental notion in quantum computing is the “qubit,” a concept
that parallels the “bit” in conventional computation, but carrying with it a much broader
set of representations. Rather than a finite dimensional binary representation for
information, the qubit is a member of a two-dimensional Hilbert space containing a
continuum of elements. Thus quantum computers operate in a much richer space than
10
binary computers. Researchers have defined many sets of elementary quantum gates
based on the qubit concept that perform mappings from the set of input quantum registers
to a set of output quantum registers. A single gate can entangle the qubits stored in two
adjacent quantum registers and combinations of gates can be used to perform more
complex computations. It can be shown that, just as in Boolean computation, there exist
minimal sets of quantum gates that are complete with respect to the set of computable
functions. Considerable research has been conducted to define the capabilities of
quantum computers. Theoretically quantum computers are not inferior to standard
computers of similar complexity and speed of operation. More interesting is the fact that
for some important classes of problems, the quantum computer is superior to its standard
counterpoint. In particular, it was shown that the two prime factors of a number can be
determined by a quantum computer in time proportional to a polynomial in the number of
digits in the number.17 This truly remarkable result showed that for this particular class
of problems, the quantum computer is at least exponentially better than a standard
computer. The key to this result is the capability of a quantum computer to efficiently
compute the quantum Fourier Transform. This result has immediate application in
cryptography since it would allow the quick determination of keys to codes such as RSA.
It is estimated that few thousand quantum gates would be sufficient to solve a
representative RSA code containing of on the order of one hundred digits. There are
several other applications that are variants of the factorization problem.18
The development of a practical architecture for reliable quantum computers is just
beginning19. Elementary architecture implementation concepts such as quantum storage,
data paths, classical control circuits, parallelism, programming models, and system
integration are not yet available. The overhead requirement for quantum error correction
is a daunting problem; the error probability for a quantum operation can be as high as 10-4
and requires heroic efforts to manage. Improvements in error correction code are in
research now but their impact is not yet known. Practical architectures will require error
rates between 10-6 and 10-9.
A minimum set of architecture building blocks has been proposed20: a quantum
arithmetic logic unit, quantum memory, and a dynamic scheduler. In addition, the
architecture implementation uses a novel wiring technique that exploits quantum
teleportation. In this wiring, the desired operation is performed simultaneously with the
transport.
P. W. Shor, “Algorithms for quantum computation: Discrete logarithms and factoring”, Proc. 35nd Annual
Symposium on Foundations of Computer Science, IEEE Computer Society Press (1994), 124-134
18 C. P. Williams and S. H. Clearwater, “Explorations in Quantum Computing” (1998 Springer-Verlag, New York, Inc)
19 M. Oskin, F.T. Chong, and IL. Chuang, IEEE Computer, January, 2002, p. 79
20 See previous reference
17
11
Table 46b Emerging Research Architecture Implementations
ARCHITECTURE
IMPLEMENTATIONS
APPLICATION DOMAIN
DEVICE AND
INTERCONNECT
IMPLEMENTATIONS
CELLULAR ARRAY IMPLEMENTATIONS
DEFECT TOLERANT
IMPLEMENTATIONS
QUANTUM
CELLULAR
AUTOMATA
CELLULAR
NONLINEAR
NETWORKS
Not yet
demonstrated
Fast image
processing,
Associative
memory,
Complex
signal
processing
Reliable
computing with
unreliable devices
(such as SET’s
with background
noise), Historical
examples include
WSI, Teramac
FPGA
implementations
Resonant
tunneling
devices
Molecular
switches, crossed
arrays of 1D
structures,
Switchable
interconnects
Arrays of
nanodots or
molecular
assemblies
BIOLOGICALLY
INSPIRED
IMPLEMENTATIONS-
Goal-driven
computing using
simple and
recursive
algorithms,
High computational
efficiency through
data compression
algorithms
Molecular organic
and biomolecular
devices and
interconnects
COHERENT
QUANTUM
COMPUTING
Special
algorithms
such as
factoring and
deep data
searches
Spin
resonance
transistors,
NMR devices,
Single flux
quantum
devices,
Photonics
DESIRABLE FUNCTIONAL CHARACTERISTICS and CHALLENGES
INFORMATION
THROUGHPUT
POWER
INTERCONNECTS
Fan out =1,
Functional
throughput
constrained
by interdot
capacitances
Fan out close
to unity
Power
comparable to
scaled CMOS
(0.1-0.5
MIPS/mW)
Data
streaming
applications
will need 10100 MOPS/mW
Power
comparable to
scaled CMOS
(0.1-0.5
MIPS/mW)
Data
streaming
applications
will need 10100 MOPS/mW
No local
interconnects
Local
interconnects
with neuronlike
waveforms
Fan out variable
but performance
degraded by need
for defect
management
schemes
Not demonstrated
yet
Interconnects by
crossed arrays
Massive
parallelism,
Requires some
long-range data
transfer,
Fan out very high
in brains (max=104
and avge=103)
Exponential
performance
scaling,
Presently
limited to 5
qubits but 50100 qubits
needed for
large
computations
High parallelism
results in lower
operational
speeds,
Power
consumption of
human brain 1030W @ millisecond
rates
Not
demonstrated
yet for largescale
computations
Interconnects
distributed over a
range of distances
Interconnects
through
wavefunction
coupling, and
entangled
states
12
Sensitive to
background
charge,
Low
temperature
operation,
Not yet
determined
Multiple modular
redundancy and
multiplexing for
transient errors
Highly dynamical
neural-like systems
Implement adaptive
self-organization,
fault tolerance
Error
correction
costs high
DEFECT TOLERANCE
Not
demonstrated
Not yet
determined
Techniques used
include:
Redundancy,
NAND
multiplexing,
Reconfiguration
Inherently
insensitive to
defects through
adaptive
algorithms
Error
correcting
algorithms
needed
MANUFACTURABILITY
Precise
dimensional
control
needed,
Tight
tolerances on
tunnel rates of
all junctions to
minimize jitter,
Self assembly
possible
Not yet
demonstrated
Demonstrated
NMR quantum
computing
with 6 qubits
TEST
Not
demonstrated
Demonstrated
only for image
processing
Test functions are
included in the
adaptive
algorithms used
Test not
possible
directly
Locally active
and locally
connected
Cell and array
design
immature (no
fan-out),
No
programming
model yet
Goal directed
program model,
Backed by
extensive neural
network research
Algorithmic
implementations
need more
research support
ERROR TOLERANCE
No
programming
model yet
MATURITY
RESEARCH ACTIVITY
(2001-2003)
Self-test or
requires extensive
pre-computing
test
REMARKS
Supports memory
based computing,
Applications in
dependable
systems,
Extreme
application
limitation,
No generalpurpose
architecture or
programming
model yet
Demonstration
Demonstration
Demonstration
Concept
Concept
25 research
papers
92 research
papers
10 research
papers
12 research papers
976 research
papers
Architecture Performance Limitations
The metrics for evaluating the maturity and relevance of architectural implementations of
nanoscale technologies are difficult to make quantitative. Each of the architecture
implementations in Table 46 is radically different from others and must be evaluated with
different metrics. Normally benchmark programs are used to evaluate implementations.
These benchmarks, such as SPECint for microprocessors, must be able to measure the
results of “typical” computations for any given application. However benchmark
programs also have problems because they are usually so specific that scaling of
architectures for extended applications and extension of architectures into new
applications are not measured and in fact may even be discouraged. Benchmark circuits
can be identified that include memory and register cells, some arithmetic, and long-range
buses. For any technology option to be considered seriously, some benchmark measures
must be devised and agreed to by end users. In the case of emerging research devices,
benchmark logic circuits such as fan-out=4 circuits and simple programming models that
generate instructions or operations per second (MIPS or MOPS) are the most
recognizable metrics. Increasingly such metrics must also recognize power limitations as
well. The most recent targets for processors range from 0.1 MOPS/mW for MPU’s to
13
1000 MOPS/mW for custom-designed DPU’s (data processing units).21 Any
nanotechnology and its associated computing model must meet or exceed these targets to
be viable. Another measure commonly used is MOPS/mm2 or, in the case of 3-D
systems, MOPS/mm3. The current targets for architectures in CMOS silicon are 1-1000
MOPS/mm2 depending on the specific implementation being used. The upper bound is
obviously limited by power dissipation.
Future Architecture/Implementation/Technology Tradeoffs
The physical limits of CMOS-based integrated circuits are summarized in Figure 1, taken
from the work of Hadley and Mooij from Delft University22. The average delay per
device is plotted against device density for CMOS and quantum devices, based on
electron transport, and compared to several limits. This average delay is less than that
given by the clock frequency since not every device switches with every clock cycle. The
average delay depends on the circuit design.
The dissipation limit is given by the removal of heat from the gates and is based on an
average value of 25W/cm2 The technology that comes closest to the dissipation limit will
deliver the most computational power.
The quantum limit arises when the energy necessary to switch a bit approaches the
quantum limit given by the Heisenberg uncertainty principle and the delay is about 2 X
10-15 s for circuits to be stable against quantum fluctuations. For the device itself this
number should be 10X greater (2 X 10-14 s). The quantum limit intersects the dissipation
limit at a device density of about 107 to 108 devices/cm2. This is close to the current
device density of CMOS circuits. Thus at room temperature operation, CMOS devices at
current densities will not be quantum noise limited.
The relativistic limit is caused by the finite speed of light. No information can be
transported over 1 cm in a time less than 0.3ns. For shorter delays, the distance traveled
must be correspondingly less and the micro- and nanoarchitecture implementations must
become more localized. Of course parasitic RC delays will be longer than the relativistic
limit. Such delays are very serious at nanoscale dimensions due to the high output
impedance of the devices (even more than the interconnect resistance at nm dimensions)
and are estimated to be 1000X larger in a quantum circuit than in a CMOS circuit. Thus
high impedance nanodevices, such as single electron transistors, will be slower than
CMOS FET’s. SET’s that operate at room temperature have low gain, high output
21
Private communication, Professor Jan Rabaey, University of California, Berkeley
22
P. Hadley and J.E. Mooij, Quantum nanocircuits: chips of the future? in Quantum Semiconductor
Devices and Technologies edited by T. P. Pearsall, Kluwer Academic Publishers, Dordrecht, pp 1-20
(2000).
14
impedance, and background offset charges. Thus no room-temperature SET logic or
memory approach is practical at this time. The issue for SET’s is whether the density
increase possible outweighs the increase in associated delays with respect to useful and
cost-effective information processing.
The above discussion shows that quantum devices based on electron transport do not
have any performance advantage over CMOS at room temperature. They will be more
dense but slower due to the high output impedance. And the very small dimensions
(<10nm) required for room temperature operation are not currently manufacturable at
such densities.
Fig. 1 The average delay vs. device density showing the dissipation limit, the relativistic
limit, and the quantum limit for room temperature CMOS integrated circuits. (From P.
Hadley and J.E. Mooij, Quantum nanocircuits: chips of the future? in Quantum Semiconductor Devices and
Technologies edited by T. P. Pearsall, Kluwer Academic Publishers, Dordrecht, pp 1-20 (2000))
Implications of Emerging Nanoscale Quantum –Dominated Devices on Future
Computing and Communications Needs
15
The role of nanoscale devices in meeting future computing and communications needs is
not clear at this point. However there are many needs that would benefit from the
terascale level of integration that such devices would potentially offer. There are also
limitations that arise with nanoscale devices that will impact their usefulness,
Layout Regularity
Many applications will require large amounts of memory that integrates closely with
CMOS logic. Nanoscale devices are well suited to provide very dense memory arrays that
are organized in periodic arrays. However logic implementations have not been based
traditionally on regular arrays.
Device Performance
The gain of nanodevices is an important limitation for current combinatorial logic where
gate fan-outs require significant drive current and low voltages make gates more noise
sensitive. New logic and low-fanout memory circuit approaches will be needed to use
most of these devices for computing applications. Signal regeneration for large circuits
may need to be accomplished by integration with CMOS. Integratability of nanodevices
to CMOS silicon is a key requirement due to both the need for signal restoration for many
logic implementations and also the established technology and market base. This
integration will be necessary at all levels from design tools and circuits to process
technology.
The total impedance for electron transport devices can be more than 100 k so that, for
comparable interconnect capacitances, the lowest impedance device will be favored. In
fact, device output impedance may be even more important than interconnect resistance
for nanoscale devices. This is reflected in the low gain of these devices. Most nanoscale
devices have output impedances of hundreds of megohms or more.
Contact resistances for metal-nanodevice contacts must be comparable to the interconnect
resistances and be repeatable and reliable.
The error rate of all nanoscale devices and circuits is a major concern. These errors arise
from the highly precise dimensional control needed to fabricate the devices and also from
interferences from the local environment, such as spurious charges in SET’s. It has been
estimated that redundancies of 103 to 104 will be needed for manufacturing and lifetime
device failure rates of 10-6 to 10-7. Thus for nanodevice levels of 1012, only 108 to 109
devices will be useable for computation. Large-scale error detection and correction will
need to be a central theme of any architecture and implementations that use nanoscale
devices.
Nanodevices must be able to operate at or close to room temperature for practical
applications.
Device Transfer Functions
Nanoscale devices may perform circuit functions directly due to their nonlinear outputs
and therefore save both real estate and power. In addition, nanodevices that implement
16
both logic and storage in the same device would revolutionize circuit and
nanoarchitecture implementations.
Interconnect Limitations
Nanodevices based on electron transport must be interconnectable without a major loss in
density, performance, or power. The interconnects must demonstrate transmission
resistances of several tens of kilohms. The interconnect pitch transformation from
nanoscale dimensions to the order of millimeters used in most applications will require
sophisticated multiplexing schemes to enable bi-directional signal flows.
Power Limitations
Clock speed vs density tradeoffs for electron transport devices will dictate that for future
technology generations, clock speed will need to be decreased for very high densities or
conversely density will need to be decreased for very high clock speeds. In other words,
the power-delay product {minimum power dissipated X (switching time)2} cannot be less
than Planck’s constant, h, in the quantum limit. Nanoscale electron transport devices
mostly fit into the former category and will best suit implementations that rely on the
efficient use of concurrent devices more than on fast switching.
Nanoscale devices, such as quantum dots and single-atomic or nuclear spins, based on
quantum electron behavior rather than electron transport, offer significant relief from the
thermal dissipation problems of electron transport devices. However problems of
manufacturability and low-temperature operation are major obstacles to early
implementation for metal dot structures. For molecular structures, operation speed and
defect tolerance remain to be explored.
Coherent Quantum Computing
Coherent quantum effects using devices based on photons, or electron or nuclear spins,
and superconducting devices will require totally new computing structures. Such devices
function by superposed wave functions that are entangled as qubits and that easily
decohere when interacting with an external environment such as a measurement device.
Although enormously capable for a few selected algorithms such as encryption or deep
database searching, quantum computing is not seen yet as being of more general interest.
Also the nanofabrication requirements for arrays of coherent quantum devices needed to
maintain the needed coherent states are far beyond any extrapolated process control
capabilities. So, for now, there is no known path to produce quantum-computing systems
with more than a few qubits and we will not consider them further here at this time.
Future research will require a more detailed consideration of them for future
semiconductor roadmaps.
Future R&D Needs
17
Needless to say, neither the circuit technology nor the design tools are available to
implement the kind of nanoarchitectures being considered here. So significant research
and development will be needed in these areas to support the useful implementation of
quantum-dominated nanoscale devices for computing and communications applications.
Also self-test procedures will need to be developed that include automated re-routing
among tiles, particularly for new information processing structures that self organize
instead of being programmed and that have intrinsic fault tolerance.
In addition to the need for infrastructure research, it will be necessary to fund significant
programming model/algorithm/application research that will throw off the constraints of
traditional instruction sets and prepare for implementation in software-controlled
processor systems that can be reconfigured at various times including run time and/or
enable direct bit-mapped implementations. Such implementations should be enabled by
dense memories made possible by nanodevices. At the very high densities of future
technology generations, reconfigurability at run time through soft control and direct
binary mapping of applications to hardware will enable far more complex algorithm
implementations than is possible through hard-wired, ISA-bound systems of today.
18