Download Computer Architectures for Nanoscale Devices

(TAKEN FROM CARRUTHERS, HAMMERSTROM, AND COLWEL DRAFT PAPER “COMPUTER ARCHITECTURES FOR NANOSCALE DEVICES”, REV. 16, JULY 22, 2003.) This Section deals with the Metrics for Comparing Computational Architectures) ARCHITECTURES – LIMITATIONS OF PERFORMANCE The metrics for evaluating the maturity and relevance of architectural implementations of nanoscale technologies are difficult to make quantitative. Each of the architecture implementations in Table 46 is radically different from others and must be evaluated with different metrics. Normally benchmark programs are used to evaluate implementations. These benchmarks, such as SPECint for microprocessors, must be able to measure the results of “typical” computations for any given application. However benchmark programs also have problems because they are usually so specific that scaling of architectures for extended applications and extension of architectures into new applications are not measured and in fact may even be discouraged. Benchmark circuits can be identified that include memory and register cells, some arithmetic, and long-range buses. For any technology option to be considered seriously, some benchmark measures must be devised and agreed to by end users. In the case of emerging research devices, benchmark logic circuits such as fan-out = 4 circuits and simple programming models that generate instructions or operations per second (MIPS or MOPS) are the most recognizable metrics. Increasingly such metrics must also recognize power limitations as well. The most recent targets for processors range from 0.1 MOPS/mW for MPU’s to 1000 MOPS/mW for custom-designed DPU’s (data processing units).1 Any nanotechnology and its associated computing model must meet or exceed these targets to be viable. Another measure commonly used is MOPS/mm2 or, in the case of 3-D systems, MOPS/mm3. The current targets for architectures in CMOS silicon are 1-1000 MOPS/mm2 depending on the specific implementation being used. The upper bound is obviously limited by power dissipation. 1 Private communication, Professor Jan Rabaey, University of California, Berkeley 1 Computer Architectures for Nanoscale Devices Contribution to the Emerging Devices Section of the 2004 ITRS John Carruthers, Dan Hammerstrom, Bob Colwell, George Bourianoff, Victor Zhirnov Rev 16, July 22, 2003 This paper is a description of the current state of computer architecture approaches that are relevant to nanoscale devices. This will be published as part of the Emerging Research Devices section of the 2004 International Technology Roadmap for Semiconductors after review by appropriate ITRS Working Groups and integration with other written contributions to the ITRS. A preceding section of this paper on the state of computer architectures has been resected and published as Part 1 in the SRC Cavin’s Corner section. The present paper will also be published as Part 2 in Cavin’s Corner as well. Definition of Computer Architecture The architecture of computers is influenced by both the applications requirements needs and by the technology capabilities on which it is implemented. In turn computer architecture and its implementation into hardware is the major determinant of which technology is used. The tight coupling between computer architecture and technology will continue as CMOS scaling below the 22-25nm generation in 2015/6 is extended by quantum-dominated nanoscale technologies represented by devices and interconnects that operate by ballistic electron transport, tunneling electron transport, or even coulomb charge transfer. It is necessary to clarify what is mean’t by computer architecture since the term is used in multiple contexts. Computer architecture is defined as "the minimal set of properties that determine what programs will run and what results they will produce”. The definition has been expanded over the years to recognize a three-way distinction between architecture (or behavior), implementation (or structure), and realization. For this discussion the term, architecture, is used in the strict sense and the structures that execute that architecture are called implementations or organizations. Architectures become established through the applications that make them useful and cost-effective. The functional requirements of these applications determines the appropriate instruction set or algorithm set. The associated infrastructure needed to execute the instructions or algorithms includes the operating systems and compilers that serve as the interface between the instruction sets and the program being computed. The usefulness of any architecture is determined by its ability to meet functional requirements of the application as well as price, performance, and power goals for the particular market. 2 New nanoscale devices and implementations must add value to applications within the context described above. Where there is an intersection between application needs and device capabilities in satisfying those needs through improved or new computer/communication architectures, there will be possible insertion of the new device structures into the mainstream of computer and communications hardware providing the appropriate software infrastructure is developed. General Architecture Principles for Nanoscale Devices The following discussion applies primarily to nanoscale devices that are dominated by quantum transport phenomena. Coherent quantum devices and architectures based on energy transition probabilities and phases are still in the research phase and, although mentioned briefly below, a detailed discussion must be deferred until later revisions of the ITRS. The characteristics of nanoscale devices and fabrication methods that must be considered in developing appropriate circuits and computing architectures include regularity of layout, unreliable device performance, device transfer functions, interconnect limitations, and thermal power generation. The regular layout is a result of the self-assembly methods that must be used at dimensions below those for which the standard “tops-down” processing techniques are used. The device performance is a consequence of both the physical principles and the inherent variability associated with the nanoscale where it is estimated that percent quantities of devices will not function adequately for useful circuits. Device transfer functions include the need for gain so that complex circuits can be designed as well as input/output relationships that are useful for circuit design. Interconnect limitations come from two origins: the geometrical challenge of accessing extremely small devices with connections that will transfer information at the needed speed and bandwidth; and the transformation of interconnect dimensions from the nanoscale to the physical world of realizable system connections. Thermal power generation comes from the device switching energy and also the energy needed to drive signals through circuits. For electron transport devices such as MOS transistors, the switching energy has a fundamental limit of 10-18 J/switching transition at the 20nm node that will limit the useable combination of device density and speed as discussed below. The limitations of nanoscale devices impose restrictions on the organizations that are available for future architectures. Local computing tiles composed of simple device structures have been proposed that are interconnected with nearest neighbors through crossbar interconnect arrangements that bound the devices. Such organizations satisfy constraints on device gain and interconnect parasitics. Other organizations are based on molecular devices inspired by biological systems with much larger circuit fanout than used in today’s technology. Such circuits work by using chemical regulatory approaches. For all nanoscale organizations, the management of defective devices will be a critical 3 element of any future architecture since the defects rates are expected to be much higher than current practice. Specific Architectures for Nanoscale Devices This section describes the coupling of future nanoscale devices to new applications and the architectures to support them. Please refer to Table 46b below for a summary. a) Fine-Grained Parallel Implementations in Nanoscale Cellular Arrays For nanoscale devices, the integration level will be terascale (1012 devices/cm2). For this large number of devices, many new information processing and computing capabilities are possible in principle that would not be considered at the gigascale level of integration. Electron transport devices must comprehend the realities of different characteristics such as high output impedance and contact resistance as well as the inability to provide global interconnects due to parasitic RC limitations that require the cross-sections to not scale and the line lengths to be short. Furthermore, patterning techniques will not allow the random layouts of present logic circuits. Thus choice of devices and their organization will need to be quite different than current practice. These devices will need to be interconnected mostly locally and patterned in grids or arrays of cells using techniques such as directed “self”-assembly. Devices such as quantum dots interconnected in regular arrays by local coulomb charge interactions are being considered for terascale densities Two architecture implementations proposed for these cellular arrays are quantum cellular automata (QCA) and cellular nonlinear networks (CNN). Actually, QCA implementations can be regarded as a subset of CNN’s2; but since they evolved separately, we shall treat them separately below. These implementations are particularly useful for hybrid analog/digital systems with data structures that map well to parallel processing. i) Quantum Cellular Automata Architecture Implementations (QCA) The QCA paradigm is one in which a regular array of cells, each interacting with its neighbors, is employed in a locally interconnected manner. Such cells are typically envisioned to be electrostatically coupled quantum dots, or magnetic-field-coupled nanomagnets. Ongoing research is exploring QCA in various molecular structures as well3. Therefore, there are no wires in the signal paths. If QCA cells are arranged in a closely packed grid, then long-established cellular automation theory can be used to implement specific information processing algorithms. Also, QCA can be extended to cellular nonlinear (neural) networks discussed below. Thus a large body of theoretical algorithm implementations can be applied to QCA arrays. By departing from closepacked, regular grid structures, it is possible to use QCA’s to carry out general logic 2 3 W. Porod, C.S. Lent, G. Toth, A. Csurgay, Y.-F. Huang, R.-W. Liu, IEEE Abstracts, p. 745, 1997. C.S. Lent, Science, 288, 1597 (2000) 4 functions and universal computing with modest efficiency. In addition to non-uniform layouts, QCA’s need a spatially non-uniform “adiabatic clocking field” that controls cell switching from one state to another and allows them to evolve rapidly to a stable end state. The clock also produces some gain, non-linearity, and isolation between neighboring parts of a circuit. It is possible to construct a complete set of Boolean logic gates with QCA cells and to design arbitrary computing structures, but current device and circuit analyses indicate that the clock speed of QCA’s may be extendable to the THz regime. The energy per switching transition, adjusted for the required cooling energy, is expected to be of order 3x-19J to 3x10-15J4 at the 100GHz mark (as compared to CMOS values of 10-18J projected for CMOS at the 20nm node). Although there will be no interconnect capacitance associated with these structures, there will be a significant capacitance associated with the inter-dot size and spacing geometries. Power gain in QCA’s has been demonstrated by using energy from the clock.5 Physically this occurs because the clock must do some work on a slightly unpolarized cell during the latching phase of the switching operation. From a manufacturing viewpoint, the tolerances for metal dot quantum device arrays are very small and beyond projected manufacturing capabilities. From an operational viewpoint, quantum effects using metal dots will require temperatures from 70mK to 20K and therefore may not be suitable for widespread applications. Molecular QCA structures can operate at room temperature or above due to the sub-nanometer size of the zerodimensional metal cluster6; however defect tolerance issues remain to be explored. Applications requiring very low power will benefit from QCA implementations. However, if room temperature operation is desired, then molecular structures will be needed and the switching speed and defect tolerance must be adequate for applications and architectures to be useful. Such investigations are currently underway at several universities. ii) Cellular Nonlinear Networks (CNN) A CNN is an array of mainly identical dynamical systems called cells that satisfy two properties: 1) most interactions are local – within a distance of one cell dimension, and 2) the state variables are continuous valued signals, (not digital). A template specifies the interaction between each cell and all its neighbor cells in terms of their input, state, and output variables. The interaction between the variables of one cell may be either a linear or nonlinear function of the variables associated with its neighbor cells. A cloning function determines how the template varies spatially across the grid and determines the dynamical response of the array to boundary values and initial conditions. Since the range of interaction and the connection complexity of each cell are independent of the number of cells, the architecture is extremely scalable, reliable, and robust. Programming the 4 J. Timler and C. S. Lent, J., Power Gain and Dissipation in Quantum-Dot Cellular Automata, J. Appl. Phys., 91, 823,(2002) 5 ibid 6 C.S. Lent, B. Isaksen, M. Lieberman, Molecular Quantum Dot Cellular Automata, J. Amer. Chem Soc., 125, 1056 (2003 5 array consists of specifying the dynamics of a single cell, the connection template, and the cloning function of the templates. This approach is simpler than traditional VLSI design methodology since the functional components are simple and reusable. CNN’s can be used to implement Boolean logic as well as more complex functions such as majority gates, MUX gates, and switches. CNN’s can simulate many mathematical problems such as diffusion and convection and nervous system functions. The CNN organization also lends itself to implementing defect management techniques as discussed below. Devices that can be used include quantum dots (QCA’s)78, SET’s, and RTD’s. Tunneling phase logic has been combined with CNN to enable neural-like spike switching waveforms and ultra-low power dissipation. One caution concerning CNN’s is that despite the potential applications discussed above, the only published application to date has been for analog image processing. However algorithms for pattern recognition and analysis can be implemented very efficiently in CNN’s. b) Defect Tolerant Architecture Implementations The goal of defect tolerant implementations is to enable reliable circuits and computing from unreliable devices. Such defects are different from fault tolerance, which implies the ability of a machine to recover from errors made during a calculation. Defects can occur as permanent defects from hardware manufacturing and as transient defects such as random charges that affect single electron transistors. Defective devices may be functional but still not meet the tolerance and reliability requirements for effective largescale circuit operation. These effects are expected to be particularly acute for quantumdominated devices at the nano- and molecular scale and will require significant resources to control9. It is expected that the invention of nanometre-scale devices could eventually permit extremely large scales of integration, of the order of 1012 devices per chip. However, it is almost certain that it will be very difficult to make nanoscale circuits with any degree of certainty. Furthermore, it is probable that the proposed nanoelectronic devices will be more fragile than conventional devices, and will be sensitive to external influences. Hence, fault-tolerant architectures will certainly be necessary in order to produce reliable systems that are immune to manufacturing defects and to transient errors. Several techniques exist for overcoming the effects of inoperative devices. All of them use the concept of redundancy (in resources or in time). The most representative 7 G. Toth, C.S. Lent, P.D. Tougaw, Y. Brazhnik, W.W. Weng, W. Porod, R.W. Liu, and Y.F. Huang, Quantum Cellular Neural Networks, Superlattices and Microstructures, 20(4), 473-478, (1996) 8 A.I. Csurgay, Signal Processing with Near Neighbor Coupled Time Varying Quantum Dot Arrays, IEEE Trans. Circuits and Systems,-1: Fundamental Theory and Applications, 47, 1212 (2000) 9 J.R. Heath et al., A Defect Tolerant Computer Architecture: Opportunities for Nanotechnlogy, Science, 280, 1716, 1998 6 techniques are: R-fold modular redundancy (RMR)10, NAND multiplexing (NAND-M)11, and reconfiguration (RCF)12. The effectiveness of RCF was successfully demonstrated on a massively parallel computer ‘Teramac’13. An analysis of fault tolerance of nanocomputers was presented in 14. The two characteristic parameters of a fault-tolerant architecture are the amount of redundancy R and the allowable failure rate per device pf. In this context, redundancy usually means static redundancy: redundant rows and columns for example. Dynamic redundancy is used to catch and correct problems “on the fly” and is a more expensive use of resources. It is not clear how much dynamic redundancy will be needed at the nano- and molecular levels until new computing models are developed. The choice of fault-tolerant scheme may be both manufacturing and application specific. For example, although the RMR technique is the least effective, with the level of redundancy of R = 5, we can achieve the same level of chip reliability, but with devices which are four orders of magnitude less reliable. The price for this improvement is that the effective number of devices is reduced to N/5 (and the pf for each device must be smaller than 10-9 for N = 1012 devices). On the other hand, the reconfigurable computer can in principle handle extremely large manufacturing defect rates—in the limit, even approaching unity—but only at the expense of colossal amounts of redundancy, for example, manufacturing defect rates of 0.1 (i.e. 10%). If one wishes to fabricate a chip containing the equivalent of many present-day workstations, then the device failure rate during manufacturing must be smaller than 10-5. This may be difficult to achieve for nanoscale devices. RMR and NAND-M in general are not as effective as reconfiguration. However, if the dead devices cannot be located during manufacture, then a fault tolerant strategy must be adopted, which allows a chip to work, even with many faulty (either temporarily or permanently) devices. Furthermore, reconfiguration might be very time consuming for protecting against transient errors that may occur in service, and therefore demand temporary shutdown of the system until reconfiguration is performed. It may also be necessary to use NAND multiplexing if reconfiguration methods are impractical or if the probability of transient errors is very high. RMR provides some benefits, but these are unlikely to be useful for chips with 10 12 devices, once the manufacturing defect rate is greater than about 10-8. The NAND-M technique in principle would allow chips with 1012 devices to work, even if the fault rate is as high as 10-3 per device. However, this needs even more redundancy than the reconfiguration technique. Depledge P G 1981”Fault-tolerant computer systems” IEE Proc.128 257–72 Spagocci S and Fountain T 1999 “Fault rates in nanochip devices” Electrochem. Soc. Proc. 99 354–68 12 Von Neumann J 1955 Probabilistic logics and the synthesis of reliable organisms from unreliable components Automata Studies ed C E Shannon and J McCarthy (Princeton, NJ:Princeton University Press) pp 43–98 13 Heath J R, Kuekes P J, Snider G S and Williams R S 1998 “A defect-tolerant computer architecture: opportunities for nanotechnology Science 280 1716–21” 14 K Nikolic, A Sadek and M Forshaw, “Fault-tolerant techniques for nanocomputers”, Nanotechnology 13 (2002) 357-362 10 11 7 The implications of these results are that the future usefulness of various nanoelectronic devices may be seriously limited if they cannot be made in large quantities with a high degree of reliability. The results shows that it is theoretically possible to make very large functional circuits, even with one dead device in ten, but only if the dead devices can be located and the circuit reconfigured to avoid them. Even so, this technique would require a redundancy factor of ~10 000, i.e., a chip with 1012 non-perfect devices would perform as if it had only 108 perfect devices. If it is not possible to locate the dead devices, then one of the other two techniques would have to be used. These would require the manufacturing and lifetime failure rate for R = 1000 to be between 10-7 and 10-6. c) Biologically-Inspired Architecture Implementations Biologically-inspired computing implies emulation of human and biological reasoning functions. Such architectures possess basic information processing capabilities that are organized and reorganized in goal-directed systems. The living cell is the biological example of a goal-directed organism and has the features of flexibility, adaptability, robustness, autonomy, situation-awareness, and interactivity. The self-organization of biological cells is responsible for its own survival, destruction, replication, and differentiation into multicellular forms, all under the direction of goals encoded in its genes. The programming model does not involve millions of lines of code but rather modules of encoded instructions that are activated or deactivated by regulatory modules to act in concert with an overall goal-directed system. Algorithms inspired by computational neurobiology have been the first approach to computing systems that exhibit such behavior, implemented either as unique processors or on general-purpose architectures. However there is an enormous gap in our understanding of how biological pathways or circuits function. So there is much learning ahead of us before this knowledge can be captured in useable computing systems. At the nanoscale, devices are more stochastic in operation and quantum effects become the rule rather than the exception. It is unlikely that existing computational models will be an optimal mapping to these new devices and technologies, and this is the motivation for biologically inspired algorithms. Neural circuits use loosely coupled, relatively slow, globally asynchronous, distributed computing with unreliable (and occasionally failing) components. Furthermore, even simple biological systems perform highly sophisticated pattern recognition and control. Biological systems are self-organizing, tolerant of manufacturing defects, and they adapt, rather than being programmed, to their environments. The problems they solve involve the interaction of an organism/system with the real world15. Biological systems are also inherently low power at these relatively slow speeds. The human brain is known to consume 10-30W in performing its functions at millisecond timeframes that are compatible with the physiological processes being controlled. 15 G. Palm et al., Neural Associative Memories, in Associative Processing and Processors, A. Krikelis and C.C. Weems, Editors, IEEE Computer Society, Los Alamitos, CA, 1997, p.284-306 8 The interconnect capabilities of biologically-inspired architectures are the key to its massive parallelism. The connectivity of neurons in humans provides the best known example of this: One cubic millimeter of cortex contains about 105 neurons and 109 synapses (104 synapses/neuron) and the human nervous system has about 1012 neurons and 1015 synapses (103 synapses/neuron). Thus the fan-out per neuron ranges from 10,000 to 1000 in humans16. This amounts to about 1-10 synapses/m3. Most neurons are not connected to nearest neighbors but rather to different cell classes required to execute the goal-directed function. This enormous interconnectivity requires a much different approach to managing information and algorithmic complexity than we implement in current computing systems. And the large fan-out will require either large-gain devices or circuit approaches based on additional signal processing inputs such as the regulatory enzymes of biological reaction pathways. Some of the most important problems in computing involve teaching computers to act in a more intelligent manner. Key to this is the efficient representation of knowledge or contextual information. A variety of highly parallel algorithms are being studied, including neuromorphic structures, with the intent of enhancing the representation and manipulation of knowledge in silicon. As discussed previously, realistic systems based on biological precepts are a long way from implementation until we learn more about the function of the constituent elements and their organization in goal-directed applications. The feasibility of using nanoscale electronic devices and interconnects to implement such massively parallel, adaptive, self-organizing computational models is an active research area. In general, such architectures should be of interest to complex digital and intelligent signal processing applications such as advanced human computer interfaces. These interfaces will include elements such as computer recognition of speech, textual, and image content as well as problems such as computer vision and robotic control. These classes of problems require computers to find complex structures and relationships in massive quantities of low-precision, ambiguous, and noisy data. Implementations of biologically-inspired systems can be either entirely analog or digital, or a hybrid of the two. Each has its advantages and disadvantages. Analog has more density than digital, and many of the algorithmic operations, such as leaky integration, that often appear in this class of algorithms can be implemented very efficiently in analog. Also analog can be much more efficient in terms of power/operation. Digital representation of computations allows more flexibility and allows multiplexing of expensive computer hardware by a number of network nodes. This is particularly attractive when the network is sparsely activated. On the other hand, analog is much harder to design and debug due to the lack of mature design tools. Also analog quantities are much more difficult to store reliably and bit precision may not be acceptable with the small numbers of electrons and low values of voltage and current. Digital implementations use many more transistors and power per operation and must eventually interface with analog signals from the real world. 16 Patricia S. Churchland and Terrence J. Sejnowski, The Computational Brain, The MIT Press, 1992, ISBN 0-26203188-4. 9 The communications functions, even in analog systems, are best performed digitally. Most neurons communicate via inter-spike-intervals using the time between pulses to represent a signal vs. current or voltage. This type of signaling is very noise tolerant and scales cleanly to single electron systems. Representing addresses in digital forms, such as packets, means that dedicated metal interconnect wires are not required and that the network can grow without adding new wires. Also multiplexing schemes for increasing bandwidth are enabled by digital systems. However single-electron systems do not have the gain required to drive large fanout circuits typical of biological implementations. Very little work has been performed on nanoscale devices and circuits that would provide such functions. d) Coherent Quantum Computing Coherent quantum devices rely on the phase information of quantum wavefunctions to store and manipulate information. The phase information of any quantum state is called a qubit and is extremely sensitive to its external environment. It is easily connected or entangled with the quantum states of particles in the local environment. However, no physical system can ever be completely isolated from its environment; the same sensitivity can be used to entangle adjacent qubits in ways that can be controlled by physical gates. The core idea of quantum information processing or quantum computing is that each individual component of an infinite superposition of wavefunctions is manipulated in parallel, thereby achieving massive speed-up relative to conventional computers. The challenge is manipulate wavefunctions so that they can perform a useful function and then find a way to read out the result of the calculation. Essentially there have been three approaches for the implementation of quantum computers: 1) Bulk resonance quantum implementations including nuclear magnetic resonance, linear optics, and cavity quantum electrodynamics (CQED) 2) Atomic quantum implementations including trapped ions and optical lattices 3) Solid-state quantum implementations including semiconductors and superconductors Decoherence is a major issue – where qubits lose their quantum properties exponentially quickly in the presence of a constant amount of noise per qubit. The decoherence per operation ranges from 10-3 for electron charge states in semiconductors, to 10-9 for photons, 10-13 for trapped ions, and 10-14 for nuclear spins. The emphasis of this description is on solid-state implementations with a focus on semiconductors since this is the most attractive for developing the required manufacturing process control and commercial products. As stated above, a fundamental notion in quantum computing is the “qubit,” a concept that parallels the “bit” in conventional computation, but carrying with it a much broader set of representations. Rather than a finite dimensional binary representation for information, the qubit is a member of a two-dimensional Hilbert space containing a continuum of elements. Thus quantum computers operate in a much richer space than 10 binary computers. Researchers have defined many sets of elementary quantum gates based on the qubit concept that perform mappings from the set of input quantum registers to a set of output quantum registers. A single gate can entangle the qubits stored in two adjacent quantum registers and combinations of gates can be used to perform more complex computations. It can be shown that, just as in Boolean computation, there exist minimal sets of quantum gates that are complete with respect to the set of computable functions. Considerable research has been conducted to define the capabilities of quantum computers. Theoretically quantum computers are not inferior to standard computers of similar complexity and speed of operation. More interesting is the fact that for some important classes of problems, the quantum computer is superior to its standard counterpoint. In particular, it was shown that the two prime factors of a number can be determined by a quantum computer in time proportional to a polynomial in the number of digits in the number.17 This truly remarkable result showed that for this particular class of problems, the quantum computer is at least exponentially better than a standard computer. The key to this result is the capability of a quantum computer to efficiently compute the quantum Fourier Transform. This result has immediate application in cryptography since it would allow the quick determination of keys to codes such as RSA. It is estimated that few thousand quantum gates would be sufficient to solve a representative RSA code containing of on the order of one hundred digits. There are several other applications that are variants of the factorization problem.18 The development of a practical architecture for reliable quantum computers is just beginning19. Elementary architecture implementation concepts such as quantum storage, data paths, classical control circuits, parallelism, programming models, and system integration are not yet available. The overhead requirement for quantum error correction is a daunting problem; the error probability for a quantum operation can be as high as 10-4 and requires heroic efforts to manage. Improvements in error correction code are in research now but their impact is not yet known. Practical architectures will require error rates between 10-6 and 10-9. A minimum set of architecture building blocks has been proposed20: a quantum arithmetic logic unit, quantum memory, and a dynamic scheduler. In addition, the architecture implementation uses a novel wiring technique that exploits quantum teleportation. In this wiring, the desired operation is performed simultaneously with the transport. P. W. Shor, “Algorithms for quantum computation: Discrete logarithms and factoring”, Proc. 35nd Annual Symposium on Foundations of Computer Science, IEEE Computer Society Press (1994), 124-134 18 C. P. Williams and S. H. Clearwater, “Explorations in Quantum Computing” (1998 Springer-Verlag, New York, Inc) 19 M. Oskin, F.T. Chong, and IL. Chuang, IEEE Computer, January, 2002, p. 79 20 See previous reference 17 11 Table 46b Emerging Research Architecture Implementations ARCHITECTURE IMPLEMENTATIONS APPLICATION DOMAIN DEVICE AND INTERCONNECT IMPLEMENTATIONS CELLULAR ARRAY IMPLEMENTATIONS DEFECT TOLERANT IMPLEMENTATIONS QUANTUM CELLULAR AUTOMATA CELLULAR NONLINEAR NETWORKS Not yet demonstrated Fast image processing, Associative memory, Complex signal processing Reliable computing with unreliable devices (such as SET’s with background noise), Historical examples include WSI, Teramac FPGA implementations Resonant tunneling devices Molecular switches, crossed arrays of 1D structures, Switchable interconnects Arrays of nanodots or molecular assemblies BIOLOGICALLY INSPIRED IMPLEMENTATIONS- Goal-driven computing using simple and recursive algorithms, High computational efficiency through data compression algorithms Molecular organic and biomolecular devices and interconnects COHERENT QUANTUM COMPUTING Special algorithms such as factoring and deep data searches Spin resonance transistors, NMR devices, Single flux quantum devices, Photonics DESIRABLE FUNCTIONAL CHARACTERISTICS and CHALLENGES INFORMATION THROUGHPUT POWER INTERCONNECTS Fan out =1, Functional throughput constrained by interdot capacitances Fan out close to unity Power comparable to scaled CMOS (0.1-0.5 MIPS/mW) Data streaming applications will need 10100 MOPS/mW Power comparable to scaled CMOS (0.1-0.5 MIPS/mW) Data streaming applications will need 10100 MOPS/mW No local interconnects Local interconnects with neuronlike waveforms Fan out variable but performance degraded by need for defect management schemes Not demonstrated yet Interconnects by crossed arrays Massive parallelism, Requires some long-range data transfer, Fan out very high in brains (max=104 and avge=103) Exponential performance scaling, Presently limited to 5 qubits but 50100 qubits needed for large computations High parallelism results in lower operational speeds, Power consumption of human brain 1030W @ millisecond rates Not demonstrated yet for largescale computations Interconnects distributed over a range of distances Interconnects through wavefunction coupling, and entangled states 12 Sensitive to background charge, Low temperature operation, Not yet determined Multiple modular redundancy and multiplexing for transient errors Highly dynamical neural-like systems Implement adaptive self-organization, fault tolerance Error correction costs high DEFECT TOLERANCE Not demonstrated Not yet determined Techniques used include: Redundancy, NAND multiplexing, Reconfiguration Inherently insensitive to defects through adaptive algorithms Error correcting algorithms needed MANUFACTURABILITY Precise dimensional control needed, Tight tolerances on tunnel rates of all junctions to minimize jitter, Self assembly possible Not yet demonstrated Demonstrated NMR quantum computing with 6 qubits TEST Not demonstrated Demonstrated only for image processing Test functions are included in the adaptive algorithms used Test not possible directly Locally active and locally connected Cell and array design immature (no fan-out), No programming model yet Goal directed program model, Backed by extensive neural network research Algorithmic implementations need more research support ERROR TOLERANCE No programming model yet MATURITY RESEARCH ACTIVITY (2001-2003) Self-test or requires extensive pre-computing test REMARKS Supports memory based computing, Applications in dependable systems, Extreme application limitation, No generalpurpose architecture or programming model yet Demonstration Demonstration Demonstration Concept Concept 25 research papers 92 research papers 10 research papers 12 research papers 976 research papers Architecture Performance Limitations The metrics for evaluating the maturity and relevance of architectural implementations of nanoscale technologies are difficult to make quantitative. Each of the architecture implementations in Table 46 is radically different from others and must be evaluated with different metrics. Normally benchmark programs are used to evaluate implementations. These benchmarks, such as SPECint for microprocessors, must be able to measure the results of “typical” computations for any given application. However benchmark programs also have problems because they are usually so specific that scaling of architectures for extended applications and extension of architectures into new applications are not measured and in fact may even be discouraged. Benchmark circuits can be identified that include memory and register cells, some arithmetic, and long-range buses. For any technology option to be considered seriously, some benchmark measures must be devised and agreed to by end users. In the case of emerging research devices, benchmark logic circuits such as fan-out=4 circuits and simple programming models that generate instructions or operations per second (MIPS or MOPS) are the most recognizable metrics. Increasingly such metrics must also recognize power limitations as well. The most recent targets for processors range from 0.1 MOPS/mW for MPU’s to 13 1000 MOPS/mW for custom-designed DPU’s (data processing units).21 Any nanotechnology and its associated computing model must meet or exceed these targets to be viable. Another measure commonly used is MOPS/mm2 or, in the case of 3-D systems, MOPS/mm3. The current targets for architectures in CMOS silicon are 1-1000 MOPS/mm2 depending on the specific implementation being used. The upper bound is obviously limited by power dissipation. Future Architecture/Implementation/Technology Tradeoffs The physical limits of CMOS-based integrated circuits are summarized in Figure 1, taken from the work of Hadley and Mooij from Delft University22. The average delay per device is plotted against device density for CMOS and quantum devices, based on electron transport, and compared to several limits. This average delay is less than that given by the clock frequency since not every device switches with every clock cycle. The average delay depends on the circuit design. The dissipation limit is given by the removal of heat from the gates and is based on an average value of 25W/cm2 The technology that comes closest to the dissipation limit will deliver the most computational power. The quantum limit arises when the energy necessary to switch a bit approaches the quantum limit given by the Heisenberg uncertainty principle and the delay is about 2 X 10-15 s for circuits to be stable against quantum fluctuations. For the device itself this number should be 10X greater (2 X 10-14 s). The quantum limit intersects the dissipation limit at a device density of about 107 to 108 devices/cm2. This is close to the current device density of CMOS circuits. Thus at room temperature operation, CMOS devices at current densities will not be quantum noise limited. The relativistic limit is caused by the finite speed of light. No information can be transported over 1 cm in a time less than 0.3ns. For shorter delays, the distance traveled must be correspondingly less and the micro- and nanoarchitecture implementations must become more localized. Of course parasitic RC delays will be longer than the relativistic limit. Such delays are very serious at nanoscale dimensions due to the high output impedance of the devices (even more than the interconnect resistance at nm dimensions) and are estimated to be 1000X larger in a quantum circuit than in a CMOS circuit. Thus high impedance nanodevices, such as single electron transistors, will be slower than CMOS FET’s. SET’s that operate at room temperature have low gain, high output 21 Private communication, Professor Jan Rabaey, University of California, Berkeley 22 P. Hadley and J.E. Mooij, Quantum nanocircuits: chips of the future? in Quantum Semiconductor Devices and Technologies edited by T. P. Pearsall, Kluwer Academic Publishers, Dordrecht, pp 1-20 (2000). 14 impedance, and background offset charges. Thus no room-temperature SET logic or memory approach is practical at this time. The issue for SET’s is whether the density increase possible outweighs the increase in associated delays with respect to useful and cost-effective information processing. The above discussion shows that quantum devices based on electron transport do not have any performance advantage over CMOS at room temperature. They will be more dense but slower due to the high output impedance. And the very small dimensions (<10nm) required for room temperature operation are not currently manufacturable at such densities. Fig. 1 The average delay vs. device density showing the dissipation limit, the relativistic limit, and the quantum limit for room temperature CMOS integrated circuits. (From P. Hadley and J.E. Mooij, Quantum nanocircuits: chips of the future? in Quantum Semiconductor Devices and Technologies edited by T. P. Pearsall, Kluwer Academic Publishers, Dordrecht, pp 1-20 (2000)) Implications of Emerging Nanoscale Quantum –Dominated Devices on Future Computing and Communications Needs 15 The role of nanoscale devices in meeting future computing and communications needs is not clear at this point. However there are many needs that would benefit from the terascale level of integration that such devices would potentially offer. There are also limitations that arise with nanoscale devices that will impact their usefulness, Layout Regularity Many applications will require large amounts of memory that integrates closely with CMOS logic. Nanoscale devices are well suited to provide very dense memory arrays that are organized in periodic arrays. However logic implementations have not been based traditionally on regular arrays. Device Performance The gain of nanodevices is an important limitation for current combinatorial logic where gate fan-outs require significant drive current and low voltages make gates more noise sensitive. New logic and low-fanout memory circuit approaches will be needed to use most of these devices for computing applications. Signal regeneration for large circuits may need to be accomplished by integration with CMOS. Integratability of nanodevices to CMOS silicon is a key requirement due to both the need for signal restoration for many logic implementations and also the established technology and market base. This integration will be necessary at all levels from design tools and circuits to process technology. The total impedance for electron transport devices can be more than 100 k so that, for comparable interconnect capacitances, the lowest impedance device will be favored. In fact, device output impedance may be even more important than interconnect resistance for nanoscale devices. This is reflected in the low gain of these devices. Most nanoscale devices have output impedances of hundreds of megohms or more. Contact resistances for metal-nanodevice contacts must be comparable to the interconnect resistances and be repeatable and reliable. The error rate of all nanoscale devices and circuits is a major concern. These errors arise from the highly precise dimensional control needed to fabricate the devices and also from interferences from the local environment, such as spurious charges in SET’s. It has been estimated that redundancies of 103 to 104 will be needed for manufacturing and lifetime device failure rates of 10-6 to 10-7. Thus for nanodevice levels of 1012, only 108 to 109 devices will be useable for computation. Large-scale error detection and correction will need to be a central theme of any architecture and implementations that use nanoscale devices. Nanodevices must be able to operate at or close to room temperature for practical applications. Device Transfer Functions Nanoscale devices may perform circuit functions directly due to their nonlinear outputs and therefore save both real estate and power. In addition, nanodevices that implement 16 both logic and storage in the same device would revolutionize circuit and nanoarchitecture implementations. Interconnect Limitations Nanodevices based on electron transport must be interconnectable without a major loss in density, performance, or power. The interconnects must demonstrate transmission resistances of several tens of kilohms. The interconnect pitch transformation from nanoscale dimensions to the order of millimeters used in most applications will require sophisticated multiplexing schemes to enable bi-directional signal flows. Power Limitations Clock speed vs density tradeoffs for electron transport devices will dictate that for future technology generations, clock speed will need to be decreased for very high densities or conversely density will need to be decreased for very high clock speeds. In other words, the power-delay product {minimum power dissipated X (switching time)2} cannot be less than Planck’s constant, h, in the quantum limit. Nanoscale electron transport devices mostly fit into the former category and will best suit implementations that rely on the efficient use of concurrent devices more than on fast switching. Nanoscale devices, such as quantum dots and single-atomic or nuclear spins, based on quantum electron behavior rather than electron transport, offer significant relief from the thermal dissipation problems of electron transport devices. However problems of manufacturability and low-temperature operation are major obstacles to early implementation for metal dot structures. For molecular structures, operation speed and defect tolerance remain to be explored. Coherent Quantum Computing Coherent quantum effects using devices based on photons, or electron or nuclear spins, and superconducting devices will require totally new computing structures. Such devices function by superposed wave functions that are entangled as qubits and that easily decohere when interacting with an external environment such as a measurement device. Although enormously capable for a few selected algorithms such as encryption or deep database searching, quantum computing is not seen yet as being of more general interest. Also the nanofabrication requirements for arrays of coherent quantum devices needed to maintain the needed coherent states are far beyond any extrapolated process control capabilities. So, for now, there is no known path to produce quantum-computing systems with more than a few qubits and we will not consider them further here at this time. Future research will require a more detailed consideration of them for future semiconductor roadmaps. Future R&D Needs 17 Needless to say, neither the circuit technology nor the design tools are available to implement the kind of nanoarchitectures being considered here. So significant research and development will be needed in these areas to support the useful implementation of quantum-dominated nanoscale devices for computing and communications applications. Also self-test procedures will need to be developed that include automated re-routing among tiles, particularly for new information processing structures that self organize instead of being programmed and that have intrinsic fault tolerance. In addition to the need for infrastructure research, it will be necessary to fund significant programming model/algorithm/application research that will throw off the constraints of traditional instruction sets and prepare for implementation in software-controlled processor systems that can be reconfigured at various times including run time and/or enable direct bit-mapped implementations. Such implementations should be enabled by dense memories made possible by nanodevices. At the very high densities of future technology generations, reconfigurability at run time through soft control and direct binary mapping of applications to hardware will enable far more complex algorithm implementations than is possible through hard-wired, ISA-bound systems of today. 18

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Computer Architectures for Nanoscale Devices