ATAC: Enhancing Multicore Programmability through
All-to-All Computing
The computing world has made a generational shift to multicore as a way of addressing the Moore’s
gap, which is the growing disparity between the performance offered by sequential processors and the
scaling expections set by Moore’s law. Two- or four-core multicores are commonplace today, with scaling
to 1000’s of cores expected by the middle of the next decade. Unfortunately, because even two- and fourcore multicores (let alone thousand-core multicores) are extremely hard to program, multicore’s
widespread acceptance is threatened [60]. The multicore programming challenge is a serious issue and
requires us to think about bold new approaches to architecture, programming, and software.
The ATAC project is based on one simple idea. The idea is that a low latency, low energy broadcast
mechanism from any core to all other cores can yield a big step forward in multicore programmability.
The broadcast mechanism is enabled by CMOS-integrated chip-level optical interconnection using WDM
(wave-division multiplexing) with multiple add/drop points. The optical interconnect augments a traditional
electrical-mesh interconnect in a tiled multicore processor. We believe that although point-to-point
electrical interconnect is capable of delivering performance that is competitive with on-chip optical
interconnect, it does not solve the programmbility issue. Thus, in ATAC, the optical broadcast capability
does not replace basic electrical interconnect, it simply augments it for programmability. Previous work
of the co-PIs of this proposal has demonstrated that the on-chip optical broadcast network integrated
with a standard CMOS process is feasible to build, and that the availability of the broadcast mechanism to
programmers through a software API can yield significant programmability gain.
Accordingly, to drastically ease programming for multicores, we propose the ATAC computer
architecture that augments an on-chip mesh network with an on-chip optical broadcast network. Such a
network enables blazingly fast broadcast communication that will allow programmers to fully take
advantage of the multicore opportunity, even as multicores scale to thousands of cores. Although this
capability has the potential to greatly speed-up existing algorithms, its biggest appeal lies in its ability to
facilitate new, easy-to-use programming models. An efficient broadcast mechanism allows for novel,
distributed coherent-shared-memory architectures, as well.
We have assembled a cross disciplinary team with expertise in computer architecture, programming
languages and compilers, VLSI design, and integrated microphotonics. Our team is led by the MIT
Computer Science and Artificial Intelligence Laboratory (CSAIL) and MIT Microphotonics Center (MPC) in
collaboration with the MIT Microsystems Technology Laboratories (MTL), Sandia Labs, and Intel Corp., to
create a new computing platform that has the potential to significantly simplify multicore programming.
The ATAC project proposes to design and to build a prototype computer system using the ATAC
approach. This includes the ATAC computer architecture design, a detailed computer system simulator, a
compiler system, a runtime system, a programming model and associated APIs that leverage the
broadcast capability of the underlying ATAC multicore, and architectural models and interfaces for the
optical interconnect technology. The optical interconnect design and fabrication is supported by DARPA’s
EPIC (Electrical and Photonic Integrated Circuits) program. The multicore simulator will be developed in
collaboration with Intel. The simulator is based on the Pin dynamic instrumentation infrastructure and will
enable us to simulate 1000’s of cores on our existing compute server farm with over 100 processors.
The proposed research will make five fundamental contributions. The first contribution is increased
programmability for multicores due to ATAC’s seamless integration of the efficient broadcast facility
enabled by an on-chip optical network along with the electrical mesh network. The second contribution is
the development of Application Programming Interfaces (APIs) and programming languages that allow
algorithms to best take advantage of ATAC’s on-chip communication infrastructure while minimizing the
burden on the programmer. The third contribution is the development of appropriate metrics that assess
programmer productivity when developing parallel applications on multicore architectures. Fourth, this
research will implement a multicore simulator that can simulate thousands of cores, and can run on
current multicore hardware. Finally, this research will assess how well multicores scale to thousands of
cores, especially in light of performance and energy scalability.
The broader impacts of this work will include easing the multicore programming burden for future
multicore systems, thereby removing the fundamental constraint to mass adoption of multicore. It will also
create the “killer app” for on-chip optical interconnect, thereby bringing this technology into the
mainstream. The project will also train undergraduate and graduate students in system building and
multicore software and hardware technologies. As with our previous Raw and Alewife projects, we will
involve several undergraduate students in our research, thereby helping to train the next generation of
multicore researchers and programmers.
Results from Prior NSF Support
Project ATAC’s team includes Profs. Anant Agarwal (Computer Science and Artificial Intelligence Lab,
MIT), Saman Amarasinghe (Computer Science and Artificial Intelligence Lab, MIT), and Lionel Kimerling
(Microphotonics Center, MIT).
Anant Agarwal Prof. Agarwal’s research group focuses on computer architecture, compilers, VLSI and
applications. We refer to current and previous NSF grants EIA-9810173 and EIA-0071841, “Baring It All
to Software: The MIT Raw Machine”, as the Raw project, and MIP-9012773, “Automatic Management of
Locality in a Scalable Cache-Coherent Multiprocessor: The MIT Alewife Machine,” as the Alewife project.
Figure 1: Die photo of the Raw processor (left), and photo of the Raw prototype computer system (right).
The Raw project [2,4,5,6,10] designed and implemented [7] the Raw processor (a single-chip tiled
multicore), a distributed ILP and stream compiler [9], runtime and operating systems, tools and other
system software. The Raw processor, implemented in the .18 micron IBM SA-27E ASIC process,
occupies an 18.2x18.2mm die and runs at 420MHz. The Raw chip was operational in October 2002. The
prototype system is fully operational and many researchers from MIT as well as other institutions such as
USC/ISI, Lincoln Labs, and Lockheed Martin have used it. We also built a Raw Fabric multiprocessor
system consisting of 4 Raw chips (for a total of 64 tiles).
The Raw effort pioneered the tiled architecture concept. It also conceptualized the notion of a scalar
operand network, or SON [8]. The Raw effort developed the notion of the on-chip distributed direct cache.
The project created a characterization of on-chip networks called Astro [8], which facilitates comparison of
different tiled and conventional superscalar processors such as Trips, Scale, ILDP and others.
The effort also led to many fundamental discoveries in compiler techniques for orchestrating ILP and
streams including algorithms for distributed ILP (DILP), control localization, space-time instruction
scheduling [9], software-serial ordering, modulo unrolling and equivalence class unification. The effort
also produced early work on transactional software methods such as SUDS (Software Undo System) [12].
For more details please refer to the Raw publications site available at, or
Agarwal’s retrospective on Raw and tiled processors given as a keynote at the 2007 Micro 40 conference
(the talk is available from the conference web site).
The Raw effort developed several applications for multicores. It also created a new metric for the
versatility of processors and distributed an associated benchmark suite called VersaBench
( or google versabench).
The Raw project impacted the community in several ways. First, the project produced several dozen
research papers. The project also graduated several Post Docs, PhD, MS and BS students, many of
whom went on to become professors at other universities (E.g., Matt Frank at UIUC, Rajeev Barua at U
Maryland, Andras Moritz at U Amherst, Michael Taylor at UC San Diego).
The tiled multicore technology developed by the Raw project was also commercialized by a venturefunded startup called Tilera. Tilera has made commercially available a 64-tile multicore chip called the
Tile Processor. The Tile Processor was also chosen by the NRO for US space-based applications.
The Alewife project [14] was also funded by NSF and was conducted at MIT in the early 90’s. The
goal of the Alewife experiment was to discover and to evaluate mechanisms for automatic locality
management in scalable multiprocessors. The MIT Alewife machine became operational on May 7, 1994.
Like the Raw project, Alewife involved a major system building effort. A 32-node machine was in regular
use until 1998. The machines and simulators have also been used in graduate-level courses and a
summer course for industry participants at MIT.
Alewife pioneered the integration of message passing and shared memory into a single coherent
interface, and a flexible, software-extended, shared-memory system called LimitLESS directories.
Alewife’s Sparcle processor [15] was an early demonstration of a multithreaded microprocessor. It
created the concept of coarse-grain multithreading (CGMT). Several Sparcle mechanisms — including
trap vector spreading for fast exception handling, rapid context switching, and user-level address space
identifiers for fast, user-level messages — influenced SPARC V9.
The Alewife project produced dozens of publications. Alewife and Virtual Wires papers and pictures
are available through the web sites,, and
Saman Amarasinghe Professor Amarasinghe’s current work on the StreamIt project was supported by
an NSF NGS award (0305453) “StreamIt: A Language and a Compiler for Streaming Applications” and
by an NSF ITR award (0325297) “A Language, Compilers and Tools for the Streaming Application
StreamIt [34,35,36,37,38,39,33,13] is aiming to ease the burden of programming multicore
architectures by developing high-level programming idioms, compiler technologies, and runtime systems.
StreamIt project has two goals: to improve programmer productivity for a streaming class of applications
and to obtain high performance, portability, and scalability for StreamIt programs.
Improving Programmer Productivity In StreamIt, the programmer builds an application by connecting
components together into a stream graph, where nodes represent filters that carry out the computation,
and edges represent FIFO communication channels between filters. As a result, the parallelism and
communication topology of the application are exposed, empowering the compiler to perform many
stream-aware optimizations that elude other languages.
In StreamIt, all of the processing is encapsulated hierarchically into single-input, single-output
streams with well-defined modular interfaces. This facilitates development and boosts programmer
productivity, as components can be debugged and verified as standalone components.
High performance, portability, and scalability StreamIt attempts to expose the common properties
across all multicore architectures in the lanauge while abstracting away the differences between different
architectures. Thus, common properties such as are multiple flows of control and multiple local memories
are directly exposed in the language, making the abstraction boundary between the language and the
architecture that the compiler has to bridge as narrow as possible. Thus, unlike existing imperative
languages where compilers have to do heroic to impossible tasks, StreamIt compiler is able to achieve
respectable performance with relative ease.
StreamIt also hides processor specific properties such as the number of cores, communication
primitives and topology, computation strength and memory layout within the cores etc. In the StreamIt
compiler, we are developing the algorithms needed to effectively take advantage of these properties
without loosing portability or performance.
Lionel Kimerling Prof. Kimerling’s research [16,17,18,19,22,23,24,25,26,27,28,29] has focused on
microphotonic integration on the silicon platform for over 20 years. Among the group’s achievements are:
1) a monolithically integrated MOSFET driver and Si:Er LED emitting at 1550nm (operating at room
temperature); 2) the first low loss, silicon channel waveguides; 3) the first omnidirectional dielectric stack
reflector; 4) the first Ge-on-Si photodetector; 5) the first silicon disk, ring and race track resonators and
integrated silicon bus-add/drop filters; 6) discovery of a strong Franz-Keldysh effect in Ge and application
to integrated SiGe optical modulators; 7) the smallest waveguide integrated Ge-on-Si photodetectors
exhibiting 100% quantum efficiency; record low loss silicon channel waveguides (0.35 dB/cm); 8)
demonstrated full CMOS process flow for monolithically integrated Si/Ge-on-Si waveguide/modulator/ring
resonator/photodetector circuit; 9) first monolithically integrated silicon optical RF channelizer.
In terms of NSF sponsored research, Kimerling established and led the Microphotonics IRG in the
MIT MRSEC from 1997-2002. The team studied the silicon platform for HIC photonic materials and
devices. They created the first photonic crystal device to operate at the wavelength of 1550nm, a
waveguide-integrated, photonic crystal add/drop filter. That structure continues to hold the record for the
smallest modal volume of a photonic device.
Strong Atom-Photon Interaction for Microphotonic Devices (2000-2002) We observed the first
enhancement of 1550nm emission from Er2O3 in a Si/SiO2 microcavity; we observed the first evidence of
THz Rabi splitting from a matched cavity structure of the same materials; we created continuously tunable
(1200-1600nm with bias <12volts) MEMS microcavity devices using double resonant structures.
Agglomeration of Ultra-thin Silicon-On-Insulator Films: Understanding Dewetting in Crystalline Thin
Films (2004-2006) We developed a 5-step surface-energy-driven dewetting model for SOI
agglomoration based on the capillary film edge stability and the generalized Rayleigh instability. Our
surface-energy-driven model was able to well explain all of the key experimental observations in the
existing literature and our own new experimental results. For the first time, we observed highly
anisotropic dewetting behavior that was very sensitive to the edge orientation of a patterned mesa. We
demonstrated the effectiveness of a dielectric edge coverage technique for stabilizing patterned SOI
structures against dewetting.
The trend in computer architecture for the foreseeable future is clear: microprocessor designers are
using copious silicon resources to integrate more and more processor cores onto a chip. In fact, within
the next ten years, general purpose multicore processors will likely contain 1,000 cores or more. While
this path towards ever-increasing parallelism will theoretically enable massive performance, it is unlikely
that application developers will be able to harness this potential unless drastic improvements to
programming are made [60]. Current approaches to multicore programming are barely manageable for
multicores with two or four cores, but they certainly will not scale to massive amounts of parallelism. A
new architectural mechanism---fast on-chip broadcast enabled by novel optical technology---will
revolutionize the programmability of future multicore processors.
While parallel programming used to be somewhat of a black art reserved for the handful of rocket
scientists that programmed supercomputers and clusters, multicore’s imminent rule will require most
programmers to implement parallel applications. However, by incorporating powerful hardware and
architectural mechanisms, such as a fast broadcast, and empowering programmers with the right
interfaces to the underlying architecture via APIs and language facilities, all programmers will be able to
efficiently construct programs that exploit multicore’s power.
The broadcast primitive, whereby one node in a parallel computer system communicates some data
to every other core (or, some subset of the cores), is powerful and straightforward to use. Parallel
algorithms often use broadcast to achieve synchronization and communicate sentinel values or data
values. The popular SMP (symmetric multiprocessing) computational model can also use a cheap
scalable broadcast to scale beyond a small handful of cores. However, broadcast operations can be
expensive. Even a state-of-the-art electrical mesh network for a multicore of the future with 1000 cores
would require 100’s of cycles to broadcast a single value to all cores. Such a latency is large enough that
broadcast must still be used judiciously, or not at all, by programmers. In fact, programmers often
implement otherwise straightforward algorithms in complicated ways to work around performance
bottlenecks such as slow broadcasts. As an example, MPI has a broadcast feature, but it is rarely used
for this reason. With an essentially “free” broadcast, however, programming parallel systems would be
hugely simplified, as programmers could use broadcast freely.
Why do this research now? There are two major reasons. First, multicores have recently become
mainstream and are facing a parallel programming crisis [60], so bold architectural changes are
warranted. Second, recent breakthroughs in CMOS integration of nanophotonic components [25] provide
the enabling technology to make the broadcast mechanism viable. Our photonic technology uses planar
lightguide circuits, or PLCs, with wavelength division multiplexing (WDM). CMOS silicon offers all of the
information capacity advantages of fiber and the precision planar processing of PLCs with the additional
advantage of dense integration on a platform that is compatible with electronic integrated circuits. The
basis of this dense integration (~106 photonic devices per cm 2) capability is high index contrast. While
fiber and PLCs typically utilize a core/cladding refractive index ratio of <0.01, the Si/SiO 2 ratio is ~2.33.
This design paradigm, named high index contrast, HIC, provides strong confinement of light in small
volumes, such as for on-chip planar waveguides. Conventional optical devices utilize low index contrast
that is neither compatible with silicon circuits in size or in terms of CMOS processes. The layer thickness
and dimensions of silicon waveguides, 200x500nm, are similar to upper level metal interconnects in size
with much higher information carrying capacity and no electromagnetic interference, EMI.
MIT’s microphotonics group (led by co-PI Prof. Kimerling) has designed, implemented and
demonstrated HIC devices within a CMOS process flow. Their fabrication efforts are funded under
Darpa’s EPIC program.
Our collaborative effort for this cross-discipline proposal focuses on designing computer architectures
and programming for the novel broadcast interconnect enabled by the on-chip optical technology,
researching the degree to which an efficient broadcast mechanism eases multicore programming,
research on partitioning of function between the optical and electrical-digital domains, and design of
interfaces and models for the optical components to facilitate their incorporation in computer systems.
ATAC creates a fundamental shift in multicore computing that utilizes an electronic mesh for short
range intercore communications and a broadcast optical network optimized for global communication.
Our early results indicate that the ATAC approach will simplify multicore programming significantly, it is
scalable to 4000 cores/chip, and that it can also ease the off-chip memory bandwidth bottleneck by
extending the optical network offchip.
Overview of the ATAC Approach
As displayed in Figure 2, the proposed ATAC microprocessor is constructed in a 2-D tile layout of
computing cores, each containing data and instruction caches, communication hardware, and
computational resources. The cores are interconnected via an electrial mesh network for near neighbor
communication, as well as an optical broadcast network for global communication. The optical broadcast
network can be thought of as a global bus whereby every core can communicate with every other core in
a few cycles. However, unlike a standard electrical bus, the optical broadcast network is a global
communication channel that is scalable to thousands of cores. Indeed, the ATAC architecture is being
designed to scale to four thousand cores or more.
Figure 2: High-level view of ATAC
composition consisting of a 2-D array
of tiles interconnected by both an
electrical mesh network and an optical
broadcast network. A conceptual view
of the broadcast network is shown
here. The practical implementation
using a set of rings is described later.
Electrical Mesh Network
Optical Broadcast Network
At a high level, ATAC overlays a standard a multicore processor with on-chip 2-D mesh network (e.g.,
Raw, Trips) with an optical broadcast network. Some applications require significant near-neighbor
communication due to spacial locality inherent in the algorithm. However, many algorithms, such as
search, are more easily coded if they can use global communication to broadcast the current best value.
SMP architectures also benefit from an efficient broadcast because they can invalidate copies of data that
are cached in multiple caches quickly. The ATAC approach is to blend the best of both electrical mesh
and optical broadcast networks in a way that yields good performance and programming ease.
Fast global communication will become increasingly important as multicores scale to hundreds or
thousands of cores. It is estimated that it will take a future multicore with 1,000 cores at least 100 cycles
to perform a global broadcast operation. However, ATAC will be able to perform such an operation on a
1000-core chip in 10 cycles or less. Given that many applications rely heavily on global communication,
such benefits will allow for performance improvements of over 40x for some applications, as seen in
preliminary performance results.
Perhaps more importantly, fast global communication will become essential to enable programmers
to manage the arduous task of programming hundreds or thousands of cores. Programmers have
typically used global communications operations sparingly, as such operations often impose a significant
performance penalty. Accordingly, parallel programmers often decompose otherwise straighforward
algorithms into much more complicated forms to minimize the amount of global communication necessary
to implement the algorithm. Furthermore, programmers have to account for potentially widely varying
communication latencies, depending on the distance between the sending and receiving cores. Not
surprisingly, this all means that getting good performance on standard multicore systems can be
incredibly challenging, and can take a long time. On ATAC, programmers will be able to broadcast values
at will, or use the underlying electrical mesh, without worrying about the typical performance impact of
imprudent use of global communication operations.
The ATAC broadcast and select network optimizes power efficiency by sending the data to multiple
places with little extra power consumption (the source modulator is the primary source of power
consumption and is independent of the number of receivers to first order). The use of wavelength division
multiplexing minimizes interconnect contention. ATAC is scalable to 4000 cores/chip and also to multichip and multi-box architectures. The ATAC solution is also monolithically integrated into standard
microprocessor chip fabrication processes using CMOS thereby improving its performance/cost benefit.
Related Research
Parallel programming is a challenging problem [41]. In order for existing sequential codes to take
advantage of multicores, programming tools and novel architectures are needed to ease the transition
from sequential codes to multicore codes [42]. Much academic and industry research has gone into
attempting to ease the effort required for executing codes on parallel computers such as parallel
programming APIs, domain specific languages, automatic parallelizing compilers, parallel performance
tools, and incremental architectural enhancements to ease programming but with little impact. We believe
bold architectural approaches are needed to address this pressing issue.
There are many extensions to sequential programming languages to provide parallel capabilities.
Examples include threads [43], OpenMP [44], and MapReduce [45] MPI [46]. These methods work well
for some set of applications, but none has shown itself capable of tackling all forms of parallelism. Many
of the methods such as threads do not scale beyond a few cores. Also, some researchers believe that
these extensions allow the programmer to easily introduce program errors; Lee [47] shares this view. MPI
is difficult to program, and it squanders the advantage of multicore with its high operation overhead. Allto-All broadcast using optical interconnect addresses both the scalability and programmability issue.
Domain specific languages also attempt to address multicore programmability. StreamIt [48] and
Brook [49] are programming languages primarily focused on signal processing and stream processing.
Parallelism can also be extracted from sequential codes. This allows the programmer to not modify
their sources while still realizing performance improvements on parallel machines. Typically these are
modest gains and unlikely to scale to more than 10 or 20 cores. Examples of distributed ILP (DILP)
compiler efforts include Mahlke’s work [50], GREMIO [51], and our own effort on RawCC [9].
In order to ease multicore programming, we have seen the advent hardware being added to provide
easier programming models. An example of this additional hardware is transactional memory systems
[52, 53]. Transactional memory systems allow multiple threads to access shared memory inside of a
transaction. If multiple threads access the same piece of data, then the system rolls back the
transactions such that only one modifies the shared data at a time. It is conjectured that this is easier to
program and less error prone than threaded programs with locks, but needs to be investigated further.
Different architectures attempt to solve the problem of organizing and connecting parallel resources
on a single chip. One manner to do this is via processors designed to exploit instruction level parallelism.
The Itanium processor [54] and Multiflow work [55] are examples of this. Streaming processors have
attempted to solve this problem by optimizing for applications with little temporal locality. The prototypical
stream processor is the Imagine processor [56]. Another organization of parallel resources is to build a
SMP on a chip. The Piranha project [57] is an example of a SMP on a chip for commercial work-loads.
Finally there are processors which support multiple types of parallelism, examples being Trips [58] and
Raw processors which can support stream processing, thread level parallelism, and ILP.
Although other methods of using photonics for on-chip interconnect are being developed, their limited
gains versus electrical interconnect rarely justify their added cost and complexity. Free space optics has
attracted significant interest because it offers flexibility in terms of hybrid components. The downside lies
in its limited CMOS compatibility and fabrication of reliable optical components. Another approach
replaces the electrical bus with an optical bus. Unfortunately, contention still limits the optical bus. We
believe that our approach using WDM, CMOS process compatibility, and broadcast uniquely leverages
the strengths of photonics for the specific goal of enhancing multicore programmability in an area where
electrical interconnect falls short, thereby justifying its use.
Research Questions
The aim of the ATAC project is to create a multicore computer system that can scale to thousands of
cores, both in terms of performance and ease of programming. ATAC attempts to achieve this goal by
integrating an optical broadcast network into a tiled multicore processor with an electrical mesh network.
ATAC will also provide the programmer with high-level APIs to efficiently and effectively exploit ATAC’s
hardware resources. The research questions for the ATAC project center around how to best achieve and
balance the two goals of performance and programmability. More specifically, our research will attempt to
answer the following questions:
 What is the best way to interface an optical broadcast network in a basic tiled multicore processor?
 What is the right balance in the power budget between energy used in the electrical mesh network
and energy used in the optical broadcast network? Furthermore, what is the right balance between
the power consumption of such communication networks, the computational part of each core, and
the on-chip and off-chip memory resources?
 What is the best API to provide programmers to take advantage of both the optical broadcast
network and the electrical mesh? Should the API’s allow the programmer to observe or control the
spatial location where a particular piece of computation is run, or should the API handle this behind
the scenes? Are there high-level language constructs that would help programmer productivity?
 What is the best way to measure ease of programming and programmer productivity?
 A comparison of performance and programming effort for the baseline tiled multicore architecture,
versus the architecture with the optical network.
 What is the degree to which the broadcast network can make programming easier for a given level
of performance? Or conversely, what is the degree to which performance can be increased with a
broadcast network for a given level of programming effort? This result will provide the justification to
take the next step of actually building a physical prototype of the ATAC processor.
 We will also study the extent to which a pipelined broadcast can be implemented in software in a
traditional electrical network, and assess the performance achievable.
 Which application classes will best take advantage of ATAC’s broadcast network?
 What is the best way to simulate 1000 cores at reasonable speeds and with sufficient flexibility?
The ATAC Approach
The ATAC approach incorporates an optical broadcast network into a tiled multicore processor
architecture. The optical network is enabled by recent advances in electronic-photonic integration using
standard CMOS. ATAC also seamlessly integrates an electrical mesh interconnect for high-bandwidth
near-neighbor communication. Programmers will interact with ATAC via high-level APIs that leverage the
system’s underlying resources. The following sections discuss details of the ATAC architecture, optical
network, and software infrastructure.
The Basic ATAC Architecture
Figure 3: ATAC architecture for a 4096-core microprocessor chip.
The ATAC processor architecture is a tiled multicore architecture combining the best of current
scalable electrical interconnects with cutting-edge on-chip optical communication networks. The tiled
layout uses a 2-D array of simple processing cores, each containing a multiple-instruction-issue in-order
RISC pipeline with an FPU and local memories. Each tile contains an L1 cache and a portion of the
distributed L2 and L3 caches. The L1 and L2 caches are SRAMs while the L3 cache is embedded DRAM.
Chip resources are divided approximately evenly between computation, communication, and memory
(one-third to each) which has been shown to be near-optimal for multicore processors [32].
One of the important and appealing aspects of our design is that we use a modest clock speed of
1GHz for the processors, electrical network, and the optical network. Although optical networks can be
clocked at much higher speeds, the power consumption at the endpoint transducers and interface
circuitry can be prohibitive. Since our optical network is being used only for broadcast it has modest
bandwidth requirements and we do not need to resort to clock speeds that are much higher than the base
processor frequencies.
Each of the cores is connected to its four nearest neighbors using point-to-point electrical connections.
The sum of all the electrical links is a complete mesh network (the “ENET” indicated in Figure 3) capable
of transferring data between any two cores using multiple hops. On top of this state-of-the-art electrical
substrate, ATAC adds an integrated photonic communications network to improve the performance and
efficiency of operations that are costly when using the electrical mesh. These operations include
broadcast/multicast communication, as well as point-to-point communication between cores that are
physically far apart.
The heart of the on-chip optical interconnect is the all-to-all network (or “ANET”). The ANET provides
a low-latency, contention-free connection between a set of optical endpoints, as depicted for 64 tiles in
the center part of Figure 3. This highly interconnected topography is achieved using a set of optical
waveguides that visit every endpoint and loop around on themselves to form continuous rings. Further, as
illustrated in the right side of Figure 3, each sending endpoint can place data onto a waveguide using an
optical modulator (shown as a yellow circle on each of the waveguides) and receive data from the other
endpoints using optical filters and photodetectors (shown as a red circle). Because the data waveguide
forms a loop, a signal sent from any endpoint will quickly reach all of the other endpoints. Thus, every
transmission on the ANET has the potential to be a fast, efficient broadcast. To avoid having all of these
broadcasts interfere with each other, the ANET uses wavelength division multiplexing (WDM). The
processor cores in the ATAC architecture have a 32-bit word size making it desirable for them to be able
to send a 32-bit word on each clock cycle. This is accomplished using a set of parallel waveguides where
each waveguide carries one bit. In a baseline ATAC processor, there would be 32 waveguides, each
transmitting data at the same frequency as the processor core. If chip real-estate for optical components
is limited (as might be the case when scaling up to thousands of cores) serialization can be used to
decrease the number of waveguides. In other words, we can reduce the number of waveguides to 16, 8
or even 4, and serialize a 32-bit word using multiple sub-word transfers.
In addition to the primary data waveguides, there are several other special-purpose waveguides. First,
there is an optical “power supply” waveguide that provides the light source for the modulators. Second,
there is a clock waveguide which sources use to send the clock along with the data. Third, there is a
backwards flow-control waveguide that is used to throttle back a sender when a receiver is overwhelmed.
Finally, we are exploring the use of several “metadata” waveguides that are used to indicate a message
type (e.g., cache read, cache write, barrier, ping and raw data) or a message tag (for disambiguating
multiple message streams from the same sender).
In the WDM design, all the modulators on a given sender are tuned to transmit at a unique
wavelength. To receive data from any sender at any time, each receiving endpoint must contain sets of
filters trimmed to the wavelengths of each of the other endpoints’ modulators. Each set of filters then
feeds into a separate FIFO (First-In, First-Out buffer), allowing the data from each sender to be buffered
separately. This saves the processor core from the extra step of examining each message to determine
the sender and find the message it needs. Since every receiver is not necessarily interested in every
message sent on the network, special-purpose hardware is used to pre-screen messages and forward
only messages of interest to the FIFOs. This avoids the extra energy associated with buffering and
handling unwanted data and allows the FIFOs to be kept smaller. It also frees the processing core from
having to sort through messages using software. Messages can be screened based on sender (i.e.,
wavelength) or by the metadata transmitted with each message. This novel buffering and filtering scheme
is an area of active research for this project.
The design of a single ANET optical link scales to approximately 64 endpoints. Therefore a 64-core
chip (feasible using a 90nm or 65nm CMOS process) could simply place a single core at each ANET
endpoint. Scaling beyond this point requires some number of cores to share a single optical endpoint.
The set of cores sharing one endpoint is referred to as a “region.” For chip designs requiring only a small
number of cores in each region, electrical circuits can be used to negotiate access and distribute
incoming data, preserving the illusion that all cores are optically connected. As regions grow larger, it is
preferable to use a ring of rings optical architecture as illustrated in the left hand side of Figure 3. In this
design, we use an optical network within each region and create a two-level hierarchical design. In this
design (shown in Figure 3), there is a top-level ANET (ANET1) that connects together multiple regional
ANETs (ANET0). On the ANET0, there is a single core in each region. The cores connected together by
each ANET0 form a region on the ANET1. Our analysis indicates that this two-level design will scale to
4096 nodes at the 11nm CMOS technology point.
As described in more detail in a later section, the ANET0 and ANET1 networks are connected by an
interface tile using conventional electronics. Our design allows for a pair of values to be transmitted
simultaneously between the two levels. Thus, at any given instant two broadcasts can be happening
simultaneously over all 4096 cores. However, 64 simultaneous broadcasts can be happening within each
64-core regional network or ANET0.
Seamless connections to external DRAM or I/O devices are made by replacing some processing
cores with memory or I/O controllers. Processing cores access off-chip resources by sending messages
to these gateways. Memory cores receive requests on the ATAC network, interpret the requests
electrically, and then send messages to DRAM using a separate waveguide that goes off-chip. We clock
this waveguide at 2GHz. Data is transmitted on this waveguide using 64 different wavelengths to send 64
bits at a time. Replacing 4 cores in each 64-core region with memory controllers yields a memorybandwidth-to-computation ratio of 1 byte/FLOP (assuming a 2GHz clock for the waveguides going offchip). A 4096-core chip would require a reasonable 256 memory connections (supplying 4 TB/s
bandwidth) to achieve the same ratio.
However, because the massive on-chip bandwidth of the combined electrical and optical networks
encourages communication-centric rather than memory-centric computing, the traditional rule-of-thumb
ratio of 1 byte/FLOP is excessive. Communication-centric computing allows processes to exchange
values directly rather than storing them in memory as an intermediary. In addition, ATAC’s efficient
broadcasts act as DRAM bandwidth multipliers, allowing each value fetched from DRAM to be received
by multiple cores. Together, these effects can lower memory bandwidth requirements significantly.
The area required to implement the computational portion (included L1 and L2 caches) of 4096 cores
is approximately 400 mm2. Using an additional 200 mm 2 to implement an L3 cache using embedded
DRAM will allow for over 8 GB of on-chip memory. Because communication-centric computing reduces
the pressure on all levels of the memory hierarchy, this amount should be sufficient for most application
domains. If additional on-chip capacity is desired, 3D integration of chips can be used to stack regular
DRAM above each tile, allowing even more total “on-chip” memory.
The ANET Optical Network Architecture
A broadcast and select approach enables ATAC communications in a simple optical network. One of
the principle philosophies we have taken is to use electronics where electronics is best and optics where
optics is best. All switching and routing of data is performed in the electrical domain where switching
circuitry is readily implemented. Doing so eliminates the significant limitations associated with tuning in
the optical domain. Further, we show that despite the broadcast nature of this network, it is highly
efficient. This is because most of the power is consumed in driving the optical modulators, not in the
optical power required for transmission. We refer to this broadcast and select optical network as the Allto-All Network or ANET.
The operation of an ANET region is as follows. Each ANET region contains 64 cores. Each core is
assigned a wavelength channel to transmit on and enough receive elements to read all of the data being
transmitted from every other core within an ANET region. Data is transmitted at the chip clock frequency,
clock = 1 GHz, so as to avoid costly serialization and deserialization steps. In order to transmit a 4-byte
word within a single clock cycle, 32 ANET communication lines are required. An additional 8 lines are
added for address information and parity checks bringing the total number of ANET lines within an ANET
region to 40. The number of connect cores is N=64 in a standard ANET region.
A more detailed design of the optical network is shown in Figure 4. Figure 4 also summarizes our
estimates of optical power consumed by the network. For an ANET with 64 cores and clock = 1 GHz, the
transmit bandwidth is TBW = 2.6 Tb/s and the receive bandwidth RBW =161 Tb/s. This simply indicates
that because of the broadcast capability, there is a 64X multiplier on the send bandwidth. Our analysis
shows that the latency of the ANET network is approximately 3ns. Approximately 0.5ns is due to optical
delay and the rest is electrical. The total off-chip power consumption, as shown in Figure 4 is 2.6W.
Comparisons of ANET performance with that of an electrical mesh are given in Table 1 (3 pages ahead).
ANET Resources
 Wavelengths = 64, Data Waveguides = 40
 # Modulators NM=2560, # Receivers NR=161,280
 Clock Frequency fc = 1GHz
ANET Power Budget (based on 11nm node)
 Optical Losses (L = 7dB)
Backplane Coupling: 1dB
Waveguide Propagation Losses: 2dB
Regional ANET: 1.5-6cm  0.33-to-1.3dB/cm
1-Global ANET: 10-20cm  0.1-to-0.2dB/cm
Modulator Drop Loss: 1dB
Filter Drop Loss: 1dB
Filter Set Thru Loss: 1dB
Power Supply Splitter Losses: 1dB
Off-Chip Optical Power (POptical = 0.2W)
POptical = 10(L/10) x Qg x fc x NR/Rd = 0.2W
Charge to Flip a Gate: Qg = 0.25fC
Responsivity of Detector Rd = 1.1A/W
On-Chip Electrical Power (POn-chip= 0.6W)
Modulator Power: PM = PMod-driv+PMod=234W
Receiver Power: PR = 0.14W
PElectrical = NM x PM + NR x PR = 0.6W
Optical Power Supply (POptical-Supply= 2W)
Laser: PL = 1.5 W  0.2W Optical (in fiber)
(JDSU DFB data sheet CQF 935.708-19050)
TEC: PTEC ~ 0.5W (est., but depends on Tcase)
Total Optical Power Supply = PL+ PTEC= 2W
Total ‘wallplug’ ANET Power ( PANET = 2.6W)
PANET = Pon-chip+Poff-chip = 0.6W+2W = 2.6W
Figure 4: ANET network with memory interface and layout detail for optical power coupling to chip and
distribution within the ATAC network. TBW = MxNxclock = 2.5Tb/s; RBW = MxN(N-1)xclock = 161Tb/s. The
power consumption of a 64-core photonic ANET is 2.6W with only 0.6W consumed on-chip.
Scaling to 4096 cores The All-to-All photonic network can be scaled to at least 4096 cores with a
hierarchical structure. At 4096 cores, there will be 32-ANET0 networks and a single ANET1 network. The
regional ANET0 networks connect 128 cores using 64 optical nodes (i.e. two cores share a node). The
ANET0 and ANET1 connections are made via an optical-to-electrical-to-optical (OEO) on an ANET0ANET1 interface tile. This interface tile has the function of routing the data. The aggregate performance
of the photonic network is TBW =80Tb/s (transmit bandwidth) and RBW = 5000 Tb/s (i.e. 5 Pb/s) (receive
bandwidth). Off-chip memory and I/O connections will be made optically as well. The power
consumption of the on-chip photonic network is approximately 26W. Off-chip, memory and chip-to-chip
communications are expected to add only 10-20% to this power budget due to substantially lower off-chip
bandwidth requirements and to the point-to-point nature of memory connections. So, power consumption
is not a major concern. The area constraints of a 2cm x 2cm chip are a bit more stringent. In order to fit
the required 5.4 million devices, each device can take up no more than 75m2.
Interfaces between Digital and Optical Components
A key area of innovation in this project is the interface between traditional digital logic and the novel
integrated photonic devices. This includes both the low-level details of how a digital signal is translated
into an optical signal as well as the higher-level question of how the massive quantity of data received
from the optical network should be screened, sorted, and buffered before it is consumed by the processor.
Figure 5 shows the path a bit takes as it is transmitted through the optical network. First, a signal
stored in a flip-flop is used to activate a modulator driver (an analog electrical circuit) which, in turn,
controls the optical filter/modulator component. The filter/modulator couples light of a specific wavelength
from the wideband source waveguide, modulates it, and transfers it to the data waveguide. When the
light passes a receiver filter, part of it is drawn off and fed into a photodetector. The photodetector
outputs a small electrical signal which is then converted back to a digital bit.
Properly designing the optical components, modulator driver and circuits to receive the signal from
the photodetector requires a close collaboration between the architecture and photonics groups. Since
these optical components are used only to transmit digital signals, their functional requirements may be
significantly different from those needed to transmit analog signals. In addition, the physical
characteristics of the optical devices (e.g., size, capacitance, and manufacturing variation) greatly
influence the design of the electrical interface circuits.
Our proposed design uses a
wideband source waveguide
technique to transition from
an optical signal to a digital
electrical signal. Photons that are
data waveguide
extracted from a data waveguide by
a receiver filter are routed to a
photodetector whose output is
directly connected to a digital logic
gate. The key to making this work
is ensuring that the photodetector
output is sufficient to charge the
sending core
receiving core
input capacitance of a digital gate to
Figure 5: Path of a single bit from the digital domain, through the
the required switching voltage.
optical network and back into digital form. Our design does not
Traditionally a transimpedance
require the traditional TIA on the receive side.
amplifier (TIA) is used as the
interface to convert the tiny photodetector output current to digital voltage levels. However, TIAs are
sensitive analog circuits that are difficult to design and consume large amounts of die area (500 μm2) and
power (1mW) per receiver. In addition, placing these sensitive analog circuits and noisy digital circuits in
close proximity decreases reliability. By using high-efficiency filters and small-footprint, waveguideintegrated photodetectors and carefully managing optical power levels, TIAs can be omitted in future
process generations. As digital circuits shrink, the input capacitances decrease and this technique
becomes feasible. Our research indicates that TIA-free designs will be practical for chips based on 22
nm and smaller process technologies.
Programming and Compilation for ATAC
Given the multicore trend, all computer systems will be parallel systems; all programs will be parallel
programs. Yet, it is still unclear how programmers will make the shift from sequential programming to
efficient parallel programming. Accordingly, a key goal of the ATAC project is to enable programmers to
easily and efficiently harness the system’s computational power by establishing a clear, high-level
programming interface that makes use of the broadcast capability in the short term, and a language
based on broadcast in the longer term.
Our programming and compilation effort will have two facets. The first will develop an API that
exposes broadcast related primitives directly to the user. We will use the experience with this API to
design a language (such as Thinking Machines’ C*) where the basic broadcast primitive will drive the
language design. We have experience with designing APIs and languages for novel architectures. For
example, in the Raw effort, we designed an API for streaming called libStream followed by the
development of a language Streamit [34].
The ATAC programming model and APIs will allow users to easily write programs employing global
operations such as broadcast, scatter/gather, and reduction via the optical broadcast network. Likewise,
programmers will be able to use the electrical mesh network through the high-level programming interface
as well. In fact, the interface will be able to transparently choose which network is used depending on the
type of communication (e.g., single point-to-point message vs. chip-wide broadcast operation).
The second will provide ways of implementing existing architectural models that are known to be
reasonably easy to program, but difficult to implement, such as large-scale coherent snooping caches
along with a shared memory programming model, or message passing interface MPI along with multicast.
We will also experiment with hybrid approaches in which the programming model will support both
shared memory and message passing modes, both build upon the underlying on-chip networks and incore caches. The shared memory model will build upon novel cache coherence algorithms that use the
broadcast network to efficiently communicate shared cache state. The message passing model will build
upon existing message passing models such as MCAPI and MPI, with an emphasis on ease of
programming with the addition of a set of broadcast primitives.
Measuring Programming Ease
Given that programming efficiency is such a key driver of the ATAC project, developing appropriate
metrics for programming efficiency is important. As such, this project will develop metrics that assess
“programming efficiency” and “ease of programming” by taking into account the difficulty of implementing
a variety of algorithms with the ATAC programming model as well as baseline programming models such
as a standard tiled multicore processor (e.g., Raw) or an MPI-based parallel system.
There are several possible ways of measuring the difficulty of implementing algorithms. One measure
that has been used in the past is “lines of code”. Our implementations of applications in both the ATAC
style and existing styles will directly yield the lines of code metric.
However, studies have shown [40] that Lines of Code is not a particularly good measure of
programming ease. This is due to the fact that not all lines of code require the same amount of effort to
write. For example, code that involves communication or synchronization between different processors is
generally more difficult to write than a simple computational loop. “Lines of code” also does not capture
the effort that the programmer had to expend deciding how to partition and spatially distribute an
application. To measure the true benefit to ATAC programmers, more sophisticated measures will need
to be created. These measures include a comparison of the actual time spent developing a given
application in both programming models.
Our team has previous experience with experiments designed to measure programmer productivity.
A study was conducted using the StreamIt programming language [33] where different programmers were
given a variety of application development and debugging problems and the time it took them to reach a
solution was measured. Similar studies might be a good way to more accurately measure the
programming advantages of an ATAC system. One problem with this approach is this. Since
programmers are given a problem and are asked to program it in both programming models, the second
model tends to appear easier since the programmer is familiar with the problem the second time around.
A particularly appealing approach with normalizes out the “learning” bias of programmers is to divide
the programmer group into two subgroups. The two subgroups are asked to implement the application in
the two programming models in opposite orders, which tends to even out the learning bias. Thus, the
programming time measured in this manner tends to be a good measure of programming ease.
Proof of Concept, Early Results and Current Status
To assess the performance characteristics of an ATAC multicore processor, we compare its
performance and programmability to a leading-edge processor based on an electrical mesh network
design similar to the MIT Raw processor. In general, the performance of the electrical processor is
expected to be only slightly lower than the ATAC system if programming effort is not an issue. However,
for applications and programming models that benefit from a fast broadcast capability, ATAC will yield a
performance benefit. We estimate performance using an analytical model and also measure
programming ease using lines of code.
64-core System
4096-core System
Theoretical Peak Performance
Actual Performance
Chip Power
24 W
22.7 W
140 W
155 W
Total System Power (CPU +
DRAM + Optical Supply)
40 W
28 W
225 W
232 W
Total System Actual Power
Table 1: Comparison of performance, power, and efficiency of ATAC and electrical-mesh processors.
Results are presented for 64-core and 4096-core chips.
The ATAC processor is essentially the same as the baseline processor with the addition of an optical
network to optimize global communications. Both processors have the same number of cores running at
the same frequency and therefore have the same theoretical peak performance. However, the theoretical
peak is unachievable on most applications, particularly for those applications that require large amounts
of communication or coordination between the cores. A better way to compare performance is to measure
useful operations performed while executing an actual application. Dividing the number of operations
performed by the total time required to complete the application yields the effective performance. The
effective performance numbers shown in Table 1 are based on a study of an abstracted coherent shared
memory application (described in more detail in Applications Performance section below).
The actual performance of the ATAC processor is better than the baseline processor due to the
increased efficiency of the global operations necessary to implement coherent shared memory. The
processing cores in ATAC spend less time waiting for global communication operations to complete and
therefore spend a greater fraction of their time doing useful work. Note that the difference between ATAC
and the baseline is greater with a larger number of cores due to the distance-based communication costs
on the electrical mesh. Even though the peak power consumption of the ATAC processor can be
somewhat higher (due to the addition of the optical network), the actual power efficiency of an ATAC
system is substantially better. This is due to two factors: 1) less energy is wasted waiting for global
communication operations and 2) the availability of the broadcast allows a value fetched from DRAM to
be sent to all the cores as opposed to having all the cores access DRAM for that value. This greatly
reduces the number of DRAM accesses.
Applications Performance ATAC particularly helps those applications that have lots of global
communication or irregular communication patterns. The reason is that electrical networks scale poorly in
terms of communication bandwidth and coordination predictability.
ATAC performs well on global operations because of its highly efficient broadcast operations, which
are an order of magnitude faster than a mesh-based multicore. Mesh-based multicores with 64 or more
cores do not perform global operations efficiently because they have large, non-uniform core-to-core
communication latencies. Furthermore, mesh based multicores exhibit a lot of contention during
broadcast operations. ATAC, on the other hand, does not have either of these problems. Broadcast
communication latency is not distance-dependent, but rather about 3ns for all communication, regardless
of the endpoints. Its network’s WDM-nature enables contention-free broadcasts within a region, enabling
many simultaneous broadcast operations. Furthermore, unlike typical mesh multicores, processor cores
can consume messages immediately when they
arrive; ATAC’s novel WDM and buffer scheme
Application Studies
64 cores
ATAC Speedups vs. Electrical Mesh
obviates message sorting and reassembly by
4096 cores
the receiving core.
These features of ATAC translate into
significant performance improvements. Eight
applications were analyzed by estimating their
running time on both a mesh-based multicore
and ATAC for both 64 cores and 4096 cores.
As seen in Figure 6, application performance
improves by up to a factor of 80.
Application speedups are calculated using
models based on application
n-body snooping matrix
characteristics. These models include
shared multiply
estimates of time spent in three categories:
normal computation, core-to-core
communication, and memory accesses. For all
Figure 6: Comparison of performance between ATAC and
applications except Snooping Shared Memory,
electrical mesh for selected applications.
the models calculate the exact number of
operations in each category required to execute
the application on a specific problem size. The estimated costs of each operation are then summed to
calculate the total run time. Care was taken to accurately model the operation of the communications
networks, including any end-point contention. A range of problem sizes were examined and
representative examples were chosen for the results shown.
The Snooping Shared Memory application represents an abstract application that performs
computation and makes memory accesses that sometimes result in cache coherence operations. The
model includes probabilities that each instruction of an application will induce different types of coherence
traffic. Modeled operations include reads and writes to a local region of cores as well as the entire chip.
The time required to resolve these operations is added to the time required to execute the instructions.
lines of code
Ease of Programming The primary goal of the ATAC project is to develop ways in which high
performance can be achieved on ATAC with only modest programmer effort.
“Lines of code” is one possible measure of
effort that has been used in the
Lines of Code Comparison
past, and we present some early results based
on that metric. Although we realize that it is not
the best measure of programming ease, it was
somewhat easier to measure for our initial
results than the more sophisticated methods
mentioned in our proposed research such as
programming time. Lines of Code measures the
amount of code that a programmer needs to
write to implement a particular application.
Figure 7 compares ATAC with a mesh-based
vector add
multicore architecture for four applications,
vector addition, jacobi relaxation, leader
Figure 7: Lines of code required to implement four
election, and barrier synchronization. Even on
algorithms on ATAC vs. a mesh-based multicore.
these relatively simple applications, ATAC’s
code size relative to the same programs written
for a mesh is smaller.
Proposed Research and Experimental ATAC System
The ATAC project proposes to build a prototype computer system, including a detailed system
simulator, a compiler system, a runtime system, a programming model and associated APIs. These
efforts will be driven by models of the optical components that we will develop.
ATAC Architecture The ATAC chip architecture will be developed. This will include the ATAC network
hierarchy, the mesh network, and the processor-to-network interface, including support for broadcast,
external memory and I/O interfaces. This effort will also define the coding and clocking transmission
scheme for the optical network. We will validate our assumptions that we can build message flow control
and receiver-side filtering and buffering for the all-to-all broadcast network with reasonable complexity,
area, speed, and energy. Selected portions of the processor-to-network interface will be implemented in
Verilog, synthesized and prototyped in FPGAs to verify their feasibility.
Optical Interconnect Interfaces and Component Models As research progresses on the novel
integrated photonic components, we will be developing and refining models of these components. This
includes characteristics such as switching speed, propagation delay, propagation loss, insertion loss,
energy consumption, and physical size. Based on these models, we will be designing the interfaces
between the digital electronics of the core processors and the optical components of the broadcast
interconnect. This will include some analog electrical circuitry as well as additional digital logic to prefilter, sort, and buffer data moving between the processor pipeline and the optical network. As we gain
additional information about the optical components and refine the architectural design, we will update the
simulator’s functional and performance models to reflect these changes. Using these models, we will be
able to accurately estimate the bit-error-rate, latency, on and off chip range, and footprint of the optical
network and the performance, energy consumption, and physical size of an ATAC processor.
Pin-Based ATAC Simulator We will use simulation to evaluate and to refine the ATAC architecture and
its broadcast network using parallel applications. In collaboration with Robert Cohn’s Pin group of Intel,
we are using the Intel Pin dynamic binary instrumentation infrastructure to develop a massively parallel
simulator for fast simulation of generic multicore systems with 1000s of cores. The Pin-based simulator
can be used to develop multicore applications, compilers, and operating systems and for rapidly
prototyping and evaluating multicore architectural mechanisms such as the ATAC broadcast network. The
ATAC simulator will incorporate mechanisms for energy modeling at both the chip level and the system
level. We have already successfully created an early version of a multicore pin simulator.
Pin allows one to insert extra code at specific points in the program at run time; the specific points
and the code to be inserted at each point can be specified in a separate executable called a “Pintool.”
Additionally, Pin allows function calls in the application to be replaced by calls to functions defined within
the Pintool. The inserted/replaced code can be used to simulate new features: it can modify processor
state, change program behavior or use a performance model to adjust a simulated clock. The simulator
uses these features of Pin to implement architectural mechanisms and to model performance.
The Pin-based simulator has been designed to take
advantage of the parallelism of host architectures such as
multicores or clusters. It models each core within the
simulated system as a separate kernel thread,
independently schedulable by the OS (Figure 8). The OS
maps the threads to the hardware, enabling the simulator
to exploit the available parallelism. Cores (threads)
communicate using calls to a simple API which represents
the intrinsic capabilities of the simulated architecture. e.g.,
broadcast, point-to-point message-passing, etc. The
simulator replaces API calls within the application with calls
to functions defined within the Pintool that implement the
corresponding functionality and update the simulation clock
of the appropriate cores using a model of the
communication cost. The implementation of the API
functions within the simulator depends on what
communication mechanisms are available on the host
architecture. For example, the implementation of inter-core
communication may use buffers in shared memory for
threads on a single machine, or use sockets over Ethernet
for threads running on different machines in a cluster.
Figure 8: Mapping of simulated cores
to threads and physical cores.
We have chosen to base our multicore simulator on Pin
because it offers several advantages over creating our own
simulator from scratch as in the Raw project. First, the Pin infrastructure is reliable: it is mature, robust,
and well-supported. Second, it is high performance: it natively executes application code on the host
hardware rather than interpreting it. Third, using Pin shortens our simulator toolchain development time: it
allows us to use existing tools for compiling multicore applications (gcc, binutils, etc.) instead of having to
develop them ourselves. Its major drawback is that it only allows us to model the x86 processor. We
believe this is not a significant issue because in future massive multicores the specifics of the core and
ISA become secondary issues to global communications, and memory and I/O systems.
ATAC API/Language Constructs As the ATAC project heavily emphasizes the importance of ease of
programming, a significant part of the project will involve the development of programming APIs and highlevel language constructs. The goal of the API and language construct development is to enable
programmers to quickly implement reasonably complex parallel algorithms on ATAC using straightforward
implementations and relatively minimal effort, while still achieving excellent performance. This goal will be
achieved in part by having the API and language constructs implement all of the “heavy lifting”. For
example, it will handle message management, shared memory coherence protocols, and automatically
choosing between the optical broadcast network or the electrical mesh network depending on traffic
patterns and current network congestion.
The ATAC Compiler The ATAC compiler will compile programs written using the aforementioned highlevel language constructs into assembly code suitable for the ATAC hardware. It will generate the lowlevel code needed to send and receive on the optical broadcast network and the electrical mesh network.
It will also incorporate profile information gathered through an application profiling mechanism into the
compilation process.
Applications The ATAC project will include a significant application development effort to help develop
and test our new API. The applications to be implemented will include standard benchmark suites (e.g.,
SPEC, SPLASH [21]), stream-based multimedia codes (e.g., MediaBench II, video encode/decode), and
scientific codes (e.g., FFT, N-body simulation). These applications will also be used to assess
programmer productivity and architectural performance.
Programmer Productivity We will assess the programmer productivity benefits of ATAC by comparing it
to traditional mesh-style multicores. The ATAC ease of use study will use the following three metrics to
quantify programming effort:
1. Lines of code, and lines of communication code.
2. Programming time through a user study. Users will code up several simple benchmarks in C for a
traditional mesh architecture, and in the ATAC API for the ATAC architecture. Time to first result
and time to achieve a given level of performance will be measured.
3. Programming gap. This metric will compare the variance in performance between a quick
implementation and an optimized implementation of a benchmark. Easy-to-program architectures
will have a smaller variance between the two.
Broader Impact of Proposed Research
The ATAC project seeks to improve the programmability of large-scale multicore processors through
a combination of architectural features and programming language development. This work has the
potential to change the way multicore processors are designed and manufactured by demonstrating the
value of integrated on-chip photonic devices. This will spur additional research in the area of on-chip
photonics and speed the adoption of this innovative technology.
By making high-performance multicore programming easier, it also has the potential to solve an
important challenge facing the computer industry today. Although processor manufacturers have realized
that their future products must be multicore, no one has practical solution that allows the average
programmer to harness the power of these processors. Without further advances in this area, only
specially-trained expert programmers will be able write high-performance software. Thus, either
application performance will cease to improve or software development costs will skyrocket as more
highly-trained programmers are required. By simplifying programming, the ATAC architecture will allow
the computer industry to continue to produce high-performance software using the existing base of
programmers, even though the underlying hardware will be somewhat more sophisticated.
Besides the potential long-term benefits to the computer industry and thereby society at-large, this
project will have a more immediate impact on the community of multicore researchers and educators.
The results of our research will be published in main-stream journals and conferences, allowing
acceptance or criticism from a large community of researchers and industry experts. In addition, we plan
to make much of the infrastructure developed in our project publicly available for use by others. This
includes our new API and languages as well as our multicore simulator in open-source form. Our
simulator infrastructure will need to be flexible enough to model a variety of multicore architectures to
allow comparisons between the ATAC design and other approaches. Therefore, it will be useful to many
other multicore researchers who will be able to modify it for their own experiments. We hope that our
simulator infrastructure will become the de-facto standard used by researchers across the world to create
their own multicore simulators.
As with all of our previous projects, both graduate and undergraduate education will be an integral
part of the ATAC project. Graduate students form the core of our research team, working closely with
each other as well as faculty investigators, postdoctoral researchers, and industry experts. In addition to
graduate students, we always include a number of undergraduates participating in MIT’s UROP
(Undergraduate Research Opportunities) program. These graduates and undergraduates will be directly
involved in the proposed work, learning the techniques and challenges of multicore system building and
programming. In addition to directly involving students in research, this project will influence a larger
group of students through graduate-level courses taught by the PIs at MIT. These courses typically
include hot research topics and reflect the current research of the PIs, exposing student to cutting-edge
ideas and tools, such as the massive multicore simulator. This project will thereby help train the next
generation of multicore researchers, engineers, and programmers.
