Sage Journals: Discover world-class research

Abstract

Computing technology has evolved significantly during the past five decades. As semiconductor scaling is reaching physical and technological limits, it is driving many transformative changes in computing hardware. This has led to computing systems that rely heavily on multi-core processors and GPUs, and resulted in the development of specialized hardware for applications in machine learning and scientific computing. While modern hardware provides significant computing power, and therefore opportunities, it challenges many established algorithms and workflows in scientific computing: these algorithms may not be able to fully leverage modern hardware. Often times, effective use of modern hardware entails revised algorithms, and even rewriting a considerable portion of an existing code. Understanding technology trends in computing hardware is necessary for designing next-generation algorithms for scientific computing. This paper reviews these trends, along with their drivers, in a language that is accessible to computational and data scientists, and applied mathematicians. In this paper (Part I), we review technology evolution in general-purpose microprocessors and hardware accelerators, along with background material. In Part II (Hanindhito et al., 2026), we consider memory systems, inter-device communication, heterogeneous computing and system integration, energy consumption, and how these trends impact scientific computing.

Keywords

Scientific computing computing hardware Moore’s law Dennard scaling GPU specialized hardware

Introduction

Motivation

Computing hardware has experienced a lot of changes, especially during the last two decades: multi-core microprocessors are common; GPU computing is gaining traction among a broader group; multi-node computing is more routine in industry and academia due to interest in solving larger and more complex problems; there are several examples of specialized hardware that are being used for scientific computing; and there is renewed interest in more exotic forms of computing, such as quantum and neuromorphic computing.

Some of these changes are stimulated by the end of Moore’s law and semiconductor technology scaling: it is harder to double the processing power of modern chips every 2 years by doubling the number of transistors on a chip, at the same cost; moreover, power consumption is limiting the compute capacity of modern chips, as they typically have a higher power density, and cooling them is becoming increasingly more challenging.

A forward-looking computational scientist, applied mathematician, algorithm developer, or an industry that relies heavily on scientific computing must carefully examine the impact of changes in computing hardware on their algorithms and workflows: should different algorithms be designed to better harness emerging hardware? Is it feasible to design specialized hardware for some compute-intensive workflows? What are possible consequences of inaction, i.e., using current algorithms and workflows on emerging hardware? What problems will become tractable a decade from now due to projected trends in computing hardware? Answering these questions, and quantifying their impact, are quite challenging, and may need a multi-year, multi-disciplinary effort.

Generally speaking, computational and data scientists, and applied mathematicians, are not trained in computer architecture and technology trends in computing hardware at a level of detail that may guide strategic decision-making on future algorithms and workflows. Literature that provides more detail is often too technical, hard to read, not self-contained, or may not clearly outline the interaction between different parts and technologies.

We attempt to bridge this gap: through a collaborative effort between hardware engineers, computational scientists, and applied mathematicians, we provide a holistic view of technology trends in computing hardware,¹ and highlight interactions between different components. We venture into how these trends may impact different algorithms that are commonly used in high-performance scientific computing. This effort is distinct from similar works, by providing a deep level of technical detail, while attempting to keep it accessible to a general computational scientist. By providing numerous examples and illustrations to clarify key trends, we help readers form their own judgement. To improve readability, helping readers see the big picture without getting lost in details, and keeping the paper self-contained, we provide certain details and examples in footnotes.

We believe this paper provides a clear and detailed perspective about technology trends in computing hardware to our targeted audience, along with their impacts on algorithms. These trends, along with domain knowledge, provide insights to computational scientists, and could position them to make informed decisions about designing algorithms and workflows that are expected to perform well on modern hardware, especially in industries that rely heavily on high-performance scientific computing, such as oil and gas, aerospace, defense, automotive, pharmaceutical, and finance.

Outline and summary

We begin each section by reviewing historical and current trends, along with future projections, based on publicly-available industry roadmaps. We discuss how different technologies interact, and how they impact the overall performance. We end each section with takeaways, possibilities, and key challenges, along with our own opinions that are supported by the trends. In what follows, we provide a high-level summary and refer to different parts of the text for details. Acronyms we used throughout the text are listed in Appendix 1.

Background material on key concepts in computing hardware, along with general trends are covered in General technology trends and concepts in computing hardware. They are repeatedly referred to throughout the rest of the paper, and therefore, we suggest readers review it before reading the rest of the text. Specifically, we highlight key characteristics of transistors, which are the fundamental building blocks of computer chips. These include the clock frequency, which impacts the switching speed and therefore runtime performance, as well as power consumption of transistors. The transistors have consistently become smaller over the years. This has allowed placing more of them in the same area, which typically results in more powerful chips. Moore’s law described the rate of transistor miniaturization at the same cost, and thus, it played a key role in setting expectations and guiding industry roadmaps. While some believe Moore’s law is losing steam, others believe it is no longer alive. Semiconductor scaling is facing physical and technological limits. Dennard scaling, which described how computer chips could keep their power consumption under control as transistors become smaller does not hold anymore. Consequently, modern computers often need more energy to operate and therefore, energy efficiency has become a central issue in modern chip design.

As transistors become smaller, more of them are placed on the same silicon die area. This increases the possibility of defected chips as they become larger, and therefore results in lower yield. However, advances in packaging technology have resulted in larger and more powerful chips without impacting the yield, as well as keeping R&D and manufacturing costs under control. The small size of transistors also increases their vulnerability, which makes maintaining reliability for advanced chips increasingly more challenging. Not only modern chips have smaller transistors, they also have thinner wires for internal connectivity. Wire scaling is challenging as it increases resistance and impacts signal integrity, and thus, has motivated the development of optical on-chip interconnects. In parallel, as semiconductor technology scaling is reaching its limits, chip designers need to find new ways to improve performance, resulting in the widespread adoption of diversification in architecture and implementation of modern chips.

Computer chips often need to communicate with off-chip components (e.g., other processors or memory) to perform computations. The communication is enabled through pins. Over the years, the number of transistors on a computer chip has grown much faster compared to the number of its pins. This has led to communication bottlenecks, and thus contributed to substantial alleviation efforts through both hardware-centric and algorithmic approaches. The hardware-centric techniques often attempt to increase the communication bandwidth though using new technologies, or using the communication signals more efficiently, e.g., through signal modulation. The algorithmic approaches often attempt to reduce the communication at the expense of increasing local computations.

In General-purpose microprocessors, we review their evolution during the past several decades. As transistor scaling enabled placing more transistors on a chip, a significant portion of these transistors supported instruction-level parallelism, leading to more powerful, single-core chips. This strategy showed diminishing returns on performance in the early 2000s, and led to the development of multi-core processors. Since then, adding more cores to a chip generally has a greater impact on the overall performance. Indeed, sophisticated units within compute cores do not benefit many applications that enjoy less complexity (e.g., dense linear algebra). The rise of these applications led to the development of many-core processors. Future general-purpose microprocessors are becoming increasingly diverse to meet specific performance needs of various applications. This includes processors that enable a faster runtime, or processors that may not be as fast, but are more energy-efficient, as well as processors that have access to a significant amount of on-package memory.

In Hardware accelerators, we examine how these devices provide performance improvements for select applications. These applications often enjoy a lot of regularity and structure, afforded by linear algebra, and enjoy a high computation-to-data-movement ratio. These conditions are typically observed in machine learning, and, sometimes, in scientific computing. Accordingly, the application’s structure can be exploited to improve how a chip utilizes silicon area and energy: more silicon can be dedicated to perform the actual computations, as opposed to units that manage on-chip flow control and data movement. In layman’s terms, well-behaved applications require less management, and therefore, the real estate that performs management can shrink to accommodate more workers. This specialization results in chips, such as GPUs that have significantly more processing power, and consume less energy, compared to general-purpose microprocessors. Availability of well-developed and well-documented software stacks² has been a key enabler for the adoption of GPUs by a large group of computational and data scientists. Porting a CPU code into GPUs is typically a tedious task. Often times, the underlying algorithms need to be revised, and a considerable portion of the code needs to be rewritten to maximize performance. Not all workloads and algorithms are suited for GPUs. General-purpose microprocessors are still needed for many applications. GPUs are still programmable hardware, and while efficient for some computations, they do not deliver the best performance for many applications. Given that technology scaling is reaching its limits, more aggressive hardware specialization remains among the very few options to further improve performance for the foreseeable future. Typically, a large market for the specialized hardware is needed to justify research and development costs, as well as the software support. A specialized hardware may be implemented on different fabrics (e.g., FPGA, CGRA, or ASIC), depending on performance requirements and the maturity of the targeted applications. We provide several examples of publicly-known specialized hardware that are used in machine learning and scientific computing.

General technology trends and concepts in computing hardware

Concepts outlined in this part are fundamental to understanding the underlying reasons behind computing technology trends, and are repeatedly referred to in the paper. They are briefly explained to provide context, improve readability, and to keep the paper self-contained.

Clock frequency

Clock frequency has been used as a proxy for chip performance (Agarwal et al., 2000; Henning, 2000). The switching activity of the transistors is governed by a periodic pulse signal, called clock (Xiu, 2019). The clock synchronizes the switching time of millions of transistors inside the chip (Friedman, 2001; Messerschmitt, 1990). This synchronization is vital for the chip to perform its operations correctly, including fetching data from peripheral devices (e.g., main memory (Cristal et al., 2005)), executing a stream of instructions (e.g., program code (Choi et al., 2004; Crawford, 1990)), and storing back the results. Therefore, the clock frequency, measured in Hz, determines how fast the transistors switch states between on and off, roughly translating into how many operations a chip can do per second (Messerschmitt, 1990). Several factors impact the highest possible clock frequency that a chip can run at, while maintaining the integrity of the data that flows throughout the circuitry. These are: a) intrinsic properties of the transistors, which depend on the process node³ and manufacturing technologies (Geppert, 2002; Shahidi, 2007); b) operating voltage of the transistors (Liu and Svensson, 1993; Meijer et al., 2004); and c) the microarchitecture implementation of the chip itself (Boggs et al., 2004; Marculescu and Talpes, 2005).

Clock frequency may not always be a representative metric for performance. For instance, consider two chips with identical integer units, A and B: chip A runs at a clock frequency that is twice as fast as B, and B is equipped with a more advanced floating-point unit that can perform floating-point operations in 75% less clock cycles than A (e.g., 2 clock cycles for B vs 8 clock cycles for A). In a workload where 99% of the instructions are floating-point operations, chip B will perform nearly twice as fast as A, despite its lower clock frequency. However, in a workload where there is minimal floating-point instructions, chip A will perform nearly twice as fast as B. Therefore, the clock frequency is not an accurate metric for comparing the performance of chips with different manufacturing technologies, operating conditions, and microarchitectures, which run diverse workloads.

Transistor power consumption and heat dissipation

Reducing the power consumption of modern transistors is arguably the biggest challenge in modern chip design (Dennard scaling). Electrical power consumed by transistors can be grouped into two major categories: dynamic power and static power.

CMOS⁴ (transistors) consume dynamic power when switching states (Kaxiras and Martonosi, 2008; Kocanda and Kos, 2015). The majority of dynamic power consumption is due to charging and discharging of the parasitic capacitance⁵ while a small portion is due to the short-circuit current.⁶ The dynamic power (P_d) can be estimated through P_d = C.f.V², where C is the (parasitic) capacitance that depends on the process node and manufacturing technologies, f is the clock frequency, and V is the operating voltage of the transistors. The operating voltage at a transistor gate should be greater than a threshold voltage (V_t) to turn it on (Weste and Harris, 2010). While increasing the clock frequency directly increases the dynamic power, a higher operating voltage is also required for the transistors to sustain faster switching activity and maintaining the correctness of the operations (Le Sueur and Heiser, 2010). Therefore, increasing clock frequency increases the dynamic power significantly.

Static power is a consequence of current leakage (Butts and Sohi, 2000; Elgharbawy and Bayoumi, 2005). Ideally, when a CMOS transistor holds its state (either on, or off), no current should flow from the voltage source applied to the transistor. In reality, a small amount of current flows due to leakage. The leakage power increases as transistor size becomes smaller.

Power consumed by the transistors is converted into heat (Brooks et al., 2007; Gonzalez and Horowitz, 1996), which then must be pulled away to prevent damage. Cooling systems, including heat spreaders, heat sinks, and active cooling (e.g., fans) work together to remove the heat from an area in the order of squared centimeters. High power density, and small surface area for thermal contact, make heat removal challenging (Mahajan et al., 2006a; Wei, 2008). For instance, the Intel Pentium 4 Netburst P68 micro-architecture (Boggs et al., 2004) was expected to reach 10 GHz by 2005 (Ogasawara, 2002). However, even at 3.8 GHz, its power density reached 105 W/cm², which is the same power density induced by the core of a nuclear reactor (Gelsinger, 2001).

Supplying power to chips is bounded (Aygün et al., 2005; Radhakrishnan et al., 2021) due to physical limits on the amount of current that can be carried by wires and pins, from a power supply and power delivery system⁷ to a chip (Lidow and Sheridan, 2003; Yazdani et al., 1997). The use of aggressive power management features, such as dynamic voltage and frequency scaling (DVFS) (Le Sueur and Heiser, 2010; Papadimitriou et al., 2019), play important roles in keeping the chip power consumption under control. The term typical power is often used to indicate the sustained power that a modern microprocessor consumes under a long heavy load,⁸ and is used to design the cooling system needed for heat removal (Ganapathy and Warner, 2008; Guermouche and Orgerie, 2022).

Transistor size and Moore’s law

Transistors are the building blocks of computer chips. They operate like a switch representing a binary value of 0 and 1 (i.e., off and on, respectively). Generally speaking, placing more transistors on a chip improves its processing power.

Gordon Moore, the co-founder of Intel, has made several predictions about transistor miniaturization trends due to technology improvements in Moore (1965) and in a 1975 paper reprinted in Moore (2006). His best-known forecast (1975), which is widely known as Moore’s law, predicted shrinking of transistor sizes that corresponds to a doubling of transistor density on chips every 2 years. When Moore made his prediction, the design and manufacturing cost of a chip was proportional to its area. Therefore, Moore’s law was intrinsically an economic forecast, stating computer chips becoming twice as powerful every 2 years, with the same price tag.

Consequences of Moore’s law were significant: it guided technology roadmaps and became a self-fulfilling prophecy; furthermore, many applications enjoyed performance improvements proportional to transistor scaling rates while making minimal changes to their algorithms.

While chip manufacturers still go to great lengths to shrink transistor size according to Moore’s law, the economic aspects of Moore’s forecast are no longer sustained. For instance, Jensen Huang, CEO of NVIDIA, believes Moore’s law is dead (Jain and Murugesan, 2021). Transistor scaling has been challenged by physical⁹ and technological limits. We highlight the technological challenges in Transistor power consumption and heat dissipation, Dennard scaling (the need for more power), Transistor count and yield, and manufacturability (lower yield due to higher transistor density on a chip), and Cost of designing and manufacturing semiconductors (skyrocketing costs of designing and manufacturing of modern chips).

The end of Moore’s law has far-reaching impacts. Since performance can no longer be improved through technology scaling, new strategies are being pursued. These are highlighted in Advanced packaging technologies (to improve yield and reduce cost), Hardware accelerators (increasing hardware diversity to maximize performance of various application groups), and Part II: Near-memory processing (NMP) and processing-in-memory (PIM) (Hanindhito et al., 2026) (moving away from von Neumann architecture).

Dennard scaling

The end of Dennard scaling has made power consumption the central issue in modern chip design. As a result, modern chips need more power to run, and their cooling has become more difficult. It has created unprecedented challenges in supplying power to supercomputing and data centers (Part II: Energy consumption of large computing centers and its implications (Hanindhito et al., 2026)).

Dennard et al. (1974) published MOSFET scaling rules, which are widely known as Dennard scaling. Dennard described scaling relationships between transistor density, switching speed, and power dissipation due to MOSFET miniaturization. Dennard scaling states that by scaling down the transistor size by a factor of κ, the voltage (V), electrical current (I), and capacitance (C) will be scaled by a factor of $\frac{1}{κ}$ . The most important consequence of Dennard scaling was that power density per square area of a chip remains constant due to transistor miniaturization. Specifically, transistor density increases by a factor of κ², and power requirement per transistor decreases by a factor of $\frac{1}{κ^{2}}$ , due to P = V × I, leading to no change in power density. Other consequences of Dennard scaling include: a) parasitic capacitance, which is related to the area, becomes smaller (Kim et al., 2003); b) circuit performance is improved¹⁰ as delay is scaled down by a factor of $\frac{1}{κ}$ , along with transistors being able to operate at higher clock frequencies¹¹; and c) voltage to drive transistors (V) and threshold voltage (V_t) become smaller (Geppert, 2002; Rieger, 2019).

Dennard scaling did not take into account the existence of a physical lower limit for the operating voltage (V), imposed by the threshold voltage (V_t) (Taur, 1999a), as well as implications of scaling down the threshold voltage (V_t) (Stillmaker and Baas, 2017; Taylor, 2013). It assumed the operating voltage (V) and the threshold voltage (V_t) could continue to scale down as transistors become smaller. However, decreasing the threshold voltage (V_t) in smaller transistors increases the current leakage¹² (Ahmed and Schuegraf, 2011; Kim et al., 2003), which leads to increased static power consumption (Kao et al., 2002). This makes both static and dynamic power equally important¹³ when transistor size becomes very small (Sylvester and Kaul, 2001). At nanometer-scales, power density per square area can no longer be held constant. Power density increases as the transistors become smaller, causing the “power wall” (Wang and Skadron, 2013). This marks the end of Dennard scaling, which occurred in the mid-2000s (DeBenedictis, 2017; Taylor, 2013).

Transistor count and yield, and manufacturability

Figure 1 shows historical data on the advancement of semiconductor process nodes and a projection for the next few years. When transistor size becomes a few nanometers, it approaches physical and technological limits (Naveh and Likharev, 2000; Taur, 1999b). This makes further shrinking of the size challenging, as indicated by the slowdown in the advancement of the process node.¹⁴ Due to the slowdown in transistor scaling, larger silicon dies became popular, enabling higher transistor counts. Larger dies increase the possibility of defects, thus lowering the manufacturing yield and increasing manufacturing costs (Mack, 2015; Sun et al., 2020).

Figure 1.

Semiconductor process node and packaging technology trends since the 1970s, with projections up to 2030.

However, the maximum die size that can be manufactured for future process nodes is decreasing. The reticle limit, currently at 26 mm × 33 mm (858 mm²) (Lai, 2021; Suzuki, 2020), acts as the upper limit on how large a silicon die can be manufactured. The reticle limit is expected to be halved at 26 mm × 16.5 mm (429 mm²) due to the amorphous lens array used in the upcoming process node. Advances in packaging technologies allow the compartmentalization of a chip, by using multiple smaller dies called chiplets. This improves the yield and reduces costs, while conforming to the reticle limit of future process nodes.

Advanced packaging technologies

Advances in silicon packaging¹⁵ technologies (Figure 1) (Su et al., 2017) will play an important role in increasing the number of transistors on a package in the near future. We highlight two key technologies: Multi-Chip Modules (MCMs) (Naffziger et al., 2021) and chiplets.

An MCM uses multiple (often similar) silicon dies to form a large chip (Arunkumar et al., 2017; Burd et al., 2019; Su et al., 2017). While MCM has been around since the 1980s, its popularity has increased in recent years to overcome the low-yield challenge and reticle limit (Transistor count and yield, and manufacturability). The silicon dies communicate through wires on the packaging substrate.¹⁶ Traditionally, the dies are connected to the wires on the package substrate using wire-bond or flip-chip technology, which is referred to as 2D packaging. However, the wires in the package substrate are orders of magnitude larger than those in the silicon dies. The thicker wires and low wire density lead to routing congestion and therefore limit the attainable die-to-die bandwidth.

To improve connectivity between the silicon dies, an interposer layer is placed between the dies and the packaging substrate. In addition to bridging the connection between the silicon dies and the packaging substrate through vias, the interposer also acts as a conduit for connections between the silicon dies. Interposers are often manufactured using silicon,¹⁷ hence the name silicon interposer. Silicon interposers provide significantly higher wire density, allowing implementation of high-bandwidth connection between dies. This is referred to as 2.5D packaging technology (Lenihan et al., 2013; Zhang et al., 2013), and the silicon dies are placed side-by-side. Further advancement of this technology is referred to as 3D stacking, which allows multiple dies to be stacked on top of each other (Agarwal et al., 2022; Su et al., 2017) and connected through vias.¹⁸ While the 3D stacking technologies are promising, they face many challenges, such as the thermal issues¹⁹ (Gomes et al., 2020; Su et al., 2017).

Specialization of silicon dies is becoming more common. Accordingly, a silicon die may perform only a specific function and thus manufactured using the best process node suited for that function. This type of specialization and compartmentalization is called chiplet (Loh et al., 2021; Naffziger et al., 2020). Just like in a puzzle, multiple chiplets with various functions²⁰ are used to build a modern System-on-Chip (SoC) MCM, which has continuously grown in popularity over the past decades. The use of chiplets is the recipe to keep Moore’s law (perhaps nominally) alive, and is expected to result in chips with trillion transistors by 2030 (Gelsinger, 2022; Kelleher, 2022; Moore and Schneider, 2022). Note that communication interfaces between chiplets consume additional silicon area and power. Finally, the development for future scalable interconnect technologies are critical to allow for lower latency and high bandwidth communication between chiplets (Chirkov and Wentzlaff, 2023).

Reliability and availability of computing systems

Transistor size is getting smaller due to advanced manufacturing, leading to increased transistor density on advanced chips, which enables chips to carry increasingly more complex functions. The smaller size of transistors and rise of complexity make computer systems more susceptible to reliability issues. In this sense, reliability may be defined as a measure of success, where a computing system’s behavior conforms to its specifications over a given operating period (Shooman, 2002). Failure happens when the behavior of the system deviates from its specifications. The relative proportion of time the system meets its specifications is called availability, which depends on the duration of failures as well as time needed to fix them.

An error is defined as an incorrect state of information stored in the system, whereas a fault is the cause of the error. The fault sources include component failure, equipment damage, interference (cross-talk) between wires, power disturbance, induction due to lightning, electromagnetic fields, electrical noise, and radiation (Randell et al., 1978). Radiation is a common cause of fault, especially for chips that are manufactured through smaller process nodes (Schrimpf et al., 2008), as they become more sensitive to it. Cosmic rays and alpha particles are common sources of radiation. They can cause soft errors (e.g., bit flips) in computing systems, particularly in logic and memory. A soft error is a non-permanent and non-recurring error, which corrupts information while the device itself may still function properly (Karnik and Hazucha, 2004). On the other hand, a hard error is often permanent (e.g., due to hardware failure) and may be repairable (Wang and Agrawal, 2008).

There are two approaches for achieving reliability in computing systems: fault-intolerance and fault-tolerance (Avižienis, 1975). Fault intolerance seeks to eliminate sources of unreliability. Since elimination of all possible sources of fault is not possible, fault intolerance reduces the probability of fault occurrence to an acceptable low level, and devises maintenance procedures should a fault occur. However, maintenance will impact system availability significantly, which may not be preferable for some critical computing systems. On the other hand, fault tolerance tolerates sources of unreliability, and seeks to counteract the consequences by using protective redundancy. Accordingly, systems that adopt fault tolerance can continue to operate despite the existence of errors, either at full or reduced capacity and capability, until the fault is addressed. For instance, aircraft computer systems use the Multiple-instruction Single-Data (MISD) approach, where multiple computer systems operate on the same data simultaneously. Next, the outputs are compared through a majority-voting mechanism. We remark that achieving reliability raises the cost of increasingly more expensive modern computing systems.

Wiring, connectivity, and signal integrity

In this part, we discuss the medium through which data is transmitted, which impacts the achievable bandwidth (Bogatin, 2022; Cho et al., 2007; Saraswat et al., 2008) as it affects signal integrity. The widely-used medium to carry electrical signals is metal-based wires, such as aluminum and copper. Carbon Nanotubes (CNT) are also briefly discussed as a potential replacement for metal-based wires. Optical interconnects, which have gained more traction in recent years and serve as alternatives to electrical interconnects are explored in Optical interconnects.

A widely-used metric to measure the performance of wires for carrying electrical signals is propagation delay. There are several approaches for modeling the propagation delay, based on physical characteristics of the wires, and their interaction with other materials²¹ (Seckin and Yang, 2008; Zhou et al., 1988). A simple but common approach is the RC model,²² which relies on resistance (R) and capacitance (C) of a wire to estimate the propagation delay (Qian et al., 1994; Sakurai, 1993). The RC model is also referred to as RC delay (Ciofi et al., 2016; Sylvester and Keutzer, 1998). The resistivity of a wire relies on its material property and geometry (Ciofi et al., 2016; Savage, 2002). Capacitance of a wire is influenced by interactions between adjacent wires, and dielectric materials²³ (Duan et al., 2001; Ruehli and Brennan, 1975; Zhao et al., 2009).

As the number of transistors on a chip grows, the number of wires that provide connectivity between them has to increase as well (Edelstein et al., 1995). Most of the chip area has already been covered by wires, which limits space for laying out new wires. Therefore, more metal layers are being used (Gelatos et al., 1994; Koyanagi et al., 1998) to construct local, intermediate, and global on-chip interconnects.²⁴ Increasing the number of metal layers complicates the manufacturing processes, and has negative impacts on the capacitance and cross-talk of the wires (Duan et al., 2001; Sim et al., 2003). In principle, making the wires smaller increases wire density. However, unlike transistors, metal interconnects do not benefit from further scaling (JM Veendrick, 2017; Koo et al., 2007). Scaling negatively impacts metal interconnects: smaller wires have higher resistance, and carry less electrical current (Edelstein et al., 1995; Srivastava and Banerjee, 2004). Increased wire density also exacerbates cross-talk and capacitance effects between wires (Naeemi et al., 2006), affecting signal integrity.

Early process nodes used aluminum (Al) for on-chip wires. However, aluminum could not meet interconnect performance requirements for process nodes beyond 180 nm (Staff, 2019; Sylvester and Keutzer, 1998). The move from aluminum (Al) to copper (Cu) in the late 1990s for metal wires (Andricacos, 1999; Theis, 2000), and the use of low-κ dielectric materials in the 2000s (Beyne, 2003), provided a one-time opportunity to improve propagation delay.²⁵ Despite their more complicated manufacturing process (Andricacos et al., 1998; Gelatos et al., 1994, 1996), copper wires have significantly smaller resistance,²⁶ and higher durability, compared to aluminum wires. For a while, this allowed copper wires to be made smaller, keeping pace with transistor scaling. Eventually, however, technology scaling became more challenging for copper, due to complications in electrical conductivity, reliability,²⁷ latency, and power dissipation (Kapur et al., 2002; Tőkei et al., 2016). Copper saw its limitation²⁸ at 40 nm (Kaushik et al., 2007). IBM estimated new metal materials²⁹ are needed for constructing on-chip wires beyond 15 nm (Bonilla et al., 2020; Huang et al., 2020; Staff, 2019).

Replacing copper with carbon nanotubes (CNT) for on-chip wires has been proposed since the early 2000s (Joshi and Soni, 2016; Xu et al., 2022). CNT has a higher reliability than copper, due to its mechanical and thermal stability. It reduces delay for intermediate and global on-chip interconnects by about 30% (Naeemi et al., 2006; Nieuwoudt and Massoud, 2008). However, integrating CNT onto a chip is challenging, due to immature manufacturing techniques, imperfect metal to CNT contacts, and high growth temperature³⁰ of CNT, which results in higher probability of defects and low achievable wire density (Kaushik et al., 2014; Pasricha et al., 2010; Xu et al., 2022).

Optical interconnects

Some researchers are considering the possibility of moving away from electrical signals to optical signals and optical interconnects for on-chip communication, thus limiting the amount of on-chip metal wires. Optical interconnects operate at the speed of light. Therefore, they provide significantly higher bandwidth for data transmission with lower latency compared to electrical signals. Optical interconnects have also proven to be reliable, performant, and efficient, for long-distance inter-node communication (Part II: Inter-node communication (Hanindhito et al., 2026)). Indeed, research on optical interconnects and silicon photonics³¹ dates back to the 1980s (Soref and Bennett, 1987; Soref and Larenzo, 1986). Earlier studies showed using optical signals for on-chip interconnects may not be effective due to silicon area and power requirements needed by electrical-optical conversion devices (Kobrinsky et al., 2004; Sato et al., 2015). Nevertheless, optical interconnects have several merits, compared to copper wires, which make them attractive for process nodes beyond 10 nm.

Co-existence of electrical interconnects and optical interconnects on future chips is likely: electrical interconnects may be used to realize local on-chip connections, whereas optical interconnects can be used to realize intermediate and global on-chip connections (Chen et al., 2006; Cheng et al., 2016; Kapur and Saraswat, 2002; Saraswat et al., 2008). The minimum distance for which using an optical interconnect becomes more efficient than a corresponding electrical interconnect is referred to as the critical length, which gets shorter as more advanced process node are used (Chen et al., 2006; Kaushik et al., 2007). The critical length is estimated by using multiple metrics, which measure the performance³² of an optical interconnect, compared to a corresponding electrical interconnect (Miller, 2009; Rakheja and Kumar, 2012).

An on-chip optical interconnect comprises several components: light-sources, modulators, wave-guides, and photo-detectors. Light-sources can be either on-chip, or off-chip lasers; they are being actively studied in the silicon photonics field (Li et al., 2022; Zhou et al., 2023). On-chip lasers, such as VCSEL,³³ are more efficient than off-chip lasers: they can be modulated at GHz frequencies, at the expense of increased power consumption, and reduced thermal dissipation (Amann and Hofmann, 2009). Off-chip lasers have low efficiency: they must be turned on almost all the time, to avoid light-generation delays; their integration is also more complex, since waveguides must be constructed to distribute the lasers to each on-chip optical modulator (Bai et al., 2011; Cadien et al., 2005; Peng et al., 2010). The modulator modulates light to encode data³⁴ (Section 2.14). The modulated lights then travel through the waveguides (Mekawey et al., 2022; Ryu et al., 1999) across the chip. Constructing on-chip waveguides to distribute the optical signals is not an easy task. It entails utilizing low-loss materials (Cadien et al., 2005), minimizing power consumption and efficiency loss (Bashir et al., 2019; Rahman et al., 2008), and finding an efficient topology and protocol for the waveguides (Le Beux et al., 2014; Werner et al., 2017).

While on-chip optical interconnects face challenges (Bashir et al., 2019; Chen and Segev, 2021), they have advantages over electrical interconnects (Cho et al., 2008). Optical interconnects enjoy minimal cross-talk and interference, allowing them to maintain signal integrity over long distances (Turkane and Kureshi, 2017). Moreover, due to wavelength division multiplexing (WDM),³⁵ optical interconnects can achieve significantly higher bandwidth compared to copper wires. With further improvements in light sources, modulators, and wave-guides, higher efficiency can be achieved, resulting in significantly lower energy requirements for data movement, compared to electrical interconnects (Karabchevsky et al., 2020). In addition to being used for on-chip interconnects, optical interconnects may also become candidates for connecting chiplets (Ayar Labs Inc., 2021; Hao et al., 2021; Wade et al., 2020), replacing copper wires on PCB to bridge multiple devices with massive bandwidth (Aleksic, 2017; Sharma et al., 2021), and also becoming the backbone of disaggregated infrastructure (Part II: Disaggregated infrastructure (Hanindhito et al., 2026)).

Cost of designing and manufacturing semiconductors

Design cost³⁶ of next-generation chips is expected to increase significantly, as the industry moves to smaller process nodes (Li et al., 2020). Both International Business Strategy Corporation (IBS) (Hruska, 2018) and McKinsey (Bauer et al., 2020) estimated the chip design cost to be: $28.5 M for a chip in a 65 nm process node, $51.3 M for a 28 nm process node (doubled), $106.3 M for a 16 nm technology (doubled), and $297.8 M for a 7 nm process node (tripled). The design cost of a chip in the latest 5 nm process node is estimated to be around $542 M, and the design cost for a future 3 nm process node is expected to reach $1 B. Moreover, the investment needed to build a semiconductor manufacturing facility (foundry) has also increased for smaller process nodes: $0.4 B for 65 nm, $0.9 B for 28 nm, $1.3 B for 16 nm, $2.9 B for 7 nm, $5.4 B for 5 nm, and $15 B to $20 B for 3 nm (Hruska, 2018).

The significant cost of designing and manufacturing modern chips may impact the economic feasibility of deploying custom-chips for certain scientific computing applications, especially those that are attractive to smaller markets. An alternative is to design and build chips in older and more cost-effective process nodes, especially when the absolute highest performance is not required.

Execution model, architecture, and implementation style

Since transistor scaling can no longer be a major driver of performance, there is growing interest in designing different classes of chips that are performant for specific or a group of applications. In this part, we comment on the execution model, architecture, and implementation style of different types of chips, their significance, and how they impact performance.

Execution model

The top layer of Figure 2 shows the classification of chips based on their execution model, according to Flynn’s Taxonomy (Flynn, 1966, 1972). Single-instruction single-data (SISD) is used in single-core processors that execute a single instruction stream to operate on data, which was popular before the 2000s. Multi-core and many-core processors adopted the multiple-instruction multiple-data (MIMD) execution model, where each core can run different instructions and operate on different data. The single-instruction multiple-data (SIMD) execution model can be found in the vector units of CPUs, where each vector unit runs the same operations, and accesses a set of contiguous data to enable parallelization (Hassaballah et al., 2008; Raman et al., 2000). GPUs modify this concept, and use the single-instruction multiple-threads (SIMT) (Fung and Aamodt, 2011; Habermaier and Knapp, 2012), where each thread executes the same instructions and execution flow. With SIMT, any difference in branch outcomes will result in thread divergence. Accordingly, each group of threads with a different branch outcome will be executed sequentially, resulting in reduced computational efficiency. Nevertheless, SIMD or SIMT are often more efficient for explicitly parallel applications, compared to MIMD, since they reduce the overhead for processing of instructions per data item. The multiple-instruction single-data (MISD) model is less common, and is typically used when there are reliability concerns.³⁷

Figure 2.

Execution model (top), architecture design of a chip (middle), and the substrate used for implementation (bottom). At the execution level, chips can have four types of models, according to Flynn’s taxonomy: SISD, SIMD, MISD, and MIMD. At the architecture level, chips can be grouped according to their programmability. A fixed-function chip only operates on a specific input to produce a specific output, based on what operation it is designed to do. A general-purpose chip can perform various operations, based on what program it is running, and hence, is programmable. Some chips exist in between. For instance, a domain-specific architecture chip targets a wider class of applications that share common operations. At the substrate level, chips can be implemented as hard-logic, soft-logic, or somewhere in the middle. Hard logic substrates, such as ASICs, have a fixed structure of logic gates and their connections, which cannot be changed once manufactured. On the other hand, soft logic substrates, such as FPGAs, provide reconfigurable logical functions and connections even after manufacturing.

Chip architecture and programmability

The middle layer of Figure 2 shows the classification of chips at their architectural level. The architecture of a chip defines the structure of logic blocks inside the chip, how they are connected to each other, and how software interacts with them to achieve the desired functionality. A chip can be fully programmable, fixed-function, or somewhere in the middle.

A programmable architecture can interpret different instructions³⁸ to perform different operations (Peccerillo et al., 2022). Programmable architectures target a wide range of applications and are suitable for workloads whose algorithms or protocols constantly change (Iyer, 2012). While programmability brings versatility, it comes with performance, energy, and silicon area overheads associated with interpreting and decoding instructions (Dally et al., 2020; Hennessy and Patterson, 2019). An example of a fully programmable architecture is the general-purpose microprocessor (CPU) (Section 3), which can be used for virtually any application.³⁹ By contrast, a fixed-function architecture is designed to perform specific computations for a specific set of inputs, which typically provides the best performance, highest energy efficiency, and smallest silicon area footprint, at the expense of losing flexibility (Tong et al., 2006). Fixed-function architectures are suitable for applications that are mature or have a short lifetime, such as cryptographic processors (Anderson et al., 2006), analog-to-digital or digital-to-analog converters, and multimedia encoding or decoding processors (Harrand et al., 1995; Tamitani et al., 1992), among many others (Peccerillo et al., 2022).

Some architectures may provide a limited degree of programmability for a specific class of applications, to balance performance, energy efficiency, and silicon area. These architectures may not be as versatile as a fully programmable architecture. However, they target a wider range of applications, compared to fixed-function architectures. Some examples include Graphics Processing Units (GPUs), which were originally designed for graphics applications and have become more programmable (Aamodt et al., 2018; Peddie, 2023b) to target a wide range of applications that have significant inherent parallelism, and Digital Signal Processors (DSP) (Lee, 1990), which are designed for signal processing applications, such as audio (Han et al., 1996), image (Chen and Chien, 2008), video (Kim et al., 2005), and sensor data (Gao et al., 2020). Domain-specific architectures (DSAs) (Section 4.2) (Fujiki et al., 2021; Halawani and Mohammad, 2024) target multiple applications within the same domain (Huang et al., 2017).

Implementation fabric

The chip architecture is then implemented on a fabric, where it can be a hard-logic fabric or a soft-logic fabric (Gindin et al., 2007; Paulin, 2004; Rose, 2004) (bottom layer of Figure 2). Hard logic fabric implements logical functions by using logical gates made from transistors on the silicon die, which cannot be modified or altered after the chip is manufactured. Hard logic fabric provides the best performance, energy efficiency, and mass-product cost (Abdelfattah and Betz, 2012; Gindin et al., 2007). Since a hard-logic chip is not reconfigurable after manufacturing, it may also be called an application-specific integrated circuit (ASIC). Examples include CPUs released by Intel and AMD, GPUs released by Intel, AMD, and NVIDIA, and many hardware accelerators, some of which are discussed in Section 4.

A soft logic fabric permits reconfigurability after manufacturing, by modifying the logical functions of the logic elements and their interconnections (Peccerillo et al., 2022). Field-programmable gate arrays (FPGAs) (Mencer et al., 2020) are commonly used as soft-logic fabrics that provide fine-grained reconfigurability. This allows users to reconfigure the programmable logic elements and their interconnection through a hardware description language (HDL) (Riesgo et al., 1999), such as Verilog (Dubey, 2009), VHDL (Shahdad et al., 1985), or using high-level synthesis (HLS) from C code (Nane et al., 2016). The reconfigurability comes with a price: architectures implemented in FPGAs can not compete with ASICs in terms of performance, energy efficiency, and silicon area footprint⁴⁰ (Boutros et al., 2018; Kuon and Rose, 2006; Zahiri, 2003). Soft-logic fabrics (e.g., FPGAs) are suitable for implementing architectures that are likely to be changed in a short timeframe (Gandhare and Karthikeyan, 2019; Leong, 2008), performing architectural validation before manufacturing them into ASICs (e.g., pre-silicon verification and prototyping) (Huang et al., 2011; Ray and Hoe, 2003), or manufacturing low-volume, fast-time-to-market, and short-lived products (Gandhare and Karthikeyan, 2019; Marquardt et al., 2000).

Besides hard-logic (e.g., CPUs) and soft-logic (e.g., FPGAs) fabrics, Coarse-Grained Reconfigurable Arrays (CGRAs) lie in the middle to provide some degree of reconfigurability for a specific class of applications (Prabhakar et al., 2020). In contrast to FPGAs, whose reconfigurability is performed at the lowest level logical functions (i.e., boolean algebra), CGRAs provide hard-coded arithmetic and logic building blocks that are optimized for specific domains of applications and can be configured to operate in a specific arrangement to perform a larger functionality (Peccerillo et al., 2022). This allows CGRAs to provide increased hardware efficiency, energy efficiency, faster reconfiguration times, and increased performance, through more specialized functional units (Tan et al., 2021; Wijerathne et al., 2022), bringing CGRAs closer to the spectrum of a hard-logic fabric in terms of power, performance, and silicon area (Liu et al., 2019; Niu and Anderson, 2018). We highlight some chips that use CGRA in Examples of custom and specialized hardware.

Off-chip interfaces and pin limitations

A modern chip uses hundreds to thousands of pins (Figure 3) as conduits for power delivery and data exchange with (off-chip) peripheral devices when it is mounted⁴¹ on a motherboard. Additional pins are typically needed if a chip requires more memory bandwidth,⁴² or needs more inter-device bandwidth to connect to peripheral devices⁴³ (Burger et al., 1996; Chen et al., 2017). Moreover, due to increasing power requirements, more pins are needed to deliver the electrical current to the microprocessor (Stanley-Marbell et al., 2011). In modern chips, about half of the pins are used to supply power, and the remaining pins are used for data exchange.⁴⁴ While the transistors have become much smaller, the pins have not enjoyed the same level of miniaturization due to physical limitations for carrying electrical current and maintaining physical strength. Adding more pins is expensive. Since both the pin and the pitch⁴⁵ are not getting significantly smaller anymore, it increases the size of the chip package. A larger package size entails a more complicated mounting mechanism, as it becomes more difficult to maintain uniform pressure for providing sufficient contact for all the pins to the socket (Chow et al., 2006).

Figure 3.

Evolution of microprocessor socket and pin (back of the processor), drawn to scale. The Intel socket 1 is a pin-grid array (PGA) socket, containing 169 pins, with 2.54 mm pitch (i.e., the distance between the center of pins). Launched in 1989, it was designed to accommodate Intel 486 SX, Intel 486 DX, and Intel 486 DX2 microprocessors, with 1.2 million to 1.6 million transistors, operating at typical power of 5 W. Almost three decades later, Intel socket H4, which is a land-grid array (LGA) socket, containing 1151 pins, spaced at 0.914 mm, was released to support Intel Skylake and Kaby Lake microprocessors, with 1.75 billion transistors, operating at a typical power of 91 W. Intel socket H4 was followed by Intel socket H5 5 years later. It retains the same socket dimension, but has 49 more pins to deliver more power (125 W vs 91 W of typical power) to Intel Comet Lake and Rocket Lake microprocessors, which have double core counts compared to their predecessor. A year later, Intel socket V, containing 1700 pins, spaced at 0.8 mm in a land-grid array (LGA), was released to support Intel Alder Lake and Raptor Lake microprocessors, operating at a typical power of 125 W. It has more connectivity than its predecessors. AMD Socket sTRX4 has 4094 pins in a land-grid array (LGA), which has the same physical dimension as AMD Socket SP3. It was launched in 2019 to support AMD Ryzen Threadripper Castle Peak, with up to 64 cores, 30 billion transistors, and 280 W of typical power. AMD Socket SP5 is a land-grid array (LGA) socket, containing 6096 pins, spaced at 0.81 mm. It was released in 2022 to support 96-core AMD EPYC Genoa microprocessors, with 90 billion transistors, 400 W typical power, 12-channel DDR5 memory, and 128 PCI Express 5.0 lanes.

Architecture of communication interfaces

A communication interface enables communication of a microprocessor with other microprocessors, accelerators, or off-chip memory. Parallel interfaces were popular early on. The need for higher bandwidth resulted in the development and adoption of serial interfaces. We review both next, and also highlight implementation aspects.

Parallel interface

Early interfaces were implemented by using a parallel architecture (Figure 4(a)) (Giuma and Hart, 1996; Shanley et al., 1995), which transferred data by dividing it simultaneously over multiple wires (Dawoud and Dawoud, 2020; Roman, 1998). For instance, a 32-bit-wide parallel interface needed at least 32 wires to transfer 32 bits (4 bytes) of data in one cycle, with additional wires for controls.⁴⁶ To correctly receive the full data, the receiving end must wait until all signals have arrived (Bandyopadhyay and Cases, 2000; Lee et al., 2013), before assembling them into the complete 4-byte data. While building a wider parallel interface or raising the clock frequency to increase the bandwidth may seem intuitive, it is challenging in practice (Bandyopadhyay and Cases, 2000; Dawoud and Dawoud, 2020). A wider parallel interface entails more pins on the (already pin-limited) microprocessor (Sarmah and Azeemuddin, 2017), and uses more wires (Dawoud and Dawoud, 2020; Roman, 1998), which then makes PCB routing more difficult. That is also why early interfaces were half-duplex⁴⁷ (Figure 4(d)). Full-duplex interfaces require dedicated wires for sending and receiving data at the same time (Figure 4(d)) (Dawoud and Dawoud, 2020; Summerville, 2009), doubling the number of wires, further complicating the routing and pin allocation.

Figure 4.

Simplified high-level overview of communication interfaces: (a) assume a word w consists of 8 bits of data b₀, b₁, b₂, b₃, b₄, b₅, b₆, b₇, which can be transmitted in parallel at the same time (i.e., during the same clock cycle) using eight data wires along with dedicated clock signal and control signal wires on the 8-bit wide synchronous parallel interface; (b) in serial interface, a serializer takes each bit of the word and transmits it one by one per clock cycle (i.e., eight clock cycles are needed to transmit a word) using one wire along with control signal wires (i.e., in this case, the clock signal is embedded into the data signal). Although it needs more cycles to transmit the word, the serial interface can run at a significantly higher frequency compared to the parallel interface; (c) aside from increasing the clock frequency, the serial interface can have multiple lanes to increase its bandwidth. In this case, there are eight lanes, each transmits a word at the same time (w₀, w₁, w₂, w₃, w₄, w₅, w₆, w₇). Each word is 8-bit of data (b₀, b₁, b₂, b₃, b₄, b₅, b₆, b₇); (d) a two-way communication system, either parallel or serial, can be implemented as a half-duplex or a full-duplex interface. In half-duplex, only one wire is used for sending and receiving a bit, and thus both sending and receiving cannot happen at the same time. On the other hand, in a full-duplex, two wires are used for sending and receiving a bit, and thus both sending and receiving can happen at the same time, improving bidirectional bandwidth at the expense of more wires; (e) the communication link can use single-ended or differential pair signaling. Single-ended signaling is usually used for short-distance communication since it is prone to noise while differential pair signaling can be used for longer distances as it is more immune to noise.

Increasing the clock frequency to enable higher data transfer rate introduces signal integrity issues (Bogatin, 2011; Wu et al., 2013), including timing skew⁴⁸ (Hu and Yuan, 2009; Lee et al., 2013), and electromagnetic interference⁴⁹ (Frenzel, 2007; Karstensen et al., 2000). When wires have different lengths,⁵⁰ or are subjected to different noises and cross-talks, they may have different arrival times (skew). Accordingly, implementing a wider parallel interface becomes more challenging at higher clock frequencies: the tolerance window in which the receiver can wait for all signals to arrive becomes shorter, whereas the skew increases due to higher parasitic capacitance, noise, and cross-talk. These challenges led to the development and adoption of serial interfaces.

Serial interface

Modern communication interfaces rely on serial architectures to limit the number of pins on a processor, as well as wiring on PCB. Since the 2000s, interfaces were implemented through a serial architecture,⁵¹ where each bit-line was implemented by using either one wire for a single-ended interface, or two wires for a differential pair interface⁵² (Figure 4(e)) (Chen and Katopis, 2004; Mechaik, 2001). To increase the bandwidth, multiple serial lanes are used to form a wider serial interface (Figure 4(c)) (Sarmah and Azeemuddin, 2014, 2015) and multi-lane synchronization can be done by using a special control byte (Wu et al., 2016). Unlike the parallel interface, each serial lane is individually synchronized, and multi-lane synchronization can be performed through the use of a control byte. However, adding more lanes would then need more wires and pins on the microprocessor, leading to increased routing complexity, and higher implementation costs (Na et al., 2017; Sreerama et al., 2018). Moreover, adding more lanes also necessitates using a dedicated physical layer for each lane, which consumes area on the silicon die, and increases energy consumption (Abdennadher et al., 2020; Rashdan et al., 2020).

The physical layer of the serial interface

Serializer-Deserializer (SerDes) implements the physical layer⁵³ of the serial interface,⁵⁴ and its performance directly impacts communication speed. SerDes converts a parallel stream of data into a serial stream of data; once the serial stream of data has been transmitted and obtained by the receiving end, SerDes (on the receiving end) converts it back into the parallel stream⁵⁵ (Figure 4(b)) (Frenzel, 2007; Ko, 2022). SerDes resides inside the chips, and is the heart of a serial interface (Rashdan et al., 2020). SerDes also supports inter-node communication interfaces, such as Ethernet (Law et al., 2013), and Infiniband (Part II: Inter-node communication (Hanindhito et al., 2026)).

The need for higher bandwidth has pushed for significant improvements in SerDes data rate: a factor of 200 during the last 20 years (Table 1). These improvements have been enabled by transistor scaling and using denser data modulation schemes. Specifically, transistor miniaturization allows integration of more advanced Digital Signal Processing (DSP) blocks into SerDes (Fujimori, 2014; Tonietto, 2022; Yue and Shekhar, 2022). The advanced DSP then supports high-order modulation schemes, resulting in higher SerDes data-rate. Using a more advanced modulation scheme can possibly push the data rate beyond 224 Gbps by 2030 (Che and Chen, 2023; Hecht et al., 2022; Yue and Shekhar, 2022). However, this could be challenging since it increases the complexity of SerDes circuitry, and results in more power consumption (Rashdan et al., 2020).

Table 1.

SerDes data rate based on optical internetworking forum (OIF).

Standard	Year	Data rate	Example interfaces
SPI-3	2000	0.1 Gb/s	SONET 622 Mb/s
SPI-4.2	2001	0.8 Gb/s	HyperTransport 1.0
Sxl-5	2002	3.1 Gb/s	SATA 2.0, SAS-1, InfiniBand SDR
CEI-6G	2004	6 Gb/s	SATA 3.0, SAS-2, 4G/8G Fibre Channel, HyperTransport 3.1, Infiniband DDR
CEI-11G	2005	11 Gb/s	SAS-3, 16G Fibre Channel, InfiniBand QDR
CEI-28G	2008	28 Gb/s	SAS-4, 32G Fibre Channel, InfiniBand EDR
CEI-56G	2017	56 Gb/s	64G Fibre Channel, InfiniBand HDR, 50G/100G/200G Ethernet (802.3 cd)
CEI-112G	2022	116 Gb/s	InfiniBand NDR, 100G/200G/400G Ethernet (802.3ck)
CEI-224G	TBD	232 Gb/s	Terabit Ethernet (802.3dj)

Signal coding and modulation

We highlight techniques and technologies that are used to encode and modulate data for transmission through a medium (e.g., metal wires or optical interconnect). They play a fundamental role in future bandwidth improvements by allowing a signal to carry more data over a long distance. Advanced data encoding and modulation are especially important as they are among the very few options that can alleviate communication bottlenecks.

Before data could be transmitted through a medium that carries either electrical or optical signals, data is encoded through encoding schemes. This improves spectral efficiency⁵⁶ (Rysavy, 2014; Winzer, 2012), and enables error detection and error correction (Alyaei and Glass, 2009; Fair et al., 1991; Imai and Hirakawa, 1977), which are needed for maintaining signal integrity (Mercier et al., 2010; Seshadri et al., 1993; Tzimpragos et al., 2016). Then, modulation is performed to transform the data into suitable signals for transmission over long distances. The primary goal of encoding and modulation is to increase the data rate⁵⁷ while reducing the signal rate.⁵⁸ We focus on digital modulation techniques (Smithson, 1998; Xiong, 2006) since they are widely used in intra-node and inter-node communication (Part II: Inter-node communication (Hanindhito et al., 2026)).

Digital baseband⁵⁹ modulation, also known as line coding (Guri et al., 2015; LoCicero and Patel, 2018; Matin, 2018; Rezaei et al., 2023; Teixeira and Zaharov, 2007), is used to encode data into a pattern of voltage, current, or photons for short-distance communication through metal wires or optical cables (Loan, 2007; LoCicero and Patel, 2018). The choice of line coding depends on several considerations, such as timing and synchronization capability, power efficiency and electrical characteristics, error detection and correction capability, probability of error and noise tolerance, and complexity of transmitter and receiver design (Couch, 1994; Lathi and Ding, 2022). Line coding comprises two parts: pulse shaping and block coding.

Pulse shaping

Pulse shape defines the pattern of voltage, current, or photons that transmit information at the bit level. Commonly-used pulse shapes can be grouped into five categories: unipolar, polar, bipolar, multi-transition, and multi-level (Madhow, 2008; Rezaei et al., 2023). The unipolar, polar, and bipolar pulse can be either non-return to zero (NRZ) or return-to-zero (RZ), resulting in several combinations.⁶⁰ Herein, we only highlight line coding techniques that are used by communication technologies that are mentioned in later sections. Interested readers are suggested to consult (Madhow, 2008) for a detailed comparison between pulse schemes.

In NRZ, the non-zero voltage pulse maintains its voltage level throughout the bit-time.⁶¹ With this scheme, NRZ does not provide enough signal transition to help distinguish each bit during long consecutive transmissions of the same bit value, leading to synchronization problems (Anand and Razavi, 2001; Song and Soo, 1997). RZ tries to make sufficient transition, by making the non-zero voltage pulse return to zero in the middle of the bit-time. This allows the receiver to distinguish each bit, in the case of long consecutive transmission of the same bit value.⁶² In addition to greater complexity, the RZ data rate is half that of NRZ for the same signal rate.

Due to its simplicity, NRZ is often used for bit rates up to 40 Gbps (Chen et al., 2023; Van Kerrebrouck et al., 2019). However, implementing higher-bandwidth communication interfaces (Section 2.13) necessitates finding denser pulse shapes, such as those that use Pulse-Amplitude Modulation (PAM), which uses multi-level voltage to represent the digital symbols. For instance, PAM-4 uses four levels of voltage to represent 2-bit symbols, in order to increase the modulation density. This choice increases the complexity of implementing transmitter and receiver, and raises signal vulnerability to noise and cross-talk⁶³ (Chen et al., 2023; Forghani and Razavi, 2022; Van Kerrebrouck et al., 2019), resulting in vastly higher bit error rate.⁶⁴ Increasing the operating voltage⁶⁵ to address the issue is not preferred due to higher energy consumption (Garcia et al., 2007; Müller et al., 2015) Higher-order PAM, such as PAM-6 or PAM-8, are currently being developed to fulfill the needs for future data rates (Che and Chen, 2023; Hecht et al., 2022; Yue and Shekhar, 2022).

Block coding

Although pulses are sufficient for transmitting information at the bit level, finding a more efficient scheme is necessary to fulfill the demand for high-bandwidth communication interfaces. This is where block coding can help. In addition to higher bandwidth, block coding also provides signals with higher power efficiency⁶⁶ and self-synchronization capabilities, which NRZ and PAM alone cannot achieve. Block coding is used with both NRZ and PAM, and is being used in recent high-bandwidth communication interfaces.⁶⁷

Block coding divides the data into fixed-length blocks, and encodes each block with slightly longer blocks, according to predefined coding schemes. Popular block codings include 4B/5B (Robe et al., 1993), 2B1Q (Sugimoto et al., 1989), 8B6T (Buchanan, 1999), and 8B/10B (Wang et al., 2010). For instance, the 8B/10B block coding, used in many communication interfaces,⁶⁸ encodes 8-bit blocks (or 8-bit words) into 10-bit symbols. This increases power efficiency and provides self-synchronization capability, at the expense of 2-bit overhead for every 8-bit of transmitted data. Denser block codings have lower overhead. These include 64B/66B⁶⁹ (Balasubramanian et al., 2011; Mohapatra et al., 2017), 128B/130B⁷⁰ (Mhaboobkhan et al., 2019; Weng et al., 2021), 242B/256B, and 512B/514B (Cideciyan et al., 2013; Teshima et al., 2008), and are used for communication interfaces that require even higher bandwidth.

General-purpose microprocessors

Central Processing Units (CPUs) are designed to support various instructions in diverse workloads, and hence, are called general-purpose microprocessors (Gelsinger, 2001) (Figure 2). Through using multiple complex hardware structures, they are designed to maximize average performance⁷¹ for a wide range of applications. While this is expensive in terms of implementation⁷² and energy consumption (Bhandarkar, 1997; Blem et al., 2013), it makes programming CPUs easier for software developers.

We review how technology trends in computing are impacting CPUs. In this vein, Figure 5 exhibits the evolution of key characteristics of general-purpose microprocessors during the past five decades, along with future projections.⁷³ These characteristics include transistor counts, single-thread performance, base clock frequency, typical power consumption, and the number of logical cores. Both the transistor counts and the number of logical cores are given per package. Moreover, computing paradigm (Hennessy, 2021) is also shown in the figure to indicate how they evolved over the years. Among these five characteristics, the clock frequency (Leiserson et al., 2020; Xiu, 2017) is perhaps the most widely-known feature, due to its publicity in marketing for decades.

Figure 5.

Technology trends of general-purpose microprocessors since the 1970s, with projections up to 2030.

We discuss the key drivers of these trends. The end of Moore’s law and Dennard scaling significantly impacted the evolution of general-purpose microprocessors. Therefore, we partition the discussion of the trends to before and after the end of Dennard scaling, followed by future possibilities.

Trends between 1980s and early 2000s

The clock frequency increased significantly between the 1980s and the early 2000s, as both Intel and AMD were competing to develop their speed-demon chips, in which higher clock frequency was the main figure of merit a microprocessor (Ronen et al., 2001) used for marketing (Olukotun and Hammond, 2005). Meanwhile, transistor sizes continuously decreased with more advanced process nodes (Bohr, 2007; Gargini, 2017), from 10 μm during the 1970s to 32 nm in late 2000s (Figure 1). During this period, Dennard scaling (Bohr, 2007; Dennard et al., 1974) still held, allowing manufacturers to raise the clock frequency and reduce power costs per transistor switching, thus keeping the overall power under control. Although the increase in the clock frequency improved single-thread performance, measured through SPEC Integer benchmark⁷⁴ (Dujmovic and Dujmovic, 1998), this design strategy came with costs. It increased dynamic power consumption (Liu and Svensson, 1994) and made thermal dissipation more challenging (Gurrum et al., 2004; Kish, 2002).

The advancement in the process node also allowed designers to pack more transistors in the same area, resulting in the steady increase of the number of transistors per package of microprocessors between the 1980s to early 2000s. These transistors were used to implement more complex functional units, which improved single-thread performance⁷⁵ through the out-of-order execution engine,⁷⁶ vector extensions,⁷⁷ branch predictor,⁷⁸ and improvements in the memory system (Peleg and Weister, 1991). During this period, the increase in performance and transistor count followed Moore’s law.

The end of Dennard scaling and emergence of multi-core processors

In the mid-2000s, both clock frequency and typical power started to plateau (Figure 5) (Theis and Wong, 2017). Clock frequency lost its significance as the key figure of merit in microprocessor design. Power efficiency became important (Zyuban and Kogge, 2000, 2001) due to the end of Dennard scaling (Esmaeilzadeh et al., 2011a; Wang and Skadron, 2013). Parallelism became fundamental to delivering higher performance. Instead of increasing the clock frequency or adding more features to improve single-thread performance, which came with disproportional increases in overhead and hence decreases in performance and power efficiency, designers built multi-core processors whose cores were architected to provide a good balance between performance and power consumption. This trend has continued, and the number of cores in a microprocessor has increased since then (Geer, 2005; Gepner and Kowalik, 2006; Parkhurst et al., 2006). Table 2 shows the increase in the number of cores in recent datacenter class microprocessors. The paradigm shift into multi-core microprocessors (Figure 5) made parallel programming crucial for fully utilizing the compute power offered by each core (Blake et al., 2009; Marowka, 2011). Applications that have inherent parallelism,⁷⁹ as is often the case in high-performance computing, and machine learning, are well-suited to exploit multi-core processors.

Table 2.

Comparison of recent, commercially-available datacenter class microprocessors.

CPU name	Year	# of cores	# Transistor (bn)	# Pins	Mem. Bw. (GB/s)	Bw. per core (GB/s)	TDP (W)
Intel E7-8894 v4	2017	24	7.2^a	2011	85	3.54	165
IBM POWER9	2017	22	8	3899	170	7.72	190
AMD EPYC 7601	2017	32	19.2^b	4094	170	5.31	180
Intel Xeon 8280	2019	28	8^c	3647	131	4.68	205
AMD EPYC 7742	2019	64	39.5^d	4094	205 s	3.2	225
Ampere M128-30	2021	128	38^e	4926	204^f	1.59	250
AMD EPYC 9654	2022	96	90^g	6096	461	4.80	360
Intel Xeon 8490H	2023	60	48^h	4677	307	5.11	350

^aIntel Broadwell-EX features up to 24 cores, using a High Core Count (HCC) die, which measures 456.12 mm² in area and contains 7.2 billion transistors.

^bEach multi-chip module (MCM) has 4.8 billion transistors, for a total of 19.2 billion transistors (Naffziger et al., 2021).

^cEstimated from its predecessor, Intel Xeon Platinum 8180 (Skylake-SP), with similar core count and process node.

^dEach core chiplet has 3.9 billion transistors. The (I/O die) IOD chiplet has 8.3 billion transistors (Naffziger et al., 2021).

^eNumber is estimated from the approximate die size of each ARM Neoverse N1 core, with 1 MB L2 cache (1.4 mm²), L3 cache (Part II Section 2.1.2 (Hanindhito et al., 2026)), integrated memory controller, and PCIe interface, for a total area of 400 mm², along with TSMC 7 nm transistor density of around 95 million transistors/mm².

^fApproximate memory bandwidth from memory configuration, with 8 channels of DDR4-3200 ECC (Ampere Computing LLC, 2022).

^gEach core chiplet has 6.57 billion transistors. The IOD chiplet has 11 billion transistors.

^hEach chiplet has an area of 400 mm², containing approximately 11–12 billion transistors (Nassif et al., 2022).

While increasing the number of cores seems to be an intuitive way of improving the performance of microprocessors, several factors limit the number of cores that can be placed into a microprocessor. These are:

Memory bandwidth bottlenecks. Increasing the number of cores elevates pressure on the memory subsystems (Ahn et al., 2009; Borkar, 2007; Mandal et al., 2010; Sancho et al., 2010), as higher bandwidth would be needed to supply the required data to the cores. Insufficient bandwidth may cause cores to wait for memory accesses (Cristal et al., 2005). Microprocessors with a high number of cores usually have multiple memory channels to accommodate the demand for memory bandwidth (Sancho et al., 2010). Adding more memory channels is costly because it needs additional pins (Figure 3), which are already limited, and it necessitates more memory controllers that consume area on the silicon die. Moreover, a more sophisticated motherboard design will be needed in order to house more memory modules, and maintain signal integrity.

Resource contention and core synchronization. Typically, a core needs to communicate with other cores to share data, share resources, and perform synchronization barriers. Increasing the number of cores in a microprocessor makes synchronization and resource sharing more difficult (Blagodurov et al., 2010; He et al., 2017; Zhuravlev et al., 2010). Extensive inter-core communication and resource contentions can limit the attainable performance of multi-core processors (Hood et al., 2010; Xu et al., 2010).

Cache Coherency. As the number of cores increases, maintaining coherency across multiple levels of cache becomes more difficult (Part II: Registers (Hanindhito et al., 2026)).

Transistor count, die size, and manufacturing yield. Adding more cores to a microprocessor increases the transistor count. With the slowdown of transistor scaling, a larger die size is then needed, which lowers the yield and increases manufacturing costs (Mack, 2015; Sun et al., 2020) (Section 2.5). To overcome these challenges, since 2017, microprocessors started to use multi-chip module (MCM) technology⁸⁰ (Naffziger et al., 2021), followed by chiplet⁸¹ (Loh et al., 2021; Naffziger et al., 2020) and 3D stacking technologies⁸² (Agarwal et al., 2022; Beyne et al., 2021; Ingerly et al., 2019; Su et al., 2017). Figure 6 shows the evolution of packaging technologies used by modern microprocessors.

Power consumption and heat dissipation. Packing more transistors on a silicon die, and the end of Dennard scaling (Esmaeilzadeh et al., 2011b; Wang and Skadron, 2013), has resulted in higher power consumption, which is one of the major limiting factors of adding more cores (Horowitz, 2014; Tiwari et al., 1998). To overcome this difficulty, manufacturers typically lower the clock frequency of higher-core-count microprocessors (Gepner and Kowalik, 2006), which resulted in a slower increase in power consumption since 2005 (Figure 5). The use of dynamic voltage and frequency scaling (DVFS) (Herbert and Marculescu, 2007; Le Sueur and Heiser, 2010; Papadimitriou et al., 2019) makes up the performance loss due to the lower clock frequency, by allowing the microprocessor to momentarily increase its power consumption to boost an individual core’s clock frequency to achieve higher single-thread performance, as long as it is within its power and thermal envelopes.⁸³ This feature is specifically useful for legacy applications that cannot take advantage of all the available cores (Cochran et al., 2011). Another method to balance power consumption is through heterogeneity, by combining high-performance and low-power cores. This approach reduces power when running workloads that do not heavily stress the CPU. Notable examples are the ARM big.LITTLE and Intel Alder Lake architectures (Rotem et al., 2022). Nevertheless, high-core-count microprocessors will soon require a liquid cooling system for effective heat dissipation.

Pin Limitation. With the increase in power consumption, higher memory bandwidth requirements, and the need for faster connectivity to off-chip peripherals, modern microprocessors need more pins, which are expensive⁸⁴. Figure 3 shows the pins and the package size used by several microprocessors, as well as their evolution over time. Specifically, it shows that, while the number of transistors on a chip has increased by about three orders of magnitude during the past three decades, the number of pins have increased by only about an order of magnitude, exacerbating communication bottlenecks.

Figure 6.

Evolution of microprocessor packaging technology, drawn to scale. Until recently, microprocessors had a single silicon die on a package. Nowadays, more transistors and more cores are placed on a microprocessor die. The slowdown in the advancement of process nodes and transistor shrinking has resulted in larger silicon die sizes, aimed at fitting more transistors on a die. For instance, the Intel Xeon Platinum 8380 is a 40-core microprocessor, implemented by using a single monolithic die, whose size is approximately 600 mm². A larger silicon die raises the possibility of manufacturing defects, which lowers the yield, and increases the manufacturing cost (Mack, 2015; Sun et al., 2020) (Section 2.5). To improve the yield, AMD uses multiple identical silicon dies (each single die is able to function as a stand-alone die) within a single package, referred to as a multi-chip module (MCM). For instance, the 32-core AMD EPYC 7601 microprocessors have four silicon dies, each containing eight cores, and have a size of 213 mm², for a total of 852 mm² per package (Naffziger et al., 2021). The next advancement in packaging technology uses multiple silicon dies, referred to as chiplets. Each chiplet may have a different functionality and process node. For instance, the 64-core AMD EPYC 7663 consists of 33 million transistors, implemented in an eight-core complex die, manufactured in a 7 nm process node (8 × 81 mm² in size), and one I/O die manufactured in a 12 nm process node (416 mm² in size), for a total of 1064 mm² of silicon die in a package (Naffziger et al., 2021). Chiplets can be stacked on top of each other by using 3D packaging technologies. An example includes AMD EPYC 7773X, where SRAM cache chiplet (in light-grey) is stacked on top of each core complex die, providing triple the capacity of the last-level cache (256 MB on AMD EPYC 7663 vs 768 MB on AMD EPYC 7773X) (Agarwal et al., 2022). The 56-core Intel Xeon Max microprocessors have four chiplets (each is presumed to have a size of 400 mm²) and integrate HBM2e memory (in dark grey) on the same package (Sanca and Ailamaki, 2023).

Many-core processors

Applications that have inherent parallelism, which are abundant in high-performance computing and machine learning, typically have simple execution flows⁸⁵ (Mittal, 2020a; Véstias and Neto, 2014). Advanced branch predictors, aggressive speculative execution engines or features like simultaneous multi-threading (SMT) do not typically benefit these workflows. These applications can enjoy larger performance improvements if area on a silicon die is allocated to build a large number of simple cores, instead of fewer but more sophisticated ones (Carter et al., 2013; Narayanan et al., 2015). More cores allow these applications to improve performance through parallelization (Schmidl et al., 2013; Silva et al., 2019). Manycore processors, followed by hardware accelerators, such as GPUs, have been a response to this need. We highlight the basic design philosophy of manycore processors through an example.

As summarized in Table 3, about the same amount of silicon die and pins were used to produce two different processors: Knights Landing has more simple cores (many-core processor), and is suited for highly-parallel and low-execution-complexity workloads; Skylake-SP has small number of sophisticated cores (multi-core processor), and is more suited for applications that may have unpredictable behavior. Specifically, the Intel Xeon Phi⁸⁶ Knights Landing has up to 72 cores, implemented with 7.1 billion transistors, on a 682 mm² silicon die. Each core uses the Airmont micro-architecture (Farrell et al., 2017), which has a simpler design when compared to the Skylake micro-architecture, used in Intel Xeon codenamed Skylake-SP (Tam et al., 2018), which is a multi-core processor introduced in 2017. Both Knights Landing and Skylake-SP share the same LGA-3647 socket (i.e., they have roughly the same off-chip memory bandwidth), and are manufactured on the same 14 nm process node. However, Skylake-SP can only have up to 28 cores, which are implemented with 8 billion transistors, on a 694 mm² silicon die. Both Knights Landing and Skylake-SP can execute x86-64 instruction sets, which implies applications that run on Skylake-SP can also run on Knights Landing; this eases the migration of applications between the two microprocessors (Wang et al., 2014). On the memory side, Knights Landing features on-package, 16 GB Multi-Channel Dynamic Random Access Memory (MCDRAM) (Pohl and Sattler, 2018; Salehian and Yan, 2017), which is 3D stacked DRAM (Part II: High-Bandwidth Memory (HBM) and variants (Hanindhito et al., 2026)). This on-package DRAM provides more than 400 GB/s of memory bandwidth (Sodani, 2015) in addition to the off-chip 6-channel DDR4 memory, which provides 115.2 GB/s bandwidth. Despite seamless code migration, optimization would still be needed to achieve higher performance on Knights Landing (Fang et al., 2014; Mittal, 2020b) and for taking advantage of its high-bandwidth MCDRAM (Butcher et al., 2018; Peng et al., 2017).

Table 3.

Comparison of multi-core (Intel Xeon) and many-core (Intel Xeon Phi) processors.

	Multi-core	Many-core
Product Name	Intel Xeon 8180	Intel Xeon Phi 7290
Year	2017	2017
Code Name	Skylake-SP	Knights Landing
Core Architecture	Skylake	Airmont
Socket	LGA-3647	LGA-3647
Number of Transistors	8 Billion	7.1 Billion
Silicon Die Area	694 mm²	682 mm²
Process Node	14 nm	14 nm
Number of Cores	28 Cores	72 Cores
Base Clock Frequency	2.50 GHz	1.50 GHz
On-package DRAM	N/A	MCDRAM
Bandwidth	N/A	400 GB/s
Off-package DRAM	DDR4-2666	DDR4-2400
Bandwidth	119.21 GB/s	115.2 GB/s
Typical Power	205 W	260 W

The last generation of Intel Xeon Phi was Knights Mills⁸⁷ (2017), which was specifically designed for accelerating AI and ML workloads (Domke et al., 2019; Georganas et al., 2018). Intel Xeon Phi offers substantial performance gains for HPC, AI, and ML workloads, due to abundant data-level parallelism and simple execution flows (Mittal, 2020b; Shao and Brooks, 2013). However, it faced strong competition from GPUs (Mittal, 2020a), which forced Intel to discontinue Xeon Phi product lineup in 2020. Nevertheless, the spirit of many-core computing is still alive in other areas, such as cloud computing clusters⁸⁸ that host small applications and micro-services in containerized forms (Pahl et al., 2019; Singh and Singh, 2016) for multiple tenants. These micro-services that are hosted in the cloud are typically not as computationally demanding as HPC or ML applications. Therefore, a simpler and more energy-efficient core is often preferred: many-core processors allow the cloud provider to achieve higher compute density through higher aggregate number of cores per server rack, and improve energy efficiency, which decrease the total cost of ownership (TCO). This has lead to the development of microprocessors designed specifically for cloud computing clusters, such as AMD EPYC Bergamo (2023), which has 128 Zen4c cores⁸⁹ that are optimized based on performance-per-watt. Intel is expected to release (2024) a new product lineup for cloud computing with its Sierra Forest processor, which is estimated to have 288 (energy) efficiency-oriented cores.

Several semiconductor companies are increasingly adopting ARM-based processors due to their lower power consumption, making them a strong alternative to Intel and AMD’s x86 architectures. This shift has long been evident in consumer devices—Apple’s M-series (MacBooks) and A-series (iPhones), as well as Qualcomm’s Snapdragon processors, all use ARM microarchitectures. More recently, major cloud providers have extended this trend to their infrastructures, integrating ARM processors like Amazon AWS Graviton (Loghin, 2024), Microsoft Azure Cobalt, and Google Cloud Axion. Ampere Computing also develops ARM-based many-core server solutions like AmpereOne, incorporating up to 192 cores, with future models expected to reach 256 and 512 cores. NVIDIA develops Grace CPU based on ARM architecture (Evans, 2022) as part of their CPU-GPU heterogeneous system (Part II: System integration and heterogeneous computing (Hanindhito et al., 2026)).

Programmability

CPU is the most flexible hardware platform, as the Instruction Set Architecture (ISA) allows them to execute any workload and application with relative ease. CPUs can be coded through several programming languages: from machine level (e.g., x86 or ARM assembly), to mid-level (e.g., C/C++), and to high-level (e.g., Python). Compiler tools (e.g., gcc or LLVM) are responsible for converting high-level software code to machine binary code. There exist several tools to extract parallelism from multi-core CPUs. For instance, in C, programmers may use either pthread, a low-level execution model compliant with most operating systems, or OpenMP, a library that offers an easy-to-use API.

While highly flexible and easy to use, extracting the maximum amount of performance out of a CPU is not trivial. Although the compilers offer many optimization techniques, their generated machine code is typically suboptimal. In order to efficiently utilize a CPU, the programmer needs to be aware of hardware-related features of the underlying CPU architecture.⁹⁰

What comes next?

Since chip designers have limited options to improve performance further, the next generation of general-purpose microprocessors is expected to have features that are purpose-built and optimized for specific use cases:

Single thread performance. General-purpose microprocessors will only see slight improvements in individual core performance due to the diminishing returns of adding more transistors to implement a more complex core. With slight performance improvements between generations, manufacturers will, once again, rely on (slightly) increasing the clock frequency to improve performance across generations.⁹¹ Due to the performance improvement stagnancy, it is expected that manufacturers will include more on-chip accelerators to offload popular workloads, such as machine learning (Khaldi et al., 2021; Tukanov et al., 2022), data analytics (Sanca and Ailamaki, 2023), data encryption (Biswas, 2021), and network applications (Nassif et al., 2022) (Section 4).

Number of cores. The number of cores is expected to increase to improve aggregate performance for highly parallelizable applications.

Core architecture and heterogeneity. Manufacturers will continue to develop multiple variants of core architecture: performance-oriented cores, to achieve the highest performance-per-core; and efficiency-oriented cores, to achieve the highest performance-per-watt. This allows manufacturers to develop different product-lines with different core architectures to address specific needs⁹² (Sideco, 2023) or integrate both of the different core architectures on the same die or in the same package (Rotem et al., 2022; Vasilakis et al., 2017).

Memory bandwidth. Improvements have been made by integrating high-bandwidth memory (HBM) into the same package as the CPU (Sanca and Ailamaki, 2023; Shipman et al., 2022), through the use of higher-bandwidth memory modules (DDR5), and by adding more memory channels (Part II: Memory systems (Hanindhito et al., 2026)). HBM will provide significant speed-up for applications that fit inside the memory, such as machine learning inference, where models typically fit in HBM (Part II: High-Bandwidth Memory (HBM) and variants (Hanindhito et al., 2026)).

Packaging. Manufacturers will employ even larger package sizes to increase the number of pins in order to deliver more power and provide more connectivity, while continuing to use chiplets,⁹³ along with new substrate material, such as glass substrate (Kudo et al., 2021; Vanna-Iampikul et al., 2023), to improve connectivity between the chiplets in a package. For instance, Intel’s next-generation data center processors, code-named Granite Rapids and Sierra Forest, are expected to have an LGA 7529 socket, which has 7529 pins (60% increase over the current LGA 4677, used by Sapphire Rapids and Emerald Rapids).

Summary and remarks

When Moore’s law was alive, CPUs were becoming steadily faster at the same price tag. Improvements in performance were primarily due to the implementation of more complex features, thanks to transistor miniaturization. In this era, improvements in hardware would typically impact application performance directly, sometimes making it difficult to justify algorithmic modifications.

The end of Moore’s law significantly impacted this trend. Adding more complex features to the chip resulted in minor impacts on performance. Moreover, power density on a chip increased due to the end of Dennard scaling, making energy efficiency a central issue in modern chip design. In this era, better performance could be realized by building multiple simpler and more efficient cores within a chip, as opposed to a single powerful but inefficient core. Multi-core processors became popular for complex parallelizable workflows, and many-core processors could provide significant performance gains to parallelizable applications that enjoyed considerable regularity and structure. Parallel algorithms became fundamental to harness the performance offered by multi- and many-core processors.

CPUs are here to stay. They need to run the operating system and control other hardware, such as GPUs. Algorithms that are hard to parallelize, or legacy software that do not get financial and technical support for modernization will continue to run on CPUs.

With more transistors needed to realize the higher number of cores, next-generation CPUs will rely on advanced packaging to improve yield and decrease manufacturing costs. On the other hand, the number of CPU pins has grown much slower than the number of on-chip transistors, resulting in performance degradation when off-chip communication is significant. On-package high-bandwidth memory alleviates this bottleneck when the application is small enough to fit into that memory. Algorithms that have reduced off-chip communication could be valuable when considerable off-chip communication occurs. Lastly, energy considerations will result in different classes of CPUs integrated into heterogeneous systems to support different application needs: those cores that are very fast but consume a lot of energy (performance-oriented), and those that are energy-efficient and thus are slower (efficiency-oriented).

Hardware accelerators

The design philosophy of general-purpose microprocessors is to maximize average performance for a wide range of applications (Küçük et al., 2013; Smith and Sohi, 1995). To this end, their hardware strives to extract parallelism from applications through: a) out-of-order execution engine (Eyerman et al., 2009; Peleg and Weister, 1991) (instruction-level parallelism (Davidson and Jinturkar, 1995; Jouppi and Wall, 1989; Rau and Fisher, 2003; Wall, 1991)); b) speculative execution by using branch prediction (Chang et al., 1996; Modi et al., 2005), to hide instruction latency (Brekelbaum et al., 2002; Gronowski et al., 1998); and c) caching (Iyer et al., 2021; Juan et al., 1997) and prefetching (Jiménez et al., 2012; Vanderwiel and Lilja, 2000), to minimize off-chip memory access latency. These mechanisms are generally hidden from the programmer and help reduce the burden of optimizing applications. However, the hardware that is needed to implement these mechanisms is expensive, and consumes a significant amount of power.⁹⁴ They also occupy the majority of the silicon die, increasing the manufacturing cost of the chip (Dally et al., 2020; Hameed et al., 2010a). Table 4 shows how these features occupy more than 95% of the silicon die area on modern general-purpose chips, leaving the rest (around 5%) of the silicon die for implementing the integer and floating-point execution units that actually perform the sought-after computations.

Table 4.

Estimated silicon die area (mm²) for AMD Zen architecture based on silicon die image analysis^a. Silicon dedicated to arithmetic operations constitutes less than 5% of the total die area, whereas 95% of the silicon die is used to improve efficiency and manage data-movement.

Architecture	Zen 1	Zen 2	Zen 3	Zen 4
Product Segment	Ryzen 7	Ryzen 9	Ryzen 9	Ryzen 9
Product Model	1800X	3950X	5950X	7950X
Number of Cores	8	16	16	16
Year	2017	2019	2020	2022
Total Die Area	212	277	293	267
I/O Die (IOD) Area^b	212^c	125	125	125
(%)		45.13%	42.66%	46.82%
Core Die (CCD) Area^d	212^c	2 × 76	2 × 84^e	2 × 71^e
(%)		54.87%	57.34%	53.18%
Aggr. Core Complex (CCX)	2 × 44.11	4 × 31.39	2 × 84	2 × 71
(%)	41.61%	45.33%	57.34%	53.18%
L3 Cache for all CCXs	2 × 16.32	4 × 16.82	2 × 35.52	2 × 25.1
(%)	15.40%	24.29%	24.25%	18.80%
L2 Cache for all Cores	8 × 1.65	16 × 0.81	16 × 0.77	16 × 1.03
(%)	6.23%	4.68%	4.20%	6.17%
Aggr. Core without L2	8 × 5.29	16 × 2.83	16 × 3.09	16 × 2.66
(%)	19.96%	16.35%	16.87%	15.94%
L1 instruction cache	8 × 0.6	16 × 0.13	16 × 0.08	16 × 0.09
(%)	2.26%	0.75%	0.44%	0.54%
L1 data cache	8 × 0.64	16 × 0.23	16 × 0.23	16 × 0.15
(%)	2.42%	1.33%	1.26%	0.90%
Fetch & Decode	8 × 0.88	16 × 0.47	16 × 0.48	16 × 0.45
(%)	3.32%	2.71%	2.62%	2.70%
Branch Prediction	8 × 0.72	16 × 0.48	16 × 0.44	16 × 0.47
(%)	2.72%	2.77%	2.40%	2.82%
Out-of-Order Scheduler^f	8 × 0.79	16 × 0.38	16 × 0.35	16 × 0.29
(%)	2.98%	2.19%	1.91%	1.74%
Integer Exec. Unit	8 × 0.27	16 × 0.14	16 × 0.19	16 × 0.12
(%)	1.02%	0.81%	1.04%	0.72%
Float. Point Exec. Unit^g	8 × 0.53	16 × 0.40	16 × 0.40	16 × 0.35
(%)	2.00%	2.31%	2.18%	2.10%
Load Store Unit	8 × 0.64	16 × 0.41	16 × 0.52	16 × 0.41
(%)	2.42%	2.37%	2.84%	2.46%

^aSilicon die images are obtained from Fritz (2019a, 2019b), Fritz (2020), and Killian (2023) for Zen 1, Zen 2, Zen 3, and Zen 4, respectively. Analysis of the die images is based on Locuza (2020) for Zen 1 and Zen 2, and Locuza (2022) for Zen 3 and Zen 4. Percentage is based on the total area of all dies within a package.

^bI/O Die (IOD) (Suggs et al., 2020) contains the InfinityFabric interface for inter-die communication, PCI Express interface, Memory Controller, Memory Interface, and Integrated GPU (for Zen 4 only). It is manufactured by using 12 nm (Zen 2 (Naffziger et al., 2021) and Zen 3 (Burd et al., 2022)) and 6 nm (Zen 4 (Munger et al., 2023)) process nodes.

^cZen 1 does not have a stand-alone I/O die. It has the Zeppelin Multi-Chip Module (Beck et al., 2018; Burd et al., 2019), where Core and I/O are implemented as one die.

^dCore Complex Die (CCD) contains the Core Complex (CCX), L3 Cache, and other functional units. It is manufactured by using 14 nm (Zen 1 (Naffziger et al., 2021)), 7 nm (Zen 2 (Suggs et al., 2020) and Zen 3 (Evers et al., 2022)), and 5 nm (Zen 4 (Munger et al., 2023)) process nodes.

^eZen 3 and Zen 4 feature a unified CCD, where a single CCD has a single 8-core CCX (Burd et al., 2022; Munger et al., 2023) (as opposed to two 4-core CCX in Zen 1 and Zen 2 (Naffziger et al., 2021; Suggs et al., 2020)), allowing all eight cores to share L3 cache.

^fOut-of-Order scheduler includes both integer and floating-point.

^gFloating-point execution unit includes the SIMD units (i.e., SSE/AVX), but excludes floating-point registers.

Remarkably, many classes of applications may not substantially benefit from the above-mentioned exotic features⁹⁵ (Giles and Reguly, 2014) and may not even get the best execution efficiency by using them⁹⁶ (Brooks et al., 2000; Zyuban and Kogge, 2000, 2001). For these applications, which typically enjoy a lot of regularity and parallelism, the silicon area can be used more efficiently. Many-core processors and hardware-accelerators have been a response to this reality. It is worth noting that while typically a key objective for hardware specialization (D’Arnese et al., 2023; Peccerillo et al., 2022; Qasaimeh et al., 2019; Chong et al., 2014) through the use of hardware accelerators has been to improve energy efficiency (Part II: Energy consumption of large computing centers and its implications (Hanindhito et al., 2026)), the specialization often improves other performance metrics as well (Altaf and Wood, 2017; Hameed et al., 2010b).

A hardware accelerator (accelerator, for short) can be defined as a separate compute structure⁹⁷ that has an architecture specifically developed for the needs of a particular application or a class of applications (Hwu and Patel, 2008), and is typically connected to a general-purpose microprocessor for execution of other code that does not fit on the accelerator. An accelerator provides significant improvements in many metrics⁹⁸ when the supported applications, referred to as accelerated workloads (Nowatzki et al., 2017), are offloaded from the general-purpose microprocessor to the accelerator. There are four primary techniques that accelerators exploit to deliver performance and efficiency, compared to general-purpose microprocessors (Dally et al., 2020; Hennessy and Patterson, 2019): a) using specialized functional units to operate on particular data-types, which allows for fast execution with low overhead⁹⁹; b) exploiting parallelism at several levels that are more efficient for particular applications, along with using optimized hardware structure¹⁰⁰; c) tailoring the memory hierarchy to the application¹⁰¹; and d) reducing overhead associated with fetching and decoding instructions through specific and simplified control flow.¹⁰² In the remainder of this section, we discuss GPUs, which are among the most well-known hardware accelerators, followed by custom-made hardware for specific applications.

Graphics processing units (GPUs)

In this part, we discuss how GPUs have evolved during the past three decades, their compute unit, memory system, and recent additions, such as matrix accelerators, to make GPUs attractive to a wider class of applications and markets.

Evolution of GPUs

In what follows, we discuss why and how GPUs evolved from a rigid hardware, which was made to only process graphics-related workloads into a computing unit capable of processing a wider class of applications.

Fixed-function pipeline era

Before the 2000s, GPUs were a fixed-function microprocessor, solely responsible for processing graphics. They had a fixed graphics pipeline for performing 2D and 3D transformations, and computing lighting equations (Blythe, 2008; Lindholm et al., 2008). The graphics pipeline includes vertex,¹⁰³ primitive,¹⁰⁴ fragment,¹⁰⁵ and pixel¹⁰⁶ generation and processing units. These operations are highly parallel since the vertices, primitives, fragments, and pixels are independent, and can be processed in parallel during each stage of the graphics pipeline (Blythe, 2008). While this hardware could support basic graphics processing tasks, it did not have general-purpose computing capability.

Programmable shaders

In the early 2000s, due to the need for creating more complex computer-generated imagery, GPUs became increasingly programmable (Blythe, 2008; Elliott, 2004). Programmability was initiated with the introduction of programmable shaders (Peddie, 2023a), through the graphics-focused application programming interfaces (APIs), such as Direct3D by Microsoft and OpenGL by Khronos Group. More parts of the graphics pipeline became programmable as GPUs and graphics APIs advanced. For instance, Direct3D version 9 (2002) features programmable vertex and fragment processing (Rodriguez et al., 2011), which was implemented by dividing the GPU hardware into several hardware pipeline stages (Goodnight et al., 2005) that resemble the programmable graphics pipeline (Angel and Shreiner, 2011; Owens et al., 2007). It was difficult for hardware designers to determine the area of the silicon die that should have been dedicated to each hardware pipeline stage across a wide range of graphics applications, as well as for software designers to identify bottlenecks in graphics applications in order to balance the performance of each graphics pipeline stage across different GPU architectures¹⁰⁷ (Chen et al., 2005; Peddie, 2023a). Moreover, the graphics APIs were also constantly changing, and, at times, the number of graphics pipeline stages was not known during the design of the hardware.

Unified shaders

A unified graphics architecture was introduced in 2006 to overcome the difficulties of balancing the hardware resources across different types of shaders (Peddie, 2023a). A unified hardware structure within a GPU, referred to as a Streaming Multiprocessor (SM) in NVIDIA GPUs, a Compute Unit (CU) in AMD GPUs, or a Core (XC) in Intel GPUs, can run all vertex, fragment, and geometry tasks, without distinction, depending on the program (i.e., GPU kernel¹⁰⁸) that was given to it. The introduction of computing libraries (e.g., NVIDIA CUDA (Buck, 2007), AMD ROCm (Sun et al., 2018), Intel OneAPI (Aktemur et al., 2020)) made developing non-graphics applications to take advantage of GPUs easier, which marked the birth of general-purpose graphics processing units (GP-GPUs) (Figure 2). While we employ the terminology that is used by NVIDIA GPUs, the concepts generally apply to AMD and Intel GPUs as well. Table 5 lists the terminology equivalency between NVIDIA, AMD, and Intel GPUs.

Table 5.

Terminology equivalency between NVIDIA, AMD, and Intel GPUs.

NVIDIA	AMD	Intel	Description
Software Perspective
Thread	Work-item	Work-item	Single stream of instruction and data
Warp	Wavefront	Sub-group	Group of threads that execute the same instruction stream in lock-step fashion and operate on different data
Thread Block	Work-group	Work-group	Group of warps executed by single SM/CU/XC and share synchronization barrier and on-chip memory
Grid	ND-Range	ND-Range	Collection of thread-block or work-group of a kernel executed by GPU.
Hardware Perspective
CUDA Cores (CC)	Stream Processors (SP)	Vector Engines (XVE)	Arithmetic unit that processes one data item and executes portion of SIMT instruction stream
Tensor Cores (TC)	Matrix Cores	Matrix Engines (XMX)	Specialized unit to accelerate matrix-matrix operations (e.g., GEMM)
Subpartition (SMSP)	SIMD Unit	Execution Unit (EU)	SIMT processor to execute warp/wavefront/sub-group in a lock-step manner
Streaming Multi-processor (SM)	Compute Unit (CU)	Xe Core (XC)	Unit capable of executing one thread-block/work-group of a kernel. It consists of subpartitions sharing on-chip memory
Texture Processing Cluster (TPC)	Workgroup Processor (WGP)	Render Slice or Compute Slice (SLC)	GPU super-clusters, consisting of several SMs/CUs/XCs to perform texture operations
Graphics Processing Cluster (GPC)	Shader Engine	Render Slice or Compute Slice (SLC)	GPU mega-clusters, consisting of several super-clusters to perform graphics operations
Graphics Processing Die (GPD)^a	Graphics Complex Die (GCD)	Stack (STK)	GPU dies, consisting of several megaclusters; used for chiplet-based GPU.

^aUnofficial name, given for completeness.

General-purpose graphics processing units (GP-GPUs)

Tables 6 and 7 summarize the evolution of NVIDIA’s datacenter general-purpose GPUs. It started with the introduction of the Tesla architecture (2007) (Lindholm et al., 2008), a ground-breaking architecture that became the foundation of the NVIDIA GPU architecture for about two decades. The trend depicted in Tables 6 and 7 is similar to that shown in Figure 5 for general-purpose microprocessors, especially for transistor count, typical power, and number of cores. The base clock of the GPU shows only a small increase to manage thermal dissipation. Starting with the Kepler architecture (2013), NVIDIA introduced the GPU Boost feature, which can increase the GPU clock frequency for a short time, as long as it stays within the thermal and power envelope. From a manufacturing perspective, the advancement of process nodes played an important role in the evolution of GPUs. The number of transistors is doubled every 2 years, except from 2013 to 2015, due to process node stagnancy. As it becomes more difficult to shrink the transistor size further, the die area has grown slowly,¹⁰⁹ which then lowers the yield, and increases manufacturing costs (Arunkumar et al., 2017; Mack, 2015; Sun et al., 2020). Just like general-purpose microprocessors, future GPUs will use advanced packaging technologies, such as chiplets¹¹⁰ (Loh et al., 2023; Pratheek et al., 2022) to improve yield and reduce manufacturing costs (Section 2.10).

Table 6.

Evolution of NVIDIA datacenter-class general-purpose GPUs.

Year	2007	2009	2011	2013	2015	2016
Name
Architecture	Tesla	Tesla 2	Fermi	Kepler	Maxwell	Pascal
Product	C870	C1080	M2090	K40^a	M40^a	P100
Manufacturing
Process node^b (nm)	90	55	40	28	28	16
Transistor count^b	681 M	1400 M	3000 M	7080 M	8000 M	15.3 B
Die area^b (mm²)	484	470	520	561	601	610
Compute
Base clock (MHz)^b	600	610	651	745	948	1328
Boost clock (MHz)^b	–	–	–	876	1112	1480
Total GPDs	1	1	1	1	1	1
Total GPCs	1	1	4	5	6	6
TPC per GPC	8	10	4	3	4	5 or 4^c
Total TPCs	8	10	16	15	24	28
SM per TPC	2	3	1	1	2	2
Total SMs	16	30	16	15	24	56
CUDA Cores per SM	8	8	32	192	128	64
Total CUDA Cores	128	240	512	2880	3072	3584
Vector FP64 (TFLOP/s)^b	–	0.08	0.67	1.68	0.21	5.3
Vector FP32 (TFLOP/s)^b	0.35	0.62	1.32	5.05	6.83	10.6
Vector FP16 (TFLOP/s)^b	–	–	–	–	–	21.2
On-chip Memory
Register/SM (kB)	32	64	128	256	256	256
L1 Cache/SM (kB)	–	–	64^d	64^e	48^f	24^g
Shared/SM (kB)	16	16			96^f	64^g
L2 Cache (kB)	–	–	768	1536	3072	4096
Off-chip Memory
Size (GB)	1.5	4	6	12	12	16
Technology	GDDR3	GDDR3	GDDR5	GDDR5	GDDR5	HBM2
Interface (bit)	384	512	384	384	384	4096
Clock (MHz)^b	800	800	924	1502	1502	715
Bandwidth (GB/s)^b	76.8	102	177	288	288	732
Physical Properties
Form Factor	PCIe	PCIe	PCIe	PCIe	PCIe	SXM2
TDP^b,h (Watts)	171	188	250	245	250	300

^aNVIDIA Tesla K40 and NVIDIA Tesla M40 were manufactured using the same 28 nm process node, although with different architectures. The M40 with Maxwell architecture has a higher single-precision performance but much lower double-precision performance, compared to K40 with Kepler architecture. NVIDIA marketed the Tesla M40 as a deep learning training accelerator, where most deep learning applications only need single precision.

^bData is obtained from TechPowerUp GPU Specs Database: C870, C1080, M2090, K40, M40, P100.

^cTo improve manufacturing yields, some GPCs can have all of their TPCs active, while others can have fewer TPCs enabled. For example, the H100 can have two GPCs with 9 TPCs per GPC and 6 GPCs with 8 TPCs/GPC.

^dThe GF110 chip in NVIDIA Tesla M2090 features a unified L1 cache and shared memory of 64 KB per SM, which can be configured as 16 KB:48 KB or 48 KB:16 KB (L1-cache:shared-memory).

^eJust like its predecessor, the GK180 chip in NVIDIA Tesla K40 features 64 KB of unified L1 cache and shared memory per SM. It can be configured as 16 KB:48 KB, 48 KB:16 KB, or a split 32 KB:32 KB (L1-cache:shared-memory).

^fThe GM200 chip in NVIDIA Tesla M40 has a dedicated 96 KB shared memory and 48 KB of unified L1 cache and texture cache. In its predecessors, the texture cache was a dedicated unit separate from L1 (e.g, Kepler has a 48 KB texture cache and 64 KB unified L1 cache and shared memory).

^gThe GP100 chip in NVIDIA Tesla P100 has the same implementation of on-chip memory as its predecessor. Although GP100 has reduced L1 cache and shared memory per SM, due to the half number of CUDA cores in each SM compared to GM200, the total L1 cache and shared memory in P100 are 42% larger than M40 because of the double number of SMs compared to M40.

^hThermal design power (TDP) specifies the maximum amount of heat that can be generated by a chip, which a cooling system must handle (Ganapathy and Warner, 2008; Guermouche and Orgerie, 2022). Since heat is a consequence of dissipated power, TDP indicates the power a chip consumes during a sustained, long operation. A chip may consume higher power than TDP for a short period of time (e.g., due to the boost clock) as long as it is still within its thermal and power envelopes.

Table 7.

Evolution of NVIDIA datacenter-class general-purpose GPUs-continued.

Year	2018	2018	2020	2022	2022/2024	2024/2025
Name
Architecture	Volta	Turing	Ampere	Ada Lovelace	Hopper	Blackwell
Product	V100	Quadro RTX 6000	A100	RTX 6000 Ada Gen	H100/H200	B200/B300
Manufacturing
Process node^a (nm)	12	12	7	4	4	4
Transistor count^a	21.1 B	18.6 B	54.2 B	76.3 B	80 B	2 × 104 B
Die area^a (mm²)	815	754	826	609	814	2×∼820
Compute
Base clock (MHz)^a	1290	1440	1275	915	1590	1665
Boost clock (MHz)^a	1530	1770	1410	2505	1980	1837
Total GPDs	1	1	1	1	1	2
Total GPCs	6	6	7	12	8	16
TPC per GPC	7 or 6	6	8 or 7	6	9 or 8	9 or 8
Total TPCs	40	36	54	71	66	132
SM per TPC	2	2	2	2	2	2
Total SMs	80	72	108	142	132	264
CUDA Cores per SM	64	64	64	128	128	128
Total CUDA Cores	5120	4608	6912	18176	16896	33792
Vector FP64 (TFLOP/s)^a	7.8	0.51	9.7	1.42	33.5	62.1
Vector FP32 (TFLOP/s)^a	15.7	16.3	19.5	91.1	67	124.1
Vector FP16 (TFLOP/s)^a	31	32.6	78	91.1	134	248.2
Tensor Cores^b per SM	8	8	4	4	4	4
Total Tensor Cores	640	576	432	568	528	1056
Tensor FP64 (TFLOP/s)^a	–	–	19.5	-	67	37/1.2
Tensor TF32 (TFLOP/s)^a	–	–	156	182	495	1100
Tensor FP16 (TFLOP/s)^a	125	130.5	312	364	989	2250
Tensor INT8 (TOP/s)^a	–	261	624	728	1979	4500/0.14
On-chip Memory
Register/SM (kB)	256	256	256	256	256	256
L1 Cache/SM (kB)	128^c	96^d	192^e	128	256^f	256^g
Shared/SM (kB)
L2 Cache (kB)	6144	6144	40960	98304	51200	102400
Off-chip Memory
Size (GB)	32	24	80	48	80/141	192/288
Technology	HBM2	GDDR6	HBM2E	GDDR6	HBM3^h/3E^h	HBM3E^h
Interface (bit)	4096	384	5120	384	5120/6016	8192
Clock (MHz)^a	877	1750	1593	2500	1310/1600	∼2000
Bandwidth (GB/s)^a	900	672	2039	960	3353/4812	∼8000
Physical Properties
Form Factor	SXM2	PCIe	SXM4	PCIe	SXM5	SXM6
TDP^a (Watts)	300	260	500	300	700	1000/1200

^aData is obtained from TechPowerUp GPU Specs Database: V100, Quadro RTX 6000, A100, RTX 6000 Ada Generation, H100, H200, and B200.

^bThe Tensor Cores are specialized compute blocks optimized for tensor operations. They were first introduced with the V100 GPU.

^cThe GV100 chip in NVIDIA V100 features a unified L1 cache, texture cache, and shared memory, for a total of 128 KB per SM. The shared memory can be configured to use up to 96 KB of the unified memory, while the rest is used for both the L1 and texture cache.

^dThe TU102 chip in NVIDIA Quadro RTX 6000 features a 96 KB L1 data cache/shared memory. Traditional graphics workloads can partition it as 64 KB of dedicated graphics shader RAM and 32 KB for texture cache and register file spill area. Compute workloads can divide it into 32 KB shared memory and 64 KB L1 cache, or 64 KB shared memory and 32 KB L1 cache.

^eThe GA100 chip in NVIDIA A100 features a unified L1 cache, texture cache, and shared memory, for a total of 192 KB per SM. The shared memory can be configured to use up to 164 KB of the unified memory, while the rest is used for both the L1 and texture cache.

^fThe GH100 chip in NVIDIA H100 features a unified L1 cache, texture cache, and shared memory, for a total of 256 KB per SM. The shared memory can be configured to use up to 228 KB of the unified memory, while the rest is used for both L1 and texture cache. The preliminary specification of H200 retains the same compute configuration with increased memory bandwidth and capacity, which is useful for handling large language models (LLMs).

^gBlackwell is the first chiplet-based NVIDIA GPU consisting of two dies connected through 10 TB/s NVIDIA High Bandwidth Interface (NV-HBI). Its preliminary specification is based on publicly available information. The Blackwell Ultra B300 is fine-tuned specifically to accelerate large language models by adding more memory and a higher power envelope than the Blackwell B200, resulting in 50% higher Tensor FP4 performance at the cost of lower performance on other precisions.

^hHBM3 and HBM3E can double the effective bandwidth compared to HBM2E at roughly the same memory clock by utilizing higher number of memory channels to increase parallelism (i.e., eight 128-bit channels in HBM2E and sixteen 64-bit channels in HBM3); See Part II Section 2.2.6 (Hanindhito et al., 2026).

GPU compute unit

GPUs are popular for running and accelerating non-graphics applications that exhibit significant parallelism, such as high-performance computing, and machine learning. Abundant parallelism in these applications can be extracted through hundreds of GPU cores (e.g., CUDA Cores in NVIDIA GPUs, Stream Processors (SP) in AMD GPUs, and Vector Engines (XVE) in Intel GPUs) with single-instruction multiple-threads (SIMT) execution model (Section 2.11). While the design philosophy of GPU is similar to that of many-core microprocessors, the term “core” in a GPU is significantly different than that in many-core and general-purpose microprocessors. In both general-purpose and many-core microprocessors, each individual core has a front-end to fetch and decode instructions, a back-end with multiple functional units to execute the instructions, and dedicated registers and first-level cache to store recent and commonly used data. Each core can run software threads independently, until they arrive at a synchronization barrier, if any.

On the other hand, GPU cores are much simpler. They are grouped into SMs/CUs/XCs, where each core shares commonly-used hardware structures, such as instruction schedulers, L1 cache, and register files, roughly leaving each core only to contain arithmetic-logical units (e.g., floating-point or integer units). As seen in Table 4, these commonly-used hardware structures typically occupy more than 95% of the silicon die area in general-purpose microprocessors, and thus, sharing them across hundreds of cores makes it possible for GPUs to have thousands of (simple) cores, while keeping the power consumption and silicon die area under control. Figure 7 illustrates terminologies that are used in GPU hardware and software.

Figure 7.

A GPU die floor plan (right) and execution model (left). CUDA Cores execute a thread and perform operations on data. A group of threads, called warp, is scheduled into SM Sub-partitions (SMSP), where they are executed in a lock-step manner. A thread block or cooperative-thread-arrays consists of up to 1024 threads (32 warps) and is mapped into a Streaming Multiprocessor (SM). A GPU kernel may use tens of SMs.

While GPUs can execute thousands of threads, they struggle to handle irregular execution patterns. A group of threads, called a warp, is executed by a single SIMT processor. Due to sharing a large portion of hardware structures, each thread within a warp is executed in a lock-step fashion; i.e., any difference in branch outcomes¹¹¹ between threads within a warp must be serialized, reducing GPU execution efficiency.

Tables 6 and 7 show the total number of CUDA cores has increased from generation to generation, as they are the ultimate workhorse to extract as much parallelism as possible from an application. The hierarchical organization has also changed between generations, either by changing the number of CUDA cores per SM, by adding more features to each SM, or by adding more SMs to a GPU. For the same total number of CUDA cores per GPU, it is generally preferred to have more SMs with less number of CUDA cores per SM. This allows for better resource¹¹²-sharing within an SM, and reduces the complexity of thread scheduling. In addition, each SM can run a different kernel, which is beneficial for applications that cannot utilize all the available SMs on a single GPU.

GPU memory system

A GPU’s memory system consists of large on-chip and high-bandwidth off-chip memory, in order to keep up with data demands from its extremely large number of cores.

On-chip memory

A GPU’s on-chip memory consists of a large number of registers (Part II: Registers (Hanindhito et al., 2026)) and various types of caches: texture cache, constant cache, and data caches (Part II: Caches (Hanindhito et al., 2026)). Moreover, GPUs provide user-managed, on-chip memory, which is called shared memory (Part II: User-managed vs. compiler-managed scratchpad memory (Hanindhito et al., 2026)).

The L1 cache started to appear on the Fermi (2011) architecture in addition to the shared memory. Both of them were implemented as a unified, on-chip memory per SM, where users have control over how the unified on-chip memory is divided between the L1 cache and the shared memory. This organization was changed in Maxwell (2015) and Pascal (2016) architectures, where the L1 cache and the texture cache were implemented as unified on-chip memory, whereas the shared memory was implemented as stand-alone (i.e., fixed-sized) on-chip memory. The organization of the on-chip memory changed again in the Volta (2018) architecture, where the L1 cache, texture cache, and shared memory were implemented as a unified, on-chip memory. Its successors, Ampere and Hopper still retain this organization. The L2 cache is shared across SMs, and provides a way for SMs to share data; it is directly connected to the off-chip memory controller.

Off-chip memory

Due to limited on-chip memory capacity, GPUs need to access data from off-chip memory. Accessing data from off-chip memory has high latency, which can stall execution, resulting in reduced execution efficiency of GPUs. GPU tries to hide the memory access latency by aggressive context switching (Lee and Wu, 2014; Lin et al., 2018): it schedules another thread block in case a thread block is stalled and is waiting for memory access. This keeps the SMs busy. On the other hand, to fulfill the off-chip bandwidth requirements, GPUs tend to use off-chip memory that has significantly higher bandwidth due to using wider memory interfaces (Kim et al., 2014; Li et al., 2018), and higher memory clock frequency¹¹³ (Cho et al., 2012; Zheng et al., 2008). This led to the development of a specialized version of such memory dedicated to graphics in 1997, which was called SGRAM (Glaskowsky, 1997; Hitachi, Ltd, 1997), and its successor, GDDR-SDRAM (Foss, 1997; Prince, 1999), which will be discussed in detail in Part II: Graphics DRAM (Hanindhito et al., 2026).

Even with these specialized memory systems, meeting the growth in bandwidth demand due to increased number of cores on a GPU is challenging (Hwu and Patel, 2018). In 2015, HBM found its way to the GPUs to fulfill the needs of even higher memory bandwidth (Hu et al., 2018; Macri, 2015). Since then, HBM and its successors have been integrated into GPUs, especially data-center-class GPUs (Tables 6 and 7). HBM and its competing product, HMC, are discussed in Part II: High-Bandwidth Memory (HBM) and variants (Hanindhito et al., 2026).

Specializations for machine learning

Machine-learning applications have become very popular during the last decade. Many of these applications involve convolution, which translates into matrix operations. Machine-learning’s large market has motivated GPU designers to add more specialized features to improve efficiency across popular applications.

General matrix-matrix multiplication (GEMM) operations are abundant in machine learning and few high performance computing applications (Juan et al., 2021; Zhuang et al., 2023b). Starting from the NVIDIA Volta architecture (2018), GPUs have been equipped with Tensor Cores, which are specialized cores for performing GEMM with different levels of precision. The first-generation of Tensor Cores were able to perform half-precision GEMM operations to support mixed-precision machine learning training (Micikevicius et al., 2018) with the theoretical peak performance of 125 TFLOP/s, which is roughly four times the peak performance provided by CUDA Cores for half-precision computation within the same chip (Markidis et al., 2018). Therefore, GEMM operations can be performed more efficiently through (specialized) Tensor Cores, as opposed to (more general) CUDA cores. This makes GPUs a popular hardware accelerator for many machine-learning applications (Cass, 2020; Jordà et al., 2019).

Subsequent versions of Tensor Cores added more levels of precision and operations. Second generation Tensor Cores in NVIDIA Turing architecture (2019) added 8-bit, 4-bit, and 1-bit integer data-types to support quantized machine learning inference (Kim et al., 2021; Li et al., 2021) and binary neural networks (Li and Su, 2021). Third generation Tensor Cores in NVIDIA Ampere architecture (2020) added new BFloat16 (Wang and Kanwar, 2019) and TensorFloat32 (Choquette et al., 2021) to accelerate machine learning workloads with better dynamic ranges. They also added new FP64 support, which opens the possibility of Tensor Cores being used in HPC and scientific computing applications (Gallet and Gowanlock, 2022; Lee et al., 2022). Support for sparse matrix-matrix multiplication for applications that have abundant sparsity was also added, which improves performance and reduces bandwidth utilization (Anzt et al., 2020; Sun et al., 2022b). Fourth generation Tensor Cores in NVIDIA Hopper architecture (2022) added new quarter precision, FP8, with a built-in transformer engine that can dynamically choose the appropriate FP8 format to improve performance, while maintaining model accuracy for large language models (Zhuang et al., 2023a). The new Tensor Memory Accelerators (TMA) improve asynchronous data staging for the Tensor Cores. Both AMD and Intel followed NVIDIA to integrate their own version of Tensor Cores in their GPUs: AMD introduced Matrix Core with their CDNA GPU architecture in 2020, and Intel introduced XMX Matrix Engine into their Ponte Vecchio GPU architecture in 2022 (Jiang, 2022).

Programmability

NVIDIA and AMD provide CUDA (Buck, 2007) and ROCm (Sun et al., 2018), respectively, for programming GPUs. High-level software stacks, such as TensorFlow (Abadi et al., 2015) and PyTorch (Paszke et al., 2019), have been specifically developed for machine learning; these frameworks provide intuitive and practical Python APIs, which efficiently map user-defined network models to the underlying GPU architecture, and therefore, maximize productivity.

Despite the existence of these frameworks, developing efficient GPU code for a new application is challenging, and requires extensive understanding of the GPU architecture. For instance, developers must be aware of performance degradation that is caused by several factors, such as: a) thread divergence, which happens when threads in a warp follow different paths in conditional statements; b) register spilling (Part II: Registers (Hanindhito et al., 2026)), which can lead to significant delays and underutilization of the compute units; c) high-kernel-launch overhead, which occurs when kernels are not assigned enough work; and d) lack of vectorization, i.e., not taking advantage of the SIMD capabilities of the vector units.

Specialized and custom hardware

Specialized hardware can be used when desired performance or efficiency metrics cannot be met by using CPUs and GPUs. In essence, hardware accelerators epitomize the art of balancing flexibility (e.g., range of supported applications) and efficiency (e.g., cost, performance, energy consumption) (Sze et al., 2017; Verbauwhede et al., 2004), as illustrated in Figure 2, and discussed in Execution model, architecture, and implementation style. We outline typical challenges that arise during the development of specialized hardware, followed by possibilities for implementing the specialized accelerator. We end this section by providing several examples of specialized hardware that have been used in machine learning and scientific computing.

Challenges in developing specialized hardware accelerators

The development and adoption of specialized accelerators face two major challenges: cost and software support, which we highlight next.

Development and manufacturing of an accelerator is expensive, compared to using off-the-shelf products. The total cost of ownership (TCO) (Cui et al., 2017; Martens et al., 2012) of the off-the-self products, compared to a specialized accelerator, is often used as the key metric for making strategic decisions (Khazraee et al., 2017; Magaki et al., 2016). For large-scale¹¹⁴ deployments, the cost per chip of a specialized accelerator may be lower than that of a corresponding off-the-shelf hardware (Wu and Tsai, 2004; Zahiri, 2003). By contrast, if only a small number of units is to be expected, the baseline cost of designing an accelerator may not be amortized across a large enough number of fabricated instances. This implies development of specialized accelerators is economically viable only if many of them are going to be produced. Moreover, specialized accelerators often have longer development and implementation cycles compared to off-the-shelf components. A common way of reducing the cost of an accelerator is for it to target a broad class of applications within the same domain, instead of a single application. This is often referred to as domain-specific architecture (DSA) (Dally et al., 2020; Fujiki et al., 2021; Halawani and Mohammad, 2024; Krishnakumar et al., 2023). We provide several examples of DSAs.

Furthermore, availability of robust software ecosystems¹¹⁵ that help users map and migrate their existing applications to target specific accelerators (Cong et al., 2016; Koeplinger et al., 2018; Koul et al., 2023) is critical. Without adequate software support, users will face difficulties in migrating their existing workloads to accelerators, limiting their adoption rates (Cascaval et al., 2010; Hawick and Playne, 2014; Ikarashi et al., 2022).

Implementation choices

There are several options for implementing specialized hardware accelerators. The fabric in which a chip architecture can be implemented is discussed in Execution model, architecture, and implementation style. Next, we highlight key characteristics of popular fabrics, such as FPGAs, CGRAs, or implementation as full-custom ASICs.

Field Programmable Gate Arrays (FPGAs) have been used as the substrate to implement many hardware accelerators, including machine learning (Chen et al., 2019b; Roorda et al., 2022), linear algebra (Hu et al., 2021; Matteis et al., 2020), graph processing (Besta et al., 2019; Zhou et al., 2019), cryptography (Chelton and Benaissa, 2008; Chen et al., 2019a; Kumar et al., 2020), data analytics (Hoozemans et al., 2021; Kara et al., 2020), and high-performance scientific computing (Belletti et al., 2009; Tan et al., 2014), among many others (Gandhare and Karthikeyan, 2019; Shahzad et al., 2021). Aside from programmable logic elements, FPGAs can contain memory cells (on-chip static random access memory (SRAM); see Part II: Static random access memory (Hanindhito et al., 2026), and specialized blocks (e.g., digital signal processing (DSP) blocks) (Langhammer and Pasca, 2015; Ronak and Fahmy, 2016)). Recently-released FPGAs are also equipped with High-Bandwidth Memory (HBM) (Part II: High-Bandwidth Memory (HBM) and variants (Hanindhito et al., 2026)) to tackle the bandwidth demands of many applications (Holzinger et al., 2021; Wang et al., 2020).

Intel and AMD, the leading FPGA vendors,¹¹⁶ have recently developed FPGA architectures enhanced to more effectively support the high computational demands in ML (Boutros et al., 2024). In particular, Intel employs specialized in-fabric processing blocks, known as tensor blocks, comprising multiple dot-product engines (Gribok and Pasca, 2024; Langhammer et al., 2021). On the other hand, AMD introduced the AI engine (AIE) (Ahmad et al., 2019), which is an out-of-fabric array of programmable vector processors, integrated next to the FPGA fabric. Another notable AI-optimized FPGA is Achronix’s Speedster7t, which contains up to 2560 Machine Learning Processor blocks (Cairncross et al., 2023). These devices have been used in many works to accelerate AI workloads, significantly outperforming GPUs and traditional FPGAs (Boutros et al., 2020; Taka et al., 2023, 2024). Moreover, academic researchers have proposed in-fabric blocks that employ a 2D systolic dataflow (i.e., FPGA-version of Tensor Cores; see Section 4.1.4) (Arora et al., 2021; Taka et al., 2025), or even processing-in-memory (Part II: Near-memory processing (NMP) and processing-in-memory (PIM) (Hanindhito et al., 2026)) capabilities (Arora et al., 2022), to provide better performance for popular applications. As a result, both Intel and AMD see opportunity in integrating FPGAs into their general-purpose microprocessors. This gives customers flexibility to implement accelerators for specific use cases, especially in data-center and cloud computing workloads.

The hardware designs targeted for implementation on FPGAs are written using low-level hardware description languages (HDL), such as Verilog, SystemVerilog and VHDL. Compared to commonly-used programming languages for CPUs and GPUs, these languages are much more complicated. This leads to longer development and debugging cycles. Major challenges in implementing hardware design on FPGAs include the routing congestion, and maximizing the attainable clock frequency. AMD and Intel provide High-Level-Synthesis (HLS) tools, namely Vitis HLS, and Intel HLS, respectively, which are C/C++ libraries that convert high-level software code to HDL. While HLS tools can greatly boost productivity, they often come with severe limitations that affect design choices and attainable performance.

Coarse Grained Reconfigurable Arrays (CGRAs) are used in many domains, including signal processing (Mei et al., 2008; Park et al., 2009), high-performance scientific computing (Charitopoulos et al., 2021; Käsgen et al., 2018), machine learning (Geng et al., 2020; Wei et al., 2023), and near-data processing (Gao and Kozyrakis, 2016). As highlighted in Execution model, architecture, and implementation style, their main advantage over FPGAs are faster reconfiguration time, as well as increased performance and energy efficiency, which bring them closer to ASICs. Mapping applications to CGRAs is a challenging and heavily-studied area. Improved tools, compilers, and frameworks over the years are expected to make it easier for users to adopt CGRAs for their applications (Chin et al., 2017; Martin, 2022; Wijerathne et al., 2022).

Finally, Application-Specific Integrated Circuits (ASICs) provide higher performance and energy efficiencies compared to FPGAs and CGRAs. ASICs have a higher design and manufacturing cost, and longer development time (Wu and Tsai, 2004; Zahiri, 2003). ASIC is a better choice for implementing a mature accelerator architecture that targets popular workloads.¹¹⁷ Due to the high cost of developing ASICs, most ASICs sacrifice some level of efficiency for programmability, in order to support several similar applications for a given domain.¹¹⁸

Examples of custom and specialized hardware

In this part, we provide several examples of popular and publicly-known hardware accelerators. Many of these accelerators are targeted for processing of deep neural networks and are often referred to as Neural Processing Units (NPUs). The list is not exhaustive: our intention is to highlight various possibilities, and show readers that making specialized hardware is becoming more common. Table 8 summarizes key specifications of these devices, along with other popular chips.

Table 8.

Specifications of popular server-class custom ASICs.

	Google	Google	Amazon	Meta	Intel	Cerebras	Graphcore	Sambanova
	TPUv4	TPUv5p	Inferentia2	MTIA	Gaudi 3	WSE-2	MK2 IPU	SN40L RDU
Process Node (nm)	7	–^a	–^a	7	5	7	7	5
Peak FP8/INT8 (TOPs/s)	275	918	380	105.6	1835	7500^b	560^c	638^d
DRAM Technology	HBM2	HBM2e	HBM2e	LPDDR5	HBM2e	DDR/Flash^e	DDR^f	HBM/DDR^g
DRAM Capacity (GB)	32	95	32	64	128	1000/1500^e	448^f	64/768^g
DRAM Bandwidth (GB/s)	1200	2765	820	176	3700	150^e	−^a	1600/100^g
On-Chip SRAM size (MB)	160	–^a	–	136	96	40000	900	520
TDP (W)	170	–^a	–^a	25	900	23000	300^h	625ⁱ

^aInformation not yet publicly available.

^bThis corresponds to FP16 dense peak performance, while sparse peak performance is at 75 PFLOPs (Lie, 2022).

^cPeak FP8 throughput in a C600 PCIe card comprising a single IPU chip.

^dRefers to peak throughput for BFloat16. This is slightly lower than the 688 TOPs/s peak of SN30L.

^eRefers to the MemoryX memory unit, comprising 12 MemoryX nodes, primarily used to store model weights and stream them into WSE-2 for processing. Each MemoryX node contains 1 TB of DRAM and 0.5 TB of Flash memory.

^fDRAM is shared between the host CPU and all 4 IPUs in an M2000 IPU-Machine (Knowles, 2021). In a C600 PCIe card comprising a single IPU, the IPU relies on the PCIe bus for communicating with the host’s DRAM.

^gThe SN40L comprises 64 GB of HBM and 768 GB of DDR memory, each offering 1.6 TB/s and 100 GB/s peak bandwidth, respectively.

^hRefers to the TDP of each individual IPU chip in an M2000 IPU-Machine (Knowles, 2021). The TDP of a C600 PCIe card is 185 W.

ⁱRefers to typical inference power per SN40L RDU chip in a Cerulean system comprising 16 SN40L RDU chips.

Google’s tensor processing unit (TPU)

TPU is a machine-learning accelerator DSA that was developed by Google, and was first introduced in 2016 (Jouppi et al., 2017, 2018). The main advantage of TPUs over GPUs is that they are about an order-of-magnitude more energy-efficient. Due to the scale of Google’s operations, these energy savings could be considerable, reducing their total cost of ownership (TCO).

TPUs are programmable through TensorFlow (Abadi et al., 2015), and are available for public use through the Google Cloud Platform (GCP).

Its first generation (TPUv1) was primarily used for inference and was manufactured by using 28 nm process node technology. It had matrix-multiply-accumulate units arranged in a systolic array fashion (Jouppi et al., 2018), which is capable of multiplying 8-bit integers and accumulating the results in 32-bit. Even with its specialized architecture, the matrix-multiply-accumulate unit – the arithmetic unit that performs the sought-after computations – is only 24% of the whole silicon die area, compared to less than 5% in general-purpose microprocessors (Table 4).

The second generation of TPU (TPUv2), introduced in 2017, used 16 nm process node technology. It added training support and included High-Bandwidth Memory (HBM) to address the memory bottleneck problems of its predecessor (Jouppi et al., 2020).

The third generation of TPU (TPUv3), introduced in 2018, used 16 nm process node technology. Compared to its predecessor, it has twice the number of matrix-multiply-accumulate units, twice the HBM capacity, higher clock frequency, and higher memory bandwidth (Jouppi et al., 2021).

The fifth-generation of TPU (TPUv4¹¹⁹), introduced in 2021, uses 7 nm process node technology. It includes optical switches to allow reconfiguration of inter-chip interconnection topology. It supports 8-bit integers, in addition to BFloat16, and has hardware support for large language models and recommender systems (Jouppi et al., 2023).

The latest generation (TPUv5p) was introduced in 2023, which is their most powerful chip so far. While its process node technology has not been disclosed yet, it provides a significantly higher memory bandwidth of 2.76 TB/s, with a peak 918 TOPS/s INT8 performance, targeted for more efficient training and inference of large language models.

GraphCore’s intelligence processing unit (IPU)

IPU¹²⁰ is a machine-learning DSA, developed by GraphCore. The fundamental architecture of IPU is different than that of CPUs and GPUs. Contrary to the SIMD in the vector units¹²¹ (Hassaballah et al., 2008; Raman et al., 2000) of CPUs, or SIMT of GPUs (Fung and Aamodt, 2011; Habermaier and Knapp, 2012), Graphcore IPU uses MIMD execution model (Berg and Siegel, 1991; Flynn and Rudd, 1996), which allows it to efficiently execute massive number of threads that have distinct codes, execution flows, and irregular or sparse data-access (Section 2.11) (Jia et al., 2019). Unlike GPUs and TPUs, which use High Bandwidth Memory (HBM) to provide adequate off-chip bandwidth and memory capacity, IPUs use a large number of on-chip static random access memory (SRAM) (Part II: Static random access memory (Hanindhito et al., 2026)) to provide ultra-high bandwidth memory, at the cost of consuming more power: the Mk1 has 304 MB of on-chip memory, while the Mk2 has 896 MB of on-chip memory,¹²² providing 62 TB/s on-chip memory bandwidth. IPUs share the DRAM memory with the host CPU, which can be used to provide additional memory capacity to the IPUs (Knowles, 2021).

IPU’s architecture (i.e., MIMD) allows it to efficiently handle massively irregular computations that exhibit irregular memory access patterns, compared to GPUs. Moreover, IPU’s large, on-chip memory allows some machine learning models to fit inside the chip. Such models can exploit the significantly higher bandwidth and lower latency of the on-chip memory, compared to off-chip HBM in GPUs. A larger model can be partitioned across IPUs, and one-time used data can be streamed from the host CPU through the PCI Express interface.

Graphcore provides Graphcore Poplar software stack and development tools (Bohl, 2022), along with integration with popular machine-learning frameworks, such as TensorFlow (Abadi et al., 2015), and PyTorch (Paszke et al., 2019). Among the applications that target Graphcore IPUs are machine-learning (Bohl, 2022) (e.g., regression (Balewski et al., 2022), text detection (Sumeet et al., 2022), neuromorphic (Sun et al., 2022a)), particle physics (Maddrell-Mander et al., 2021), and cosmology (Arcelin, 2021).

The first generation IPU (Mk1), introduced in 2017 (Trader, 2017), was manufactured via 16 nm process node technology, and had 23 billion transistors. The second generation (Mk2), introduced in 2020, relied on 7 nm process node technology, and had nearly 60 billion transistors (Knowles, 2021).

Cerebras’ Wafer scale engine (WSE)

WSE is a machine learning accelerator DSA developed by Cerebras. The accelerator occupies the whole area of the silicon wafer,¹²³ measuring an area of 46,225 mm² (Lauterbach, 2021), which is 56 times larger than the die size of an NVIDIA H100 GPU (Table 7). Key reasons behind this radical design philosophy are: a) machine-learning models are becoming increasingly large, necessitating the use of large compute clusters; this requires the programmer to distribute the model by navigating through the complex system hierarchy,¹²⁴ which could be challenging; and b) the conventional centralized memory system is not optimized for many machine learning applications due to its high latency and limited bandwidth (Cerebras Systems, 2019). The WSE provides fast on-chip interconnection between cores and distributed on-chip static random access memory (SRAM) (Part II: Static random access memory (Hanindhito et al., 2026)), at the expense of consuming more energy, to address these two issues. It also comes with a software stack to map the machine-learning workloads across the cores in WSE (Lie, 2022). The chip has also been used for solving partial differential equations, and has shown impressive performance as long as the problem fits into a single WSE (Groeneveld et al., 2021; Lin et al., 2022; Luo et al., 2023).

Cerebras manages the yield problem by implementing spare cores as a redundancy measure to replace any defective cores (Lauterbach, 2021). Power delivery and heat problems are managed through vertical power delivery to supply 20,000 amperes of electrical currents, and large water-cooled copper heat exchangers, respectively (Lauterbach, 2021).

The first generation (WSE), introduced in 2019, relies on 16 nm process node technology. It features 400,000 cores, provides 18 GB of on-chip memory, and has 1.2 trillion transistors (Cerebras Systems, 2019). The second generation (WSE-2), introduced in 2021, uses 7 nm process node. It features 850,000 cores, along with 40 GB of on-chip memory, and uses 2.6 trillion transistors (Lauterbach, 2021; Lie, 2022). The WSE-2 system is also paired with an external memory device named MemoryX (Lie, 2024). MemoryX is physically decoupled from WSE-2 and it is primarily used to store weights and to stream them into WSE-2 for processing. The third generation (WSE-3), released in 2023, uses a 5 nm process node and employs 900,000 cores with 44 GB on-chip memory. In total, it uses 4 trillion transistors and delivers up to 125 PFLOP/s.

SambaNova’s reconfigurable dataflow unit (RDU)

RDU, developed by SambaNova Systems, is a machine-learning accelerator, implemented as a CGRA. It consists of a network of programmable on-chip compute engines, called Pattern Compute Units (PCUs), as well as distributed on-chip memory systems, called Pattern Memory Units (PMUs). PCUs and PMUs are connected through the programmable on-chip interconnect, called Tile-level Switch Network (SWN) (Emani et al., 2021; Prabhakar et al., 2022). By providing flexible space- and time-scheduling, and flexible memory system and interconnect, this architecture allows constructing custom dataflow pipelines based on the dataflow graphs of the applications. This enables efficient execution and reduced off-chip memory access (Hosseini et al., 2023). A key characteristic of RDU is its decentralized compute and memory units, as opposed to the von Neumann architecture that is used in CPUs and GPUs.

SambaNova provides a compiler, called SambaFlow, which captures the dataflow graph of the applications, determines the communication patterns, and creates spatial dataflow to exploit data locality and parallelism (Prabhakar and Jairath, 2021). The PCUs consist of SIMD ALUs that support FP32, BFloat16, 32-bit Integer, 16-bit Integer, and bitwise operations, whereas the PMUs contain software-managed SRAM (Part II: Static random access memory (Hanindhito et al., 2026)). The first generation of RDU, SN10 (2019), was manufactured via 7 nm process node technology, with a total of 40 billion transistors, to implement 640 PCUs and 640 PMUs (320 MB of on-chip memory), achieving BFloat16 peak performance of 320 TFLOP/s (Prabhakar et al., 2022). The second generation, SN30 (2022), is manufactured by using the same process node, with a total of 86 billion transistors to double the number of PCUs and PMUs (640 MB on-chip memory), achieving BFloat16 peak performance of 688 TFLOP/s. The latest generation, SN40L (2024), is built on 5 nm, with 520 MB on-chip memory and 638 TFLOP/s peak BFloat16 performance. It also incorporates 64 GB of HBM for a peak bandwidth of 1.6 TB/s.

General NPU trend of semiconductor companies

The aforementioned examples underscore a clear industry-wide trend, where major technology companies are increasingly investing in custom AI hardware to improve performance and efficiency while reducing operational costs. Many other companies have developed their own large-scale ASICs to accelerate AI workloads in datacenters, provide rentable cloud service, and reduce dependence on costly third-party solutions like NVIDIA GPUs. Notable examples include Amazon’s Inferentia2 and Trainium (Fu et al., 2024), Meta’s MTIA (Firoozshahian et al., 2023), Intel’s Gaudi (Kaplan, 2024) and Tesla’s Dojo (Talpes et al., 2022).

NPUs in consumer devices

NPUs are also increasingly integrated with CPUs in consumer devices like desktops, smartphones and tablets. They typically handle tasks such as image processing, voice recognition, and real-time translations, reducing the burden on CPUs and GPUs. Notable examples include Samsung’s Exynos and Qualcomm’s Snapdragon chips (used in laptops, tablets and smartphones), as well as Apple’s M-series (M1–M4, used in MacBooks and iPads) and A16–A18 (used in iPhones), reflecting the growing demand for AI-optimized computing in general-purpose devices.

What comes next?

Continuing the trend shown in Tables 6 and 7, future GPUs will have more transistors, and thus, more compute (e.g., CUDA) cores. Since the die size of the latest NVIDIA GPUs has already reached the limit of reticle size (Transistor count and yield, and manufacturability), using chiplets for future GPUs is inevitable. Both AMD and Intel are already using chiplets in their latest GPUs. Using chiplets makes manufacturing of different GPU products easier. For instance, different classes of GPUs may emerge, each having optimized features for specific use cases.

Another implication of the growth of (CUDA) cores is increased power consumption. With current power consumption sitting at 700 W, it is expected that future GPUs will have power consumption exceeding 1000 W. This implies liquid-based cooling will become common for GPU-accelerated compute nodes. Accordingly, a cluster that expects to host high-end GPUs should be prepared to provide a liquid-based cooling system. Moreover, the associated datacenter needs to have an improved power delivery system since the power density of each node and each rack will be significantly higher.

While the compute performance of individual GPUs is expected to grow, memory capacity and bandwidth may not be able to keep up with that pace. Using cutting-edge DRAM technologies (e.g., HBM4 and GDDR7) can only partially fulfill the bandwidth requirements. Therefore, using algorithms that reduce data-movement can have a considerable impact on attainable performance. With the stagnancy in individual GPUs’ memory capacity, these algorithms will become even more important, as multi-GPU, multi-node computing will become more common to enable larger aggregate memory capacity, which is often needed to handle larger problems sizes. Accordingly, high-end clusters, such as those for machine learning, or advanced scientific computing, will use the most advanced inter-node communication technologies (Part II: Inter-node communication (Hanindhito et al., 2026)) to provide inter-node bandwidth that is comparable to off-chip memory bandwidth.

Summary and remarks

In order to deal with diverse applications, CPUs have a complex design, where the vast majority of the silicon die is tasked with “management” as opposed to performing “actual work”. Many applications may not fully leverage the level of complexity that CPUs offer. For these “simple” applications, the silicon die can be used more effectively, by doing less “management” and performing more “actual work”. Hardware accelerators, such as GPUs, follow this design philosophy.

GPUs were originally developed for processing graphics in gaming applications. They were rigid, and could not be programmed. As the complexity of graphics processing increased, GPUs became programmable. Machine-learning and cryptocurrency-mining also created a significant market for GPUs. Architecture of GPUs evolved to support these markets, for instance, by including matrix-multiplication accelerators. Relative ease-of-use of CUDA language for programming NVIDIA GPUs was key to its adoption.

GPUs played a fundamental role in the success of machine-learning. Observation of this success, and increased availability of GPUs in government laboratories and academia, motivated computational scientists to attempt running some scientific computing applications on GPUs. This is still an ongoing endeavor, as many scientific computing applications in industry and academia do not have GPU-friendly algorithms. Porting a CPU code into GPU often involves redesigning algorithms and rewriting the code, which could become a multi-year, multi-disciplinary effort. Scientific machine-learning (SciML) is becoming a popular trend in scientific computing. It relies on using machine-learning-based approaches to solve scientific computing problems. Popularity of SciML approaches may be attributed to: easy-to-use high-level programming languages, such as TensorFlow and PyTorch, enabling rapid progress and providing various functionalities, such as automatic differentiation; ability of SciML techniques to seamlessly incorporate observations; taking advantage of open-source culture, fostered by the machine-learning community; and considerable funding opportunities.

High-end GPUs typically enjoy high-bandwidth memory, which enables significant performance gains as long as the application fits into the memory, and off-chip communication is minimal. Advances in packaging technology will likely result in larger GPUs with larger on-package, high-bandwidth memory; however, off-chip communication, if considerable, continues to be a bottleneck.

With the end of Moore’s law, one of the very few ways to improve performance (e.g., faster run-time, or reduced energy consumption) is hardware customization. The cost of research and development for a custom hardware is significant. It is often a multi-year effort, which involves a large, multi-disciplinary team of hardware engineers, software engineers, and algorithm designers. Therefore, economic viability of custom hardware relies on a large-enough market for it. This cost may decrease in the future due to automation and availability of open-source tools.

FPGAs are typically used as substrates for prototyping hardware design. They enable fine-grained implementation and testing. Configuring them is tedious, and hardware code synthesis process is slow, which may improve in the future as they become more popular. CGRAs may also be used for custom hardware implementation. Compared to FPGAs, reconfiguring them is faster, since they come with specialized, hard-logic building blocks, and only allow coarse-grained reconfiguration; they also enable higher efficiency and performance. ASICs offer the highest hardware performance. However, once built, their design cannot change. Typically, they are used to implement mature designs.

Hardware-aware algorithms have a fundamental role to fully harness GPUs and custom hardware. In most cases, an algorithm-hardware co-design approach is necessary to maximize performance.

Remarks

In this paper (Part I), we covered background material, and reviewed technology trends in general-purpose processors, as well as hardware accelerators (e.g., GPUs). In Part II of this paper (Hanindhito et al., 2026), we will consider different memory technologies, inter-device communication, system integration and heterogeneous computing, and energy consumption of large computing centers and its implications. We will also offer perspectives on how these technology trends impact scientific computing.

Footnotes

Acknowledgements

Arash Fathi and Dimitar Trenev are grateful to ExxonMobil for supporting this work, and permitting its publication. We would like to thank Laurent White for stimulating conversations and commenting on an earlier draft of this paper, which improved its quality. We are also grateful to Amir Gholami, Arben Jusufi, Ardavan Pedram, Brent Wheelock, Chirath Neranjena, Dakshina Valiveti, Dimitri Papageorgiou, Rahul Sampath, and Wenting Xiao, for insightful conversations and feedback.

ORCID iDs

Bagus Hanindhito

Arash Fathi

Dimitrios Gourounas

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was supported by ExxonMobil Technology and Engineering, agreement number EM10480.36, National Science Foundation (NSF) grant numbers 2326894 and 2425655 and Division of Computing and Communication Foundations (grant no. 1763848). Any opinions, findings, conclusions, or recommendations are those of the authors and not of the sponsors.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Notes

Author biographies

Bagus Hanindhito was a PhD student at the Laboratory of Computer Architecture, Department of Electrical and Computer Engineering, The University of Texas at Austin. His research interests include workload characterization, performance evaluation, and architecture-aware optimizations of HPC and AI/ML applications on GPU-accelerated computing clusters and emerging accelerators. After graduating with his PhD in 2024, he works as a Principal Engineer in the Chief Technology Office of the Infrastructure Solutions Group at Dell Technologies. He received his MSE in Computer Engineering from The University of Texas at Austin in 2020, and earned a BS and an MS in Electrical Engineering from Institut Teknologi Bandung, Indonesia, in 2015 and 2019, respectively.

Arash Fathi is a computational scientist at ExxonMobil with broad interests in computational engineering, mathematics, and HPC. At ExxonMobil Corporate Strategic Research, he led a hardware–algorithm co-design project, developing specialized hardware and tailored algorithms to maximize the computing efficiency of wave simulations. He has also explored the potential of quantum computing for solving PDEs prevalent in the oil and gas industry. Arash has organized several symposiums on novel computational algorithms for future computing platforms and played a key role in projects involving PDE-constrained optimization, uncertainty quantification, scientific machine learning, emission detection, and digital twins.

Dimitrios Gourounas is pursuing his PhD in the System-Level Architecture and Modeling (SLAM) group at the University of Texas at Austin. His research interests lie in reconfigurable architectures targeting high-performance computing and machine learning workloads. His work includes the design of a reconfigurable accelerator for memory-bound, discontinuous-Galerkin–based PDE solvers and automated frameworks for generating high-efficiency matrix-multiplication units on modern AI-optimized FPGA architectures. He holds a BSc in Electrical and Computer Engineering from the National Technical University of Athens.

Dimitar Trenev is a Computational Scientist at ExxonMobil Technology and Engineering Company, focusing on high-performance computing, quantum computing, and numerical analysis. His work centers on developing and applying advanced computational methods to challenging problems in the energy industry. Dimitar has played key roles in projects involving large-scale seismic inversion, reservoir simulation, uncertainty quantification and other compute-intensive workflows. He also led a collaboration between ExxonMobil and IBM exploring the use of quantum computing for energy applications.

Andreas Gerstlauer received the PhD degree in Information and Computer Science (ICS) from the University of California at Irvine (UCI), Irvine, CA, USA, in 2004. He is a Cullen Trust for Higher Education Professor in the Chandra Family Department of Electrical and Computer Engineering at The University of Texas at Austin (UT Austin), Austin, TX, USA. Prior to joining UT Austin in 2008, he was an Assistant Researcher with the Center for Embedded Computer Systems (CECS) at UCI. His research interests include systems-level design automation, system modeling, design languages and methodologies, and embedded hardware and software synthesis. Prof. Gerstlauer’s work was recognized with several best paper awards and nominations from major conferences, such as DAC, DATE, and HOST, as one of the most influential contributions in ten years at DATE in 2008, and as recipient of a 2016–2017 Humboldt Research Fellowship. He serves or has served as an Editor for ACM TECS and TODAES journals, as well as the General or Program Chair for major international conferences such as ESWEEK.

Lizy Kurian John holds the Truchard Foundation Chair in Engineering in the Department of Electrical and Computer Engineering at The University of Texas at Austin. Her research is in the areas of computer architecture, multicore processors, memory systems, performance evaluation and benchmarking, workload characterization, and reconfigurable computing. She has published four books, 300+ refereed journal and conference publications and holds 20 U. S. patents. Prof. John was the Editor-in-Chief of IEEE Micro from 2019-2023. She is an IEEE Fellow, ACM Fellow, AAAS Fellow, and Fellow of the National Academy of Inventors.

Appendix

References

Aamodt

Lun Fung

Rogers

(2018) Programming Model. Chapter 2. Springer International Publishing, pp. 9–20.

Abadi

Agarwal

Barham

, et al. (2015) TensorFlow: large-scale machine learning on heterogeneous systems. Software available from tensorflow.org.

Abdelfattah

Betz

(2012) Design tradeoffs for hard and soft FPGA-based networks-on-chip. In: 2012 international conference on field-programmable technology, Seoul, South Korea, 10–12 December 2012, pp. 95–103.

Abdennadher

Tripician

Singaravelu

(2020) At speed testing challenges and solutions for 56Gbps and 112Gbps PAM4 SerDes. In: 2020 IEEE Latin-American test symposium (LATS), Maceio, Brazil, 30 March–2 April 2020, pp. 1–5.

Agarwal

Hrishikesh

Keckler

, et al. (2000) Clock rate versus IPC: the end of the road for conventional microarchitectures. SIGARCH Computer Architecture News 28(2): 248–259.

Agarwal

Cheng

Shah

, et al. (2022) 3D packaging for heterogeneous integration. In: 2022 IEEE 72nd electronic components and technology conference (ECTC), San Diego, California, 31 May–3 June 2022, pp. 1103–1107.

Ahmad

Subramanian

Boppana

, et al. (2019) Xilinx first 7nm device: Versal AI core (VC1902). In: 2019 IEEE hot chips 31 symposium (HCS), Cupertino, CA, 18–20 August 2019, pp. 1–28.

Ahmed

Schuegraf

(2011) Transistor wars. IEEE Spectrum 48(11): 50–66.

Ahn

Jouppi

Kozyrakis

, et al. (2009) Future scaling of processor-memory interfaces. In: Proceedings of the conference on high performance computing networking, storage and analysis, SC ’09. New York, NY: Association for Computing Machinery, pp. 1–12.

10.

Aktemur

Metzger

Saiapova

, et al. (2020) Debugging SYCL programs on heterogeneous Intel® architectures. In: Proceedings of the international workshop on OpenCL, IWOCL ’20. New York, NY: Association for Computing Machinery, pp. 1–10.

11.

Alameldeen

Wood

(2006) IPC considered harmful for multiprocessor workloads. IEEE Micro 26(4): 8–17.

12.

Albina

Hackl

(2007) Layout parasitic interconnections effects on high frequency circuits. In: 2007 6th IEEE Dallas circuits and systems workshop on system-on-chip, Dallas, TX, 15-16 November 2007, pp. 1–4.

13.

Aleksic

(2017) The future of optical interconnects for data centers: a review of technology trends. In: 2017 14th international conference on telecommunications (ConTEL), Zagreb, Croatia, 28–30 June 2017, pp. 41–46.

14.

Alexa

Gross

Pauly

, et al. (2004) Point-based computer graphics. In: ACM SIGGRAPH 2004 course notes, SIGGRAPH ’04. New York, NY: Association for Computing Machinery, p. 7.

15.

Altaf

MSB

Wood

(2017) LogCA: a high-level performance model for hardware accelerators. In: Proceedings of the 44th annual international symposium on computer architecture, ISCA ’17. New York, NY: Association for Computing Machinery, pp. 375–388.

16.

Alyaei

Glass

(2009) Line coded modulation. In: 2009 3rd international conference on signal processing and communication systems, Omaha, Nebraska, 28–30 September 2009, pp. 1–4.

17.

Amann

Hofmann

(2009) InP-based long-wavelength VCSELs and VCSEL arrays. IEEE Journal of Selected Topics in Quantum Electronics 15(3): 861–868.

18.

Ampere Computing LLC (2022) Ampere® Altra® Max 64-bit Multi-Core Processor Featrures. Whitepaper, Ampere Computing LLC.

19.

Anand

Razavi

(2001) A CMOS clock recovery circuit for 2.5-Gb/s NRZ data. IEEE Journal of Solid-State Circuits 36(3): 432–439.

20.

Anderson

Bond

Clulow

, et al. (2006) Cryptographic processors-a survey. Proceedings of the IEEE 94(2): 357–369.

21.

Andricacos

(1999) Copper on-chip interconnections: a breakthrough in electrodeposition to make better chips. The Electrochemical Society Interface 8(1): 32–37.

22.

Andricacos

Uzoh

Dukovic

, et al. (1998) Damascene copper electroplating for chip interconnections. IBM Journal of Research and Development 42(5): 567–574.

23.

Angel

Shreiner

(2011) Teaching a shader-based introduction to computer graphics. IEEE Computer Graphics and Applications 31(2): 9–13.

24.

Anzt

Tsai

Abdelfattah

, et al. (2020) Evaluating the performance of NVIDIA’s A100 ampere GPU for sparse and batched computations. In: 2020 IEEE/ACM performance modeling, benchmarking and simulation of high performance computer systems (PMBS), Atlanta, Georgia, 12 November 2020, pp. 26–38.

25.

Arcelin

(2021) Comparison of graphcore IPUs and Nvidia GPUs for cosmology applications.

26.

Arden

(2002) The international technology roadmap for semiconductors—Perspectives and challenges for the next 15 years. Current Opinion in Solid State and Materials Science 6(5): 371–377.

27.

Arora

Mehta

Betz

, et al. (2021) Tensor slices to the rescue: supercharging ML acceleration on FPGAs. In: The 2021 ACM/SIGDA international symposium on field-programmable gate arrays, FPGA ’21. New York, NY: Association for Computing Machinery, pp. 23–33.

28.

Arora

Anand

Borda

, et al. (2022) CoMeFa: compute-in-memory blocks for FPGAs. In: 2022 IEEE 30th annual international symposium on field-programmable custom computing machines (FCCM), New York City, NY, 15–18 May 2022,pp. 1–9.

29.

Arunkumar

Bolotin

Cho

, et al. (2017) MCM-GPU: multi-chip-module GPUs for continued performance scalability. In: Proceedings of the 44th annual international symposium on computer architecture, ISCA ’17. New York, NY: Association for Computing Machinery, pp. 320–332.

30.

Avižienis

(1975) Fault-tolerance and fault-intolerance: complementary approaches to reliable computing. ACM SIGPLAN Notices 10(6): 458–464.

31.

Ayar Labs Inc (2021) Optical I/O Chiplets Eliminate Bottlenecks to Unleash Innovation. Whitepaper, Ayar Labs Inc.

32.

Aygün

Hill

Eilert

, et al. (2005) Power delivery for high-performance microprocessors. Intel Technology Journal 9(4): 273–283.

33.

Bai

Bandyopadhyay

Tsao

, et al. (2011) Room temperature quantum cascade lasers with 27% wall plug efficiency. Applied Physics Letters 98(18): 181102.

34.

Balasubramanian

Agili

Morales

(2011) Investigating the new 64b/66b encoding scheme’s power spectral density. In: 2011 IEEE international conference on consumer electronics (ICCE), Las Vegas, Nevada, 9–12 January 2011, pp. 377–378.

35.

Balewski

Liu

Tsyplikhin

, et al. (2022) Time-series ML-regression on graphcore IPU-M2000 and Nvidia A100. In: 2022 IEEE/ACM international workshop on performance modeling, benchmarking and simulation of high performance computer systems (PMBS), Dallas, TX, 13–18 November 2022, pp. 141–146.

36.

Ball

Larus

(1993) Branch prediction for free. In: Proceedings of the ACM SIGPLAN 1993 conference on programming language design and implementation, PLDI ’93. New York, NY: Association for Computing Machinery, pp. 300–313.

37.

Bandyopadhyay

Cases

(2000) Packaging challenges in the design of a 800 Mbps source-synchronous simultaneous bi-directional parallel interface. In: IEEE 9th topical meeting on electrical performance of electronic packaging (Cat. No.00TH8524), Scottsdale, AZ, 23–25 October 2000, pp. 9–12.

38.

Barnwell

Wood

(1997) A novel thick film on ceramic MCM technology offering MCM-D performance. In: Proceedings 1997 international conference on multichip modules, Santa Cruz, CA, 4–5 February 1997, pp. 48–52.

39.

Bashir

Peter

Sarangi

(2019) A survey of on-chip optical interconnects. ACM Computing Surveys 51(6): 1–34.

40.

Bauer

Burkacky

Kenevan

, et al. (2020) Semiconductor Design and Manufacturing: Achieving Leading-Edge Capabilities. Report. McKinsey & Company.

41.

Bays

Lange

(2012) SPEC: driving better benchmarks. In: Proceedings of the 3rd ACM/SPEC international conference on performance engineering, ICPE ’12. New York, NY: Association for Computing Machinery, pp. 249–250.

42.

Beck

White

Paraschou

, et al. (2018) ‘Zeppelin’: an SoC for multichip architectures. In: 2018 IEEE international solid - state circuits conference - (ISSCC), San Francisco, CA, 11–15 February 2018, pp. 40–42.

43.

Belletti

Cotallo

Cruz

, et al. (2009) Janus: an FPGA-based system for high-performance scientific computing. Computing in Science & Engineering 11(1): 48–58.

44.

Berg

Siegel

(1991) Instruction execution trade-offs for SIMD vs. MIMD vs. mixed mode parallelism. In: [1991] Proceedings. the fifth international parallel processing symposium, Anaheim, CA, 30 April–02 May 1991, pp. 301–308.

45.

Besta

Stanojevic

Licht

JDF

, et al. (2019) Graph processing on FPGAs: taxonomy, survey, challenges.

46.

Beyne

(2003) Cu interconnects and low-k dielectrics, challenges for chip interconnections and packaging. In: Proceedings of the IEEE 2003 international interconnect technology conference (Cat. No.03TH8695), Burlingame, CA, 04 June 2003, pp. 221–223.

47.

Beyne

Milojevic

Van der Plas

, et al. (2021) 3D SoC integration, beyond 2.5D chiplets. In: 2021 IEEE international electron devices meeting (IEDM), San Francisco, 11–15 December 2021, pp. 3.6.1–3.6.4.

48.

Bhandarkar

(1997) RISC versus CISC: a tale of two chips. ACM SIGARCH Computer Architecture News 25(1): 1–12.

49.

Biswas

(2021) Sapphire rapids. In: 2021 IEEE hot chips 33 symposium (HCS), Palo Alto, CA, 22–24 August 2021, pp. 1–22.

50.

Blagodurov

Zhuravlev

Fedorova

(2010) Contention-aware scheduling on multicore systems. ACM Transactions on Computer Systems 28(4): 1–45.

51.

Blake

Dreslinski

Mudge

(2009) A survey of multicore processors. IEEE Signal Processing Magazine 26(6): 26–37.

52.

Blake

Dreslinski

Mudge

, et al. (2010) Evolution of thread-level parallelism in desktop applications. ACM SIGARCH Computer Architecture News 38(3): 302–313.

53.

Blem

Menon

Sankaralingam

(2013) Power struggles: revisiting the RISC vs. CISC debate on contemporary ARM and x86 architectures. In: 2013 IEEE 19th international symposium on high performance computer architecture (HPCA), Shenzhen, China, 23–27 February 2013, pp. 1–12.

54.

Blythe

(2008) Rise of the graphics processor. Proceedings of the IEEE 96(5): 761–778.

55.

Bogaerts

Chrostowski

(2018) Silicon photonics circuit design: methods, tools and challenges. Laser & Photonics Reviews 12(4): 1700237.

56.

Bogatin

(2011) Essential principles of signal integrity. IEEE Microwave Magazine 12(5): 34–41.

57.

Bogatin

(2022) What’s new in signal integrity and high-speed serial links: approaching the fundamental limits of copper interconnects. IEEE Microwave Magazine 23(5): 84–95.

58.

Boggs

Baktha

Hawkins

, et al. (2004) The microarchitecture of the intel pentium 4 processor on 90nm technology. Intel Technology Journal 8(1): 7–23.

59.

Bohl

(2022) Graphcore IPUs: accelerating Argonne’s AI/ML applications. In: Argonne training program on extreme-scale computing (ATPESC) 2022.

60.

Bohr

(2007) A 30 year retrospective on Dennard’s MOSFET scaling paper. IEEE Solid-State Circuits Newsletter 12(1): 11–13.

61.

Bonilla

Lanzillo

, et al. (2020) Interconnect scaling challenges, and opportunities to enable system-level performance beyond 30 nm pitch. In: 2020 IEEE international electron devices meeting (IEDM), San Francisco, 12–18 December 2020, pp. 20.4.1–20.4.4.

62.

Borkar

(2007) Thousand core chips: a technology perspective. In: Proceedings of the 44th annual design automation conference, DAC ’07. New York, NY: Association for Computing Machinery, pp. 746–749.

63.

Boutros

Yazdanshenas

Betz

(2018) You cannot improve what you do not measure: FPGA vs. ASIC efficiency gaps for convolutional neural network inference. ACM Transactions on Reconfigurable Technology and Systems 11(3): 1–23.

64.

Boutros

Nurvitadhi

, et al. (2020) Beyond peak performance: comparing the real performance of AI-optimized FPGAs and GPUs. In: 2020 International conference on field-programmable technology (ICFPT), Maui, HI, 9–11 December 2020, pp. 10–19.

65.

Boutros

Arora

Betz

(2024) Field-programmable gate array architecture for deep learning: survey & future directions. arXiv preprint arXiv:2404.10076.

66.

Brekelbaum

Rupley

Wilkerson

, et al. (2002) Hierarchical scheduling windows. In: 35th Annual IEEE/ACM international symposium on microarchitecture, 2002. (MICRO-35). Proceedings, Istanbul, Turkey, 18–22 November 2002, pp. 27–36.

67.

Brooks

Bose

Schuster

, et al. (2000) Power-aware microarchitecture: design and modeling challenges for next-generation microprocessors. IEEE Micro 20(6): 26–44.

68.

Brooks

Dick

Joseph

, et al. (2007) Power, thermal, and reliability modeling in nanometer-scale microprocessors. IEEE Micro 27(3): 49–62.

69.

Buchanan

(1999) Fast Ethernet and Switches. Chapter 39. Springer US, pp. 513–521.

70.

Buck

(2007) GPU computing with NVIDIA CUDA. In: ACM SIGGRAPH 2007 courses, SIGGRAPH ’07. New York, NY: Association for Computing Machinery, pp. 6–es.

71.

Burd

Beck

White

, et al. (2019) “Zeppelin”: an SoC for multichip architectures. IEEE Journal of Solid-State Circuits 54(1): 133–143.

72.

Burd

Pistole

, et al. (2022) Zen3: the AMD 2nd-generation 7nm x86-64 microprocessor core. In: 2022 IEEE international solid- state circuits conference (ISSCC), San Francisco, CA, 20–26 February 2022, pp. 1–3.

73.

Burger

Goodman

Kägi

(1996) Memory bandwidth limitations of future microprocessors. In: Proceedings of the 23rd annual international symposium on computer architecture, ISCA ’96. New York, NY: Association for Computing Machinery, pp. 78–89.

74.

Butcher

Olivier

Berry

, et al. (2018) Optimizing for KNL usage modes when data doesn’t fit in MCDRAM. In: Proceedings of the 47th international conference on parallel processing, ICPP ’18. New York, NY: Association for Computing Machinery, pp. 1–10.

75.

Butts

Sohi

(2000) A static power model for architects. In: Proceedings 33rd annual IEEE/ACM international symposium on microarchitecture. MICRO-33 2000, Monterey CA, 10–13 December 2000. pp. 191–201.

76.

Cadien

Reshotko

Block

, et al. (2005) Challenges for on-chip optical interconnects. In: Kubby

Jabbour

(eds) Optoelectronic Integration on Silicon II: International Society for Optics and Photonics, SPIE, Vol. 5730, 133–143.

77.

Cai

Fang

Ratemo

, et al. (2005) A test case for 3Gbps serial attached SCSI (SAS). In: IEEE international conference on test, 2005, Austin, TX, 8–10 November 2005, pp. 9–660.

78.

Cairncross

Henry

Chalmers

, et al. (2023) AI Benchmarking on Achronix Speedster® 7t FPGAs. White Paper.

79.

Calahan

Ames

(1979) Vector processors: models and applications. IEEE Transactions on Circuits and Systems 26(9): 715–726.

80.

Carter

Agrawal

Borkar

, et al. (2013) Runnemede: an architecture for ubiquitous high-performance computing. In: 2013 IEEE 19th international symposium on high performance computer architecture (HPCA), Shenzhen, China, 23–27 February 2013, pp. 198–209.

81.

Cascaval

Chatterjee

Franke

, et al. (2010) A taxonomy of accelerator architectures and their programming models. IBM Journal of Research and Development 54(5): 5:1–5:10.

82.

Cass

(2020) Nvidia makes it easy to embed AI: the Jetson nano packs a lot of machine-learning power into DIY projects - [Hands on]. IEEE Spectrum 57(7): 14–16.

83.

Cerebras Systems (2019) Wafer-scale deep learning. In: 2019 IEEE hot chips 31 symposium (HCS), Cupertino, CA, 18–20 August 2019, pp. 1–31.

84.

Chang

Hao

Patt

, et al. (1996) Using predicated execution to improve the performance of a dynamically scheduled machine with speculative execution. International Journal of Parallel Programming 24(3): 209–234.

85.

Chang

Oscilowski

Bracken

(1998) Future challenges in electronics packaging. IEEE Circuits and Devices Magazine 14(2): 45–54.

86.

Chang

Tang

King

, et al. (2000) Gate length scaling and threshold voltage control of double-gate MOSFETs. In: International Electron Devices Meeting 2000. Technical digest. IEDM (Cat. No.00CH37138), pp. 719–722.

87.

Charitopoulos

Papaefstathiou

Pnevmatikatos

(2021) Creating customized CGRAs for scientific applications. Electronics 10(4): 445.

88.

Che

Chen

(2023) Modulation format and digital signal processing for IM-DD optics at Post-200G era. Journal of Lightwave Technology 42(2): 588–605.

89.

Chelton

Benaissa

(2008) Fast elliptic curve cryptography on FPGA. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 16(2): 198–205.

90.

Chen

Chien

(2008) CRISP: coarse-grained reconfigurable image stream processor for digital still cameras and camcorders. IEEE Transactions on Circuits and Systems for Video Technology 18(9): 1223–1236.

91.

Chen

Katopis

(2004) A comparison of performance potentials of single ended vs. differential signaling. In: Electrical performance of electronic packaging - 2004, pp. 185–188.

92.

Chen

Segev

(2021) Highlighting photonics: looking into the next decade. eLight 1(1): 2.

93.

Chen

Tai

Huang

, et al. (2002) Electrical characterization of BGA test socket for high-speed applications. In: Proceedings of the 4th international symposium on electronic materials and packaging 2002, Kaohsiung, Taiwan, 4–6 December 2002, pp. 123–126.

94.

Chen

Gordon

Thies

, et al. (2005) A reconfigurable architecture for load-balanced rendering. In: Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on graphics hardware, HWWS ’05. New York, NY: Association for Computing Machinery, pp. 71–80.

95.

Chen

Haurylau

, et al. (2006) On-chip copper-based vs. optical interconnects: delay uncertainty, latency, power, and bandwidth density comparative predictions. In: 2006 International interconnect technology conference, Burlingame, CA, 5–7 June 2006, pp. 39–41.

96.

Chen

Jensen

Stolojan

, et al. (2011) Growth of carbon nanotubes at temperatures compatible with integrated circuit technologies. Carbon 49(1): 280–285.

97.

Chen

Irving

Peng

, et al. (2017) Using switchable pins to increase off-chip bandwidth in chip-multiprocessors. IEEE Transactions on Parallel and Distributed Systems 28(1): 274–289.

98.

Chen

(2019a) High performance data encryption with AES implementation on FPGA. In: 2019 IEEE 5th intl conference on big data security on cloud (BigDataSecurity), IEEE intl conference on high performance and smart computing, (HPSC) and IEEE intl conference on intelligent data and security (IDS), pp. 149–153.

99.

Chen

Zhang

, et al. (2019b) Cloud-DNN: an open framework for mapping DNN models to cloud FPGAs. In: Proceedings of the 2019 ACM/SIGDA international symposium on field-programmable gate arrays, FPGA ’19. New York, NY: Association for Computing Machinery, pp. 73–82.

100.

Chen

Tajali

Holden

, et al. (2023) Noise performance comparison: ENRZ, NRZ, PAM3, and PAM4. In: 2023 IEEE symposium on electromagnetic compatibility & signal/power integrity (EMC+SIPI), Grand Rapids, MI, 31 July–4 August 2023, pp. 275–279.

101.

Cheng

Tan

Gao

, et al. (2008) High speed serial interface & some key technology research. In: 2008 International symposium on electronic commerce and security, Guangzhou, China, 3–5 August 2008, pp. 562–566.

102.

Cheng

Gao

, et al. (2016) Optics vs. copper — from the perspective of Thunderbolt 3 interconnect technology. In: 2016 China semiconductor technology international conference (CSTIC), Shanghai, China, 13–14 March 2016, pp. 1–3.

103.

Chin

Sakamoto

Rui

, et al. (2017) CGRA-ME: a unified framework for CGRA modelling and exploration. In: 2017 IEEE 28th international conference on application-specific systems, architectures and processors (ASAP), Seattle, WA, 10–12 July 2017, pp. 184–189.

104.

Chirkov

Wentzlaff

(2023) Seizing the bandwidth scaling of on-package interconnect in a post-Moore’s law world. In: Proceedings of the 37th ACM international conference on supercomputing, ICS ’23. New York, NY: Association for Computing Machinery, pp. 410–422.

105.

Cho

Koo

Kapur

, et al. (2007) The delay, energy, and bandwidth comparisons between copper, carbon nanotube, and optical interconnects for local and global wiring application. In: 2007 IEEE international interconnect technology conference, Burlingame, CA, 3–6 June 2007, pp. 135–137.

106.

Cho

Koo

Kapur

, et al. (2008) Performance comparisons between Cu/Low-κ, carbon-nanotube, and optics for future on-chip interconnects. IEEE Electron Device Letters 29(1): 122–124.

107.

Cho

Ahn

Choi

, et al. (2012) Performance analysis of multi-bank DRAM with increased clock frequency. In: 2012 IEEE international symposium on circuits and systems (ISCAS), Seoul, South Korea, 20–23 May 2012, pp. 2477–2480.

108.

Choi

Soma

Pedram

(2004) Dynamic voltage and frequency scaling based on workload decomposition. In: Proceedings of the 2004 international symposium on low power electronics and design, ISLPED ’04. New York, NY: Association for Computing Machinery, pp. 174–179.

109.

Chong

Heck

MJR

Ranganathan

, et al. (2014) Data center energy efficiency:improving energy efficiency in data centers beyond technology scaling. IEEE Design & Test 31(1): 93–104.

110.

Choquette

Gandhi

Giroux

, et al. (2021) NVIDIA A100 tensor core GPU: performance and innovation. IEEE Micro 41(2): 29–35.

111.

Chow

Chua

Hantschel

, et al. (2006) Pressure contact micro-springs in small pitch flip-chip packages. IEEE Transactions on Components and Packaging Technologies 29(4): 796–803.

112.

Chrysos

(2012) Intel® Xeon Phi coprocessor (codename Knights Corner). In: 2012 IEEE hot chips 24 symposium (HCS), Cupertino, CA, 27–29 August 2012, pp. 1–31.

113.

Cideciyan

Gustlin

, et al. (2013) Next generation backplane and copper cable challenges. IEEE Communications Magazine 51(12): 130–136.

114.

Ciofi

Contino

Roussel

, et al. (2016) Impact of wire geometry on interconnect RC and circuit delay. IEEE Transactions on Electron Devices 63(6): 2488–2496.

115.

Cochran

Hankendi

Coskun

, et al. (2011) Pack & cap: adaptive DVFS and thread packing under power caps. In: Proceedings of the 44th annual IEEE/ACM international symposium on microarchitecture, MICRO-44. New York, NY: Association for Computing Machinery, pp. 175–185.

116.

Cong

Huang

Pan

, et al. (2016) Software infrastructure for enabling FPGA-based accelerations in data centers: invited paper. In: Proceedings of the 2016 international symposium on low power electronics and design, ISLPED ’16. New York, NY: Association for Computing Machinery, pp. 154–155.

117.

Corbin

Ramirez

Massey

(2002) Land grid array sockets for server applications. IBM Journal of Research and Development 46(6): 763–778.

118.

Couch

(1994) Modern Communication Systems. Macmillan.

119.

Crawford

(1990) The i486 CPU: executing instructions in one clock cycle. IEEE Micro 10(1): 27–36.

120.

Cristal

Santana

Cazorla

, et al. (2005) Kilo-instruction processors: overcoming the memory wall. IEEE Micro 25(3): 48–57.

121.

Cui

Ingalz

Gao

, et al. (2017) Total cost of ownership model for data center technology evaluation. In: 2017 16th IEEE intersociety conference on thermal and thermomechanical phenomena in electronic systems (ITherm), Orlando, FL, 30 May–2 June 2017, pp. 936–942.

122.

Cunningham

(2001) The status of the 10-Gigabit ethernet standard. In: Proceedings 27th European conference on optical communication (Cat. No.01TH8551), Vol. 3, pp. 364–367.

123.

Dai

Wang

Niu

, et al. (2017) Simulation and structure optimization of triboelectric nanogenerators considering the effects of parasitic capacitance. Nano Research 10(1): 157–171.

124.

Dally

Turakhia

Han

(2020) Domain-specific hardware accelerators. Communications of the ACM 63(7): 48–57.

125.

Danowitz

Kelley

Mao

, et al. (2012a) CPU DB: recording microprocessor history. Communications of the ACM 55(4): 55–63.

126.

Danowitz

Kelley

Mao

, et al. (2012b) CPU DB: recording microprocessor history: with this open database, you can mine microprocessor trends over the past 40 years. Queue 10(4): 10–27.

127.

Davidson

Jinturkar

(1995) Improving instruction-level parallelism by loop unrolling and dynamic memory disambiguation. In: Proceedings of the 28th annual international symposium on microarchitecture, Ann Arbor, MI, 29 November–1 December 1995, pp. 125–132.

128.

Dawoud

(2020) 6 Serial Peripheral Interface (SPI). Chapter 1. River Publishers, pp. 1–44.

129.

DeBenedictis

(2017) It’s time to redefine Moore’s law again. Computer 50(2): 72–75.

130.

Deepaksubramanyan

Nunez

(2007) Analysis of subthreshold leakage reduction in CMOS digital circuits. In: 2007 50th midwest symposium on circuits and systems, Montreal Canada, 5–8 August 2007, pp. 1400–1404.

131.

Deering

Winner

Schediwy

, et al. (1988) The triangle processor and normal vector shader: a VLSI system for high performance graphics. ACM SIGGRAPH Computer Graphics 22(4): 21–30.

132.

Dennard

Gaensslen

, et al. (1974) Design of ion-implanted MOSFET’s with very small physical dimensions. IEEE Journal of Solid-State Circuits 9(5): 256–268.

133.

Dikhaminjia

Tsiklauri

, et al. (2016) PAM4 signaling considerations for high-speed serial links. In: 2016 IEEE international symposium on electromagnetic compatibility (EMC), Ottawa, ON, 25–29 July 2016, pp. 906–910.

134.

Domke

Matsumura

Wahib

, et al. (2019) Double-precision FPUs in high-performance computing: an embarrassment of riches? In: 2019 IEEE international parallel and distributed processing symposium (IPDPS), Rio de Janeiro, Brazil, 20–24 May 2019, pp. 78–88.

135.

Duan

Tirumala

Khatri

(2001) Analysis and avoidance of cross-talk in on-chip buses. In: HOT 9 interconnects. symposium on high performance interconnects, Stanford, CA, 22–24 August 2001, pp. 133–138.

136.

Dubey

(2009) Hardware Description Language: Verilog. Chapter 2. Springer, pp. 17–51.

137.

Dubrow

(2015) What got done in one year at NSF’s stampede supercomputer. Computing in Science & Engineering 17(02): 83–88.

138.

Duesberg

Graham

Kreupl

, et al. (2004) Ways towards the scaleable integration of carbon nanotubes into silicon based technology. Diamond and Related Materials 13(2): 354–361, Carbon materials for active electronics. proceedings of symposium L, E-MRS spring meeting 2003.

139.

Dujmovic

(1998) Evolution and evaluation of SPEC benchmarks. ACM SIGMETRICS Performance Evaluation Review 26(3): 2–9.

140.

Duran

Klemm

(2012) The Intel® many integrated core architecture. In: 2012 International conference on high performance computing & simulation (HPCS), Madrid, Spain, 2–6 July 2012, pp. 365–366.

141.

Dysart

Moore

Schaelicke

, et al. (2004) Cache implications of aggressively pipelined high performance microprocessors. In: IEEE international symposium on - ISPASS performance analysis of systems and software, 2004, Austin, TX, 10–12 March 2004, pp. 123–132.

142.

D’Arnese

Conficconi

Santambrogio

, et al. (2023) Reconfigurable Architectures: The Shift from General Systems to Domain Specific Solutions. Chapter 5. Springer Nature Singapore, pp. 435–456.

143.

Eddington

(2002) InfiniBridge: an InfiniBand channel adapter with integrated switch. IEEE Micro 22(2): 48–56.

144.

Edelstein

Sai-Halasz

Mii

(1995) VLSI on-chip interconnection performance simulations and measurements. IBM Journal of Research and Development 39(4): 383–401.

145.

Eggers

Emer

Levy

, et al. (1997) Simultaneous multithreading: a platform for next-generation processors. IEEE Micro 17(5): 12–19.

146.

Eidson

Gaines

Wolf

(2003) 30.2: HDMI: high-definition multimedia interface. SID Symposium Digest of Technical Papers 34(1): 1024–1027.

147.

Elgharbawy

Bayoumi

(2005) Leakage sources and possible solutions in nanometer CMOS technologies. IEEE Circuits and Systems Magazine 5(4): 6–17.

148.

Elliott

(2004) Programming graphics processors functionally. In: Proceedings of the 2004 ACM SIGPLAN workshop on Haskell, Haskell ’04. New York, NY: Association for Computing Machinery, pp. 45–56.

149.

Emani

Vishwanath

Adams

, et al. (2021) Accelerating scientific applications with SambaNova reconfigurable dataflow architecture. Computing in Science & Engineering 23(2): 114–119.

150.

Emma

(1997) Understanding some simple processor-performance limits. IBM Journal of Research and Development 41(3): 215–232.

151.

Esmaeilzadeh

Blem

St Amant

, et al. (2011a) Dark silicon and the end of multicore scaling. SIGARCH Computer Architecture News 39(3): 365–376.

152.

Esmaeilzadeh

Blem

St Amant

, et al. (2011b) Dark silicon and the end of multicore scaling. In: Proceedings of the 38th annual international symposium on computer architecture, ISCA ’11. New York, NY: Association for Computing Machinery, pp. 365–376.

153.

Espasa

Valero

(1997) Exploiting instruction- and data-level parallelism. IEEE Micro 17(5): 20–27.

154.

Evans

(2022) Nvidia grace. In: 2022 IEEE hot chips 34 symposium (HCS). Los Alamitos, CA: IEEE Computer Society, pp. 1–20.

155.

Evers

Barnes

Clark

(2022) The AMD next-generation “Zen 3” core. IEEE Micro 42(3): 7–12.

156.

Eyerman

Eeckhout

Karkhanis

, et al. (2009) A mechanistic performance model for superscalar out-of-order processors. ACM Transactions on Computer Systems 27(2): 1–37.

157.

Fair

Grover

Krzymien

, et al. (1991) Guided scrambling: a new line coding technique for high bit rate fiber optic transmission systems. IEEE Transactions on Communications 39(2): 289–297.

158.

Fang

Sips

Zhang

, et al. (2014) Test-driving Intel Xeon Phi. In: Proceedings of the 5th ACM/SPEC international conference on performance engineering, ICPE ’14. New York, NY: Association for Computing Machinery, pp. 137–148.

159.

Farrell

Calafiura

Leggett

, et al. (2017) Multi-threaded ATLAS simulation on Intel Knights Landing processors. Journal of Physics: Conference Series 898(4): 042012.

160.

Fawcett

(1995) Designing PCI bus interfaces with programmable logic. In: Proceedings of eighth international application specific integrated circuits conference, 321–324.

161.

Firoozshahian

Coburn

Levenstein

, et al. (2023) MTIA: first generation silicon targeting meta’s recommendation systems. In: Proceedings of the 50th annual international symposium on computer architecture, ISCA ’23. New York, NY: Association for Computing Machinery, pp. 1–13.

162.

Flynn

(1966) Very high-speed computing systems. Proceedings of the IEEE 54(12): 1901–1909.

163.

Flynn

(1972) Some computer organizations and their effectiveness. IEEE Transactions on Computers 21(9): 948–960.

164.

Flynn

Rudd

(1996) Parallel architectures. ACM Computing Surveys 28(1): 67–70.

165.

Forghani

Razavi

(2022) Circuit bandwidth requirements for NRZ and PAM4 signals. In: 2022 IEEE international symposium on circuits and systems (ISCAS), Austin, TX, 27 May–1 June 2022, pp. 990–994.

166.

Foss

(1997) Taking DRAM from 4 MBytes/s to 4 GBytes/s. In: Proceedings of the 23rd European solid-state circuits conference, Southampton, UK, 16–18 September 1997, p. 2.

167.

Frazier

(1998) The 802.3z gigabit ethernet standard. IEEE Network 12(3): 6–7.

168.

Frenzel

(2007) Principles of Electronic Communication Systems. 3 edition. McGraw-Hill, Inc.

169.

Friedman

(2001) Clock distribution networks in synchronous digital integrated circuits. Proceedings of the IEEE 89(5): 665–692.

170.

Fritz

(2019a) Amd 12nm zen+ pinnacle ridge die shot.

171.

Fritz

(2019b) Amd 7nm ccd and 12nm iod zen2 rome epyc 7702 die shot.

172.

Fritz

(2020) Amd 7nm ccd and 12nm iod zen3 vermeer die shot.

173.

Zhang

Fan

, et al. (2024) Distributed training of large language lodels on AWS trainium. In: Proceedings of the 2024 ACM symposium on cloud computing, Redmond, WA, 20–22 November 2024, pp. 961–976.

174.

Fujiki

Wang

Subramaniyan

, et al. (2021) Domain-Specific Accelerators. Chapter 6. Springer International Publishing, pp. 61–84.

175.

Fujimori

(2014) Evolution of multi-gigabit wireline transceivers in CMOS. In: 2014 IEEE compound semiconductor integrated circuit symposium (CSICS), pp. 1–4.

176.

Fung

WWL

Aamodt

(2011) Thread block compaction for efficient SIMT control flow. In: 2011 IEEE 17th international symposium on high performance computer architecture, San Antonio, TX, 12–16 February 2011, pp. 25–36.

177.

Gallet

Gowanlock

(2022) Leveraging GPU tensor cores for double precision Euclidean distance calculations. In: 2022 IEEE 29th international conference on high performance computing, data, and analytics (HiPC). Los Alamitos, CA: IEEE Computer Society, pp. 135–144.

178.

Ganapathy

Warner

(2008) Defining thermal design power based on real-world usage models. In: 2008 11th Intersociety conference on thermal and thermomechanical phenomena in electronic systems, Orlando, FL, 28–31 May 2008, pp. 1242–1246.

179.

Gandhare

Karthikeyan

(2019) Survey on FPGA architecture and recent applications. In: 2019 International conference on vision towards emerging trends in communication and networking (Vitecon), pp. 1–4.

180.

Gandhi

Akkary

Srinivasan

(2004) Reducing branch misprediction penalty via selective branch recovery. In: 10th International symposium on high performance computer architecture (HPCA’04), pp. 254–264.

181.

Gao

Kozyrakis

(2016) HRL: efficient and flexible reconfigurable logic for near-data processing. In: 2016 IEEE international symposium on high performance computer architecture (HPCA), pp. 126–137.

182.

Gao

Zhou

, et al. (2020) Design and implementation of an On-Chip low-power and high-flexibility system for data acquisition and processing of an inertial measurement unit. Sensors 20(2): 462.

183.

Garcia

Montiel-Nelson

Nooshabadi

(2007) Adaptive low/high voltage swing CMOS driver for on-chip interconnects. In: 2007 IEEE International symposium on circuits and systems (ISCAS), pp. 881–884.

184.

Gargini

(2002) The global route to future semiconductor technology. IEEE Circuits and Devices Magazine 18(2): 13–17.

185.

Gargini

(2017) How to successfully overcome inflection points, or long live Moore’s law. Computing in Science & Engineering 19(2): 51–62.

186.

Geer

(2005) Chip makers turn to multicore processors. Computer 38(5): 11–13.

187.

Gelatos

Jain

Marsh

, et al. (1994) Chemical vapor deposition of copper for advanced on-chip interconnects. MRS Bulletin 19(8): 49–54.

188.

Gelatos

Nguyen

Perry

, et al. (1996) Copper metallization for on-chip interconnects. In: Chen

Sasaki

Patel

, et al. (eds) Microelectronic Device and Multilevel Interconnection Technology II. International Society for Optics and Photonics, SPIE, Vol. 2875, pp. 346–357.

189.

Gelsinger

(2001) Microprocessors for the new millennium: challenges, opportunities, and new frontiers. In: 2001 IEEE international solid-state circuits conference. Digest of technical papers. ISSCC (Cat. No.01CH37177), pp. 22–25.

190.

Gelsinger

(2022) Semiconductors run the world: hot chips 2022. In: 2022 IEEE hot chips 34 symposium (HCS), pp. 1–19.

191.

Geng

Tan

, et al. (2020) CQNN: a CGRA-based QNN framework. In: 2020 IEEE high performance extreme computing conference (HPEC), pp. 1–7.

192.

Georganas

Avancha

Banerjee

, et al. (2018) Anatomy of high-performance deep learning convolutions on SIMD architectures. In: SC18: International conference for high performance computing, networking, storage and analysis, pp. 830–841.

193.

Gepner

Kowalik

(2006) Multi-core processors: new way to achieve high system performance. In: International symposium on parallel computing in electrical engineering (PARELEC’06), pp. 9–13.

194.

Geppert

(2002) The amazing vanishing transistor act. IEEE Spectrum 39(10): 28–33.

195.

Ghosal

Sigliano

Kunimatsu

(2001) Ceramic and Plastic Pin Grid Array Technology. Chapter 14. Springer US, pp. 551–576.

196.

Giles

Reguly

(2014) Trends in high-performance computing for engineering calculations. Philosophical Transactions. Series A, Mathematical, Physical, and Engineering Sciences 372(2022): 20130319.

197.

Gindin

Cidon

Keidar

(2007) NoC-Based FPGA: architecture and routing. In: First international symposium on networks-on-chip (NOCS’07), pp. 253–264.

198.

Giuma

Hart

(1996) Microcomputer bus architectures. In: Southcon/96 conference record, pp. 431–437.

199.

Glaskowsky

(1997) 3D Chips Break Megatriangle Barrier: Better Designs and Processes Help Crank Out 1M Polygons/s in Mainstream PCs. Technical report, Microprocessor Report.

200.

Gohel

(2012) The practical realities of high-speed digital test in a production environment. In: 2012 IEEE AUTOTESTCON proceedings, pp. 272–277.

201.

Gomes

Khushu

Ingerly

, et al. (2020) 8.1 lakefield and mobility compute: a 3D stacked 10nm and 22FFL hybrid processor system in 12× 12mm2, 1mm package-on-package. In: 2020 IEEE international solid- state circuits conference - (ISSCC), pp. 144–146.

202.

Gonzalez

Horowitz

(1996) Energy dissipation in general purpose microprocessors. IEEE Journal of Solid-State Circuits 31(9): 1277–1284.

203.

Goodnight

Wang

Humphreys

(2005) Computation on programmable graphics hardware. IEEE Computer Graphics and Applications 25(5): 12–15.

204.

Greig

(2007) Packaging the IC—Single Chip Packaging. Chapter 3. Springer US, pp. 31–45.

205.

Grelck

Sarris

(2017) Towards compiling SAC for the Xeon Phi Knights Corner and Knights Landing architectures: strategies and experiments. In: Proceedings of the 29th symposium on the implementation and application of functional programming languages, IFL ’17. New York, NY: Association for Computing Machinery, pp. 1–12.

206.

Gribok

Pasca

(2024) Efficient 8-bit matrix multiplication on Intel Agilex-5 FPGAs. In: 2024 IEEE 32nd annual international symposium on field-programmable custom computing machines (FCCM), pp. 43–53.

207.

Grimsrud

Smith

(2003) Serial ATA Storage Architecture and Applications: Designing High-Performance, Cost-Effective I/O Solutions. Intel Press.

208.

Groeneveld

James

Kibardin

, et al. (2021) ISPD 2021 wafer-scale physics modeling contest: a new frontier for partitioning, placement and routing. In: Proceedings of the 2021 international symposium on physical design, ISPD ’21. New York, NY: Association for Computing Machinery, pp. 143–147.

209.

Gronowski

Bowhill

Preston

, et al. (1998) High-performance microprocessor design. IEEE Journal of Solid-State Circuits 33(5): 676–686.

210.

Guermouche

Orgerie

(2022) Thermal design power and vectorized instructions behavior. Concurrency and Computation: Practice and Experience 34(2): e6261.

211.

Guri

Monitz

Mirski

, et al. (2015) BitWhisper: covert signaling channel between air-gapped computers using thermal manipulations. In: 2015 IEEE 28th computer security foundations symposium, pp. 276–289.

212.

Gurrum

Suman

Joshi

, et al. (2004) Thermal issues in next-generation integrated circuits. IEEE Transactions on Device and Materials Reliability 4(4): 709–714.

213.

Habermaier

Knapp

(2012) On the correctness of the SIMT execution model of GPUs. In: Seidl

(ed) Programming Languages and Systems. Springer Berlin Heidelberg, pp. 316–335.

214.

Hahn

(2001) The 300 mm silicon wafer — A cost and technology challenge. Microelectronic Engineering 56(1): 3–13, sub-Quarter-Micron Silicon Issues in the 200/300 mm Conversion Era.

215.

Halawani

Mohammad

(2024) Data-Centric Computing Paradigm Shift, and Domain-specific Architecture and Hardware. Chapter 1. Springer Nature Switzerland, pp. 1–6.

216.

Halvorsen

Clarke

(2011) Universal Serial Bus. Chapter 8. Apress, pp. 141–172.

217.

Hameed

Qadeer

Wachs

, et al. (2010a) Understanding sources of inefficiency in general-purpose chips. SIGARCH Computer Architecture News 38(3): 37–47.

218.

Hameed

Qadeer

Wachs

, et al. (2010b) Understanding sources of inefficiency in general-purpose chips. In: Proceedings of the 37th annual international symposium on computer architecture, ISCA ’10. New York, NY: Association for Computing Machinery, pp. 37–47.

219.

Han

Yoo

Park

, et al. (1996) An ASIC implementation of the MPEG-2 audio decoder. IEEE Transactions on Consumer Electronics 42(3): 540–545.

220.

Hanindhito

Fathi

Gourounas

, et al. (2026) Technology trends in computing hardware and their impacts on high-performance scientific computing Part II: memory systems, interconnects, and system integration. The International Journal of High Performance Computing Applications: 10943420251347461.

221.

Hao

Qin

, et al. (2021) A chip-level optical interconnect for CPU. IEEE Photonics Technology Letters 33(16): 852–855.

222.

Haque

Ali

Guha

, et al. (2008) Growth of carbon nanotubes on fully processed silicon-on-insulator CMOS substrates. Journal of Nanoscience and Nanotechnology 8(11): 5667–5672.

223.

Harrand

Henry

Chaisemartin

, et al. (1995) A single chip videophone video encoder/decoder. In: Proceedings ISSCC ’95 - international solid-state circuits conference, pp. 292–293.

224.

Hassaballah

Omran

Mahdy

(2008) A review of SIMD multimedia extensions and their usage in scientific and engineering applications. The Computer Journal 51(6): 630–649.

225.

Hawick

Playne

(2014) Developmental directions in parallel accelerators. In: Proceedings of the twelfth Australasian symposium on parallel and distributed computing - Volume 152, AusPDC ’14. AUS: Australian Computer Society, Inc., pp. 21–27.

226.

Chen

Sun

, et al. (2017) Exploring synchronization in cache coherent manycore systems: a case study with Xeon Phi. In: 2017 IEEE 23rd international conference on parallel and distributed systems (ICPADS), pp. 232–239.

227.

Hecht

Wittenhagen

Cirit

, et al. (2022) PAM-4/6/8 performance and power analysis for next generation 224Gbit/s links. In: 2022 IEEE international symposium on circuits and systems (ISCAS), pp. 752–756.

228.

Heinecke

Klemm

Bungartz

(2012) From GPGPU to many-core: Nvidia fermi and Intel many integrated core architecture. Computing in Science & Engineering 14(2): 78–83.

229.

Hennessy

(2021) The 50 year history of the microprocessor as five technology eras. IEEE Micro 41(6): 20–21.

230.

Hennessy

Patterson

(2019) A new golden age for computer architecture. Communications of the ACM 62(2): 48–60.

231.

Henning

(2000) SPEC CPU2000: measuring CPU performance in the new millennium. Computer 33(7): 28–35.

232.

Herbert

Marculescu

(2007) Analysis of dynamic voltage/frequency scaling in chip-multiprocessors. In: Proceedings of the 2007 international symposium on low power electronics and design, ISLPED ’07. New York, NY: Association for Computing Machinery, pp. 38–43.

233.

Hill

Reddi

(2021) Accelerator-level parallelism. Communications of the ACM 64(12): 36–38.

234.

Hitachi, Ltd (1997) Hitachi releases the 125 MHz 8M-Bit synchronous graphic RAM.

235.

Mai

Horowitz

(2001) The future of wires. Proceedings of the IEEE 89(4): 490–504.

236.

Holzinger

Reiser

Hahn

, et al. (2021) Fast HBM access with FPGAs: analysis, architectures, and applications. In: 2021 IEEE international parallel and distributed processing symposium workshops (IPDPSW), pp. 152–159.

237.

Hood

Jin

Mehrotra

, et al. (2010) Performance impact of resource contention in multicore systems. In: 2010 IEEE international symposium on parallel & distributed processing (IPDPS), pp. 1–12.

238.

Hoozemans

Peltenburg

Nonnemacher

, et al. (2021) FPGA acceleration for big data analytics: challenges and opportunities. IEEE Circuits and Systems Magazine 21(2): 30–47.

239.

Horowitz

(2014) 1.1 Computing’s energy problem (and what we can do about it). In: 2014 IEEE international solid-state circuits conference digest of technical papers (ISSCC), pp. 10–14.

240.

Horowitz

Labonte

Shacham

, et al. (2015) 35 Years of microprocessor trend data.

241.

Hosseini

Simini

Vishwanath

, et al. (2023) Exploring the use of dataflow architectures for graph neural network workloads. In: Bienz

Weiland

Baboulin

, et al. (eds) High Performance Computing. Springer Nature Switzerland, pp. 648–661.

242.

Hruska

(2018) As chip design costs skyrocket, 3nm process node is in Jeopardy.

243.

Yuan

(2009) Intersignal timing skew compensation of parallel links with voltage-mode incremental signaling. IEEE Transactions on Circuits and Systems I: Regular Papers 56(4): 773–783.

244.

Gignac

Liniger

, et al. (2009) Electromigration challenges for nanoscale Cu wiring. AIP Conference Proceedings 1143(1): 3–11.

245.

Stow

Xie

(2018) Die Stacking is happening. IEEE Micro 38(1): 22–28.

246.

Ustun

, et al. (2021) GraphLily: accelerating graph linear algebra on HBM-equipped FPGAs. In: 2021 IEEE/ACM international conference on computer aided design (ICCAD), pp. 1–9.

247.

Huang

Yin

Hsu

, et al. (2011) SoC HW/SW verification and validation. In: 16th Asia and South Pacific design automation conference (ASP-DAC 2011), pp. 297–300.

248.

Huang

Guo

Seok

, et al. (2017) Hybrid analog-digital solution of nonlinear partial differential equations. In: Proceedings of the 50th annual IEEE/ACM international symposium on microarchitecture, MICRO-50 ’17. New York, NY: Association for Computing Machinery, pp. 665–678.

249.

Huang

Shim

Simka

, et al. (2020) From interconnect materials and processes to chip level performance: modeling and design for conventional and exploratory concepts. In: 2020 IEEE international electron devices meeting (IEDM), pp. 32.6.1–32.6.4.

250.

Hwu

Patel

(2008) Guest editors’ introduction: accelerator architectures. IEEE Micro 28(04): 4–12.

251.

Hwu

Patel

(2018) Accelerator architectures —A ten-year retrospective. IEEE Micro 38(6): 56–62.

252.

Ikarashi

Bernstein

Reinking

, et al. (2022) Exocompilation for productive programming of hardware accelerators. In: Proceedings of the 43rd ACM SIGPLAN international conference on programming language design and implementation, PLDI 2022. New York, NY: Association for Computing Machinery, pp. 703–718.

253.

Imai

Hirakawa

(1977) A new multilevel coding method using error-correcting codes. IEEE Transactions on Information Theory 23(3): 371–377.

254.

Ingerly

Amin

Aryasomayajula

, et al. (2019) Foveros: 3D integration and the use of face-to-face chip stacking for logic devices. In: 2019 IEEE international electron devices meeting (IEDM), pp. 19.6.1–19.6.4.

255.

Iwai

(2009) Roadmap for 22nm and beyond (Invited paper). Microelectronic Engineering 86(7): 1520–1528, iNFOS 2009.

256.

Iyer

(2012) Accelerator-rich architectures: implications, opportunities and challenges. In: 17th Asia and South Pacific design automation conference, pp. 106–107.

257.

Iyer

Illikkal

, et al. (2021) Advances in microprocessor cache architectures over the last 25 years. IEEE Micro 41(6): 78–88.

258.

Jagt

(1998) Reliability of electrically conductive adhesive joints for surface mount applications: a summary of the state of the art. IEEE Transactions on Components, Packaging, and Manufacturing Technology: Part A 21(2): 215–225.

259.

Jain

Murugesan

(2021) Smart Connected World: A Broader Perspective. Chapter 1. Springer International Publishing, pp. 3–23.

260.

Jeruchim

(1984) Techniques for estimating the bit error rate in the simulation of digital communication systems. IEEE Journal on Selected Areas in Communications 2(1): 153–170.

261.

Jia

Tillman

Maggioni

, et al. (2019) Dissecting the graphcore IPU architecture via microbenchmarking.

262.

Jiang

(2022) Intel’s Ponte Vecchio GPU: architecture, systems & software. In: 2022 IEEE hot chips 34 symposium (HCS). Los Alamitos, CA: IEEE Computer Society, pp. 1–29.

263.

Jiménez

Gioiosa

Cazorla

, et al. (2012) Making data prefetch smarter: adaptive prefetching on POWER7. In: Proceedings of the 21st international conference on parallel architectures and compilation techniques, PACT ’12. New York, NY: Association for Computing Machinery, pp. 137–146.

264.

JM Veendrick

(2017) Effects of Scaling on MOS IC Design and Consequences for the Roadmap. Chapter 1. Springer International Publishing, pp. 573–594.

265.

Jordà

Valero-Lara

Peña

(2019) Performance evaluation of cuDNN convolution algorithms on NVIDIA Volta GPUs. IEEE Access 7: 70461–70473.

266.

Joshi

Soni

(2016) A comparative analysis of copper and carbon nanotubes-based global interconnects in 32 nm technology. In: Pant

Deep

Bansal

, et al. (eds) Proceedings of Fifth International Conference on Soft Computing for Problem Solving. Springer Singapore, pp. 425–437.

267.

Jouppi

Wall

(1989) Available instruction-level parallelism for superscalar and superpipelined machines. In: Proceedings of the third international conference on architectural support for programming languages and operating systems, ASPLOS III. New York, NY: Association for Computing Machinery, pp. 272–282.

268.

Jouppi

Young

Patil

, et al. (2017) In-Datacenter performance analysis of a tensor processing unit. ACM SIGARCH Computer Architecture News 45(2): 1–12.

269.

Jouppi

Young

Patil

, et al. (2018) Motivation for and evaluation of the first tensor processing unit. IEEE Micro 38(3): 10–19.

270.

Jouppi

Yoon

Kurian

, et al. (2020) A domain-specific supercomputer for training deep neural networks. Communications of the ACM 63(7): 67–78.

271.

Jouppi

Hyun Yoon

Ashcraft

, et al. (2021) Ten lessons from three generations shaped Google’s TPUv4i: industrial product. In: 2021 ACM/IEEE 48th annual international symposium on computer architecture (ISCA). pp. 1–14.

272.

Jouppi

Kurian

, et al. (2023) TPU v4: an optically reconfigurable supercomputer for machine learning with hardware support for embeddings. In: Proceedings of the 50th annual international symposium on computer architecture, ISCA ’23. New York, NY: Association for Computing Machinery, pp. 1–14.

273.

Juan

Navarro

Temam

(1997) Data caches for superscalar processors. In: Proceedings of the 11th international conference on supercomputing, ICS ’97. New York, NY: Association for Computing Machinery, pp. 60–67.

274.

Juan

Alonso-Jordá

Quintana-Ortí

(2021) High performance and energy efficient integer matrix multiplication for deep learning. In: 2021 29th Euromicro international conference on parallel, distributed and network-based processing (PDP). Institute of Electrical and Electronics Engineers, pp. 122–125.

275.

Kao

Narendra

Chandrakasan

(2002) Subthreshold leakage modeling and reduction techniques. In: Proceedings of the 2002 IEEE/ACM international conference on computer-aided design, ICCAD ’02. New York, NY: Association for Computing Machinery, pp. 141–148.

276.

Kaplan

(2024) Intel Gaudi 3 AI accelerator: architected for Gen AI training and inference. In: 2024 IEEE hot chips 36 symposium (HCS). IEEE, pp. 1–16.

277.

Kapur

Saraswat

(2002) Comparisons between electrical and optical interconnects for on-chip signaling. In: Proceedings of the IEEE 2002 international interconnect technology conference (Cat. No.02EX519), pp. 89–91.

278.

Kapur

Chandra

McVittie

, et al. (2002) Technology and reliability constrained future copper interconnects. II. Performance implications. IEEE Transactions on Electron Devices 49(4): 598–604.

279.

Kara

Hagleitner

Diamantopoulos

, et al. (2020) High bandwidth memory on FPGAs: a data analytics perspective. In: 2020 30th international conference on field-programmable logic and applications (FPL), pp. 1–8.

280.

Karabchevsky

Katiyi

Ang

, et al. (2020) On-chip nanophotonics and future challenges. Nanophotonics 9(12): 3733–3753.

281.

Karnik

Hazucha

(2004) Characterization of soft errors caused by single event upsets in CMOS processes. IEEE Transactions on Dependable and Secure Computing 1(2): 128–143.

282.

Karstensen

Auracher

Ebel

, et al. (2000) Module packaging for high-speed serial and parallel transmission. In: 2000 Proceedings. 50th Electronic components and technology conference (Cat. No.00CH37070), pp. 479–486.

283.

Käsgen

Weinhardt

Hochberger

(2018) A coarse-grained reconfigurable array for high-performance computing applications. In: 2018 International conference on reconfigurable computing and FPGAs (ReConFig), pp. 1–4.

284.

Kaufman

(1999) Memory chip packaging: an enabling technology for high-performance recce systems. In: Fishell

(ed) Airborne Reconnaissance XXIII. International Society for Optics and Photonics, SPIE, Vol. 3751, pp. 170–174.

285.

Kaushik

Goel

Rauthan

(2007) Future VLSI interconnects: optical fiber or carbon nanotube – A review. Microelectronics International 24(2): 53–63.

286.

Kaushik

Majumder

Kumar

(2014) Carbon nanotube based 3-D interconnects - a reality or a distant dream. IEEE Circuits and Systems Magazine 14(4): 16–35.

287.

Kaushik

Rao

Singh

, et al. (2021) Cloud computing and comparison based on service and performance between Amazon AWS, Microsoft Azure, and Google Cloud. In: 2021 International conference on technological advancements and innovations (ICTAI), pp. 268–273.

288.

Kaxiras

Martonosi

(2008) Optimizing Capacitance and Switching Activity to Reduce Dynamic Power. Chapter 4. Springer International Publishing, pp. 45–129.

289.

Keckler

Dally

Maskit

, et al. (1998) Exploiting fine-grain thread level parallelism on the MIT Multi-ALU processor. In: Proceedings of the 25th annual international symposium on computer architecture, ISCA ’98. USA: IEEE Computer Society, pp. 306–317.

290.

Kelleher

(2022) Celebrating 75 years of the transistor A look at the evolution of Moore’s law innovation. In: 2022 International electron devices meeting (IEDM), pp. 1.1.1–1.1.5.

291.

Keutzer

Malik

Newton

(2002) From ASIC to ASIP: the next design discontinuity. In: Proceedings. IEEE international conference on computer design: VLSI in computers and processors, pp. 84–90.

292.

Khaldi

Luo

, et al. (2021) Extending LLVM IR for DPC++ matrix support: a case study with Intel® advanced matrix extensions (Intel® AMX). In: 2021 IEEE/ACM 7th workshop on the LLVM compiler infrastructure in HPC (LLVM-HPC), pp. 20–26.

293.

Khan

Rao

Lim

, et al. (2010) Development of 3-D silicon module with TSV for system in packaging. IEEE Transactions on Components and Packaging Technologies 33(1): 3–9.

294.

Khazraee

Zhang

Vega

, et al. (2017) Moonwalk: NRE optimization in ASIC clouds. In: Proceedings of the twenty-second international conference on architectural support for programming languages and operating systems, ASPLOS ’17. New York, NY: Association for Computing Machinery, pp. 511–526.

295.

Killian

(2023) AMD’s Zen 4 I/O die shot reveals a fascinating Ryzen CCD detail.

296.

Kim

Austin

Baauw

, et al. (2003) Leakage current: Moore’s law meets static power. Computer 36(12): 68–75.

297.

Kim

Lee

Yang

, et al. (2005) Novel instructions and their hardware architecture for video signal processing. In: 2005 IEEE international symposium on circuits and systems (ISCAS), Vol. 4, pp. 3323–3326.

298.

Kim

Song

Lee

(2014) An I/O Line Configuration and Organization of DRAM. Chapter 2. Springer International Publishing, pp. 13–24.

299.

Kim

Gholami

Yao

, et al. (2021) I-BERT: integer-only BERT quantization. In: Meila

Zhang

(eds) Proceedings of the 38th international conference on machine learning, proceedings of machine learning research. PMLR, Vol. 139, pp. 5506–5518.

300.

Kish

(2002) End of Moore’s law: thermal (noise) death of integration in micro and nano electronics. Physics Letters A 305(3): 144–149.

301.

Kleveland

Madden

, et al. (2002) High-frequency characterization of on-chip digital interconnects. IEEE Journal of Solid-State Circuits 37(6): 716–725.

302.

Knowles

(2021) Graphcore. In: 2021 IEEE hot chips 33 symposium (HCS), pp. 1–25.

303.

(2022) High-speed serial link trend and technical challenge. In: 2022 IEEE Asian solid-state circuits conference (A-SSCC), pp. 1–3.

304.

Kobbelt

Botsch

(2004) A survey of point-based techniques in computer graphics. Computers & Graphics 28(6): 801–814.

305.

Kobrinsky

Block

Jun-Fei

, et al. (2004) On-chip optical interconnects. Intel Technology Journal 8(2): 129–141.

306.

Kocanda

Kos

(2015) Static and dynamic energy losses vs. temperature in different CMOS technologies. In: 2015 22nd international conference mixed design of integrated circuits & systems (MIXDES), pp. 446–449.

307.

Koeplinger

Feldman

Prabhakar

, et al. (2018) Spatial: a language and compiler for application accelerators. ACM SIGPLAN Notices 53(4): 296–311.

308.

Kohli

Sobczak

Bowin

, et al. (2001) Advanced thermal interface materials for enhanced flip chip BGA. In: 2001 Proceedings. 51st Electronic components and technology conference (Cat. No.01CH37220), pp. 564–570.

309.

Kolodzey

(1981) CRAY-1 computer technology. IEEE Transactions on Components, Hybrids, and Manufacturing Technology 4(2): 181–186.

310.

Koo

Cho

Kapur

, et al. (2007) Performance comparisons between carbon nanotubes, optical, and Cu for future high-performance on-chip interconnect applications. IEEE Transactions on Electron Devices 54(12): 3206–3215.

311.

Koul

Melchert

Sreedhar

, et al. (2023) AHA: an agile approach to the design of coarse-grained reconfigurable accelerators and compilers. ACM Transactions on Embedded Computing Systems 22(2): 1–34.

312.

Koyanagi

Kurino

Lee

, et al. (1998) Future system-on-silicon LSI chips. IEEE Micro 18(4): 17–22.

313.

Krishnakumar

Ogras

Marculescu

, et al. (2023) Domain-specific architectures: research problems and promising approaches. ACM Transactions on Embedded Computing Systems 22(2): 1–26.

314.

Küçük

Güney

Ponomarev

(2013) Instruction Scheduling in Microprocessors. Chapter 2. Springer Berlin Heidelberg, pp. 39–60.

315.

Kudo

Takano

Akazawa

, et al. (2021) High-speed, high-density, and highly-manufacturable Cu-filled through-glass-via channel (Cu bridge) for multi-chiplet systems. In: 2021 IEEE 71st electronic components and technology conference (ECTC), pp. 1031–1037.

316.

Kumar

Ramkumar

Kaur

, et al. (2020) A survey on hardware implementation of cryptographic algorithms using field programmable gate array. In: 2020 IEEE 9th international conference on communication systems and network technologies (CSNT), pp. 189–194.

317.

Kuon

Rose

(2006) Measuring the gap between FPGAs and ASICs. In: Proceedings of the 2006 ACM/SIGDA 14th international symposium on field programmable gate arrays, FPGA ’06. New York, NY: Association for Computing Machinery, pp. 21–30.

318.

Lachance

Lavoie

Montanari

(1997) Corrosion/migration study of flip chip underfill and ceramic overcoating. In: 1997 Proceedings 47th electronic components and technology conference, pp. 885–889.

319.

Lai

(2021) Opportunity and challenge of Chiplet-based HPC and AIoT. In: 2021 international symposium on VLSI design, automation and test (VLSI-DAT), pp. 1–2.

320.

Langhammer

Pasca

(2015) Floating-point DSP block architecture for FPGAs. In: Proceedings of the 2015 ACM/SIGDA international symposium on field-programmable gate arrays, FPGA ’15. New York, NY: Association for Computing Machinery, pp. 117–125.

321.

Langhammer

Nurvitadhi

Pasca

, et al. (2021) Stratix 10 NX architecture and applications. In: The 2021 ACM/SIGDA international symposium on field-programmable gate arrays, FPGA ’21. New York, NY: Association for Computing Machinery, pp. 57–67.

322.

Lathi

Ding

(2022) Modern Digital and Analog Communication. The Oxford Series in Electrical and Computer Engineering. 5 edition. Oxford University Press.

323.

Lauterbach

(2021) The path to successful wafer-scale integration: the cerebras story. IEEE Micro 41(6): 52–57.

324.

Law

Dove

D’Ambrosia

, et al. (2013) Evolution of ethernet standards in the IEEE 802.3 working group. IEEE Communications Magazine 51(8): 88–96.

325.

Le Beux

Nicolescu

, et al. (2014) Optical crossbars on chip, a comparative study based on worst-case losses. Concurrency and Computation: Practice and Experience 26(15): 2492–2503.

326.

Le Sueur

Heiser

(2010) Dynamic voltage and frequency scaling: the laws of diminishing returns. In: Proceedings of the 2010 international conference on power aware computing and systems, HotPower’10. USA: USENIX Association, pp. 1–8.

327.

Lee

(1990) Programmable DSPs: a brief overview. IEEE Micro 10(5): 14–16.

328.

Lee

(2014) Characterizing the latency hiding ability of GPUs. In: 2014 IEEE international symposium on performance analysis of systems and software (ISPASS), pp. 145–146.

329.

Lee

Kim

Jeong

, et al. (2013) Skew compensation technique for source-synchronous parallel DRAM interface. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 21(11): 2155–2159.

330.

Lee

Seo

Zhang

, et al. (2022) TensorCrypto: high throughput acceleration of lattice-based cryptography using tensor core on GPU. IEEE Access 10: 20616–20632.

331.

Leiserson

Thompson

Emer

, et al. (2020) There’s plenty of room at the top: what will drive computer performance after Moore’s law? Science 368(6495): eaam9744.

332.

Lenihan

Matthew

Vardaman

(2013) Developments in 2.5D: the role of silicon interposers. In: 2013 IEEE 15th electronics packaging technology conference (EPTC 2013), pp. 53–55.

333.

Leong

PHW

(2008) Recent trends in FPGA architectures and applications. In: 4th IEEE international symposium on electronic design, test and applications (Delta 2008), pp. 137–141.

334.

(2021) Accelerating binarized neural networks via bit-tensor-cores in turing GPUs. IEEE Transactions on Parallel and Distributed Systems 32(7): 1878–1891.

335.

Cui

, et al. (2011) Research based on OSI model. In: 2011 IEEE 3rd international conference on communication software and networks, pp. 554–557.

336.

Reddy

Jacob

(2018) A performance & power comparison of modern high-speed DRAM architectures. In: Proceedings of the international symposium on memory systems, MEMSYS ’18. New York, NY: Association for Computing Machinery, pp. 341–353.

337.

Antunes

Kalfon-Cohen

, et al. (2019) Low-temperature growth of carbon nanotubes catalyzed by sodium-based ingredients. Angewandte Chemie 58(27): 9204–9209.

338.

Hou

Yan

, et al. (2020) Chiplet heterogeneous integration technology—Status and challenges. Electronics 9(4): 670.

339.

Xue

Liu

, et al. (2021) Unleashing the low-precision computation potential of tensor cores on GPUs. In: 2021 IEEE/ACM international symposium on code generation and optimization (CGO), pp. 90–102.

340.

Chen

DKT

, et al. (2022) Integrated lasers on silicon at communication wavelength: a progress review. Advanced Optical Materials 10(23): 2201008.

341.

Lidow

Sheridan

(2003) Defining the future for microprocessor power delivery. In: Eighteenth annual IEEE applied power electronics conference and exposition, 2003. APEC ’03, Vol. 1. pp. 3–9.

342.

Lie

(2022) Cerebras architecture deep dive: first look inside the HW/SW co-design for deep learning: cerebras systems. In: 2022 IEEE hot chips 34 symposium (HCS), pp. 1–34.

343.

Lie

(2024) Inside the cerebras wafer-scale cluster. IEEE Micro 44(3): 49–57.

344.

Lin

Mantor

Zhou

(2018) GPU performance vs. thread-level parallelism: scalability analysis and a novel way to improve TLP. ACM Transactions on Architecture and Code Optimization 15(1): 1–21.

345.

Lin

Liang

, et al. (2022) Mapping large scale finite element computing on to wafer-scale engines. In: 2022 27th Asia and South Pacific design automation conference (ASP-DAC), pp. 147–153.

346.

Lindholm

Kilgard

Moreton

(2001) A user-programmable vertex engine. In: Proceedings of the 28th annual conference on computer graphics and interactive techniques, SIGGRAPH ’01. New York, NY: Association for Computing Machinery, pp. 149–158.

347.

Lindholm

Nickolls

Oberman

, et al. (2008) NVIDIA tesla: a unified graphics and computing architecture. IEEE Micro 28(2): 39–55.

348.

Liu

(2017) Multi-functional serial communication interface design based on FPGA. In: 2017 3rd IEEE international conference on computer and communications (ICCC), pp. 758–761.

349.

Liu

Svensson

(1993) Trading speed for low power by choice of supply and threshold voltages. IEEE Journal of Solid-State Circuits 28(1): 10–17.

350.

Liu

Svensson

(1994) Power consumption estimation in CMOS VLSI chips. IEEE Journal of Solid-State Circuits 29(6): 663–670.

351.

Liu

Zhu

, et al. (2019) A survey of coarse-grained reconfigurable architecture and design: taxonomy, challenges, and applications. ACM Computing Surveys 52(6): 1–39.

352.

Loan

(2007) Line Coding. Chapter 34. John Wiley & Sons, Ltd., pp. 522–537.

353.

LoCicero

Patel

(2018) The Communications Handbook. CRC Press.

354.

Locuza (2020) Zen 2 (+1) layman die shot analysis (tutorial style) - Part 6.

355.

Locuza (2022) Zen evolution: a small overview.

356.

Loghin

(2024) Are arm cloud servers ready for database workloads? An experimental study. IEEE Transactions on Cloud Computing 12(3): 818–829.

357.

Loh

Naffziger

Lepak

(2021) Understanding chiplets today to anticipate future integration opportunities and limits. In: 2021 design, automation and test in Europe conference & exhibition (DATE), pp. 142–145.

358.

Loh

Schulte

Ignatowski

, et al. (2023) A research retrospective on AMD’s exascale computing journey. In: Proceedings of the 50th annual international symposium on computer architecture, ISCA ’23. New York, NY: Association for Computing Machinery, pp. 1–14.

359.

Luebke

Humphreys

(2007) How GPUs work. Computer 40(2): 96–100.

360.

Luo

Wang

(2008) Studies of chipping mechanisms for dicing silicon wafers. The International Journal of Advanced Manufacturing Technology 35(11): 1206–1218.

361.

Luo

Lü

(2023) MS-CLS: an effective partitioning and placement metaheuristic for wafer-scale physics modeling. IEEE Transactions on Emerging Topics in Computational Intelligence PP(99): 1–15.

362.

Mack

(2015) The multiple lives of Moore’s law. IEEE Spectrum 52(4): 31.

363.

Macri

(2015) AMD’s next generation GPU and high bandwidth memory architecture: FURY. In: 2015 IEEE hot chips 27 symposium (HCS), pp. 1–26.

364.

Maddrell-Mander

Mohan

LRM

Marshall

, et al. (2021) Studying the potential of graphcore® IPUs for applications in particle physics. Computing and Software for Big Science 5(1): 8.

365.

Madhow

(2008) Fundamentals of Digital Communication. Cambridge University Press.

366.

Magaki

Khazraee

Gutierrez

, et al. (2016) ASIC clouds: specializing the datacenter. In: 2016 ACM/IEEE 43rd annual international symposium on computer architecture (ISCA), pp. 178–190.

367.

Magro

Petersen

Shah

(2002) Hyper-threading technology: impact on compute-intensive workloads. Intel Technology Journal 6(1): 1.

368.

Mahajan

Chiu

Chrysler

(2006a) Cooling a microprocessor chip. Proceedings of the IEEE 94(8): 1476–1486.

369.

Mahajan

Mallik

Sankman

, et al. (2006b) Advances and challenges in flip-chip packaging. In: IEEE custom integrated circuits conference 2006, pp. 703–709.

370.

Mallik

Ryckaert

Kim

, et al. (2019) Economics of semiconductor scaling - a cost analysis for advanced technology node. In: 2019 symposium on VLSI technology, pp. T202–T203.

371.

Mandal

Fowler

Porterfield

(2010) Modeling memory concurrency for multi-socket multi-core systems. In: 2010 IEEE international symposium on performance analysis of systems & software (ISPASS), pp. 66–75.

372.

Marculescu

Talpes

(2005) Variability and energy awareness: a microarchitecture-level perspective. In: Proceedings of the 42nd annual design automation conference, DAC ’05. New York, NY: Association for Computing Machinery, pp. 11–16.

373.

Markidis

Chien

SWD

Laure

, et al. (2018) NVIDIA tensor core programmability, performance & precision. In: 2018 IEEE international parallel and distributed processing symposium workshops (IPDPSW), pp. 522–531.

374.

Marks

Cheong

Hassan

(2022) A review of laser ablation and dicing of Si wafers. Precision Engineering 73: 377–408.

375.

Marowka

(2011) Back to thin-core massively parallel processors. Computer 44(12): 49–54.

376.

Marquardt

Betz

Rose

(2000) Speed and area tradeoffs in cluster-based FPGA architectures. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 8(1): 84–93.

377.

Martens

Walterbusch

Teuteberg

(2012) Costing of cloud computing services: a total cost of ownership approach. In: 2012 45th Hawaii international conference on system sciences, pp. 1563–1572.

378.

Martin

KJM

(2022) Twenty years of automated methods for mapping applications on CGRA. In: 2022 IEEE international parallel and distributed processing symposium workshops (IPDPSW), pp. 679–686.

379.

Matin

(2018) AM, Angle Modulation and Digital Modulation Systems. Chapter 4. Springer International Publishing, pp. 43–69.

380.

Matteis

Licht

Hoefler

(2020) FBLAS: streaming linear algebra on FPGA. In: SC20: International conference for high performance computing, networking, storage and analysis, pp. 1–13.

381.

Mattioli

(2021) Rome to Milan, AMD continues its tour of Italy. IEEE Micro 41(4): 78–83.

382.

McClelland

(1983) Services and protocols of the physical layer. Proceedings of the IEEE 71(12): 1372–1377.

383.

Mechaik

(2001) An evaluation of single-ended and differential impedance in PCBs. In: Proceedings of the IEEE 2001. 2nd international symposium on quality electronic design, pp. 301–306.

384.

Mehis

Radhakrishnan

(2002) Optimizing applications for performance on the pentium 4 architecture. In: 2002 IEEE international workshop on workload characterization, pp. 59–67.

385.

Mei

De Sutter

Vander Aa

, et al. (2008) Implementation of a coarse-grained reconfigurable media processor for AVC decoder. Journal of Signal Processing Systems 51(3): 225–243.

386.

Meijer

Pessolano

de Gyvez

(2004) Technology exploration for adaptive power and frequency scaling in 90nm CMOS. In: Proceedings of the 2004 international symposium on low power electronics and design, ISLPED ’04. New York, NY: Association for Computing Machinery, pp. 14–19.

387.

Mekawey

Elsayed

Ismail

, et al. (2022) Optical interconnects finally seeing the light in silicon photonics: past the hype. Nanomaterials 12(3): 485.

388.

Mencer

Allison

Blatt

, et al. (2020) The history, status, and future of FPGAs: hitting a nerve with field-programmable gate arrays. Queue 18(3): 71–82.

389.

Mercier

Bhargava

Tarokh

(2010) A survey of error-correcting codes for channels with symbol synchronization errors. IEEE Communications Surveys & Tutorials 12(1): 87–96.

390.

Messerschmitt

(1990) Synchronization in digital system design. IEEE Journal on Selected Areas in Communications 8(8): 1404–1419.

391.

Mhaboobkhan

Fathimaparveen

Gokila

, et al. (2019) Implementation of high speed data transfer serialized 128/130 bit encoding algorithm using 90nm technology. In: 2019 5th International conference on advanced computing & communication systems (ICACCS), pp. 732–736.

392.

Micikevicius

Narang

Alben

, et al. (2018) Mixed precision training. In: International Conference on Learning Representations. Open Review, pp. 1–12.

393.

Miller

DAB

(2009) Device requirements for optical interconnects to silicon chips. Proceedings of the IEEE 97(7): 1166–1185.

394.

Miller

Ozaktas

(1997) Limit to the bit-rate capacity of electrical interconnects from the aspect ratio of the system architecture. Journal of Parallel and Distributed Computing 41(1): 42–52.

395.

Mittal

(2019) A survey of techniques for dynamic branch prediction. Concurrency and Computation: Practice and Experience 31(1): e4666.

396.

Mittal

(2020a) A survey on evaluating and optimizing performance of Intel Xeon Phi. Concurrency and Computation: Practice and Experience 32(19): e5742.

397.

Mittal

(2020b) A survey on evaluating and optimizing performance of Intel Xeon Phi. Concurrency and Computation: Practice and Experience 32(19): e5742.

398.

Miyake

Yamashita

Asari

, et al. (2001) Design methodology of high performance microprocessor using ultra-low threshold voltage CMOS. In: Proceedings of the IEEE 2001 custom integrated circuits conference (Cat. No.01CH37169), pp. 275–278.

399.

Mochocki

Lahiri

Cadambi

(2006) Power analysis of mobile 3D graphics. Proceedings of the Design Automation & Test in Europe Conference 1: 1–6.

400.

Modi

Spracklen

Chou

, et al. (2005) Accurate modeling of aggressive speculation in modern microprocessor architectures. In: 13th IEEE international symposium on modeling, analysis, and simulation of computer and telecommunication systems, pp. 75–84.

401.

Mohapatra

Gupta

Singh

, et al. (2017) A 64b/66b line encoding for high speed serializers. In: 2017 30th International conference on VLSI design and 2017 16th international conference on embedded systems (VLSID), pp. 303–308.

402.

Moore

(1965) Cramming more components onto integrated circuits. Electronics 38(8): 114–117.

403.

Moore

(2006) Progress in digital integrated electronics [Technical literaiture, Copyright 1975 IEEE. Reprinted, with permission. Technical Digest. International Electron Devices Meeting, IEEE, 1975, pp. 11-13]. IEEE Solid-State Circuits Society Newsletter 11(3): 36–37, originally published in 1975.

404.

Moore

(2020) The node is nonsense. IEEE Spectrum 57(8): 24–30.

405.

Moore

Schneider

(2022) The state of the transistor: in 75 years, it’s become tiny, mighty, ubiquitous, and just plain weird. IEEE Spectrum 59(12): 30–31.

406.

Mujtaba

(2015) Intel’s 10nm Knights Hill powered Aurora supercomputer to feature up to 180 PetaFlops computational power – 2018 launch scheduled.

407.

Mujtaba

(2022) AMD SP5 socket pictured in all its glory, LGA 6096 for future EPYC CPUs with 96 cores & above.

408.

Müller

Reuschel

Rimolo-Donadio

, et al. (2015) Energy-aware signal integrity analysis for high-speed PCB links. IEEE Transactions on Electromagnetic Compatibility 57(5): 1226–1234.

409.

Munger

Wilcox

Sniderman

, et al. (2023) “Zen 4”: the AMD 5nm 5.7GHz x86-64 microprocessor core. In: 2023 IEEE international solid- state circuits conference (ISSCC), pp. 38–39.

410.

Mutoh

Douseki

Matsuya

, et al. (1995) 1-V power supply high-speed digital circuit technology with multithreshold-voltage CMOS. IEEE Journal of Solid-State Circuits 30(8): 847–854.

411.

Wang

Long

, et al. (2017) Exploring DDR4 address bus design for high speed memory interface. In: 2017 IEEE 67th Electronic components and technology conference (ECTC), pp. 1843–1848.

412.

Naeemi

Sarvari

Meindl

(2006) On-chip interconnect networks at the end of the roadmap: limits and nanotechnology opportunities. In: 2006 International interconnect technology conference, pp. 201–203.

413.

Naffziger

Lepak

Paraschou

, et al. (2020) 2.2 AMD chiplet architecture for high-performance server and desktop products. In: 2020 IEEE international solid- state circuits conference - (ISSCC), pp. 44–45.

414.

Naffziger

Beck

Burd

, et al. (2021) Pioneering chiplet technology and design for the AMD EPYC^TM and Ryzen^TM processor families: industrial product. In: 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). pp. 57–70.

415.

Nane

Sima

Pilato

, et al. (2016) A survey and evaluation of FPGA high-level synthesis tools. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 35(10): 1591–1604.

416.

Narayanan

Swamy

Seznec

(2015) An empirical high level performance model for future many-cores. In: Proceedings of the 12th ACM international conference on computing Frontiers, CF ’15. New York, NY: Association for Computing Machinery, pp. 1–8.

417.

Nassif

Munch

Molnar

, et al. (2022) Sapphire rapids: the next-generation Intel Xeon scalable processor. In: 2022 IEEE international solid- state circuits conference (ISSCC), Vol. 65, pp. 44–46.

418.

Naveh

Likharev

(2000) Shrinking limits of silicon MOSFETs: numerical study of 10 nm scale devices. Superlattices and Microstructures 27(2): 111–123.

419.

Nieuwoudt

Massoud

(2008) On the optimal design, performance, and reliability of future carbon nanotube-based interconnect solutions. IEEE Transactions on Electron Devices 55(8): 2097–2110.

420.

Niu

Anderson

(2018) Compact area and performance modelling for CGRA architecture evaluation. In: 2018 International conference on field-programmable technology (FPT), pp. 126–133.

421.

Nowatzki

Gangadhar

Ardalani

, et al. (2017) Stream-dataflow acceleration. In: 2017 ACM/IEEE 44th annual international symposium on computer architecture (ISCA), pp. 416–429.

422.

Oberman

Favor

Weber

(1999) AMD 3DNow! technology: architecture and implementations. IEEE Micro 19(2): 37–48.

423.

Ogasawara

(2002) Trends in research and development of lithography technology for next-generation LSIs. Science & Technology Trends Quarterly Review.

424.

Olukotun

Hammond

(2005) The future of microprocessors: chip multiprocessors’ promise of huge performance gains is now a reality. Queue 3(7): 26–29.

425.

Owens

Luebke

Govindaraju

, et al. (2007) A survey of general-purpose computation on graphics hardware. Computer Graphics Forum 26(1): 80–113.

426.

Pahl

Brogi

Soldani

, et al. (2019) Cloud container technologies: a state-of-the-art review. IEEE Transactions on Cloud Computing 7(3): 677–692.

427.

Pai

Ranganathan

Adve

(1997) The impact of instruction-level parallelism on multiprocessor performance and simulation methodology. In: Proceedings third international symposium on high-performance computer architecture, pp. 72–83.

428.

Palacharla

Jouppi

Smith

(1997) Complexity-effective superscalar processors. SIGARCH Computer Architecture News 25(2): 206–218.

429.

Pan

Hao

, et al. (2022) ESD protection designs: topical overview and perspective. IEEE Transactions on Device and Materials Reliability 22(3): 356–370.

430.

Pandey

Subramanian

Rangaraj

, et al. (2005) Mechanical design and analysis of land grid array (LGA) sockets. In: International electronic packaging technical conference and exhibition advances in electronic packaging, Parts A, B, and C, pp. 1005–1011.

431.

Pandey

Jangale

Narayan

(2020) Signal integrity and compliance test of DSI and CSI2 serial interface over MIPI D-PHY. In: 2020 IEEE 24th workshop on signal and power integrity (SPI), pp. 1–4.

432.

Papadimitriou

Chatzidimitriou

Gizopoulos

(2019) Adaptive voltage/frequency scaling and core allocation for balanced energy and performance on multicore CPUs. In: 2019 IEEE international symposium on high performance computer architecture (HPCA), pp. 133–146.

433.

Park

Mahlke

(2009) Polymorphic pipeline array: a flexible multicore accelerator with virtualized execution for mobile multimedia applications. In: 2009 42nd annual IEEE/ACM international symposium on microarchitecture (MICRO), pp. 370–380.

434.

Parkhurst

Darringer

Grundmann

(2006) From single core to multi-core: preparing for a new exponential. In: Proceedings of the 2006 IEEE/ACM international conference on computer-aided design, ICCAD ’06. New York, NY: Association for Computing Machinery, pp. 67–72.

435.

Pasricha

Kurdahi

Dutt

(2010) Evaluating carbon nanotube global interconnects for chip multiprocessor applications. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 18(9): 1376–1380.

436.

Paszke

Gross

Massa

, et al. (2019) PyTorch: an imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems 32. Curran Associates, Inc., pp. 8024–8035.

437.

Patton

(2009) Semiconductor technology-trends, challenges and opportunities. In: 2009 13th international workshop on computational electronics, pp. 1–4.

438.

Paulin

(2004) DATE panel chips of the future: soft, crunchy or hard? In: Proceedings design, automation and test in Europe conference and exhibition, Vol. 2, pp. 844–849.

439.

Peccerillo

Mannino

Mondelli

, et al. (2022) A survey on hardware accelerators: taxonomy, trends, challenges, and perspectives. Journal of Systems Architecture 129: 102561.

440.

Peddie

(2023a) The History of the GPU - Eras and Environment. Springer International Publishing.

441.

Peddie

(2023b) The History of the GPU - New Developments. Springer International Publishing.

442.

Peddie

(2023c) The History of the GPU - Steps to Invention. Springer International Publishing.

443.

Peleg

Weister

(1991) Future trends in microprocessors: out-of-order execution, speculative branching and their CISC performance potential. In: 17th Convention of electrical and electronics engineers in Israel, pp. 263–266.

444.

Peleg

Wilkie

Weiser

(1997) Intel MMX for multimedia PCs. Communications of the ACM 40(1): 24–38.

445.

Pellerite

Suhl

(1988) Sockets: considerations as an alternative to direct surface mounting of components. In: Fourth IEEE/CHMT European international electronic manufacturing technology symposium, pp. 89–91.

446.

Peng

Fattal

Fiorentino

, et al. (2010) Fabrication variations in SOI microrings for DWDM networks. In: 7th IEEE international conference on group IV photonics, pp. 120–122.

447.

Peng

Gioiosa

Kestor

, et al. (2017) Exploring the performance benefit of hybrid memory system on HPC environments. In: 2017 IEEE international parallel and distributed processing symposium workshops (IPDPSW), pp. 683–692.

448.

Pohl

Sattler

(2018) Joins in a heterogeneous memory hierarchy: exploiting high-bandwidth memory. In: Proceedings of the 14th international workshop on data management on new hardware, DAMON ’18. New York, NY: Association for Computing Machinery, pp. 1–10.

449.

Prabhakar

Jairath

(2021) SambaNova SN10 RDU: accelerating software 2.0 with dataflow. In: 2021 IEEE hot chips 33 symposium (HCS), pp. 1–37.

450.

Prabhakar

Zhang

Olukotun

(2020) Coarse-Grained Reconfigurable Architectures. Chapter 14. Springer International Publishing, pp. 227–246.

451.

Prabhakar

Jairath

Shin

(2022) SambaNova SN10 RDU: a 7nm dataflow architecture to accelerate software 2.0. 2022 IEEE international solid- state circuits conference (ISSCC), Vol. 65, pp. 350–352.

452.

Pratheek

Jawalkar

Basu

(2022) Designing virtual memory system of MCM GPUs. In: 2022 55th IEEE/ACM international symposium on microarchitecture (MICRO), pp. 404–422.

453.

Prince

(1999) A tribute to graphics DRAMs. In: Records of the 1999 IEEE international workshop on memory technology, design and testing, pp. 123–130.

454.

Qasaimeh

Denolf

, et al. (2019) Comparing energy efficiency of CPU, GPU and FPGA implementations for vision kernels. In: 2019 IEEE international conference on embedded software and systems (ICESS), pp. 1–8.

455.

Qian

Pullela

Pillage

(1994) Modeling the “effective capacitance” for the RC interconnect of CMOS gates. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 13(12): 1526–1535.

456.

Quinones

Parcerisa

Gonzailez

(2007) Improving branch prediction and predicated execution in out-of-order processors. In: 2007 IEEE 13th international symposium on high performance computer architecture, pp. 75–84.

457.

Radhakrishnan

Swaminathan

Bhattacharyya

(2021) Power delivery for high-performance microprocessors—Challenges, solutions, and future trends. IEEE Transactions on Components, Packaging and Manufacturing Technology 11(4): 655–671.

458.

Rahim

Spuesens

Baets

, et al. (2018) Open-access silicon photonics: current status and emerging initiatives. Proceedings of the IEEE 106(12): 2313–2330.

459.

Rahman

BMA

Leung

DMH

Obayya

SSA

, et al. (2008) Numerical analysis of bent waveguides: bending loss, transmission loss, mode coupling, and polarization coupling. Applied optics 47(16): 2961–2970.

460.

Rakheja

Kumar

(2012) Comparison of electrical, optical and plasmonic on-chip interconnects based on delay and energy considerations. In: Thirteenth international symposium on quality electronic design (ISQED), pp. 732–739.

461.

Raman

Pentkovski

Keshava

(2000) Implementing streaming SIMD extensions on the Pentium III processor. IEEE Micro 20(4): 47–57.

462.

Randell

Lee

Treleaven

(1978) Reliability issues in computing system design. ACM Computing Surveys 10(2): 123–165.

463.

Rashdan

El-Sayed

Salman

(2020) Performance comparison between SerDes and time-based serial links. In: 2020 7th International conference on electrical and electronics engineering (ICEEE), pp. 37–41.

464.

Rau

Fisher

(2003) Instruction-Level Parallelism. Chapter I. John Wiley and Sons Ltd, pp. 883–887.

465.

Ray

Hoe

(2003) High-level modeling and FPGA prototyping of microprocessors. In: Proceedings of the 2003 ACM/SIGDA eleventh international symposium on field programmable gate arrays, FPGA ’03. New York, NY: Association for Computing Machinery, pp. 100–107.

466.

Rezaei

Galappaththige

Tellambura

, et al. (2023) Coding techniques for backscatter communications—A contemporary survey. IEEE Communications Surveys & Tutorials 25(2): 1020–1058.

467.

Rieger

(2019) Retrospective on VLSI value scaling and lithography. Journal of Micro/Nanolithography, MEMS, and MOEMS 18(4): 040902.

468.

Riesgo

Torroja

de la Torre

(1999) Design methodologies based on hardware description languages. IEEE Transactions on Industrial Electronics 46(1): 3–12.

469.

Robe

Banwell

Hodge

, et al. (1993) 4B/5B block code to SONET OC-3 (155-Mbit/s) interface for ATM local area networks. In: Conference on optical fiber communication/international conference on integrated optics and optical fiber communication. Optica Publishing Group, p. WJ3.

470.

Rodriguez

Canosa

Pereira

(2011) Improving electrical power grid visualization using geometry shaders. In: 2011 Eighth international conference computer graphics, imaging and visualization, pp. 177–182.

471.

Roman

(1998) The Parallel Interface. Chapter 18. Springer New York, pp. 291–298.

472.

Ronak

Fahmy

(2016) Mapping for maximum performance on FPGA DSP blocks. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 35(4): 573–585.

473.

Ronen

Mendelson

Lai

, et al. (2001) Coming challenges in microarchitecture and architecture. Proceedings of the IEEE 89(3): 325–340.

474.

Roorda

Rasoulinezhad

Leong

PHW

, et al. (2022) FPGA architecture exploration for DNN acceleration. ACM Transactions on Reconfigurable Technology and Systems 15(3): 1–37.

475.

Rose

(2004) Hard vs. soft: the central question of pre-fabricated silicon. In: Proceedings. 34th international symposium on multiple-valued logic, pp. 2–5.

476.

Rotem

Naveh

Ananthakrishnan

, et al. (2012) Power-management architecture of the intel microarchitecture code-named Sandy Bridge. IEEE Micro 32(2): 20–27.

477.

Rotem

Yoaz

Rappoport

, et al. (2022) Intel alder lake CPU architectures. IEEE Micro 42(3): 13–19.

478.

Roy

Mukhopadhyay

Mahmoodi-Meimand

(2003) Leakage current mechanisms and leakage reduction techniques in deep-submicrometer CMOS circuits. Proceedings of the IEEE 91(2): 305–327.

479.

Ruehli

Brennan

(1975) Capacitance models for integrated circuit metallization wires. IEEE Journal of Solid-State Circuits 10(6): 530–536.

480.

Rupp

(2022) Microprocessor trend data.

481.

Rysavy

(2014) Challenges and considerations in defining spectrum efficiency. Proceedings of the IEEE 102(3): 386–392.

482.

Ryu

Kwon

Loke

, et al. (1999) Microstructure and reliability of copper interconnects. IEEE Transactions on Electron Devices 46(6): 1113–1120.

483.

Sakurai

(1993) Closed-form expressions for interconnection delay, coupling, and crosstalk in VLSIs. IEEE Transactions on Electron Devices 40(1): 118–124.

484.

Salehian

Yan

(2017) Evaluation of knight landing high bandwidth memory for HPC workloads. In: Proceedings of the seventh workshop on irregular applications: architectures and algorithms, IA3’17. New York, NY: Association for Computing Machinery, pp. 1–4.

485.

Sanca

Ailamaki

(2023) Post-Moore’s law fusion: high-bandwidth memory, accelerators, and native half-precision processing for CPU-local analytics. In: Joint workshops at 49th international conference on very large data bases (VLDBW’23), pp. 13.

486.

Sancho

Lang

Kerbyson

(2010) Analyzing the trade-off between multiple memory controllers and memory channels on multi-core processor performance. In: 2010 IEEE international symposium on parallel & distributed processing, workshops and Phd forum (IPDPSW), pp. 1–7.

487.

Saraswat

Cho

Kapur

, et al. (2008) Performance comparison between copper, carbon nanotube, and optical interconnects. In: 2008 IEEE international symposium on circuits and systems (ISCAS), pp. 2781–2784.

488.

Sarmah

Azeemuddin

(2014) A circuit to synchronize high speed serial communication channel. In: 2014 International conference on field-programmable technology (FPT), pp. 239–242.

489.

Sarmah

Azeemuddin

(2015) A circuit to eliminate serial skew in high-speed serial communication channels. IEEE Transactions on Circuits and Systems II: Express Briefs 62(12): 1179–1183.

490.

Sarmah

Azeemuddin

(2017) Circuits for initializing simplex communication channels. In: 2017 IEEE international conference on computational intelligence and computing research (ICCIC), pp. 1–4.

491.

Sato

Takeda

Shinya

, et al. (2015) Photonic crystal lasers for chip-to-chip and On-Chip optical interconnects. IEEE Journal of Selected Topics in Quantum Electronics 21(6): 728–737.

492.

Savage

(2002) Linking with light [high-speed optical interconnects]. IEEE Spectrum 39(8): 32–36.

493.

Schlansker

Conte

Dehnert

, et al. (1997) Compilers for instruction-level parallelism. Computer 30(12): 63–69.

494.

Schmidl

Cramer

Wienke

, et al. (2013) Assessing the performance of OpenMP programs on the Intel Xeon Phi. In: Wolf

Mohr

an Mey

(eds) Euro-Par 2013 Parallel Processing. Springer Berlin Heidelberg, pp. 547–558.

495.

Schrimpf

Warren

Weller

, et al. (2008) Reliability and radiation effects in IC technologies. In: 2008 IEEE international reliability physics symposium, pp. 97–106.

496.

Seckin

Yang

CKK

(2008) A comprehensive delay model for CMOS CML circuits. IEEE Transactions on Circuits and Systems I: Regular Papers 55(9): 2608–2618.

497.

Seiler

Carmean

Sprangle

, et al. (2009) Larrabee: a many-core x86 architecture for visual computing. IEEE Micro 29(1): 10–21.

498.

Seshadri

Sundberg

CEW

Weerackody

(1993) Advanced techniques for modulation, error correction, channel equalization, and diversity. AT&T Technical Journal 72(4): 48–63.

499.

Shah

Mello

(2004) Ball grid array solder joint failure envelope development for dynamic loading. In: 2004 Proceedings. 54th Electronic components and technology conference (IEEE Cat. No.04CH37546), Vol. 1. pp. 1067–1074.

500.

Shahdad

Lipsett

Marschner

, et al. (1985) VHSIC hardware description language. Computer 18(2): 94–103.

501.

Shahidi

(2007) Evolution of CMOS technology at 32 nm and beyond. In: 2007 IEEE custom integrated circuits conference, pp. 413–416.

502.

Shahzad

Sanaullah

Herbordt

(2021) Survey and future trends for FPGA cloud architectures. In: 2021 IEEE high performance extreme computing conference (HPEC), pp. 1–10.

503.

Shanley

Anderson

Swindle

, et al. (1995) ISA System Architecture. Mindshare PC System Architecture. Addison-Wesley.

504.

Shao

Brooks

(2013) Energy characterization and instruction-level energy model of Intel’s Xeon Phi processor. In: International symposium on low power electronics and design (ISLPED), pp. 389–394.

505.

Sharma

Bamiedakis

Karinou

, et al. (2021) Multi-chiplet system architecture with shared uniform access memory based on board-level optical interconnects. In: 2021 Optical fiber communications conference and exhibition (OFC), pp. 1–3.

506.

Shipman

Swaminarayan

Grider

, et al. (2022) Early performance results on 4th Gen Intel(R) Xeon (R) scalable processors with DDR and Intel(R) Xeon(R) processors, codenamed sapphire rapids with HBM.

507.

Shooman

(2002) Introduction. Chapter 1. John Wiley & Sons, Ltd, pp. 1–29.

508.

Shorey

(2016) Progress and application of through glass via (TGV) technology. In: 2016 Pan Pacific microelectronics symposium (Pan Pacific), pp. 1–6.

509.

Sideco

(2023) Design once. Sell multiple times – AMD leveraging chiplets to execute on workload optimized processing strategy.

510.

Silva

VRG

Furtunato

AFA

Georgiou

, et al. (2019) Energy-optimal configurations for single-node HPC applications. In: 2019 International conference on high performance computing & simulation (HPCS), pp. 448–454.

511.

Sim

Krishnan

Petranovic

, et al. (2003) A unified RLC model for high-speed on-chip interconnects. IEEE Transactions on Electron Devices 50(6): 1501–1510.

512.

Singh

(2016) Containers & Docker: emerging roles & future of Cloud technology. In: 2016 2nd International conference on applied and theoretical computing and communication technology (iCATccT), pp. 804–807.

513.

Singhal

Gaur

Mehra

, et al. (2015) Analysis and comparison of leakage power reduction techniques in CMOS circuits. In: 2015 2nd International conference on signal processing and integrated networks (SPIN), pp. 936–944.

514.

Smith

Sohi

(1995) The microarchitecture of superscalar processors. Proceedings of the IEEE 83(12): 1609–1624.

515.

Smithson

(1998) Introduction to digital modulation schemes. In: IEE colloquium on the design of digital cellular handsets (Ref. No. 1998/240), pp. 2/1–2/9.

516.

Sodani

(2015) Knights landing (KNL): 2nd generation Intel® Xeon Phi processor. In: 2015 IEEE hot chips 27 symposium (HCS), pp. 1–24.

517.

Sodani

Gramunt

Corbal

, et al. (2016) Knights landing: second-generation Intel Xeon Phi product. IEEE Micro 36(2): 34–46.

518.

Song

Soo

(1997) NRZ timing recovery technique for band-limited channels. IEEE Journal of Solid-State Circuits 32(4): 514–520.

519.

Soref

Bennett

(1987) Electrooptical effects in silicon. IEEE Journal of Quantum Electronics 23(1): 123–129.

520.

Soref

Larenzo

(1986) All-silicon active and passive guided-wave components for λ = 1.3 and 1.6 μ m. IEEE Journal of Quantum Electronics 22(6): 873–879.

521.

Spector

Gifford

(1984) The space shuttle primary computer system. Communications of the ACM 27(9): 872–900.

522.

Sreerama

Hall

Huray

, et al. (2018) A crosstalk-friendly signaling method. IEEE Transactions on Components, Packaging and Manufacturing Technology 8(9): 1621–1631.

523.

Srinivasan

Brooks

Gschwind

, et al. (2002) Optimizing pipelines for power and performance. In: 35th Annual IEEE/ACM international symposium on microarchitecture, 2002. (MICRO-35). Proceedings. pp. 333–344.

524.

Srivastava

Banerjee

(2004) Interconnect challenges for nanoscale electronic circuits. JOM 56(10): 30–31.

525.

Staff

IRE

(2019) One big wire change in ’97 still helping chips achieve tiny scale.

526.

Stanley-Marbell

Cabezas

Luijten

(2011) Pinned to the walls — impact of packaging and application properties on the memory and power walls. In: IEEE/ACM international symposium on low power electronics and design, pp. 51–56.

527.

Stanzione

Barth

Gaffney

, et al. (2017) Stampede 2: the evolution of an XSEDE supercomputer. In: Proceedings of the practice and experience in advanced research computing 2017 on sustainability, success and impact, PEARC17. New York, NY: Association for Computing Machinery, pp. 1–8.

528.

Stillmaker

Baas

(2017) Scaling equations for the accurate prediction of CMOS device performance from 180nm to 7nm. Integration 58: 74–81.

529.

Stoll

Gumhold

Seidel

(2005) Visualization with stylized line primitives. In: VIS 05: IEEE visualization, 2005, pp. 695–702.

530.

Naffziger

Papermaster

(2017) Multi-chip technologies to unleash computing performance gains over the next decade. In: 2017 IEEE international electron devices meeting (IEDM), pp. 1.1.1–1.1.8.

531.

Suggs

Subramony

Bouvier

(2020) The AMD “Zen 2” processor. IEEE Micro 40(2): 45–52.

532.

Sugimoto

Hayashi

Mano

(1989) Design of 2B1Q transceiver for ISDN subscriber loops. In: IEEE international conference on communications, world prosperity through communications, Vol.1, pp. 228–232.

533.

Sumeet

Rawat

Nambiar

(2022) Performance evaluation of GraphCore IPU-M2000 accelerator for text detection application. In: Companion of the 2022 ACM/SPEC international conference on performance engineering, ICPE ’22. New York, NY: Association for Computing Machinery, pp. 145–152.

534.

Summerville

(2009) Serial Communication. Chapter 2. Springer International Publishing, pp. 77–111.

535.

Sun

Mukherjee

Baruah

, et al. (2018) Evaluating performance tradeoffs on the radeon open compute platform. In: 2018 IEEE international symposium on performance analysis of systems and software (ISPASS), pp. 209–218.

536.

Sun

Agostini

Dong

, et al. (2020) Summarizing CPU and GPU design trends with product data.

537.

Sun

PSV

Titterton

Gopiani

, et al. (2022a) Intelligence processing units accelerate neuromorphic learning.

538.

Sun

Zheng

Wang

, et al. (2022b) Accelerating sparse deep neural network inference using GPU tensor cores. In: 2022 IEEE high performance extreme computing conference (HPEC), pp. 1–7.

539.

Sundareshan

(1992) Digital modulation—Baseband techniques. IETE Journal of Education 33(1): 35–44.

540.

Suzuki

(2020) Recent advances in underfill for new package architectures. In: 2020 Pan Pacific microelectronics symposium (Pan Pacific), pp. 1–7.

541.

Sylvester

Kaul

(2001) Power-driven challenges in nanometer design. IEEE Design & Test of Computers 18(6): 12–21.

542.

Sylvester

Keutzer

(1998) Getting to the bottom of deep submicron. In: 1998 IEEE/ACM international conference on computer-aided design. Digest of technical papers (IEEE Cat. No.98CB36287), pp. 203–211.

543.

Sylvester

Keutzer

(2000) A global wiring paradigm for deep submicron design. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 19(2): 242–252.

544.

Sze

Chen

Emer

, et al. (2017) Hardware for machine learning: challenges and opportunities. In: 2017 IEEE custom integrated circuits conference (CICC), pp. 1–8.

545.

Taka

Arora

, et al. (2023) MaxEVA: maximizing the efficiency of matrix multiplication on versal AI engine. In: 2023 International conference on field programmable technology (ICFPT), pp. 96–105.

546.

Taka

Gourounas

Gerstlauer

, et al. (2024) Efficient approaches for GEMM acceleration on leading AI-Optimized FPGAs. In: 2024 IEEE 32nd annual international symposium on field-programmable custom computing machines (FCCM), pp. 54–65.

547.

Taka

Huang

Chang

, et al. (2025) Systolic sparse tensor slices: FPGA building blocks for sparse and dense AI acceleration. In: Proceedings of the 2025 ACM/SIGDA international symposium on field programmable gate arrays, FPGA ’25. New York, NY: Association for Computing Machinery, pp. 159–171.

548.

Takahashi

Horiuchi

Tatsukoshi

, et al. (2013) Development of through glass via (TGV) formation technology using electrical discharging for 2.5/3D integrated packaging. In: 2013 IEEE 63rd electronic components and technology conference, pp. 348–352.

549.

Talpes

Williams

Sarma

(2022) Dojo: the microarchitecture of tesla’s exa-scale computer. In: 2022 IEEE Hot Chips 34 Symposium (HCS), pp. 1–28.

550.

Talpes

Sarma

Williams

, et al. (2023) The microarchitecture of DOJO, Tesla’s exa-scale computer. IEEE Micro 43(3): 31–39.

551.

Tam

Muljono

Huang

, et al. (2018) SkyLake-SP: a 14nm 28-Core xeon® processor. In: 2018 IEEE international solid - state circuits conference - (ISSCC), pp. 34–36.

552.

Tamitani

Ohta

Nomura

, et al. (1992) An encoder/decoder chip set for the MPEG video standard. In: Acoustics, speech, and signal processing, IEEE international conference on, Volume 5. Los Alamitos, CA: IEEE Computer Society, pp. 661–664.

553.

Tan

Kothapalli

Chen

, et al. (2014) A survey of power and energy efficient techniques for high performance numerical linear algebra operations. Parallel Computing 40(10): 559–573.

554.

Tan

Xie

, et al. (2021) AURORA: automated refinement of coarse-grained reconfigurable accelerators. In: 2021 Design, automation & test in Europe conference & exhibition (DATE), pp. 1388–1393.

555.

Taur

(1999a) CMOS scaling beyond 0.1/spl mu/m: how far can it go? In: 1999 International symposium on VLSI technology, systems, and applications. Proceedings of technical papers. (Cat. No.99TH8453). pp. 6–9.

556.

Taur

(1999b) The incredible shrinking transistor. IEEE Spectrum 36(7): 25–29.

557.

Taylor

(2012) Is dark silicon useful? Harnessing the four horsemen of the coming dark silicon apocalypse. In: Proceedings of the 49th annual design automation conference, DAC ’12. New York, NY: Association for Computing Machinery, pp. 1131–1136.

558.

Taylor

(2013) A landscape of the new dark silicon design regime. IEEE Micro 33(5): 8–19.

559.

Teixeira

Zaharov

(2007) Digital Transmission. Chapter 7. John Wiley & Sons, Ltd, pp. 84–101.

560.

Teshima

Kobayashi

Yamamoto

, et al. (2008) Bit-error-tolerant (512*N)B/(513*N+1)B code for 40Gb/s and 100Gb/s ethernet transport. In: IEEE INFOCOM workshops 2008, pp. 1–6.

561.

Theis

(2000) The future of interconnection technology. IBM Journal of Research and Development 44(3): 379–390.

562.

Theis

Wong

HSP

(2017) The end of Moore’s law: a new beginning for information technology. Computing in Science & Engineering 19(2): 41–50.

563.

Tilli

(2010) Chapter five - silicon wafers: preparation and properties. In: Lindroos

Tilli

Lehto

, et al. (eds) Handbook of Silicon Based MEMS Materials and Technologies, Micro and Nano Technologies. William Andrew Publishing, pp. 71–88.

564.

Tiwari

Singh

Rajgopal

, et al. (1998) Reducing power in high-performance microprocessors. In: Proceedings of the 35th annual design automation conference, DAC ’98. New York, NY: Association for Computing Machinery, pp. 732–737.

565.

Todri-Sanial

Ramos

Okuno

, et al. (2017) A survey of carbon nanotube interconnects for energy efficient integrated circuits. IEEE Circuits and Systems Magazine 17(2): 47–62.

566.

Tőkei

Ciofi

Roussel

, et al. (2016) On-chip interconnect trends, challenges and solutions: how to keep RC and reliability under control. In: 2016 IEEE symposium on VLSI technology, pp. 1–2.

567.

Tong

Anderson

IDL

Khalid

MAS

(2006) Soft-core processors for embedded systems. In: 2006 International conference on microelectronics, pp. 170–173.

568.

Tonietto

(2022) The future of short reach interconnect. In: ESSCIRC 2022- IEEE 48th European solid state circuits conference (ESSCIRC), pp. 1–8.

569.

Trader

(2017) Graphcore readies launch of 16nm Colossus-IPU chip.

570.

Tukanov

Srinivasaraghavan

Moreira

, et al. (2022) Modeling matrix engines for portability and performance. In: 2022 IEEE international parallel and distributed processing symposium (IPDPS), pp. 1173–1183.

571.

Turkane

Kureshi

(2017) Emerging interconnects: a state-of-the-art review and emerging solutions. International Journal of Electronics 104(7): 1107–1119.

572.

Tzimpragos

Kachris

Djordjevic

, et al. (2016) A survey on FEC codes for 100 G and beyond optical networks. IEEE Communications Surveys & Tutorials 18(1): 209–221.

573.

Van Kerrebrouck

De Keulenaer

Pierco

, et al. (2019) NRZ, duobinary, or PAM4?: Choosing among high-speed electrical interconnects. IEEE Microwave Magazine 20(7): 24–35.

574.

Vanderwiel

Lilja

(2000) Data prefetch mechanisms. ACM Computing Surveys 32(2): 174–199.

575.

Vanna-Iampikul

Zhu

Erdogan

, et al. (2023) Glass interposer integration of logic and memory chiplets: PPA and power/signal integrity benefits. In: 2023 60th ACM/IEEE design automation conference (DAC), pp. 1–6.

576.

Vasilakis

Sourdis

Papaefstathiou

, et al. (2017) Modeling energy-performance tradeoffs in ARM big.LITTLE architectures. In: 2017 27th International symposium on power and timing modeling, optimization and simulation (PATMOS), pp. 1–8.

577.

Veendrick

(1984) Short-circuit dissipation of static CMOS circuitry and its impact on the design of buffer circuits. IEEE Journal of Solid-State Circuits 19(4): 468–473.

578.

Venkataraman

Amadi

Chen

, et al. (2019) Carbon nanotube assembly and integration for applications. Nanoscale Research Letters 14(1): 220.

579.

Verbauwhede

Schaumont

Piguet

, et al. (2004) Architectures and design techniques for energy efficient embedded DSP and multimedia processing. Proceedings Design, Automation and Test in Europe Conference and Exhibition 2: 988–993.

580.

Véstias

Neto

(2014) Trends of CPU, GPU and FPGA for high-performance computing. In: 2014 24th International conference on field programmable logic and applications (FPL), pp. 1–6.

581.

Vidya

Kamat

Khan

, et al. (2018) 3D FinFET for next generation nano devices. In: 2018 International conference on current trends towards converging technologies (ICCTCT), pp. 1–9.

582.

Wade

Anderson

Ardalan

, et al. (2020) TeraPHY: a chiplet technology for low-power, high-bandwidth in-package optical I/O. IEEE Micro 40(2): 63–71.

583.

Wall

(1991) Limits of instruction-level parallelism. SIGPLAN Not 26(4): 176–188.

584.

Wang

Agrawal

(2008) Single event upset: an embedded tutorial. In: 21st International conference on VLSI design (VLSID 2008), pp. 429–434.

585.

Wang

Kanwar

(2019) BFloat16: the secret to high performance on Cloud TPUs. Google Cloud Blog 4.

586.

Wang

Skadron

(2013) Implications of the power wall: dim cores and reconfigurable logic. IEEE Micro 33(5): 40–48.

587.

Wang

Hua

Wang

(2010) A 1.1 GHz 8B/10B encoder and decoder design. In: 2010 Asia Pacific conference on postgraduate research in microelectronics and electronics (PrimeAsia), pp. 138–141.

588.

Wang

Zhang

Shen

, et al. (2014) High-Performance Computing on the Intel® Xeon Phi^TM. Springer International Publishing.

589.

Wang

Choi

Brand

, et al. (2018) Training deep neural networks with 8-Bit floating point numbers. In: Proceedings of the 32nd international conference on neural information processing systems, NIPS’18. Red Hook, NY: Curran Associates Inc., pp. 7686–7695.

590.

Wang

Huang

Zhang

, et al. (2020) Shuhai: benchmarking high bandwidth memory on FPGAS. In: 2020 IEEE 28th annual international symposium on field-programmable custom computing machines (FCCM), pp. 111–119.

591.

Wei

(2008) Challenges in cooling design of CPU packages for high-performance servers. Heat Transfer Engineering 29(2): 178–187.

592.

Wei

Chen

Johnson

, et al. (1998) Design and optimization of low voltage high performance dual threshold CMOS circuits. In: Proceedings of the 35th annual design automation conference, DAC ’98. New York, NY: Association for Computing Machinery, pp. 489–494.

593.

Wei

Lin

, et al. (2023) Reconfigurability, why it matters in AI tasks processing: a survey of reconfigurable AI chips. IEEE Transactions on Circuits and Systems I: Regular Papers 70(3): 1228–1241.

594.

Weng

Chen

, et al. (2018) Factors affecting near-end crosstalk (NEXT) in high speed serial links. In: 2018 15th international conference on electromagnetic interference & compatibility (INCEMIC), pp. 1–4.

595.

Weng

Chen

, et al. (2021) Eye comparison between unencoded and 128b/130b-encoded NRZ signals. In: 2021 IEEE 30th conference on electrical performance of electronic packaging and systems (EPEPS), pp. 1–3.

596.

Werner

Navaridas

Luján

(2017) A survey on optical network-on-chip architectures. ACM Computing Surveys 50(6): 1–37.

597.

Weste

Harris

(2010) CMOS VLSI Design: A Circuits and Systems Perspective. 4th edition. Addison-Wesley Publishing Company.

598.

Wijerathne

Pathania

, et al. (2022) HiMap: fast and scalable high-quality mapping on CGRA via hierarchical abstraction. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 41(10): 3290–3303.

599.

Winzer

(2012) High-spectral-efficiency optical modulation formats. Journal of Lightwave Technology 30(24): 3824–3835.

600.

Wong

(1999) Recent advances in plastic packaging of flip-chip and multichip modules (MCM) of microelectronics. IEEE Transactions on Components and Packaging Technologies 22(1): 21–25.

601.

Tsai

(2004) Structured ASIC, evolution or revolution? In: Proceedings of the 2004 international symposium on physical design, ISPD ’04. New York, NY: Association for Computing Machinery, pp. 103–106.

602.

, et al. (2009) The research and implementation of interfacing based on PCI express. In: 2009 9th International conference on electronic measurement & instruments, pp. 3–116–3–121.

603.

Buesink

Canavero

(2013) Overview of signal integrity and EMC design technologies on PCB: fundamentals and latest progress. IEEE Transactions on Electromagnetic Compatibility 55(4): 624–638.

604.

Zhang

, et al. (2016) High speed serial interface transceiver controller based on JESD204B. In: 2016 14th IEEE international new circuits and systems conference (NEWCAS), pp. 1–4.

605.

Xiong

(2006) Digital Modulation Techniques. 2nd edition. Artech House, Inc.

606.

Xiu

(2017) Clock technology: the next frontier. IEEE Circuits and Systems Magazine 17(2): 27–46.

607.

Xiu

(2019) Time moore: exploiting Moore’s law from the perspective of time. IEEE Solid-State Circuits Magazine 11(1): 39–55.

608.

Chen

Dick

, et al. (2010) Cache contention and application performance prediction for multi-core systems. In: 2010 IEEE international symposium on performance analysis of systems & software (ISPASS), pp. 76–86.

609.

Chen

Zhou

, et al. (2022) Recent progress and challenges regarding carbon nanotube on-chip interconnects. Micromachines 13(7): 1148.

610.

Xylon doo (2021) logi3D Scalable 3D Graphic Accelerator. Datasheet. Xylon D.O.O.

611.

Yazdani

Ferry

Akers

(1997) Microprocessor pin predicting. IEEE Circuits and Devices Magazine 13(2): 28–31.

612.

Yeoh

Lii

Sankman

, et al. (2000) Flip chip pin grid array (FC-PGA) packaging technology. In: Proceedings of 3rd electronics packaging technology conference (EPTC 2000) (Cat. No.00EX456), pp. 33–40.

613.

Yoon

Suthiwongsunthorn

, et al. (2009) Fabrication and packaging of microbump interconnections for 3D TSV. In: 2009 IEEE international conference on 3D system integration, pp. 1–5.

614.

Yue

Shekhar

(2022) F4: paving the way to 200Gb/s transceivers. In: 2022 IEEE international solid- state circuits conference (ISSCC), 65, pp. 537–539.

615.

Zahiri

(2003) Structured ASICs: opportunities and challenges. In: Proceedings 21st international conference on computer design, San Jose, CA, 13–15 October 2003, pp. 404–409.

616.

Zhang

Lau

, et al. (2013) Thermal characterization and simulation study of 2.5D packages with multi-chip module on through silicon interposer. In: 2013 IEEE 15th electronics packaging technology conference (EPTC 2013), Piscataway, NJ, 11–13 December 2013, pp. 363–368.

617.

Zhao

, et al. (2009) Field-based capacitance modeling for sub-65-nm on-chip interconnect. IEEE Transactions on Electron Devices 56(9): 1862–1872.

618.

Zheng

Lin

Zhang

, et al. (2008) Mini-rank: adaptive DRAM architecture for improving memory power efficiency. In: 2008 41st IEEE/ACM international symposium on microarchitecture, Lake Como, Italy, 8–12 November 2008, pp. 210–221.

619.

Zhou

Preparata

Kang

(1988) Interconnection delay in very high-speed VLSI. In: Proceedings 1988 IEEE international conference on computer design: VLSI, Rye Brook, NY, 03–05 October 1988, pp. 52–55.

620.

Zhou

Kannan

Prasanna

, et al. (2019) HitGraph: high-throughput graph processing framework on FPGA. IEEE Transactions on Parallel and Distributed Systems 30(10): 2249–2264.

621.

Zhou

Fang

, et al. (2023) Prospects and applications of on-chip lasers. eLight 3(1): 1.

622.

Zhuang

Liu

Pan

, et al. (2023a) A survey on efficient training of transformers.

623.

Zhuang

Lau

, et al. (2023b) CHARM: composing heterogeneous accelerators for matrix multiply on versal ACAP architecture. In: Proceedings of the 2023 ACM/SIGDA international symposium on field programmable gate arrays, FPGA ’23. New York, NY: Association for Computing Machinery, pp. 153–164.

624.

Zhuravlev

Blagodurov

Fedorova

(2010) Addressing shared resource contention in multicore processors via scheduling. In: Proceedings of the fifteenth international conference on architectural support for programming languages and operating systems, ASPLOS XV. New York, NY: Association for Computing Machinery, pp. 129–142.

625.

Zyuban

Kogge

(2000) Optimization of high-performance superscalar architectures for energy efficiency. In: ISLPED’00: Proceedings of the 2000 international symposium on low power electronics and design (Cat. No.00TH8514), Rapallo, Italy, 26–27 July 2000, pp. 84–89.

626.

Zyuban

Kogge

(2001) Inherently lower-power high-performance superscalar architectures. IEEE Transactions on Computers 50(3): 268–285.

Technology trends in computing hardware and their impacts on high-performance scientific computing Part I: General-purpose processors and hardware accelerators

Abstract

Keywords

Introduction

Motivation

Outline and summary

General technology trends and concepts in computing hardware

Clock frequency

Transistor power consumption and heat dissipation

Transistor size and Moore’s law

Dennard scaling

Transistor count and yield, and manufacturability

Advanced packaging technologies

Reliability and availability of computing systems

Wiring, connectivity, and signal integrity

Optical interconnects

Cost of designing and manufacturing semiconductors

Execution model, architecture, and implementation style

Execution model

Chip architecture and programmability

Implementation fabric

Off-chip interfaces and pin limitations

Architecture of communication interfaces

Parallel interface

Serial interface

The physical layer of the serial interface

Signal coding and modulation

Pulse shaping

Block coding

General-purpose microprocessors

Trends between 1980s and early 2000s

The end of Dennard scaling and emergence of multi-core processors

Many-core processors

Programmability

What comes next?

Summary and remarks

Hardware accelerators

Graphics processing units (GPUs)

Evolution of GPUs

Fixed-function pipeline era

Programmable shaders

Unified shaders

General-purpose graphics processing units (GP-GPUs)

GPU compute unit

GPU memory system

On-chip memory

Off-chip memory

Specializations for machine learning

Programmability

Specialized and custom hardware

Challenges in developing specialized hardware accelerators

Implementation choices

Examples of custom and specialized hardware

Google’s tensor processing unit (TPU)

GraphCore’s intelligence processing unit (IPU)

Cerebras’ Wafer scale engine (WSE)

SambaNova’s reconfigurable dataflow unit (RDU)

General NPU trend of semiconductor companies

NPUs in consumer devices

What comes next?

Summary and remarks

Remarks

Footnotes

Acknowledgements

ORCID iDs

Funding

Declaration of conflicting interests

Notes

Author biographies

Appendix

References