FabScalar

Overview

A heterogeneous multi-core processor (or asymmetric multi-core processor) provides multiple, differently-designed superscalar core types that can streamline the execution of diverse programs and program phases. Providing multiple superscalar core types on a chip is an exciting new direction for increasing performance and reducing power, as conventional technology and microarchitecture scaling slows. This paradigm has an “Achilles’ Heel”, however: design and verification effort is multiplied by the number of different core types. The FabScalar project is the first to bring solutions to bear on this problem, and explores a number of other intriguing questions and challenges with respect to architecting heterogeneous multi-core processors:

Automatic generation of diverse superscalar cores [C6, J1, W3, C9]

The FabScalar toolset boosts designer productivity through the automatic generation of synthesizable register-transfer-level (RTL) designs of arbitrary superscalar cores. Our vision is to enable widespread proliferation of microarchitecturally-diverse cores.

Automatic, fast and cycle-accurate superscalar processor simulation on dense FPGAs [C7, BHD’11]

We have developed a tool, FPGA-Sim, for mapping FabScalar-generated superscalar cores to dense FPGAs. FPGA-Sim is a configurable, automatically FPGA-synthesizable, register-transfer-level (RTL) model of an out-of-order superscalar processor. Consequently, FPGA-Sim provides the convenience of a software model, the speed of an FPGA model, and the experience of a prototype.

Fast design space exploration of superscalar processors [C5]

We are developing intelligent search techniques that very quickly converge on the best core design for a program phase. Longer term, we are interested in developing a closed-form analytical model that outputs the best core design.

Architecting heterogeneous multi-core processors [C11, SSN’12, SVW’12, C6, C1, C2, C3, C4, C10]

[C6] includes a preliminary study of a “generic” heterogeneous multi-core processor (G21) for maximizing single-thread performance. The problem of choosing a set of diverse cores to maximize single-thread performance on arbitary applications, deserves a more comprehensive treatment and should also consider power. This research is underway [C11, SSN’12, SVW’12].

Multi-threaded applications favor homogeneous cores, whereas, multi-programmed workloads favor heterogeneous cores. Moreover, different multi-threaded applications favor different homogeneous multi-core processors, and different multi-programmed workloads favor different heterogeneous multi-core processors. [C3] describes a processor that can be “configured” into different homogeneous and heterogeneous multi-core processors. [C4] explains how task arrival characteristics influence the best configuration.

[C2] explores the problem of providing the best instantaneous architectural configuration for a running program. We find that the best configuration varies at a fine granularity, too fine to be exploited by conventional adaptive pipelines or heterogeneous multi-core processors. Architectural contesting is a novel approach that automatically and rapidly switches effective execution to the best core for the instantaneous workload behavior. This study is a specific instance of a broader research topic: how to schedule tasks on a heterogeneous multi-core processor.

[C1] explores the question of how to divide up the workload space for customizing a limited number of core types to subsets of the space. A case is made for characterizing and subsetting workloads based on their performance on each other’s customized cores rather than raw workload characteristics.

Mapping program phases to best cores at run-time [C11, SSN’12]

We are researching off-line, on-line [C11, SSN’12], and combined techniques for mapping program phases to best cores (best for performance, power, energy-delay, etc.) in a heterogeneous multi-core processor.

Reconsidering microarchitecture techniques in heterogeneous multi-core processors

The 1990’s was a golden age of microarchitecture research: many microarchitecture optimizations were proposed during that time. Many of them have not been put into practice. One plausible reason is that many techniques are not universally beneficial, i.e., they provide significant benefit only in limited circumstances (and may even degrade performance in other circumstances). There is good reason to reconsider previously proposed microarchitecture optimizations in the context of heterogeneous multi-core processors, because universal applicability is relaxed in these processors by design. Techniques such as trace caches, value prediction, instruction/trace/computation reuse, speculative multithreading, control independence, etc., may be highly beneficial under certain workload behavior and very cost-effective (area and power efficient) in the context of whole-pipeline customization.

Diverse superscalar cores versus adaptive superscalar cores

Can a fully reconfigurable core approximate the performance and power of arbitrary customized cores? We are designing a fully reconfigurable core (a valuable contribution in its own right) and will compare its configurations with corresponding customized cores.

Better baselines in microarchitecture research

Choosing a baseline microarchitecture for evaluating a proposed microarchitecture enhancement is tricky. The choice of baseline configuration can influence the perceived speedup, either exaggerating or obscuring the speedup of the enhancement due to intrinsic imbalances. Customization is a solution to the baseline problem. In one approach, the baseline microarchitecture is the customized core for the benchmark (which is unusual in that different benchmarks use different baselines). In a second approach, the baseline microarchitecture is the customized core for the benchmark, as before, but this baseline is compared to a recustomized core with the microarchitecture enhancement. Recustomizing the core with the enhancement accounts for the fact that there is a new global optimum due to the new dynamic introduced by the enhancement.

Tools

FabScalar [C6, J1, W3]

Publications

Conference and Journal Papers

[C11] S. Navada, N. K. Choudhary, S. V. Wadhavkar, and E. Rotenberg. A Unified View of Non-monotonic Core Selection and Application Steering in Heterogeneous Chip Multiprocessors. Proceedings of the 22nd IEEE/ACM International Conference on Parallel Architectures and Compilation Techniques (PACT-22), pp. 133-144, September 2013. [pdf]

[C10] S. Priyadarshi, N. K. Choudhary, B. Dwiel, A. Upreti, E. Rotenberg, R. Davis, and P. Franzon. Hetero² 3D Integration: A Scheme for Optimizing Efficiency/Cost of Chip Multiprocessors. Proceedings of the 14th IEEE International Symposium on Quality Electronic Design (ISQED), pp. 1-7, March 2013. [pdf]

[C9] N. K. Choudhary, B. H. Dwiel, and E. Rotenberg. A Physical Design Study of FabScalar-generated Superscalar Cores. Proceedings of the 2012 IEEE/IFIP 20th International Conference on VLSI and System-on-Chip (VLSI-SoC), pp. 165-170, October 2012. [pdf]

[J1] N. K. Choudhary, S. V. Wadhavkar, T. A. Shah, H. Mayukh, J. Gandhi, B. H. Dwiel, S. Navada, H. H. Najaf-abadi, and E. Rotenberg. FabScalar: Automating Superscalar Core Design. IEEE Micro, Special Issue: Micro’s Top Picks from the Computer Architecture Conferences, 32(3):48-59, May-June 2012. [pdf]

[C8] T. Nakabayashi, T. Sasaki, E. Rotenberg, K. Ohno and T. Kondo. Research for Transporting Alpha ISA and Adopting Multi-processor to FabScalar. Proceedings of the Symposium on Advanced Computing Systems and Infrastructures 2012 (SACSIS2012), pp. 374-381, May 2012. (in Japanese)

[C7] B. H. Dwiel, N. K. Choudhary, and E. Rotenberg. FPGA Modeling of Diverse Superscalar Processors. Proceedings of the 2012 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’12), pp. 188-199, April 2012. [pdf]

[C6] N. K. Choudhary, S. V. Wadhavkar, T. A. Shah, H. Mayukh, J. Gandhi, B. H. Dwiel, S. Navada, H. H. Najaf-abadi, and E. Rotenberg. FabScalar: Composing Synthesizable RTL Designs of Arbitrary Cores within a Canonical Superscalar Template. Proceedings of the 38th IEEE/ACM International Symposium on Computer Architecture (ISCA-38), pp. 11-22, June 2011. [pdf]

[C5] S. Navada, N. K. Choudhary, and E. Rotenberg. Criticality-driven Superscalar Design Space Exploration. Proceedings of the 19th IEEE/ACM International Conference on Parallel Architectures and Compilation Techniques (PACT-19), pp. 261-272, September 2010. [pdf]

[C4] H. Hashemi Najaf-abadi and E. Rotenberg. The Importance of Accurate Task Arrival Characterization in the Design of Processing Cores. Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC’09), pp. 75-85, October 2009. [pdf]

[C3] H. Hashemi Najaf-abadi, N. K. Choudhary, and E. Rotenberg. Core-Selectability in Chip Multiprocessors. Proceedings of the 18th IEEE/ACM International Conference on Parallel Architectures and Compilation Techniques (PACT-18), pp. 113-122, September 2009. [pdf]

[C2] H. Hashemi Najaf-abadi and E. Rotenberg. Architectural Contesting. Proceedings of the 15th IEEE International Symposium on High-Performance Computer Architecture (HPCA-15), pp. 189-200, February 2009.[pdf]

[C1] H. Hashemi Najaf-abadi and E. Rotenberg. Configurational Workload Characterization. Proceedings of the2008 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’08), pp. 147-156, April 2008. [pdf]

Workshop Papers

[W3] N. K. Choudhary, S. V. Wadhavkar, T. A. Shah, S. S. Navada, H. Hashemi Najaf-abadi, and E. Rotenberg. FabScalar. 4th Workshop on Architectural Research Prototyping (WARP’09), in conjunction with ISCA-36, June 2009. [pdf]

[W2] H. Hashemi Najaf-abadi and E. Rotenberg. Exploiting Detachability: A Non-Silicon Approach to Polymorphism. 4th Workshop on Non-Silicon Computing (NSC-4), in conjunction with ISCA-34, June 2007. [pdf]

[W1] H. Hashemi Najaf-abadi and E. Rotenberg. Architectural Contesting: Exposing and Exploiting Temperamental Behavior.

Reconfigurable and Adaptive Architecture Workshop (RAAW), in conjunction with MICRO-39, December 2006.
Also appears in ACM SIGARCH Computer Architecture News (CAN), 35(3):28-35, June 2007. [pdf ]

Student Theses

S. V. Wadhavkar. Architecting a Workload-agnostic Heterogeneous Multi-core Processor. Ph.D. Thesis, Department of Electrical and Computer Engineering, North Carolina State University, September 2012. [NCSU library: on-line thesis]

A. V. Shastri. Microarchitectural Implementation of the MIPS Floating-point ISA in FabScalar-generated Superscalar Cores. M.S. Thesis, Department of Electrical and Computer Engineering, North Carolina State University, August 2012. [NCSU library: on-line thesis]

S. S. Navada. A Unified View of Core Selection and Application Steering in Heterogeneous Chip Multiprocessors. Ph.D. Thesis, Department of Electrical and Computer Engineering, North Carolina State University, June 2012. [NCSU library: on-line thesis]

N. K. Choudhary. FabScalar: Automating the Design of Superscalar Processors. Ph.D. Thesis, Department of Electrical and Computer Engineering, North Carolina State University, May 2012. [NCSU library: on-line thesis]

B. H. Dwiel. FPGA Modeling of Diverse Superscalar Processors. M.S. Thesis, Department of Electrical and Computer Engineering, North Carolina State University, November 2011. [NCSU library: on-line thesis]

H. Hashemi Najaf-abadi. Core-Selectable Chip Multiprocessor Design. Ph.D. Thesis, Department of Electrical and Computer Engineering, North Carolina State University, December 2010. [NCSU library: on-line thesis]

J. Gandhi. FabFetch: A Synthesizable RTL Model of a Pipelined Instruction Fetch Unit for Superscalar Processors. M.S. Thesis, Department of Electrical and Computer Engineering, North Carolina State University, June 2010. [NCSU library: on-line thesis]

H. Mayukh. FabIssue: Automatic RTL Generation of Issue Logic in Superscalar Processors for Core Customization. M.S. Thesis, Department of Electrical and Computer Engineering, North Carolina State University, June 2010. [NCSU library: on-line thesis]

T. A. Shah. FabMem: A Multiported RAM and CAM Compiler for Superscalar Design Space Exploration. M.S. Thesis, Department of Electrical and Computer Engineering, North Carolina State University, May 2010. [NCSU library: on-line thesis]

N. K. Choudhary. A Synthesizable HDL Model for Out-of-Order Superscalar Processors. M.S. Thesis, Department of Electrical and Computer Engineering, North Carolina State University, August 2009. [NCSU library: on-line thesis]

Talks

FabScalar. Presented at WARP-2009 (held in conjunction with ISCA-36) by E. Rotenberg. [pps]

Funding

This project is supported by NSF grant No. CCF-0811707 (CPA-CSA: FabScalar: A Standard Superscalar Library for Fabricating Heterogeneous Chip Multiprocessors), Intel, and IBM.

Any opinions, findings, and conclusions or recommendations expressed in this website and publications herein are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.