Narrative of Research

 

My research contributions and interests are in the areas of computer aided design of fault-tolerant VLSI systems, application specific programmable processors, design approaches targeting deep-submicron reliability and low power, high-speed architectures for network protocols and encryption, and fault-tolerant Nanoscale systems.

 

A. Computer aided design of fault-tolerant VLSI Systems: As virtually every aspect of society becomes reliant on digital systems, reliable, fault-free operation becomes an issue of principal importance. However, customer affordability and rapid obsolescence of consumer-oriented electronic products necessitates low-cost implementation and rapid design and test turnaround times. Hence, the level at which VLSI systems are being designed has moved from circuit level through Boolean logic level to register transfer (RT) level. RT level synthesis systems take designs described at the RT level and translate them into logic and circuit level designs. In our research, we developed low-cost RT level techniques for concurrent error detection, diagnosis, recovery and reconfiguration. The resulting fault-tolerant systems enable reliable and fault-free operation over the lifetime of a VLSI system.

1.               Concurrent Error Detection (CED): CED is based either on information redundancy or hardware redundancy or time redundancy. CED designs using self-checking (i.e. information redundancy) although very involved can identify faulty modules. Self-checking based CED is suitable for structured components such as PLAs and ROM/RAMs in control paths [ 2 , 3 , 13 ]. In hardware redundancy based CED (that is appropriate for data path designs), computations are duplicated on disjoint hardware and the results compared. Duplication has extensive coverage for faults in one of the duplicated modules. Such a straightforward strategy although easy to design entails > 100% hardware overhead [ 3 , 13 ].

1.     We developed a low cost CED scheme for data paths based on hardware redundancy called fault-secure CED that requires a less than proportional increase in hardware [ 1 , 5 , 6 ]. The fault-secure CED scheme selects some intermediate computations for duplication and comparison. In a fault-secure CED design, the >100% hardware overhead of duplication is supplanted with improved hardware utilization afforded by comparing carefully selected intermediate computations. Specifically, we developed an RT level scheduling algorithm that incorporates fault-security constraints [ 1 , 6 ] and RT level binding and allocation algorithms that incorporate fault-security constraints [ 5 ].

2.     Introspection: We reduced the area overhead of CED even further based on the observation that although a large number of hardware units are employed in high-performance VLSI systems, not all of these hardware resources are used in all clock cycles. We developed Introspection an RT level technique that uses idle computation cycles in the data path and idle data transfer cycles in the interconnection network in a synergistic fashion for CED and diagnosis [ 9 , 10 ]. The idea of utilizing spare capacity to execute redundant tasks was suggested in the context of multiprocessor systems [ 7 , 8 ]. In all these approaches CED was done in software at the system level. CED latency of introspection is one ten-thousandth of these system level CED latencies. The area overhead and performance penalty associated with introspection are < 10%.

3.     RT level time redundancy based CED: We also developed a low cost time redundancy based RT level approach to CED of transient and permanent faults. Previously, logic level time redundancy based CED techniques such as Re-computing with shifted operands (RESO) [ 11 ], Re-computing with swapped operands (RESWO) [ 13 ], Re-computing with rotated operands (RERO) [ 13 ] and Re-computing with duplication with comparison (REDWC) [ 12 ] have been developed. In these time redundancy techniques, logic level operations (add, subtract, multiply, and, nand, etc) are carried out twice - once on the basic input and once on the shifted, swapped or rotated inputs. Results of these two operations are then compared to detect an error. Operators that implement logic level RESO/RESWO/RERO/REDWC techniques can be used to implement RT level time redundancy based CED. However, this entails significant time and area overhead. We developed a RT level time redundancy based CED technique called Algorithmic Re-computing [ 14 , 15 ]. Algorithmic Re-computing does not use logic level RESO/RESWO/RERO operators. Rather, it exploits RT level operation scheduling, functional pipelining, operator chaining and multi-cycling to incorporate user error detection latencies with low area overhead and performance penalty. Further, Algorithmic Re-computing supports hardware overhead vs. performance penalty vs. error detection latency trade-offs.

2.               RT Level reconfiguration: A straightforward approach to dynamic and static reconfiguration is by providing a set of spare modules in addition to the core operational modules. Spares-based reconfiguration has been previously used in regular two-dimensional array structures [ 3 , 13 , 17 ] such as RAMs, ROMs and systolic arrays [ 16 ]. Spares-based reconfiguration in random data path logic has been implemented by using a backup data path component for each primary data path component. However, this is expensive. An RT level reconfiguration technique called BISR (built in self repair) was developed that does not require a dedicated spare for each primary component. It uses one spare for each data path component type [ 18 ]. We developed a low cost RT level reconfiguration technique called phantom redundancy [ 18 , 20 ]. Unlike the spares-based approaches, phantom redundancy uses extra interconnect to render the data path reconfigurable in the presence of single/multiple functional unit failure(s). We developed an RT level synthesis approach to incorporate phantom redundancy constraints by investigating the tight interdependence between reconfigurability of a (faulty) data path, and RT level synthesis tasks of scheduling and allocation. The phantom redundancy approach minimizes the performance degradation of the reconfigurable data path in the presence of any single/multiple faulty functional unit(s).

3.               RT Level Off-line and On-Line Built-In Self-Test (BIST): To meet the stringent quality and reliability requirements of today's complex communication networks, efficient test methodologies are necessary at all levels (system, board, circuit, etc). Conventional test methodologies are being constantly challenged by ever increasing speed and circuit size, which result sin high costs associated with test hardware, test generation, and test application time. Built-in self-test (BIST) offers a test methodology where the test functions are embedded into the circuit itself [ 1 ]. BIST is increasingly being used in production testing of VLSICs.  The advantages of BIST for complex telecommunication systems are numerous. Reduced test development time, low test application time, eliminating the need for high speed hardware testers, provision for at-speed tests, in-field test capability and high fault coverage are some of them. We developed the first comprehensive tutorial on BIST methodology for telecommunication systems [ 24 ].

    In off-line BIST, extra logic is implemented to generate test patterns and compact test responses on chip.  However, this extra logic is used only during the test mode. Typical off-line RT level BIST structures include linear feedback shift register (LFSR) based pattern generators and multiple input signature register (MISR) based test response compactors [ 1 , 3 ]. We developed a RT level Versatile BIST (VBIST) approach that uses off-line BIST circuitry for on-line testing as well [ 21 , 22 , 23 ]. Unlike traditional on-line approaches to self-test, VBIST does not use functional data as test inputs. Rather, VBIST generates test patterns and compacts test responses during the normal mode of operation. Furthermore, VBIST coordinates this generation and application of test patterns and compaction of test responses with the usage profile of the RT level modules in the design. VBIST entails little additional impact on performance and area of the design (vis-a-vis the performance and area of a design with off-line BIST). We validated the proposed approach using the Synopsys Behavioral Compiler as the synthesis framework and by writing synthesis scripts for incorporating VBIST constraints.

4.               RT level recovery from transient faults: The well-known approach of rollback and recovery for tolerating transient faults was first proposed in the context of software [26 ]. Since then several extensions have been made but all of them in the context of software [ 3 , 13 ]. While early techniques for checkpointing application software [ 27 , 28 ] required a large effort on the part of the programmer in terms of program partitioning, determining the execution times of the partitions, and inserting optimal checkpoints, automatic checkpointing of software has received a great deal of attention recently [ 29 , 30 ]. In contrast to all these software-based recovery techniques, we developed one of the first RT level techniques to implement transient fault detection and recovery in VLSI. In a self-recovering VLSI system, intermediate results are compared at regular intervals, and if correct saved in registers (checkpointing). On the other hand, on detecting a fault, the self-recovering datapath rolls back to a previous checkpoint and retries. Whereas the checkpoint insertion algorithm identifies good checkpoints by successively eliminating clock cycle boundaries that either have a high checkpoint overhead or violate the retry period constraint, the novel edge-based schedule, assigns edges to clock cycle boundaries, in addition to scheduling nodes to clock cycles [31 ,33 ,33 ]. Also, checkpoint insertion and edge-based scheduling are intertwined using a flexible RT level synthesis methodology. We additionally show an Integer Linear Programming model for the self-recovering RT level synthesis problem. The resulting ILP formulation can minimize either the number of voters or the overall hardware, subject to constraints on the number of clock cycles the retry period, and the number of checkpoints [32 ,35 ].

 

B. Design Approaches to Application Specific Programmable Processors (ASPP): Modern wireless and multimedia applications require high performance (hundreds of millions of operations per second), low power, and inexpensive multiple functionalities. Conventional architectures fail to meet these requirements of algorithm flexibility and low complexity. While dedicated ASICs can be optimized to execute a given algorithm in real-time with very low complexity, they lack the flexibility to perform a range of algorithms. Although highly flexible, software programmable processors, on the other hand are complex and have low throughput. Consequently, rather than optimizing the architecture for a single algorithm or for all algorithms, we are investigating application specific adaptive architectures targeting classes of applications (for example, a configurable architecture implementing image transform algorithms). Such architectures yield a better tradeoff between low complexity and high algorithm flexibility than either software-programmable processors or dedicated ASICs. In fact, almost all major processor manufacturers including Fujitsu (with their 86k line of programmable processors) [ 40 ], Motorola (with their application specific programmable DSP processors) [ 41 ] and LSI Logic are offering a comprehensive line of application specific programmable processors (ASPPs) that preserve all of the advantages of the special purpose processors while retaining the cost and flexibility afforded by the general purpose processors. Owing to their significance, we are also developing design approaches for such ASPPs.

1.          Implementing a configurable processor for a class of applications requires consideration of issues such as compatibility of the application topologies, resources required and word length characteristics. As a vehicle, digital signal processing algorithms. Digital and video signal processing (DSP) are data path intensive due to their significant digital computation demands. Further, wireless communication (TDMA, CDMA), television (NTSC, PAL, SECAM, HDTV), video compression (JPEG, MPEG), audio compression (AC-3, AC-97), etc. are standardized. Therefore, the functionality of commercial electronic products is limited to a minor variation of these standards. Coarse grain architectures comprising of a static data path with configurable control are an excellent implementation medium for such applications. We are developing ASPP architectures and synthesis techniques that identify classes of applications with similar topologies and resource requirements and compatible precision and word length requirements [35 ,36 ,45 ].

2.          In embedded core-plus-co-processor based applications, generic tasks are carried out on a central processor core, while the specialized tasks are delegated to and carried out by task-specific co-processors. For example, in automotive electronics a central processor communicates with the antilock brake control co-processor, throttle control co-processor, airbag control co-processor and diagnostics co-processor. Implementing dedicated co-processors for each of the applications entails significant area overhead. This is especially wasteful if these applications are invoked infrequently and in a non-overlapping fashion. We developed techniques to bundle multiple non-overlapping and specialized applications on a single dynamically configurable ASPP co-processor [36 ]. These configurable ASPP co-processors are beneficial in cost-conscious automotive and consumer electronics.

3.          An ASPP can implement a primary application and a related debug application that facilitates debugging of the processor. Whereas the primary application is used in normal operational mode, the debug application is used in the system debug mode. The implemented debug application is usually dependent on the primary application and may operate at a somewhat slower speed albeit with a very low area overhead. Based on these observations we are developing methodologies for synthesizing ASPPs that support system maintenance and debugging.

4.          Using the architectural flexibility provided by behavioral synthesis scheduling and resource allocation, we have developed novel approaches for permanent fault-tolerance of ASPPs. In the first technique the fault-tolerance of an ASPP is maximized subject to constraints on the hardware resources [35 ,48 ]. On the other hand, in the second technique the hardware resources are minimized while guaranteeing that the ASPP remains operational in the presence of all possible k-unit faults [35 ,47 ]. The proposed fault-tolerance techniques combine the behavioral synthesis-based flexibility provided by each of multiple functionalities with judicious operator-to-application assignment. The fault-tolerant ASPP synthesis imposes several unique tasks on the synthesis process. We address them and demonstrate the effectiveness of the overall approach, the synthesis algorithms, and software implementations on a number of industrial-strength designs.

5.          Task preemption is a critical enabling mechanism in multi-task VLSI systems such as ASPPs. On preemption, data in the register files must be preserved in order for the task to be resumed. This entails extra memory to save the context and additional clock cycles to restore the context. The IBM 360/91 was one of the earliest systems to support precise and imprecise interrupt handling [ 37 ]. More recently, a checkpointing approach to handling interrupts was proposed in [ 38 ]. The checkpoints (which incur some penalty in processor performance) are used to divide the sequential instruction stream into smaller units to reduce the cost of resumption. While [ 43 ] integrated the functions of reservation stations and reorder buffers into the register update unit to realize precise interrupts, [ 42 ] presented architectural solutions such as saving the intermediate state of vector instructions and saving a sequence of instructions that must be executed before saving the program counter. Finally, [ 44 ] presented a software-only solution to the synchronization problem in uniprocessors. Their idea was to execute atomic sequences without any hardware protection, and to roll the sequence forward to the end, thereby preserving atomicity. In contrast to these software-based approaches to implementing interrupts, we developed techniques and algorithms to incorporate interrupt constraints in hardware during multi-task ASPP synthesis [ 45 ]. We developed algorithms to insert and refine preemption points in scheduled task graphs subject to preemption latency constraints; techniques to minimize the context switch overhead by considering the dedicated registers required to save the state of a task on preemption and the shared registers required to save the remaining values in the tasks; and a controller based scheme to preclude preemption related performance degradation.

 

C. Design approaches targeting deep-submicron reliability and low power: Although increasing device densities and decreasing power consumption have made it possible to implement complex VLSI systems, they have also accelerated aging related processes such as electromigration and hot electron injection. This increasing unreliability of deep-submicron VLSI technologies has elevated the import