Narrative of Research
My research contributions
and interests are in the areas of computer aided design of fault-tolerant VLSI systems,
application specific programmable processors, design approaches targeting
deep-submicron reliability and low power, high-speed architectures for network
protocols and encryption, and fault-tolerant Nanoscale
systems.
A. Computer aided design of
fault-tolerant VLSI Systems: As
virtually every aspect of society becomes reliant on digital systems, reliable,
fault-free operation becomes an issue of principal importance. However,
customer affordability and rapid obsolescence of consumer-oriented electronic
products necessitates low-cost implementation and rapid design and test
turnaround times. Hence, the level at which VLSI systems are being designed has
moved from circuit level through Boolean logic level to register transfer (RT)
level. RT level synthesis systems take designs described at the RT level and
translate them into logic and circuit level designs. In our research, we
developed low-cost RT level techniques for concurrent error detection,
diagnosis, recovery and reconfiguration. The resulting fault-tolerant systems
enable reliable and fault-free operation over the lifetime of a VLSI system.
1.
Concurrent
Error Detection (CED): CED is based either on information redundancy or hardware
redundancy or time redundancy. CED designs using self-checking (i.e.
information redundancy) although very involved can identify faulty modules.
Self-checking based CED is suitable for structured components such as PLAs and ROM/RAMs in control paths
[ 2 , 3 , 13 ]. In hardware redundancy based CED (that is
appropriate for data path designs), computations are duplicated on disjoint
hardware and the results compared. Duplication has extensive coverage for
faults in one of the duplicated modules. Such a straightforward strategy
although easy to design entails > 100% hardware overhead [ 3 , 13 ].
1.
We developed
a low cost CED scheme for data paths based on hardware redundancy called fault-secure
CED that requires a less than proportional increase in hardware [
1 , 5 , 6 ]. The fault-secure CED scheme selects some
intermediate computations for duplication and comparison. In a fault-secure CED
design, the >100% hardware overhead of duplication is supplanted with
improved hardware utilization afforded by comparing carefully selected
intermediate computations. Specifically, we developed an RT level scheduling
algorithm that incorporates fault-security constraints [ 1 , 6 ] and RT level binding and allocation algorithms
that incorporate fault-security constraints [ 5 ].
2.
Introspection: We reduced the area overhead of CED even further
based on the observation that although a large number of hardware units are
employed in high-performance VLSI systems, not all of these hardware resources
are used in all clock cycles. We developed Introspection an RT level
technique that uses idle computation cycles in the data path and idle data
transfer cycles in the interconnection network in a synergistic fashion for CED
and diagnosis [ 9 , 10 ]. The idea of utilizing spare capacity to execute
redundant tasks was suggested in the context of multiprocessor systems [
7 , 8 ]. In all these approaches CED was done in software
at the system level. CED latency of introspection is
3.
RT level
time redundancy based CED:
We also developed a low cost time redundancy based RT level approach to CED of
transient and permanent faults. Previously, logic level time redundancy based
CED techniques such as Re-computing with shifted operands (RESO) [
11 ], Re-computing with swapped operands (RESWO) [
13 ], Re-computing with rotated operands (RERO) [
13 ] and Re-computing with duplication with comparison
(REDWC) [ 12 ] have been developed. In these time redundancy
techniques, logic level operations (add, subtract, multiply, and, nand, etc) are carried out twice - once on the basic input
and once on the shifted, swapped or rotated inputs. Results of these two
operations are then compared to detect an error. Operators that implement logic
level RESO/RESWO/RERO/REDWC techniques can be used to implement RT level time
redundancy based CED. However, this entails significant time and area overhead.
We developed a RT level time redundancy based CED technique called Algorithmic
Re-computing [ 14 , 15 ]. Algorithmic Re-computing does not use logic
level RESO/RESWO/RERO operators. Rather, it exploits RT level operation
scheduling, functional pipelining, operator chaining and multi-cycling to
incorporate user error detection latencies with low area overhead and
performance penalty. Further, Algorithmic Re-computing supports hardware
overhead vs. performance penalty vs. error detection latency
trade-offs.
2.
RT Level
reconfiguration: A straightforward
approach to dynamic and static reconfiguration is by providing a set of spare
modules in addition to the core operational modules. Spares-based
reconfiguration has been previously used in regular two-dimensional array
structures [ 3 , 13 , 17 ] such as RAMs, ROMs and
systolic arrays [ 16 ]. Spares-based reconfiguration in random data path
logic has been implemented by using a backup data path component for each
primary data path component. However, this is expensive. An RT level
reconfiguration technique called BISR (built in self repair) was developed that
does not require a dedicated spare for each primary component. It uses one
spare for each data path component type [ 18 ]. We developed a low cost RT level reconfiguration
technique called phantom redundancy [ 18 , 20 ]. Unlike the spares-based approaches, phantom
redundancy uses extra interconnect to render the data path reconfigurable in
the presence of single/multiple functional unit failure(s). We developed an RT
level synthesis approach to incorporate phantom redundancy constraints by
investigating the tight interdependence between reconfigurability
of a (faulty) data path, and RT level synthesis tasks of scheduling and
allocation. The phantom redundancy approach minimizes the performance
degradation of the reconfigurable data path in the presence of any
single/multiple faulty functional unit(s).
3.
RT Level
Off-line and On-Line Built-In Self-Test (BIST): To meet the stringent quality and reliability requirements of
today's complex communication networks, efficient test methodologies are
necessary at all levels (system, board, circuit, etc). Conventional test
methodologies are being constantly challenged by ever increasing speed and
circuit size, which result sin high costs associated with test hardware, test
generation, and test application time. Built-in self-test (BIST) offers a test
methodology where the test functions are embedded into the circuit itself [
1 ]. BIST is increasingly being used in production
testing of VLSICs. The advantages of BIST for complex
telecommunication systems are numerous. Reduced test development time, low test
application time, eliminating the need for high speed hardware testers,
provision for at-speed tests, in-field test capability and high fault coverage
are some of them. We developed the first comprehensive tutorial on BIST
methodology for telecommunication systems [ 24 ].
In off-line BIST, extra
logic is implemented to generate test patterns and compact test responses on
chip. However, this extra logic is
used only during the test mode. Typical off-line RT level BIST structures
include linear feedback shift register (LFSR) based pattern generators and
multiple input signature register (MISR) based test response compactors [
1 , 3 ]. We developed a RT level Versatile BIST
(VBIST) approach that uses off-line BIST circuitry for on-line testing as
well [ 21 , 22 , 23 ]. Unlike traditional on-line approaches to
self-test, VBIST does not use functional data as test inputs. Rather, VBIST
generates test patterns and compacts test responses during the normal mode of
operation. Furthermore, VBIST coordinates this generation and application of
test patterns and compaction of test responses with the usage profile of the RT
level modules in the design. VBIST entails little additional impact on
performance and area of the design (vis-a-vis the
performance and area of a design with off-line BIST). We validated the proposed
approach using the Synopsys Behavioral Compiler as
the synthesis framework and by writing synthesis scripts for incorporating
VBIST constraints.
4.
RT level
recovery from transient faults: The well-known approach of rollback and recovery for
tolerating transient faults was first proposed in the context of software [26 ]. Since then several extensions have been made but
all of them in the context of software [ 3 , 13 ]. While early techniques for checkpointing
application software [ 27 , 28 ] required a large effort on the part of the
programmer in terms of program partitioning, determining the execution times of
the partitions, and inserting optimal checkpoints, automatic checkpointing of software has received a great deal of
attention recently [ 29 , 30 ]. In contrast to all these software-based recovery
techniques, we developed one of the first RT level techniques to implement
transient fault detection and recovery in VLSI. In a self-recovering VLSI
system, intermediate results are compared at regular intervals, and if correct
saved in registers (checkpointing). On the other
hand, on detecting a fault, the self-recovering datapath
rolls back to a previous checkpoint and retries. Whereas the checkpoint
insertion algorithm identifies good checkpoints by successively eliminating
clock cycle boundaries that either have a high checkpoint overhead or violate
the retry period constraint, the novel edge-based schedule, assigns edges to
clock cycle boundaries, in addition to scheduling nodes to clock cycles [31 ,33 ,33
]. Also, checkpoint insertion and edge-based
scheduling are intertwined using a flexible RT level synthesis methodology. We
additionally show an Integer Linear Programming model for the self-recovering
RT level synthesis problem. The resulting ILP formulation can minimize either
the number of voters or the overall hardware, subject to constraints on the
number of clock cycles the retry period, and the number of checkpoints [32 ,35 ].
B. Design Approaches to
Application Specific Programmable Processors (ASPP): Modern wireless and multimedia applications
require high performance (hundreds of millions of operations per second), low
power, and inexpensive multiple functionalities. Conventional architectures
fail to meet these requirements of algorithm flexibility and low complexity.
While dedicated ASICs can be optimized to execute a
given algorithm in real-time with very low complexity, they lack the
flexibility to perform a range of algorithms. Although highly flexible,
software programmable processors, on the other hand
are complex and have low throughput. Consequently, rather than optimizing the
architecture for a single algorithm or for all algorithms, we are investigating
application specific adaptive architectures targeting classes of applications
(for example, a configurable architecture implementing image transform algorithms).
Such architectures yield a better tradeoff between low complexity and high
algorithm flexibility than either software-programmable processors or dedicated
ASICs. In fact, almost all major processor
manufacturers including Fujitsu (with their 86k line of programmable
processors) [ 40 ], Motorola (with their application specific
programmable DSP processors) [ 41 ] and LSI Logic are offering a comprehensive line
of application specific programmable processors (ASPPs)
that preserve all of the advantages of the special purpose processors while
retaining the cost and flexibility afforded by the general purpose processors.
Owing to their significance, we are also developing design approaches for such ASPPs.
1.
Implementing
a configurable processor for a class of applications requires consideration of
issues such as compatibility of the application topologies,
resources required and word length characteristics. As a vehicle, digital
signal processing algorithms. Digital and video signal processing (DSP) are
data path intensive due to their significant digital computation demands.
Further, wireless communication (TDMA, CDMA), television (NTSC, PAL, SECAM, HDTV), video compression (JPEG, MPEG), audio compression
(AC-3, AC-97), etc. are standardized. Therefore, the functionality of
commercial electronic products is limited to a minor variation of these
standards. Coarse grain architectures comprising of a static data path with
configurable control are an excellent implementation medium for such
applications. We are developing ASPP architectures and synthesis techniques
that identify classes of applications with similar topologies and resource
requirements and compatible precision and word length requirements [35 ,36 ,45
].
2.
In
embedded core-plus-co-processor based applications, generic tasks are carried
out on a central processor core, while the specialized tasks are delegated to
and carried out by task-specific co-processors. For example, in automotive
electronics a central processor communicates with the antilock brake control
co-processor, throttle control co-processor, airbag control co-processor and
diagnostics co-processor. Implementing dedicated co-processors for each of the
applications entails significant area overhead. This is especially wasteful if
these applications are invoked infrequently and in a non-overlapping fashion. We
developed techniques to bundle multiple non-overlapping and specialized
applications on a single dynamically configurable ASPP co-processor [36 ]. These configurable ASPP co-processors are
beneficial in cost-conscious automotive and consumer electronics.
3.
An ASPP
can implement a primary application and a related debug application that
facilitates debugging of the processor. Whereas the primary application is used
in normal operational mode, the debug application is used in the system debug
mode. The implemented debug application is usually dependent on the primary
application and may operate at a somewhat slower speed albeit with a very low
area overhead. Based on these observations we are developing methodologies for
synthesizing ASPPs that support system maintenance
and debugging.
4.
Using
the architectural flexibility provided by behavioral synthesis scheduling and
resource allocation, we have developed novel approaches for permanent
fault-tolerance of ASPPs. In the first technique the
fault-tolerance of an ASPP is maximized subject to constraints on the hardware
resources [35 ,48 ]. On the other hand, in the second technique the
hardware resources are minimized while guaranteeing that the ASPP remains
operational in the presence of all possible k-unit faults [35 ,47 ]. The proposed fault-tolerance techniques combine the
behavioral synthesis-based flexibility provided by each of multiple
functionalities with judicious operator-to-application assignment. The
fault-tolerant ASPP synthesis imposes several unique tasks on the synthesis
process. We address them and demonstrate the effectiveness of the overall
approach, the synthesis algorithms, and software implementations on a number of
industrial-strength designs.
5.
Task
preemption is a critical enabling mechanism in multi-task VLSI systems such as ASPPs. On preemption, data in the register files must be
preserved in order for the task to be resumed. This entails extra memory to
save the context and additional clock cycles to restore the context. The IBM
360/91 was one of the earliest systems to support precise and imprecise
interrupt handling [ 37 ]. More recently, a checkpointing
approach to handling interrupts was proposed in [ 38 ]. The checkpoints (which incur some penalty in
processor performance) are used to divide the sequential instruction stream
into smaller units to reduce the cost of resumption. While [ 43 ] integrated the functions of reservation stations
and reorder buffers into the register update unit to realize precise
interrupts, [ 42 ] presented architectural solutions such as saving
the intermediate state of vector instructions and saving a sequence of
instructions that must be executed before saving the program counter. Finally, [
44 ] presented a software-only solution to the
synchronization problem in uniprocessors. Their idea
was to execute atomic sequences without any hardware protection, and to roll
the sequence forward to the end, thereby preserving atomicity. In contrast to
these software-based approaches to implementing interrupts, we developed
techniques and algorithms to incorporate interrupt constraints in hardware
during multi-task ASPP synthesis [ 45 ]. We developed algorithms to insert and refine
preemption points in scheduled task graphs subject to preemption latency
constraints; techniques to minimize the context switch overhead by considering
the dedicated registers required to save the state of a task on preemption and
the shared registers required to save the remaining values in the tasks; and a
controller based scheme to preclude preemption related performance degradation.
C. Design approaches targeting deep-submicron reliability and low power: Although increasing device densities and decreasing power consumption have made it possible to implement complex VLSI systems, they have also accelerated aging related processes such as electromigration and hot electron injection. This increasing unreliability of deep-submicron VLSI technologies has elevated the import