NECST Friday Talk
Guaranteed k-Fault Tolerance in Scalable Systems
Wenjing Rao
Associate Professor in the ECE department, University of Illinois - Chicago
DEIB - NECST Meeting Room (Building 20, basement floor)
April 7th, 2017
12.00 pm
Contact:
Marco Santambrogio
Research Line:
System architectures
Wenjing Rao
Associate Professor in the ECE department, University of Illinois - Chicago
DEIB - NECST Meeting Room (Building 20, basement floor)
April 7th, 2017
12.00 pm
Contact:
Marco Santambrogio
Research Line:
System architectures
Sommario
This talk will cover some of our work on the attributes of a system that can be proven to tolerate k faults. We assume the system consists of local interconnects, regular and spare Processing Elements (PE’s), where faulty PE’s need to be replaced by the spares. A repair is carried out by a “replacement chain”, starting with a spare, each taking over the task of the next one, to eventually reaching a faulty PE. Based on a Task-PE relationship model, the dynamics of such replacement chain usage can be fully analyzed and predicted. This makes it possible to calculate precisely how a repair will affect all the other potential repairs in the future, and to determine whether the system remains repairable for subsequent faults. Overall, two equivalent conditions (both necessary and sufficient) are proven for such a system to be guaranteed k-FT, supporting on-the-fly repair after every fault occurrence.
We further propose a physical implementation of the system, where each PE is assigned to a Router in the neighborhood. A localized Auxiliary Network is used to provide assignments flexibilities between each Router and its peripheral PE’s. Faulty PE’s are repaired via spare PE’s in the array, and replacement chains are implemented by shifting the assignments between Routers and PE’s. This architecture is isomorphic to the Task-PE model, thus can be designed to deliver a proven level of fault tolerance properties, while being scalable in hardware and interconnect overheads.
We further propose a physical implementation of the system, where each PE is assigned to a Router in the neighborhood. A localized Auxiliary Network is used to provide assignments flexibilities between each Router and its peripheral PE’s. Faulty PE’s are repaired via spare PE’s in the array, and replacement chains are implemented by shifting the assignments between Routers and PE’s. This architecture is isomorphic to the Task-PE model, thus can be designed to deliver a proven level of fault tolerance properties, while being scalable in hardware and interconnect overheads.
Biografia
Wenjing Rao is currently an Associate Professor in the ECE department, University of Illinois at Chicago. She joined UIC in 2008, after getting her PhD from the CSE department of University of California, San Diego. Her research interests include Testing and Design for Testing of Digital systems, Fault Tolerance for nanoelectronics based systems, VLSI CAD, and Hardware Security.
Wenjing has served in the MS thesis committee for many students from PoliMi and PoliTo, and is currently visiting the NECST lab during her sabbatical year. She is enjoying Milan, PoliMi, and NECST tremendously.
Wenjing has served in the MS thesis committee for many students from PoliMi and PoliTo, and is currently visiting the NECST lab during her sabbatical year. She is enjoying Milan, PoliMi, and NECST tremendously.