来源:计算机科学与技术学院

2014.4.23 Prof. Yves Robert:Fault-tolerance techniques for high-performance computing Resilience is a critical issue for large-scale platforms

来源:院系讲座荟萃发布时间:2014-04-21浏览次数:267

题目:Fault-tolerance techniques for high-performance computing Resilience is a critical issue for large-scale platforms

时间:2014年4月23日下午1:30

地点:信息楼133

报告人Yves Robert教授

主持人:钱莹副教授

 

摘要

 This lecture will survey fault-tolerant techniques for high-performance computing. It is organized along the following topics:
(i) A brief overview of failure types (software/hardware, transient/fail-stop), and typical probability distributions (Exponential, Weibull, Log-Normal);
(ii) Checkpoint and rollback recovery protocols: description and analysis (from Young's approximation to Daly's formulas and recent work).
(ii) Extensions: replication, fault prediction, silent errors. Again, the talk will contain many examples and targets an audience of non-specialists of the area.

个人简历

 Yves Robert received the PhD degree from Institut National Polytechnique de Grenoble. He is currently a full professor in the Computer Science Laboratory LIP at ENS Lyon. He is the author of 7 books, 130 papers published in international journals, and 195 papers published in international conferences. He is the editor of 11 book proceedings and 13 journal special issues. He is the advisor of 26 PhD theses.
    His main research interests are scheduling techniques and resilient algorithms for large-scale platforms.
    Yves Robert served on many editorial boards, including IEEE TPDS. He was the program chair of HiPC'2006 in Bangalore, IPDPS'2008 in Miami, ISPDC'2009 in Lisbon, ICPP'2013 in Lyon and HiPC'2013 in Bangalore. He is a Fellow of the IEEE. He has been elected a Senior Member of Institut Universitaire de France in 2007 and renewed in 2012. He has been awarded the 2014 IEEE TCSC Award for Excellence in Scalable Computing. He holds a Visiting Scientist position at the University of Tennessee Knoxville since 2011.