Bob Jones, Sr. Quote: “The greatest ability is dependability.” (9  wallpapers) - Quotefancy

CSE 291 – D00. Dependable Systems [WI21]

Dependability is critically important for today's systems and applications. Computer hardware, software, disks, networks and configurations are unfortunately subject to faults which can eventually manifest in visible failures, causing damages.  This CSE291 course will cover topics ranging from classic fault tolerant computing, error detection techniques for hardware and software faults, to failure diagnosis and recovery in today's data centers.  The format will include lectures by me, student presentations and discussions, and a couple of invited speakers from industry such as Splunk and Google on relevant topics.  In our class project, we will inject various faults such as network failures, node crashes, etc into popular open source software of your choice (e.g. Zookeeper, HDFS, MySQL) to see how many issues you can expose in these software.

Class hours:   Tue/Thu 2-3:20pm,  Lectures:  ZOOM

Instructor: Prof. Yuanyuan Zhou
Office hours: Tue/Thu 3:30-4:30,  ZOOM      

Graduate Course Assistants: Yudong Wu 

Course Project:  Canvas

Class Schedule and Videos: Canvas

Textbook: No text book.

The course will use technical conference and journal papers. You are expected to get the papers from IEEExplore or ACM Digital Library.

Reference Books: 

1.    I. Koren and C. Mani Krishna, Fault-tolerant Systems, 1st edition, 2007, Morgan Kaufmann.

2.    D. P. Siewiorek and R. S. SwarzReliable Computer Systems - Design and Evaluation, 3rd edition, 1998, A.K. Peters, Limited.

3.    D. K. Pradhan, ed., Fault Tolerant Computer System Design, 1st edition, 1996, Prentice-Hall.

4.    K. Trivedi, Probability and Statistics with Reliability, Queuing and Computer Science Applications, 2nd edition, 2001, John Wiley & Sons.

Grade Allocation:

·        Course project: 50%

o   Milestore 1 ---10%

o   Milestone 2 (In-class proposal presentation) --- 5%

o   Milestone 3 --- 10%

o   Milestone 4 (Final presentation) --- 5%

o   Final report -- 15%

o   Submission of list of issues --- 5%

·        Class paper presentation (30%)

·        Quizzes (20%, 3 quizzes, pick top 2 scores)

Schedule:

Date

  Format

Topics

Jan 5th

Lecture

Class overview, intro
Motivation: why dependable computing

Jan 7th

Lecture 

Basic Concepts, Taxonomy

Jan 12th

Lecture

Failure Characteristic Studies

Jan 14th

Lecture

Error coding and detection

Jan 19th

Lecture

Human Errors

Jan 21st

Student Presentation

Fault injection techniques and tools 

Student Presentation

Why Do Computers Stop And What Can Be Done About It?

Jan 28th

Student Presentation

SIFT: Design and analysis of a fault-tolerant computer for aircraft control

Feb 2nd

Student Presentation

A Survey of Rollback-Recovery Protocols in Message-Passing Systems 

Feb 4th

Student Presentation

Implementing fault-tolerant services using the state machine approach: a tutorial 

Feb 9th

Student Presentation

Practical byzantine fault tolerance 

Feb 11th

Project Proposal

Project proposals (10min each project)

Feb 16th

Student Presentation

High-availability computer systems 

Feb 18th

Student Presentation

Recovery-oriented computing (ROC): Motivation, definition, techniques, and case studies 

Feb 23rd

Student Presentation

Enhancing Server Availability and Security Through Failure-Oblivious Computing 

Feb 25th

Student Presentation

Paxos Made Moderately Complex 

March 2nd

Student Presentation

Systems approaches to tackling configuration errors: A survey

March 4th

TBD

TBD

March 9th

Project Final presentations

 

March 11th

Project Final presentations