Fault tolerance in scheduler-driven operations

Video server supports the recording and removing of movies in automatic, scheduler-based fashion. All the scheduler information is stored in relational database. Database ( or metadata ) server have the separate thread which picks-up the next scheduler event, which is implemented as polymorphic classes, from the sorted set and executes it.

The fault tolerance requirements in scheduled operations are much higher than in interactive ones. The reason for that is the fact that in all operations involving the client the interactive user experience prevents the system from continuous failure: when the problem occurs, user stops to jerk with the system and complains. Unlike that in automated operations the continuous reproduction of the same error may corrupt the database or act like deny-of-service attack.

The main approach to increase the fault tolerance in scheduled operations is to preserve the database consistency with every atomic transaction. The main criteria is that the shutdown of any component of the system or the failure of any database transaction should preserve the system in state with easy recovery or rollback.

When the recording starts, the database for new movie and new movie files are created. The expiration date is attached immediately. Then, the acquisition server keep the connection with database/metadata server open and informs it about any failure with recording. On failure, database server updates the database to indicate that recording is completed. If database server crashes itself, although this is very unlikely, the database consistency is preserved. On start-up the database server reloads all states back from the database, compares those states with the schedule and fix problems, if any.

Every component of the system performs the logging of its activity. The verbose level of the logging is customizable.