File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
The moose likes Distributed Java and the fly likes Checkpoint question Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Distributed Java
Bookmark "Checkpoint question" Watch "Checkpoint question" New topic

Checkpoint question

jason williams

Joined: Nov 17, 2004
Posts: 14
I am learning to program system which needs to survive over process crash in the cluster environment. And after reading and searching papers on the internet, I vaguely understand that would require program to provide checkpoint so that the state can be saved to stable (replicated) storage and recover later from there. I understand to achieve fault tolerance it would require other components e.g. failure detector, etc., but at the moment I want to gain more understanding on checkpoint issue.

However, most of the papers emphasize more on abstraction level. For instance, `Design Patterns for Checkpoint-Based Rollback Recovery' tells that communication induced checkpoint can prevent domino effect and it provides diagrams explaining the interaction between different components e.g. failure detector, checkpointer, etc. But now my problem is `how can I checkpoint to stable storage and recover seamlessly?' For instance, I will checkpoint a running program to a storage e.g. hadoop hdfs; when trying to recover the state, how can I ensure the program would resume to continuously execute as it were without a problem?

I appreciate any suggestion.

Many thank.

I agree. Here's the link:
subject: Checkpoint question
It's not a secret anymore!