posted 14 years ago
We faced a similar problem. The first thing we did was hide all File I/O behind an interface that had 2 methods: save and load. It's important that none of your code accesses the filesystem directly because you want this kind of file replication to be handled consistently.
The first implementation of this service will probably just do whatever you are doing now, namely just saving or reading from local files. This step is basically just a refactoring of your existing code.
The next step is to write another implementation that will handle the file replication for you. In our code, the save method did the following:
1) Save the data to the local file system.
2) Write the file's checksum into a database somewhere (I'll explain later).
3) Attempt to copy (in parallel) the file to other servers in the cluster. We defined which servers belonged in each cluster through a property file.
On Windows, you can access another computer's files by doing something like: \\remote_machine\folder1\file.txt or something like that. Not entirely sure how to do it on Linux but I imagine it can't be too hard.
Finally, whenever a piece of the code needed to read a file, it calls the load method. This method first checks the local file system. If the file's there, great, if not, attempt to find it on one of the other servers in the cluster. We used the checksum here to make sure the file is valid; it's possible it might be corrupted. If it's corrupted, we move on to the next server. If no valid copy of the file exists, you can throw a FileNotFoundException or something similar.