I need to scp multiple files to an SSH server concurrently. My question is that how many files can I transfer concurrently in an efficient way? (I know that it depends also on the network, but let's ignore this factor for now. Because we have a Giga network).
For example, if I need to transfer 10,000 (plus) files. Obviously scp them sequentially is unacceptable. (That means need to wait one transmission finish before starting another one). So I decided to use multiple process(es) to transfer files concurrently. Can anybody tell me what is the right approach to deal with such a case?
Originally posted by Jiafan Zhou: Using SSH server is a requirement, so bad luck, rsync secure is out of consideration.
From the man pages for rsync:
For remote transfers, a modern rsync uses ssh for its communications, but it may have been configured to use a different remote shell by default, such as rsh or remsh.
My thoughts are that trying to transfer them in parallel will result in the bandwidth being the bottleneck. Assuming the sum total of all the files are larger than a gigabit then it doesn't matter how many concurrent transfers you attempt, you can never get more than a gigabit of data running through your tubes at any given time.
That being the case, I would probably just set up an rsync with wildcards, and come back when it is done.
Alternative to that case: if the data is large enough (TB?) then it might be worth considering sneaker-net. Have someone who has physical access to the computers copy the files to a large portable hard drive and transfer them that way.
Actually, Your Mileage May Vary, but I learned to respect SANs when someone pointed out that the disk-to-RAM data transfer rate on a local hard drive is typically MUCH slower than Gigabit Ethernet. So the theoretical penalties are not as bad as they seem, assuming low levels of contention from other parties sharing the network between source and destination.
In the case of running many scp's in parallel, a bigger issue would be, as noted, the latency of the source disk, especially if it's a single disk and not a tuned array.
Of course, as is typical in today's complicated world, simply "knowing" isn't enough - too many variables apply. The only real way to tell is to benchmark and tune.
Thank you all for sharing the invaluable knowledges with me. Couple of good ideas outlined below:
1. I agree with that the "rsync" command (or pssh) is a better replacement for scp, however, I cannot use them for couple of reasons. (believe or not) Plus, "rsync" still transmits one file at a time, it does not use the full bandwidth of our network.
2. Again I agree with that reading from the hard-disk to ROM is definitely a bottleneck without doubts. But I don't too concern about this right now, because the biggest issue at the moment is how to use the maximum bandwidth of the network. i.e. How to transmit multiple files using scp at a minimum time? (I might come back to solve the slow disk-to-ram reading, but not at the moment)
3. I don't agree with using single thread/process (rsync or scp) to transmit files. (I am not yet convinced). I forgot to mention that files being transferred into server are relatively small. (i.e. probably less than 1 M for each). That motivates the whole idea of performing parallel transfer. (i.e. multiple processes of scping).
4. Physically access and copy files is a great idea, but technically impossible in my case.
5. I will definitely run some benchmark testing and tune.
And because this concurrent scping are handled by a Java program, which creates a separate Process for every file it is going to transfer. The original problem I posted is described by the following link: (and provided my initial code proposal)
Hmmm. I hand't realized that rsync could only parallelize in cetain specific cases. Oh well.
The simplest solution would be the pscp utility, which is the parallel scp part of the suite that includes pssh. Taking the Not Invented Here approach and writing your own solution in the Unix world costs you geek points. Adding to the inventory of source code that has to be maintained and kept up to date when you could take advantage of someone else's work costs you business points. And if you use Runtime.exec() just so you can spawn multiple scp commands costs you double geek points, since you'd have less overhead doing this as a shell script and keeping Java for something that needs it. Just saying.
100,000 threads is not realistic. While there is a certain economy to be gained by using sharable code, the core variables for each thread are unique, so you're talking a lot of memory. More importantly, you're going to put a real strain on the thread dispatcher. But the real killer is on the receiving end.
If you did the brute-force approach and did a Runtime.exec() on the scp command, you'd be requiring the receiving machine to process 100,000+ login requests in a very short period of time (in addition to almost the same amount of overhead on the sending machine as it created new shell environments). It would almost certainly buckle under the strain. If it did not, you'd still have the issue that you'd not only need to bump the thread limit on the sending machine, you'd have to do the same on the receiving machine. At best, excess requests would bounce. At worst, you could interfere with other processes and risk crashing the whole system.
There's a certain overhead to setting up and tearing down a file transfer context even without the overhead of setting up a new user environment (login). The most efficient approach is a batched one where multiple files (especially small ones) can be sent within a single transfer request.
In some ways, your problem resembles what BitTorrent was designed to handle, although torrents distribute the process among multiple hosts.
You should look at the pure Java SSH clients, instead of trying to run SCP in separate processes. With Trilead SSH you can do multiple SCP sessions over a single SSH connection, to make the most of a single login and TCP connection. Their SCPClient class is thread-safe, and is really easy to use.