wood burning stoves 2.0*
The moose likes Linux / UNIX and the fly likes scp multiple files concurrently Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of The Java EE 7 Tutorial Volume 1 or Volume 2 this week in the Java EE forum
or jQuery UI in Action in the JavaScript forum!
JavaRanch » Java Forums » Engineering » Linux / UNIX
Bookmark "scp multiple files concurrently" Watch "scp multiple files concurrently" New topic
Author

scp multiple files concurrently

Jiafan Zhou
Ranch Hand

Joined: Sep 28, 2005
Posts: 192

I need to scp multiple files to an SSH server concurrently.
My question is that how many files can I transfer concurrently in an efficient way? (I know that it depends also on the network, but let's ignore this factor for now. Because we have a Giga network).

For example, if I need to transfer 10,000 (plus) files. Obviously scp them sequentially is unacceptable. (That means need to wait one transmission finish before starting another one).
So I decided to use multiple process(es) to transfer files concurrently. Can anybody tell me what is the right approach to deal with such a case?

Thanks


SCJP, SCJD, SCWCD, SCBCD, SCEA
Jiafan Zhou
Ranch Hand

Joined: Sep 28, 2005
Posts: 192

To be more specific, how many processes maximum is preferred to run scp concurrently.
Tim Holloway
Saloon Keeper

Joined: Jun 25, 2001
Posts: 16019
    
  20

Actually, I think I'd look at secure rsync for something like that.


Customer surveys are for companies who didn't pay proper attention to begin with.
Jiafan Zhou
Ranch Hand

Joined: Sep 28, 2005
Posts: 192

Using SSH server is a requirement, so bad luck, rsync secure is out of consideration.
Stefan Wagner
Ranch Hand

Joined: Jun 02, 2003
Posts: 1923

Did you find out what your bottleneck is?

If your sourcefiles are comming from the same harddrive, I wouldn't expect much from threads.
The harddrive-question I would ask for the second host too.

The head of your drive will jump for the different threads from here to there and back again. Maybe it is slower than a single thread.


http://home.arcor.de/hirnstrom/bewerbung
Jiafan Zhou
Ranch Hand

Joined: Sep 28, 2005
Posts: 192

Originally posted by Stefan Wagner:
Did you find out what your bottleneck is?

If your sourcefiles are comming from the same harddrive, I wouldn't expect much from threads.
The harddrive-question I would ask for the second host too.

The head of your drive will jump for the different threads from here to there and back again. Maybe it is slower than a single thread.

I am not sure I totally understand this. Yes, I would say they come from the same harddrive. (Actually these files come from the same directory).

Plus scp will use a different Process on Linux if execution. Although I do think about use thread.
Andrew Monkhouse
author and jackaroo
Marshal Commander

Joined: Mar 28, 2003
Posts: 11432
    
  85

Originally posted by Jiafan Zhou:
Using SSH server is a requirement, so bad luck, rsync secure is out of consideration.


From the man pages for rsync:

For remote transfers, a modern rsync uses ssh for its communications, but it may have been configured to use a different remote shell by default, such as rsh or remsh.


My thoughts are that trying to transfer them in parallel will result in the bandwidth being the bottleneck. Assuming the sum total of all the files are larger than a gigabit then it doesn't matter how many concurrent transfers you attempt, you can never get more than a gigabit of data running through your tubes at any given time.

That being the case, I would probably just set up an rsync with wildcards, and come back when it is done.

Alternative to that case: if the data is large enough (TB?) then it might be worth considering sneaker-net. Have someone who has physical access to the computers copy the files to a large portable hard drive and transfer them that way.

Andrew


The Sun Certified Java Developer Exam with J2SE 5: paper version from Amazon, PDF from Apress, Online reference: Books 24x7 Personal blog
Tim Holloway
Saloon Keeper

Joined: Jun 25, 2001
Posts: 16019
    
  20

Actually, Your Mileage May Vary, but I learned to respect SANs when someone pointed out that the disk-to-RAM data transfer rate on a local hard drive is typically MUCH slower than Gigabit Ethernet. So the theoretical penalties are not as bad as they seem, assuming low levels of contention from other parties sharing the network between source and destination.

In the case of running many scp's in parallel, a bigger issue would be, as noted, the latency of the source disk, especially if it's a single disk and not a tuned array.

Of course, as is typical in today's complicated world, simply "knowing" isn't enough - too many variables apply. The only real way to tell is to benchmark and tune.
Ernest Friedman-Hill
author and iconoclast
Marshal

Joined: Jul 08, 2003
Posts: 24183
    
  34

I have no experience using these pssh tools, but I've heard of them. I know the pscp is designed to send one file to many machines; maybe it can also send many files to one machine?


[Jess in Action][AskingGoodQuestions]
Jiafan Zhou
Ranch Hand

Joined: Sep 28, 2005
Posts: 192

Thank you all for sharing the invaluable knowledges with me. Couple of good ideas outlined below:

1. I agree with that the "rsync" command (or pssh) is a better replacement for scp, however, I cannot use them for couple of reasons. (believe or not) Plus, "rsync" still transmits one file at a time, it does not use the full bandwidth of our network.

2. Again I agree with that reading from the hard-disk to ROM is definitely a bottleneck without doubts. But I don't too concern about this right now, because the biggest issue at the moment is how to use the maximum bandwidth of the network. i.e. How to transmit multiple files using scp at a minimum time? (I might come back to solve the slow disk-to-ram reading, but not at the moment)

3. I don't agree with using single thread/process (rsync or scp) to transmit files. (I am not yet convinced). I forgot to mention that files being transferred into server are relatively small. (i.e. probably less than 1 M for each). That motivates the whole idea of performing parallel transfer. (i.e. multiple processes of scping).

4. Physically access and copy files is a great idea, but technically impossible in my case.

5. I will definitely run some benchmark testing and tune.

And because this concurrent scping are handled by a Java program, which creates a separate Process for every file it is going to transfer. The original problem I posted is described by the following link: (and provided my initial code proposal)

http://www.coderanch.com/t/234297/threads/java/Create-new-thread-each-entity

Any suggestions are welcome.

Again thanks a lot.
Tim Holloway
Saloon Keeper

Joined: Jun 25, 2001
Posts: 16019
    
  20

Hmmm. I hand't realized that rsync could only parallelize in cetain specific cases. Oh well.

The simplest solution would be the pscp utility, which is the parallel scp part of the suite that includes pssh. Taking the Not Invented Here approach and writing your own solution in the Unix world costs you geek points. Adding to the inventory of source code that has to be maintained and kept up to date when you could take advantage of someone else's work costs you business points. And if you use Runtime.exec() just so you can spawn multiple scp commands costs you double geek points, since you'd have less overhead doing this as a shell script and keeping Java for something that needs it. Just saying.

100,000 threads is not realistic. While there is a certain economy to be gained by using sharable code, the core variables for each thread are unique, so you're talking a lot of memory. More importantly, you're going to put a real strain on the thread dispatcher. But the real killer is on the receiving end.

If you did the brute-force approach and did a Runtime.exec() on the scp command, you'd be requiring the receiving machine to process 100,000+ login requests in a very short period of time (in addition to almost the same amount of overhead on the sending machine as it created new shell environments). It would almost certainly buckle under the strain. If it did not, you'd still have the issue that you'd not only need to bump the thread limit on the sending machine, you'd have to do the same on the receiving machine. At best, excess requests would bounce. At worst, you could interfere with other processes and risk crashing the whole system.

There's a certain overhead to setting up and tearing down a file transfer context even without the overhead of setting up a new user environment (login). The most efficient approach is a batched one where multiple files (especially small ones) can be sent within a single transfer request.

In some ways, your problem resembles what BitTorrent was designed to handle, although torrents distribute the process among multiple hosts.
Stefan Wagner
Ranch Hand

Joined: Jun 02, 2003
Posts: 1923

Originally posted by Jiafan Zhou:
... Because we have a Giga network).

For example, if I need to transfer 10,000 (plus) files. Obviously scp them sequentially is unacceptable. (That means need to wait one transmission finish before starting another one).


Ah - you fear to need 10 000 logins?
Well - no problem.
Tar all your 10 000 files together into one single file, and transfer them at once:

Instead of
The last 2 suggestions aren't tested.

If your real concern is only the login, you should read about using ssh with public keys.
I'm sorry, I just could provide a german wiki-link.
Carey Evans
Ranch Hand

Joined: May 27, 2008
Posts: 225

You should look at the pure Java SSH clients, instead of trying to run SCP in separate processes. With Trilead SSH you can do multiple SCP sessions over a single SSH connection, to make the most of a single login and TCP connection. Their SCPClient class is thread-safe, and is really easy to use.
Joshua Galeon
Greenhorn

Joined: Jun 02, 2010
Posts: 1
use pssh

http://code.google.com/p/parallel-ssh/
Stefan Wagner
Ranch Hand

Joined: Jun 02, 2003
Posts: 1923

@Joshua, don't wake the zombies.

Do you expect us to been waiting for allmost two years for that answer, which was a tip, given by EFH already?
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: scp multiple files concurrently