This week's giveaway is in the EJB and other Java EE Technologies forum.
We're giving away four copies of EJB 3 in Action and have Debu Panda, Reza Rahman, Ryan Cuprak, and Michael Remijan on-line!
See this thread for details.
The moose likes Hadoop and the fly likes Transfer large file >50Gb with DistCp from s3 to cluster Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of EJB 3 in Action this week in the EJB and other Java EE Technologies forum!
JavaRanch » Java Forums » Databases » Hadoop
Bookmark "Transfer large file >50Gb with DistCp from s3 to cluster" Watch "Transfer large file >50Gb with DistCp from s3 to cluster" New topic
Author

Transfer large file >50Gb with DistCp from s3 to cluster

Juan Felipe Morales Castellanos
Greenhorn

Joined: Sep 04, 2012
Posts: 2
Hello guys

I have a problem using the DistCp to transfer a large file from s3 to HDFS cluster, whenever I tried to make the copy, I only saw processing work and memory usage in one of the nodes, not in all of them, I don't know if this is the proper behaviour of this or if it is a configuration problem. If I make the transfer of multiple files each node handles a single file at the same time, I understand that this transfer would be in parallel but it doesn't seems like that.

I am using 0.20.2 distribution for hadoop in a two Ec2Instances cluster, I was hoping that any of you have an idea of how it works distCp and which properties could I tweak to improve the transfer rate that is currently in 0.7 Gb per minute.

Regards.
Srinivas Mupparapu
Greenhorn

Joined: Feb 12, 2004
Posts: 14

distcp is for copying large amounts of data to and from Hadoop filesystems in parallel. Haven't heard of anyone using it to copy files from non-hdfs to hdfs. I am curious to know if you have solved your problem.
Juan Felipe Morales Castellanos
Greenhorn

Joined: Sep 04, 2012
Posts: 2
Hello Srinivas

No, I didn't make myself clear, when I talked about transfering from S3 I didn't meant transfer it from S3 format to HDFS, I was talking about a file in HDFS(stored in an S3 bucket) being transfer to an Ec2Instance.

Finally I found that this can't be done as I expected, distcp make copies in parallel but only for multiple files, for a single file is only one thread the one that is in charge to make the transfer, I didn't knew that, it seems that facebook managed to make this making a modification to one version of hadoop (0.20.2 I think) managing to transfer a single large file in parallel using distcp but I haven't try this facebook modified version. Finally to fix this issue I wrotte a simple Map-Reduce job that allowed me to transfer the file in parallel.

Regards and thanks for the interest.
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Transfer large file >50Gb with DistCp from s3 to cluster
 
Similar Threads
WebLogic 6.0 SP2, Solaris versus Windows NT/2000 (HotSpot JVM)
How to make clsuter Ware EJB in weblogic 6.0
I am getting StreamCorruptedException
Best Practise and why
WebLogic 6.0 SP2, Solaris versus Windows NT/2000 (HotSpot JVM)