my dog learned polymorphism*
The moose likes Hadoop and the fly likes Transfer large file >50Gb with DistCp from s3 to cluster Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Databases » Hadoop
Bookmark "Transfer large file >50Gb with DistCp from s3 to cluster" Watch "Transfer large file >50Gb with DistCp from s3 to cluster" New topic
Author

Transfer large file >50Gb with DistCp from s3 to cluster

Juan Felipe Morales Castellanos
Greenhorn

Joined: Sep 04, 2012
Posts: 2
Hello guys

I have a problem using the DistCp to transfer a large file from s3 to HDFS cluster, whenever I tried to make the copy, I only saw processing work and memory usage in one of the nodes, not in all of them, I don't know if this is the proper behaviour of this or if it is a configuration problem. If I make the transfer of multiple files each node handles a single file at the same time, I understand that this transfer would be in parallel but it doesn't seems like that.

I am using 0.20.2 distribution for hadoop in a two Ec2Instances cluster, I was hoping that any of you have an idea of how it works distCp and which properties could I tweak to improve the transfer rate that is currently in 0.7 Gb per minute.

Regards.
Srinivas Mupparapu
Greenhorn

Joined: Feb 12, 2004
Posts: 14

distcp is for copying large amounts of data to and from Hadoop filesystems in parallel. Haven't heard of anyone using it to copy files from non-hdfs to hdfs. I am curious to know if you have solved your problem.
Juan Felipe Morales Castellanos
Greenhorn

Joined: Sep 04, 2012
Posts: 2
Hello Srinivas

No, I didn't make myself clear, when I talked about transfering from S3 I didn't meant transfer it from S3 format to HDFS, I was talking about a file in HDFS(stored in an S3 bucket) being transfer to an Ec2Instance.

Finally I found that this can't be done as I expected, distcp make copies in parallel but only for multiple files, for a single file is only one thread the one that is in charge to make the transfer, I didn't knew that, it seems that facebook managed to make this making a modification to one version of hadoop (0.20.2 I think) managing to transfer a single large file in parallel using distcp but I haven't try this facebook modified version. Finally to fix this issue I wrotte a simple Map-Reduce job that allowed me to transfer the file in parallel.

Regards and thanks for the interest.
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Transfer large file >50Gb with DistCp from s3 to cluster