I have a problem using the DistCp to transfer a large file from s3 to HDFS cluster, whenever I tried to make the copy, I only saw processing work and memory usage in one of the nodes, not in all of them, I don't know if this is the proper behaviour of this or if it is a configuration problem. If I make the transfer of multiple files each node handles a single file at the same time, I understand that this transfer would be in parallel but it doesn't seems like that.
I am using 0.20.2 distribution for hadoop in a two Ec2Instances cluster, I was hoping that any of you have an idea of how it works distCp and which properties could I tweak to improve the transfer rate that is currently in 0.7 Gb per minute.
distcp is for copying large amounts of data to and from Hadoop filesystems in parallel. Haven't heard of anyone using it to copy files from non-hdfs to hdfs. I am curious to know if you have solved your problem.
Juan Felipe Morales Castellanos
Joined: Sep 04, 2012
No, I didn't make myself clear, when I talked about transfering from S3 I didn't meant transfer it from S3 format to HDFS, I was talking about a file in HDFS(stored in an S3 bucket) being transfer to an Ec2Instance.
Finally I found that this can't be done as I expected, distcp make copies in parallel but only for multiple files, for a single file is only one thread the one that is in charge to make the transfer, I didn't knew that, it seems that facebook managed to make this making a modification to one version of hadoop (0.20.2 I think) managing to transfer a single large file in parallel using distcp but I haven't try this facebook modified version. Finally to fix this issue I wrotte a simple Map-Reduce job that allowed me to transfer the file in parallel.