Win a copy of Svelte and Sapper in Action this week in the JavaScript forum!
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other all forums
this forum made possible by our volunteer staff, including ...
  • Campbell Ritchie
  • Ron McLeod
  • Paul Clapham
  • Bear Bibeault
  • Junilu Lacar
  • Jeanne Boyarsky
  • Tim Cooke
  • Henry Wong
Saloon Keepers:
  • Tim Moores
  • Stephan van Hulst
  • Tim Holloway
  • salvin francis
  • Frits Walraven
  • Scott Selikoff
  • Piet Souris
  • Carey Brown

Mapreduce example from Apache Site

Ranch Hand
Posts: 1609
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I was following up on this page from Apache.

After the compilation step of word count v1.0 it says

Assuming that:

/user/joe/wordcount/input - input directory in HDFS
/user/joe/wordcount/output - output directory in HDFS

What does directory in HDFS mean? Are these already created? and I see that

lists the two files inside input directory. Even the normal "ls" command would have done that, what is the significance of using bin/hdfs here?
Posts: 9
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Please try to understand HDFS is distributed file system. If you design the system as clustered ones, the data will be split into multiple segments/chunks and distributed across clustered environment. bin/hadoop dfs ---------> it means that you are listing from HDFS not from an ordinary file system.

Hope you understand this.

The input will say that where the input files are available for processing and the output says where the processed output files are available.

Think of a file that contains the phone number for everyone in the country X; the people with a last name starting with A might be stored on server 1, B on server 2, and so on. In a Hadoop world, pieces of this phonebook would be stored across the cluster. To achieve availability as components fail, HDFS replicates these smaller pieces onto two additional servers by default.This redundancy offers multiple benefits, the most obvious being higher availability. When you query the HDFS, the data from clustered servers will be combined and re-constructed as a single one.

Hope this helps you to understand.

Akhilesh Trivedi
Ranch Hand
Posts: 1609
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks Arumugarani!

I am able to understand the concepts and working through.
For my next trick, I'll need the help of a tiny ad ...
Building a Better World in your Backyard by Paul Wheaton and Shawn Klassen-Koop
    Bookmark Topic Watch Topic
  • New Topic