BigData : How data distributed in Hadoop HDFS (Hadoop Distributed File System)

(Last Updated On: July 7, 2011)

BigData : How data distributed in Hadoop HDFS (Hadoop Distributed File System)
Apache Hadoop framework uses Google MapReduce model and Google File system logic’s. In Hadoop, Data will be split into chunks and distributed across all nodes in cluster. This concept is inherited from Google file system, In hadoop we mention it as HDFS (i.e. Hadoop Distributed File System). While loading data into HDFS, it start distributing to all nodes based on few parameters. Here will see two important parameter need to consider for better performance.

1. Chunk size (dfs.block.size(in bytes)) – 64MB,128MB,256MB or 512MB. its preferable to choose size based on our input data size to be process and power of each node.

2. Replication Factor (dfs.replication=3) – by default its 3. means data will be available in 3 nodes or 3 times around cluster. In case of high chance of failure in nodes, better to increment replication factor value. Need for data replication is, if any node in cluster failed, data in that node cannot be processed, so will not get complete result.

For Example, to process 1TB of data with 1000 nodes. 1TB(1024GB)* 3 replication factor = 3072 GB of data will be available in all 1000 node cluster. we can specify chunk size based on our node capability. if node has more than 2GB memory(RAM), then can specify 512MB chunk size. so one node TaskTracker will process one chunk at a time. If its a dual core processor, one node will process 2 chunks at a same time. so specify chunk size based on memory available in each node. Recommended not to use NameNode(Master) also as a Datanode, else that single node overloaded with task of both TaskTracker and JobTracker.

Will that data distributed equally in hadoop cluster’s node?

No, it’s not distributed like 3GB in each node. some node will have 8GB of data, other node will have 5GB, and 1GB.. and so on. but node will have complete chunk. it wont be distributed like half chunk here and there.

In Upcoming posts we will see about more hadoop parameter to improve cluster performance. If you like this post, please click +1 button below to recommend this page and click ‘like’ button to get updates in facebook(Only once in a week).

NOTE I now post my TRADING ALERTS into my personal FACEBOOK ACCOUNT and TWITTER. Don't worry as I don't post stupid cat videos or what I eat!
Don't miss out!

You will received instantly the download links.

Invalid email address
Give it a try. You can unsubscribe at any time.


Check NEW site on stock forex and ETF analysis and automation

Scroll to Top