In a Typical Enterprise Application (1.1) we have Web servers whose target is to store, process and deliver web pages to the end users. Then, we have OLTP (applications and database) where transactions happen; every organization has an Extract Transform and Load process (ETL) that they use in their own way (as per business needs for analysis), and then finally, the data is given to OLAP where the analytical/reporting tools like Microstrategy, Business Objects etc sit. The diagram below shows what a typical enterprise looks like:
Hadoop is not a replacement to OLTP, rather it's something that facilitates ETL. Let's see how:
ETL is capable of handling a few Terabytes of data. With Hadoop in the picture, your ETL process will be capable of handling PetaBytes or more of data. This is where Hadoop fits in as it can process all that data (Extract Transform Load) and gives it to OLAP as shown in img 1.2.
Planning a Hadoop Cluster
The first question is, when do we need a cluster and how big should it be? Hadoop can run from a single machine to a few thousand machines, so we can run a cluster of one machine or a few thousand machines. Running a single node is good for development and testing but it won't get you too far, why then would you run Hadoop on a single node in the production environment? Typically, organizations start with a cluster that has a small number of nodes and increases these based on the requirements. The decision of growing your cluster is most commonly based on the amount of storage capacity required.
Let's now take an example of something similar that happens in industries for planning a cluster: Say we have a project and the customer is giving 2 Terabytes of data, so the data in the cluster grows by 2 TB per month. With a default replication factor of 3 every month, what we need is 6 TB of extra storage. Plus 30 % (2TB) of overhead, i.e., other storage like operating system and files, temporary file and all that we need for an operating system (a few Terabytes of space for that).
6Tb (2Tb data flow & replication factor of 3 ) + 2Tb (30% space for OS) = 8TB
So, overall we will need 8 TB of storage per month. If we buy machines with 4TB hard disk we will need 2 machines per month and that means, by the end of the year, we will have around 96TB of data i.e., 24 data nodes (a 24 node cluster). Whenever we plan a cluster we must have a projection on how much data is going to come every month or every week (velocity of data); based on which we can decide the capacity of the cluster. These are just the industry standards while planning the cluster.
General guidelines for the hardware
There are 2 types of nodes. Hadoop has Master nodes and Slave nodes. When we start small-scale setups or smaller deployments, we have Master node services i.e. Namenode, Jobtracker and Secondary Namenode all running on the same machine (same server). Example a 5 node cluster. Say, if we go slightly bigger than the smaller deployments, Namenode and JobTracker runs on the same Master node server and Secondary Namenode is configured to run on a different Master node server. Example 20 node cluster.
In large deployments, NameNode is configured to run on a different server machine, JobTracker runs on a different server machine and Secondary Namenode runs on a different machine. Slave nodes will run Datanode daemons and Tasktracker daemons together on every individual machine. You may have now understood the capacity of your cluster based on the data. However, what should you be considering when you buy Hardware for Master Nodes?
Master Node: Hardware Considerations
- Master nodes are Single Point of Failures (SPOF)
- NameNode going down would leave HDFS inaccessible
- JobTracker going down would mean, all the jobs would fail
- Avoid using commodity hardware-Use Carrier class Hardware
- Always go for dual power supplies so that if one power fails the other is available
- Minimum 32gb of RAM
- The hard drive should be Raid-ed
- Always have dual Ethernet cards so that if one NIC fails you always have the other
These are some considerations to keep in mind while buying a Master node. Now, let's take a look at Slave.
Slave Node: Hardware Considerations
Few basic rules for Slave nodes:
- Horizontal Scaling is better than vertical scaling. This means a cluster with more nodes will perform better than the cluster with few slightly faster nodes. The more number of servers you have the more parallelism you achieve.
- Typical configuration for a slave node is as per the ratio 1:2:6-8 that means for every 1 hard drive, 2 cores, and 6-8gb RAM
Other configuration considerations
Operating System: Most often in production, RHEL is used on master nodes and ContOS is used in Slave nodes. There is no recommendation for specific OS, hence, choose an OS that your organization is comfortable administering. SUSE, Ubuntu, Fedora, RHEL and CentOS are the most common distributions used.
- Java In production, always use Oracle's Java. Hadoop, being complex, may end up exposing bugs in other java implementations. Use the latest and the most stable version of Java, avoid using latest versions as soon as they are available.
- Hadoop Available on Apache Hadoop Site. Distributions like Cloudera Hadoop, HortonWorks is also a good option. Different distributions have good documentation of Hadoop available. Let's have a look at configurations. Hadoop Installation Modes: Hadoop can be installed and run in 3 different ways:
- LocalJobRunner mode
- Pseudo-Distributed mode
- Distributed mode
-
LocalJobRunner mode
LocalJobRunner mode runs on local file system and not on HDFS. Earlier when virtual machines were not that popular or were not that mainstream people would typically go for a Local mode. Local mode means although there is no HDFS, it's the local file system, so your MapReduce program would just run like a java program running on a local box. It's as good as running a java program to add two numbers. No hadoop processes are required to run and everything runs on a single JVM, so when you do a JPS you would see only one JVM running that's it. This setup is ideal for testing MapReduce programs during development phase. This setup runs with a limitation that only one reducer can be specified.
-
Pseudo-Distributed mode
In Pseudo-distributed mode each daemon like namenode, datanode, secondary namenode, jobtracker etc runs on its own JVM , each daemon runs independently. Simulates the real Hadoop cluster on a single physical machine. It uses HDFS to store data. It is perfect for developing and testing applications before launching on the cluster to make sure that everything is working fine before you deploy it on to multi-node cluster.
-
Distributed mode
Hadoop was made for Distributed mode only. In real-world Hadoop clusters, datanode and tasktracker run on each single slave node. Namenode, Jobtracker, secondary namenode are run on dedicated machines.
Important configurations
Hadoop Configuration: In a Hadoop cluster, complete hadoop package is typically installed on each machine, Master as well as Slave. Each installation on each node will have its own set of configuration parameters; most configurations would be almost similar across the cluster. All the important files would be available into hadoops configuration directories typically present at /etc/hadoop/conf The important configuration files are core-site.xml, hdfs-site.xml, mapred-site.xml.
-
core-site.xml: All Hadoop services and clients use this file to locate the NameNode, so this file must be copied to each node that is either running a Hadoop service or is on a client node.
Important properties in core-site.xml
fs.default.name: Specifies the namenode
-
hdfs-site.xml: HDFS services use this file, it contains a number of important properties.
HTTP addresses for the NameNode , SecondaryNN
Replication for DataNodes
DataNode block storage location NameNode metadata storage
Important properties in hdfs-site.xml
dfs.block.size : size of the data block
dfs.data.dir: Location on the disk where datanode stores the data blocks
dfs.replication: Number of times each block gets replicated.
-
mapred-site.xml: This configuration file is required for runtime environment settings of a Hadoop processing component. It contains the configuration settings for MapReduce i.e. JobTracker service details and http address.
Important properties in mapred-site.xml:
mapred.job.tracker: Hostname for the jobtracker
mapred.local.dir: Location where intermediate data is stored.
File Formats and Precedence
Hadoop uses XML files for configuration management, all the configuration can be found in XML files. Typically the configuration looks like:
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://nnmac:8020</value>
</property>
</configuration>
Same configurations can be specified multiple times with different values and we can set precedence for it, the one with the highest precedence takes the priority. Precedence from highest to lowest:
-
Property values specified in MapReduce jobs that is the program code
-
*-site.xml , configurations set on client machines
-
*-site.xml , configurations set on slave nodes
The Hadoop administration can mark a property value in configuration as final and then nobody can change that configuration.
Important Port Numbers
Core operation
NameNode: 8020.
DataNode: 50010
Job Tracker: Usually 8021,9001 or 8012
Task Trackers: Usually 8021,9001 or 8012
Web Access
NameNode: 50070
JobTracker: 50030
Task Tracker: 50060
Conclusion
Planning Hadoop Cluster is a challenging and critical platform engineering task, hence design decisions and considerations need to be well planned and understood. Initially planning a small scale cluster may collapse or function inappropriately as the cluster grows, to handle more data than intended initially. This blog helps you to understand basics of Hadoop Cluster planning that may help you in most of the situations but always consider discussing with individual distribution provider or product expert teams while deciding production environment setup of a cluster.
Get going with Hadoop. Join us for the deep dive learning program.
Stay tuned for more!