The Logical division of data is called Input Split and physical division of data is called HDFS Block

The TaskTracker periodically sends heartbeat messages to the JobTracker to assure that it is alive. This messages also inform the JobTracker about the number of available slots. This return message updates JobTracker to know about where to schedule task.

Heartbeat is a signal which is used between a data node and name node, and between task tracker and job tracker. If the name node or job tracker doesn’t respond to the signal then it is considered that there is some issue with data node or task tracker.

In Hadoop, SequenceFileInputFormat is used to read files in sequence. It is a specific compressed binary file format which passes data between the output of one MapReduce job to the input of some other MapReduce job.

Following are the network requirement for using Hadoop:

  • Password-less SSH connection.
  • Secure Shell (SSH) for launching server processes

Hadoop is a distributed computing platform. It is written in Java. It consists of the features like Google File System and MapReduce.

These are the main tasks of JobTracker:

  • To accept jobs from the client.
  • To communicate with the NameNode to determine the location of the data.
  • To locate TaskTracker Nodes with available slots.
  • To submit the work to the chosen TaskTracker node and monitors the progress of each task.

TaskTracker is a node in the cluster that accepts tasks like MapReduce and Shuffle operations from a JobTracker.

HDFS data blocks are distributed across local drives of all machines in a cluster whereas, NAS data is stored on dedicated hardware.

Distributed cache is a facility provided by MapReduce Framework. It is provided to cache files (text, archives etc.) at the time of execution of the job. The Framework copies the necessary files to the slave node before the execution of any task at that node.

The ‘jps’ command is used for the retrieval of the status of daemons running the Hadoop cluster.

Hadoop streaming is a utility which allows you to create and run map/reduce job. It is a generic API that allows programs written in any languages to be used as Hadoop mapper.

JobTracker is a service within Hadoop which runs MapReduce jobs on the cluster.

There are many ways to debug Hadoop codes but the most popular methods are:

  • By using Counters.
  • By web interface provided by the Hadoop framework.

These are the most common input formats defined in Hadoop:

  1. TextInputFormat
  2. KeyValueInputFormat
  3. SequenceFileInputFormat
    TextInputFormat is a by default input format.

The big data can be categorized using the following features:

  • Volume
  • Velocity
  • Variety

Shuffling is a process which is used to perform the sorting and transfer the map outputs to the reducer as input.

When a Hadoop job runs, it splits input files into chunks and assigns each split to a mapper for processing. It is called the InputSplit.

A Combiner is a mini-reduce process which operates only on data generated by a Mapper. When Mapper emits the data, combiner receives it as input and sends the output to a reducer.

ebDAV is a set of extension to HTTP which is used to support editing and uploading files. On most operating system WebDAV shares can be mounted as filesystems, so it is possible to access HDFS as a standard filesystem by exposing HDFS over WebDAV.

Yes, It is possible. The input format class provides methods to insert multiple directories as input to a Hadoop job.

NameNode is a node, where Hadoop stores all the file location information in HDFS (Hadoop Distributed File System). We can say that NameNode is the centerpiece of an HDFS file system which is responsible for keeping the record of all the files in the file system, and tracks the file data across the cluster or multiple machines

JobTracker is a giant service which is used to submit and track MapReduce jobs in Hadoop. Only one JobTracker process runs on any Hadoop cluster. JobTracker runs it within its own JVM process.

Functionalities of JobTracker in Hadoop:

  • When client application submits jobs to the JobTracker, the JobTracker talks to the NameNode to find the location of the data.
  • It locates TaskTracker nodes with available slots for data.
  • It assigns the work to the chosen TaskTracker nodes.
  • The TaskTracker nodes are responsible to notify the JobTracker when a task fails and then JobTracker decides what to do then. It may resubmit the task on another node or it may mark that task to avoid.

In TextInputFormat, each line in the text file is a record. Value is the content of the line while Key is the byte offset of the line. For instance, Key: longWritable, Value: text

Following are the three configuration files in Hadoop:

  • core-site.xml
  • mapred-site.xml
  • hdfs-site.xml

Sqoop is a tool used to transfer data between the Relational Database Management System (RDBMS) and Hadoop HDFS. By using Sqoop, you can transfer data from RDBMS like MySQL or Oracle into HDFS as well as exporting data from HDFS file to RDBMS.

In Hadoop, A job is divided into multiple small parts known as the task