Here is the package of the most popular Big Data interview question you must be prepared with.
- What do you understand by "Big Data"?
Answer: Big Data comprises of huge chunks of data which cannot be handled by modern day computers and requires new frameworks.
- What are the sources for Big Data generation in case of IoT?
Answer: Sensors are the most common source in the case of IoT.
- What are the five V's of Big Data stand for?
Answer: Volume, Velocity, Variety, Veracity and Value.
- How would you define "Hadoop"?
Answer: Hadoop is a set of frameworks which are used to process big data using parallel computing.
- Name any core components of Hadoop?
- HDFS for storage
- MapReduce/YARN for processing.
- What is the need to have Hadoop?
Answer: Hadoop is required for scalability. It is easy to build solutions for a particular volume of data. However, solutions for increasing amount of data are complex.
- In what ways big data and Hadoop are related?
Answer: Big Data and Hadoop go hand in hand. Without Big Data there is no Hadoop and without Hadoop, there is almost no way to process Big Data. Hadoop is the gateway for all other applications to be modelled for Big Data.
- How does Apache Hadoop resolve the challenge of big data storage?
Hadoop has its own file system, HDFS. It tries to solve all ends of data storage. Firstly, it is schema-less in nature and highly compressed. It is stored as binary. The file system also maintains redundancy so that there is data reliability even when a machine fails.
- How can big data analysis help in revenue generation?
Answer: Big Data analysis can improve revenue for any company. For finance companies, it can help in crunching down huge amounts of financial data to find loopholes and mistakes. For healthcare, it can be used to detect problems with patients through patient history and digging the vitals of all the historical patients. Similarly, any kind of company can use big data analytics to correct financial leaks.
- Mention some of the fields where you can apply Hadoop?
Answer: Hadoop can be applied at any place where big data exists. It can be used to:
- Model traffic
- Model a high-frequency trading platform
- Graphics rendering
- What are the differences between Hadoop and Spark?
Answer: Hadoop has its own storage system whereas Spark doesn’t. Spark is generally faster than Hadoop but the latter is more reliable and has much more support and tools available.
- In how many modes one can run Hadoop?
Answer: The Hadoop daemon can run on three different modes: Standalone, Pseudo-distributed and Fully distributed.
- Name the common input formats in Hadoop?
Answer: The commonly used input formats in Hadoop are: Key-Value mapping, Plain text and Sequence file input.
- Define HDFS?
Answer: Hadoop Distributed File Systems (HDFS) is the core Storage solution for Hadoop. It is a wrapper on the commonly used Linux filesystems such as ext3, ext4, etc.
- Name the components of HDFS.
Answer: The two main components of HDFS are Name Node and Data/Slave Node.
- What is meant by FSCK?
Answer: FSCK stands for File System Check. It is used by HDFS to check for any missing blocks or corruption in data.
- What do you understand by DataNode?
Answer: A Data Node is a slave node. It is the one which stores data and has a linkage to Name Node.
- Mention the functions of NameNode?
Answer: NameNode holds any kind of metadata for the data. It acts as Master for managing the Slave Nodes.
- How Is DataNode failure tackled using NameNode?
Answer: The NameNode is responsible for tackling replication of data. Hence, it keeps track of each DataNode replication and hence handles a DataNode failure.
- Is there any problem in using small files in Hadoop?
Answer: Yes. Hadoop is not made for small files. The default block size for HDFS is 128 MB. Anything smaller than that reduces hit rate and makes the program slow.
- How can you resolve the issue of small files in Hadoop?
Answer: HAR (Hadoop Archive) has been built as a wrapper on HDFS to tackle small files.
- What are the salient features of Pseudo Mode?
Answer: Pseudo mode simulates the environment of a parallel machine. During testing, one would want to see how their program will work on bigger data with more threads. The pseudo mode is for that.
- How can one achieve security in Hadoop?
Answer: Kerberos is the de facto for any kind of Authentication in Hadoop.
- State any two limitations associated with Hadoop?
Answer: Two limitations of Hadoop are:
- Does not handle small files well.
- Processing speed is low due to heavy map and reduces operations
- Mention the role of Job Tracker in Hadoop?
Answer: Job tracker manages resources. It makes sure that there is load balancing and no node is becoming the bottleneck.
- How one can debug Hadoop code?
Answer: Hadoop logs every step. The logs can be found in the installation directory.
- How does Reduce Side Join differ from Map Side Join?
Answer: The data needs to be structured for a map-side join. Not the case with reduce side join.
- Mention one difference between Input Split and HDFS Block?
Answer: Input Split is done during mapping operation for nodes and is not permanent. HDFS block is a permanent storage solution.
- What do you know about rack-aware replica placement policy?
Answer: Sometimes, a node’s physical location becomes important during allocation. Hence, rack-aware replica placement policy keeps track on that.
- What is the function of a DataNode block scanner?
Answer: Block scanner looks for any corrupted DataNodes and reports them.
- What do you understand by MapReduce?
Answer: MapReduce is the heart of Hadoop. It splits data into appropriately sized chunks and allocated them to nodes accordingly.
- Mention two salient features of MapReduce?
- A robust and tested architecture for data distribution and parallelization.
- Achieves good load balancing and scalability.
- Enlist the components of the MapReduce framework.
Answer: There are four components of the MapReduce framework:
- What is InputFormat?
Answer: It defines the input format and configuration for an MR job.
- When do we use jps command in Hadoop?
Answer: It is used to check if the Hadoop daemon is up and running.
- What is meant by Speculative Execution?
Answer: Not all machines in a cluster might be the same (or give equal performance). Hadoop runs the same instance of MR jobs on multiple machines in case some machine is performing poorly.
- Mention the steps involved in the NameNode recovery process?
Answer: First a new NameNode is created. It is then connected to DataNode and Clients and acknowledged. In the final stage, the NameNode starts serving the client and receives block reports from DataNodes.