Home |
Write |
179 members |

Join with Aptibook

Hadoop interview questions and answers - Page 1

1. What is Hadoop ?

• Apache Hadoop is a software framework (open source) which promotes data-intensive distributed applications.

• The entire Hadoop platform consists of Hadoop kernal, MapReduce component, HDFS (Hadoop distributed file system)

• Hadoop is written in the Java programming language and is a top-level Apache project being built and used by a global community of contributors.

• The most well known technology used for Big Data is Hadoop

• Two languages are identified as original Hadoop languages: PIG and Hive.

• In hadoop system, the data is distributed in thousands of nodes parallely

• Hadoop deals with complexities of high volume, velocity & variety of data

• Batch processing centric is greatly achieved in Hadoop

• Hadoop can store petabytes of data reliably

• Accessibility is ensured even if any machine breaks down or is thrown out from network.

• One can use Map Reduce programs to access and manipulate the data. The developer need not worry where the data is stored, he/she can reference the data from a single view provided from the Master Node which stores all metadata of all the files stored across the cluster.

2. What is Big Data?

Big Data is large in quantity, is captured at a rapid rate, and is structured or unstructured, or some combination of the above. It is difficult to capture, mine, and manage data using traditional methods but not in Big data. There is so much hype in this space that there could be an extended debate just about the definition of big data.

Big Data technology is not restricted to large volumes. As of the year2012, clusters that are big are in the 100 Petabyte range.

Traditional relational databases,like Informix and DB2, provide proven solutions for structured data. Via extensibility they also manage unstructured data. The Hadoop technology brings new and more accessible programming techniques for working on massive data stores with both structured and unstructured data.

3. Advantages of Hadoop

Bringing compute and storage together on commodity hardware: The result is blazing speed at low cost.

Price performance: The Hadoop big data technology provides significant cost savings (think a factor of approximately 10) with significant performance improvements (again, think factor of 10). Your mileage may vary. If the existing technology can be so dramatically trounced, it is worth examining if Hadoop can complement or replace aspects of your current architecture.

Linear Scalability: Every parallel technology makes claims about scale up.Hadoop has genuine scalability since the latest release is expanding the limit on the number of nodes to beyond 4,000.

Full access to unstructured data: A highly scalable data store with a good parallel programming model, MapReduce, has been a challenge for the industry for some time. Hadoop programming model does not solve all problems, but it is a strong solution for many tasks.

4. Definition of Big data

According to Gartner, Big data can be defined as high volume, velocity and variety information requiring innovative and cost effective forms of information processing for enhanced decision making.

5. How Big data differs from database ?

Datasets which are beyond the ability of the database to store, analyze and manage can be defined as Big. The technology extracts required information from large volume whereas the storage area is limited for a database.