A Review on Storage and Large-Scale Processing of Data-Sets Using Map Reduce, YARN, SPARK, AVRO, MongoDB
7 Pages Posted: 14 Jun 2019
Date Written: April 4, 2019
This paper focus on Hadoop Distributed File System. It is a File System that is used to collect huge data sets reliably. It streams data sets to user applications with large bandwidth. Thousands of servers host are connected to storage and execute tasks. Hive data warehouse is facilitating querying. It is also managing huge datasets that are residing at distributed storage. MapReduce is moving computation processes to data over HDFS. Processing of tasks is made on physical node where data is residing. The network I/O patterns are reduced. Input outputs are kept on local disc. High aggregate read/write bandwidth is provided. HBase has been considered as column-oriented database management system. It executes on top of HDFS. Sqoop is well known tool that has been designed to shift data among Hadoop and relational databases. Pig would be used to analyze huge data sets. These data sets are consisting high-level language in order to express data analysis programs. On other hand Avro is providing easy method for complex data structures representation in case of Hadoop MapReduce job. Spark is data analytics cluster computing framework that is capable to provide performance up to 100 times faster than Hadoop MapReduce in several applications. MongoDB deployment is capable to host lot of databases. YARN is also extending power of Hadoop in order to incumbent and recent mechanism available in the data center.
Suggested Citation: Suggested Citation