Performance comparison by running benchmarks on Hadoop, Spark, and HAMR

Liu, Lu

Performance comparison by running benchmarks on Hadoop, Spark, and HAMR

Author(s)	Liu, Lu
Date Accessioned	2016-04-13T15:07:38Z
Date Available	2016-04-13T15:07:38Z
Publication Date	2015
Abstract	Today, Big Data is a hot topic both in industrial and academic fields. Hadoop is developed as a solution to Big Data. It provides reliable, scalable, fault-tolerance and efficient service for large scale data processing based on HDFS and MapReduce. HDFS stands for the Hadoop distributed file system and provides the distributed storage for the system. MapReduce provides the distributed processing for Hadoop. However, MapReduce is not suitable for all classes of applications. An alternative to overcome the limitation of Hadoop is new in-memory runtime systems such as Spark, that is designed to support applications reuse a working set of data across multiple parallel operations [31]. The weakness of Spark is that the performance is restricted by the memory. HAMR is a new technology that runs faster than Hadoop and Spark with less memory and CPU consumptions. At the time I started this thesis, CAPSL didn’t have a platform to provide students an environment to test big data applications. The purpose of the thesis is not to perform an extensive research but to construct a main eco-system that Hadoop and Spark can be in a same working condition. In additional, HAMR has also been installed as a test platform in the research eco-system. I also engaged the work of a selected of big data benchmarks, and took a preliminary test in all three eco-systems. To stress the different aspects of three big data runtimes, we selected and ran PageRank, WordCount, Sort, TeraSort, K-means and Naive Bayes benchmarks on Hadoop and Spark runtime systems, and ran PageRank and WordCount on HAMR runtime system. We measured the running time, maximum and average memory and CPU usage, the throughput to compare the performances difference among these plat- forms for the six benchmarks. As result, we found Spark has a outstanding performance on machine learning applications including K-means and Naive Bayes. For PageRank, Spark runs faster with small input size. Spark is faster on WordCount. For Sort and TeraSort, Spark runs faster with large input. However, Spark consumes more memory capacity and the performance for Spark is restricted by the memory. HAMR is faster than Hadoop for both two benchmarks with improvements on CPU and memory usage.	en_US
Advisor	Gao, Guang R.
Degree	M.S.
Department	University of Delaware, Department of Electrical and Computer Engineering
Unique Identifier	946585835
URL	http://udspace.udel.edu/handle/19716/17628
Publisher	University of Delaware	en_US
URI	http://search.proquest.com/docview/1767375901?accountid=10457
dc.subject.lcsh	Big data -- Computer programs -- Testing.
Title	Performance comparison by running benchmarks on Hadoop, Spark, and HAMR	en_US
Type	Thesis	en_US

Files

Original bundle

Now showing 1 - 1 of 1

Name:: 2015_LiuLu_MS.pdf
Size:: 3.61 MB
Format:: Adobe Portable Document Format
Description:

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 2.22 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Master's Theses (Fall 2009 to Present)