Performance comparison by running benchmarks on Hadoop, Spark, and HAMR

Author(s)Liu, Lu
Date Accessioned2016-04-13T15:07:38Z
Date Available2016-04-13T15:07:38Z
Publication Date2015
AbstractToday, Big Data is a hot topic both in industrial and academic fields. Hadoop is developed as a solution to Big Data. It provides reliable, scalable, fault-tolerance and efficient service for large scale data processing based on HDFS and MapReduce. HDFS stands for the Hadoop distributed file system and provides the distributed storage for the system. MapReduce provides the distributed processing for Hadoop. However, MapReduce is not suitable for all classes of applications. An alternative to overcome the limitation of Hadoop is new in-memory runtime systems such as Spark, that is designed to support applications reuse a working set of data across multiple parallel operations [31]. The weakness of Spark is that the performance is restricted by the memory. HAMR is a new technology that runs faster than Hadoop and Spark with less memory and CPU consumptions. At the time I started this thesis, CAPSL didn’t have a platform to provide students an environment to test big data applications. The purpose of the thesis is not to perform an extensive research but to construct a main eco-system that Hadoop and Spark can be in a same working condition. In additional, HAMR has also been installed as a test platform in the research eco-system. I also engaged the work of a selected of big data benchmarks, and took a preliminary test in all three eco-systems. To stress the different aspects of three big data runtimes, we selected and ran PageRank, WordCount, Sort, TeraSort, K-means and Naive Bayes benchmarks on Hadoop and Spark runtime systems, and ran PageRank and WordCount on HAMR runtime system. We measured the running time, maximum and average memory and CPU usage, the throughput to compare the performances difference among these plat- forms for the six benchmarks. As result, we found Spark has a outstanding performance on machine learning applications including K-means and Naive Bayes. For PageRank, Spark runs faster with small input size. Spark is faster on WordCount. For Sort and TeraSort, Spark runs faster with large input. However, Spark consumes more memory capacity and the performance for Spark is restricted by the memory. HAMR is faster than Hadoop for both two benchmarks with improvements on CPU and memory usage.en_US
AdvisorGao, Guang R.
DegreeM.S.
DepartmentUniversity of Delaware, Department of Electrical and Computer Engineering
Unique Identifier946585835
URLhttp://udspace.udel.edu/handle/19716/17628
PublisherUniversity of Delawareen_US
URIhttp://search.proquest.com/docview/1767375901?accountid=10457
dc.subject.lcshBig data -- Computer programs -- Testing.
TitlePerformance comparison by running benchmarks on Hadoop, Spark, and HAMRen_US
TypeThesisen_US
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
2015_LiuLu_MS.pdf
Size:
3.61 MB
Format:
Adobe Portable Document Format
Description:
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
2.22 KB
Format:
Item-specific license agreed upon to submission
Description: