Benchmarking BigData
Purpose:
The purpose of this blog is try to explain about different types of benchmark tools available for BigData components. We did a talk on BigData benchmark Linaro Connect @LasVegas in 2016. This is one of my effort to collectively put into a one place with more information.
We have to remember that all the BigData/components/benchmarks are developed
- Keeping in mind x86 architecture.
- So in first place we should make sure that all the relevant benchmark tools compile and run it on AArch64.
- Then we should go ahead and try to optimize the same for AArch64.
Different types of benchmarks and standards
- Micro benchmarks: To evaluate specific lower-level, system operations
- E.g. HiBench, HDFS DFSIO, AMP Lab Big Data Benchmark, CALDA, Hadoop Workload Examples (sort, grep, wordcount and Terasort, Gridmix, Pigmix)
- Functional/Component benchmarks: Specific to low level function
- E.g. Basic SQL: Individual SQL operations like select, project, join, Order-by..
- Application level
- Bigbench
- Spark bench
The below table explains different types of benchmark
Benchmark Efforts - Microbenchmarks
|
Workloads
|
Software Stacks
|
Metrics
|
DFSIO
|
Generate, read, write, append, and remove data for MapReduce jobs
|
Hadoop
|
Execution Time, Throughput
|
HiBench
|
Sort, WordCount, TeraSort, PageRank, K-means, Bayes classification, Index
|
Hadoop and Hive
|
Execution Time, Throughput, resource utilization
|
AMPLab benchmark
|
Part of CALDA workloads (scan, aggregate and join) and PageRank
|
Hive, Tez
|
Execution Time
|
CALDA
|
Load, scan, select, aggregate and join data, count URL links
|
Hadoop, Hive
|
Execution Time
|
Benchmark Efforts - TPC
|
Workloads
|
Software Stacks
|
Metrics
|
TPCx-HS
|
HSGen, HSData, Check, HSSort and HSValidate
|
Hadoop
|
Performance, price and energy
|
TPC-H
|
Datawarehousing operations
|
Hive, Pig
|
Execution Time, Throughput
|
TPC-DS
|
Decision support benchmark
Data loading, queries and maintenance
|
Hive, Pig
|
Execution Time, Throughput
|
Benchmark Efforts - Synthetic
|
Workloads
|
Software Stacks
|
Metrics
|
SWIM
|
Synthetic user generated MapReduce jobs of reading, writing, shuffling and sorting
|
Hadoop
|
Multiple metrics
|
GridMix
|
Synthetic and basic operations to stress test job scheduler and compression and decompression
|
Hadoop
|
Memory, Execution Time, Throughput
|
PigMix
|
17 Pig specific queries
|
Hadoop, Pig
|
Execution Time
|
MRBench
|
MapReduce benchmark as a complementary to TeraSort - Datawarehouse operations with 22 TPC-H queries
|
Hadoop
|
Execution Time
|
NNBench
|
Load testing namenode and HDFS I/O with small payloads
|
Hadoop
|
I/O
|
SparkBench
|
CPU, memory and shuffle and IO intensive workloads. Machine Learning, Streaming, Graph Computation and SQL Workloads
|
Spark
|
Execution Time, Data process rate
|
BigBench
|
Interactive-based queries based on synthetic data
|
Hadoop, Spark
|
Execution Time
|
Benchmark Efforts
|
Workloads
|
Software Stacks
|
Metrics
|
BigDataBench
|
1. Micro Benchmarks (sort, grep, WordCount);
2. Search engine workloads (index, PageRank);
3. Social network workloads (connected components (CC), K-means and BFS);
4. E-commerce site workloads (Relational database queries (select, aggregate and join), collaborative filtering (CF) and Naive Bayes;
5. Multimedia analytics workloads (Speech Recognition, Ray Tracing, Image Segmentation, Face Detection);
6. Bioinformatics workloads
|
Hadoop, DBMSs, NoSQL systems, Hive, Impala, Hbase, MPI, Libc, and other real-time analytics systems
|
Throughput,
Memory, CPU (MIPS, MPKI - Misses per instruction)
|
Let's go through each of the benchmark in detail.
Hadoop benchmark and test tool:
The hadoop source comes with a number of bench marks. The TestDFSIO, nnbench, mrbench are in hadoop-*test*.jar file and the TeraGen, TeraSort, TeraValidate are in hadoop-*examples*.jar file in the source code of hadoop.
You can check it using the command
$ cd /usr/local/hadoop
$ bin/hadoop jar hadoop-*test*.jar
$ bin/hadoop jar hadoop-*examples*.jar
While running the benchmarks you might want to use time command which measure the elapsed time. This saves you the hassle of navigating to the hadoop JobTracker interface. The relevant metric is real value in the first row.
$ time hadoop jar hadoop-*examples*.jar ...
[...]
real 9m15.510s
user 0m7.075s
sys 0m0.584s
TeraGen, TeraSort and TeraValidate
This is a most well known Hadoop benchmark. The TeraSort is to sort the data as fast as possible. This test suite combines HDFS and mapreduce layers of a hadoop cluster. The TeraSort benchmark consists of 3 steps Generate input via TeraGen, Run TeraSort on input data and Validate sorted output data via TeraValidate. We have a wikipage which explains about this test suite. You can refer Hadoop Build Install And Run Guide
TestDFSIO
It is part of hadoop-mapreduce-client-jobclient.jar file. The Stress test I/O performance (throughput and latency) on a clustered setup. This test will shake out the hardware, OS and Hadoop setup on your cluster machines (NameNode/DataNode). The tests are run as a MapReduce job using 1:1 mapping (1 map / file). This test is helpful to discover performance bottlenecks in your network. The benchmark write test follow up with read test. You can use the switch case -write for write tests and -read for read tests. The results are stored by default in TestDFSIO_results.log. You can use following switch case -resFile to choose different file name.
MR(Map Reduce) Benchmark for MR
The test loops a small job in number of times. It checks whether small job runs are responsive and running efficiently on your cluster. It puts focus on MapReduce layer as its impact on the HDFS layer is very limited. The multiple parallel MRBench issue is resolved. Hence you can run it from different boxes.
Test command to run 50 small test jobs
$ hadoop jar hadoop-*test*.jar mrbench -numRuns 50
Exemplary output, which means in 31 sec the job finished
DataLines Maps Reduces AvgTime (milliseconds)
1 2 1 31414
NN (Name Node) Benchmark for HDFS
This test is useful for load testing the NameNode hardware & configuration. The benchmark test generates a lot of HDFS related requests with normally very small payloads. It puts a high HDFS management stress on the NameNode. The test can be simultaneously run from several machines e.g. from a set of DataNode boxes in order to hit the NameNode from multiple locations at the same time.
The TPC is a non-profit, vendor-neutral organization. The reputation of providing the most credible performance results to the industry. The TPC is a role of “consumer reports” for the computing industry. It is a solid foundation for complete system-level performance. The TPC is a methodology for calculating total-system-price and price-performance. This is a methodology for measuring energy efficiency of complete system
TPC Benchmark
- TPCx-HS
TPC Express Benchmark Standard is easy to implement, run and publish, and less expensive. The test sponsor is required to use TPCx-Hs kit as it is provided. The vendor may choose an independent audit or peer audit which is 60 day review/challenge window apply (as per TPC policy). This is approved by super majority of the TPC General Council. All publications must follow the TPC Fair Use Policy.
- TPC-H
- TPC-H benchmark focuses on ad-hoc queries
The TPC Benchmark™H (TPC-H) is a decision support benchmark. It consists of a suite of business oriented ad-hoc queries and concurrent data modifications. The queries and the data populating the database have been chosen to have broad industry-wide relevance. This benchmark illustrates decision support systems that examine large volumes of data, execute queries with a high degree of complexity, and give answers to critical business questions. The performance metric reported by TPC-H is called the TPC-H Composite Query-per-Hour Performance Metric (QphH@Size), and reflects multiple aspects of the capability of the system to process queries. These aspects include the selected database size against which the queries are executed, the query processing power when queries are submitted by a single stream, and the query throughput when queries are submitted by multiple concurrent users. The TPC-H Price/Performance metric is expressed as $/QphH@Size.
- TPC-DS
- This is the standard benchmark for decision support
The TPC Benchmark DS (TPC-DS) is a decision support benchmark that models several generally applicable aspects of a decision support system, including queries and data maintenance. The benchmark provides a representative evaluation of performance as a general purpose decision support system. A benchmark result measures query response time in single user mode, query throughput in multi user mode and data maintenance performance for a given hardware, operating system, and data processing system configuration under a controlled, complex, multi-user decision support workload. The purpose of TPC benchmarks is to provide relevant, objective performance data to industry users. TPC-DS Version 2 enables emerging technologies, such as Big Data systems, to execute the benchmark.
- TPC-C
- TPC-C is an On-Line Transaction Processing Benchmark
Approved in July of 1992, TPC Benchmark C is an on-line transaction processing (OLTP) benchmark. TPC-C is more complex than previous OLTP benchmarks such as TPC-A because of its multiple transaction types, more complex database and overall execution structure. TPC-C involves a mix of five concurrent transactions of different types and complexity either executed on-line or queued for deferred execution. The database is comprised of nine types of tables with a wide range of record and population sizes. TPC-C is measured in transactions per minute (tpmC). While the benchmark portrays the activity of a wholesale supplier, TPC-C is not limited to the activity of any particular business segment, but, rather represents any industry that must manage, sell, or distribute a product or service.
TPC vs SPEC models
Here is our comparison between TPC Vs SPEC model benchmark
TPC model | SPEC model |
Specification based | Kit based |
Performance, Price, energy in one benchmark | Performance and energy in separate benchmarks |
End-to-End | Server centric |
Multiple tests (ACID, Load) | Single test |
Independent Review | Summary disclosure |
Full disclosure | SPEC research group ICPE |
TPC Technology conference | SPEC Research Group, ICPE (International Conference on Performance Engineering) |
BigBench is a joint effort with partners in industry and academia on creating a comprehensive and standardized BigData benchmark. One of the reference reading about BigBench Toward An Industry Standard Benchmark for BigData Analytics BigBench builds upon and borrows elements from existing benchmarking efforts (such as TPC-xHS, GridMix, PigMix, HiBench, Big Data Benchmark, YCSB and TPC-DS). BigBench is a specification-based benchmark with an open-source reference implementation kit. As a specification-based benchmark, it would be technology-agnostic and provide the necessary formalism and flexibility to support multiple implementations. It is focused around execution time calculation Consists of around 30 queries/workloads (10 of them are from TPC). The drawback is, it is a structured-data-intensive benchmark.
Spark Bench for Apache Spark
We are able to build on ARM64. The setup completed for single node but run scripts are failing. When spark bench examples are run, a KILL signal is observed which terminates all workers. This is still under investigation as there are no useful logs to debug. No proper error description and lack of documentation is a challenge. A ticket is already filed on spark bench git which is unresolved.
It is based on TPC-H and TPC-DS benchmarks. You can exeriment Apache Hive at any data scale. The benchmark contains data generator and set of queries. This is very useful to test the basic Hive performance on large data sets. We have a wiki page for Hive TestBench
This is a stripped-down version of common Mapreduce jobs. (sorting text data and SequenceFiles). Its a tool for benchmarking Hadoop clusters. This is a trace based benchmark for MapReduce. It
evaluate MapReduce and HDFS performance.
It submits a mix of synthetic jobs , modeling a profile mined from production loads. The benchmark attempt to model the resource profiles of production jobs to identify bottlenecks
Basic command line usage:
$ hadoop gridmix [-generate ] [-users ]
Con - Challenging to explore the performance impact of combining or separating workloads, e.g., through consolidating from many clusters.
The PigMix is a set of queries used test pig component performance. There are queries that test latency (How long it takes to run this query ?). The queries that test scalability (How many fields or records can ping handle before it fails ?).
Usage: Run the below commands from pig home
ant -Dharness.hadoop.home=$HADOOP_HOME pigmix-deploy (generate test dataset)
ant -Dharness.hadoop.home=$HADOOP_HOME pigmix (run the PigMix benchmark)
The documentation can be found at Apache pig - https://pig.apache.org/docs/
This benchmark enables rigorous performance measurement of MapReduce systems. The benchmark contains suites of workloads of thousands of jobs, with complex data, arrival, and computation patterns. Informs both highly targeted, workload specific optimizations. This tool is highly recommended for MapReduce operators The performance measurement - https://github.com/SWIMProjectUCB/SWIM/wiki/Performance-measurement-by-executing-synthetic-or-historical-workloads
This is a BigData Benchmark from AMPLab, UC Berkeley provides quantitative and qualitative comparisons of five systems
- Redshift – a hosted MPP database offered by Amazon.com based on the ParAccel data warehouse
- Hive – a Hadoop-based data warehousing system
- Shark – a Hive-compatible SQL engine which runs on top of the Spark computing framework
- Impala – a Hive-compatible* SQL engine with its own MPP-like execution engine
- Stinger/Tez – Tez is a next generation Hadoop execution engine currently in development
This benchmark measures response time on a handful of relational queries: scans, aggregations, joins, and UDF’s, across different data sizes.
This is a specification based benchmark. The two key components: A data model specification and a workload/query specification. It's a comprehensive end-to-end big data benchmark suite. The git hub for BigDataBenchmark
BigDataBench is a benchmark suite for scale-out workloads, different from SPEC CPU (sequential workloads), and PARSEC (multithreaded workloads). Currently, it simulates five typical and important big data applications: search engine, social network, e-commerce, multimedia data analytics, and bioinformatics.
Currently, BigDataBench includes 15 real-world data sets, and 34 big data workloads.
This benchmark test suite is for Hadoop. It contains 4 different categories tests, 10 workloads and 3 types. This is a best benchmark with metrics: Time (sec) & Throughput (Bytes/Sec)
References
https://www2.eecs.berkeley.edu/Pubs/TechRpts/2011/EECS-2011-21.pdf
Terasort, TestDFSIO, NNBench, MRBench
https://wiki.linaro.org/LEG/Engineering/BigData
https://wiki.linaro.org/LEG/Engineering/BigData/HadoopTuningGuide
https://wiki.linaro.org/LEG/Engineering/BigData/HadoopBuildInstallAndRunGuide
http://www.michael-noll.com/blog/2011/04/09/benchmarking-and-stress-testing-an-hadoop-cluster-with-terasort-testdfsio-nnbench-mrbench/
GridMix3, PigMix, HiBench, TPCx-HS, SWIM, AMPLab, BigBench
https://hadoop.apache.org/docs/current/hadoop-gridmix/GridMix.html
https://cwiki.apache.org/confluence/display/PIG/PigMix
https://wiki.linaro.org/LEG/Engineering/BigData/HiBench
https://wiki.linaro.org/LEG/Engineering/BigData/TPCxHS
https://github.com/SWIMProjectUCB/SWIM/wiki
https://github.com/amplab
https://github.com/intel-hadoop/Big-Data-Benchmark-for-Big-Bench
http://www.academia.edu/15636566/Handbook_of_BigDataBench_Version_3.1_A_Big_Data_Benchmark_Suite
Industry Standard benchmarks
TPC - Transaction Processing Performance Council http://www.tpc.org
SPEC - The Standard Performance Evaluation Corporation https://www.spec.org
CLDS - Center for Largescale Data System Research http://clds.sdsc.edu/bdbc
You expressed your thoughts in different way and i really enjoyed with your article.
ReplyDeleteSelenium Training in Chennai
Selenium Training
JAVA Training in Chennai
Python Training in Chennai
Hadoop Training in Chennai
Digital Marketing Course in Chennai
Selenium Training in Annanagar
Excellent and very cool idea and great content of different kinds of the valuable information's.
ReplyDeleteData Analytics Courses in Chennai
Data Analytics Courses
TOEFL Training in Chennai
french courses in chennai
Spoken English in Chennai
Blockchain Training in Chennai
Data Analytics Courses in Tambaram
Data Analytics Courses in Adyar
The Blog is really impressive. every concept should be very clearly explained.
ReplyDeleteData Science Training Course In Chennai | Data Science Training Course In Anna Nagar | Data Science Training Course In OMR | Data Science Training Course In Porur | Data Science Training Course In Tambaram | Data Science Training Course In Velachery
"Great post! I am actually getting ready to across this information, It's very helpful for this blog.Also great with all of the valuable information you have Keep up the good work you are doing well.
ReplyDeleteDigital Marketing Training Course in Chennai | Digital Marketing Training Course in Anna Nagar | Digital Marketing Training Course in OMR | Digital Marketing Training Course in Porur | Digital Marketing Training Course in Tambaram | Digital Marketing Training Course in Velachery
"
I have a similar interest this is my page read everything carefully and let me know what you think
ReplyDeleteDigital Marketing Training Course in Chennai | Digital Marketing Training Course in Anna Nagar | Digital Marketing Training Course in OMR | Digital Marketing Training Course in Porur | Digital Marketing Training Course in Tambaram | Digital Marketing Training Course in Velachery
A driver toolkit Crack is a software program that enables users to manage the installation of drivers for devices attached to their Driver Toolkit Crack Download
ReplyDelete