Benchmarking BigData

Purpose:

The purpose of this blog is try to explain about different types of benchmark tools available for BigData components. We did a talk on BigData benchmark Linaro Connect @LasVegas in 2016. This is one of my effort to collectively put into a one place with more information.

We have to remember that all the BigData/components/benchmarks are developed

Keeping in mind x86 architecture.

So in first place we should make sure that all the relevant benchmark tools compile and run it on AArch64.
Then we should go ahead and try to optimize the same for AArch64.

Different types of benchmarks and standards

Micro benchmarks: To evaluate specific lower-level, system operations

E.g. HiBench, HDFS DFSIO, AMP Lab Big Data Benchmark, CALDA, Hadoop Workload Examples (sort, grep, wordcount and Terasort, Gridmix, Pigmix)

Functional/Component benchmarks: Specific to low level function

E.g. Basic SQL: Individual SQL operations like select, project, join, Order-by..

Application level

Bigbench
Spark bench

The below table explains different types of benchmark

Benchmark Efforts - Microbenchmarks	Workloads	Software Stacks	Metrics
DFSIO	Generate, read, write, append, and remove data for MapReduce jobs	Hadoop	Execution Time, Throughput
HiBench	Sort, WordCount, TeraSort, PageRank, K-means, Bayes classification, Index	Hadoop and Hive	Execution Time, Throughput, resource utilization
AMPLab benchmark	Part of CALDA workloads (scan, aggregate and join) and PageRank	Hive, Tez	Execution Time
CALDA	Load, scan, select, aggregate and join data, count URL links	Hadoop, Hive	Execution Time

Benchmark Efforts - TPC	Workloads	Software Stacks	Metrics
TPCx-HS	HSGen, HSData, Check, HSSort and HSValidate	Hadoop	Performance, price and energy
TPC-H	Datawarehousing operations	Hive, Pig	Execution Time, Throughput
TPC-DS	Decision support benchmark Data loading, queries and maintenance	Hive, Pig	Execution Time, Throughput

Benchmark Efforts - Synthetic	Workloads	Software Stacks	Metrics
SWIM	Synthetic user generated MapReduce jobs of reading, writing, shuffling and sorting	Hadoop	Multiple metrics
GridMix	Synthetic and basic operations to stress test job scheduler and compression and decompression	Hadoop	Memory, Execution Time, Throughput
PigMix	17 Pig specific queries	Hadoop, Pig	Execution Time
MRBench	MapReduce benchmark as a complementary to TeraSort - Datawarehouse operations with 22 TPC-H queries	Hadoop	Execution Time
NNBench	Load testing namenode and HDFS I/O with small payloads	Hadoop	I/O
SparkBench	CPU, memory and shuffle and IO intensive workloads. Machine Learning, Streaming, Graph Computation and SQL Workloads	Spark	Execution Time, Data process rate
BigBench	Interactive-based queries based on synthetic data	Hadoop, Spark	Execution Time

Benchmark Efforts	Workloads	Software Stacks	Metrics
BigDataBench	1. Micro Benchmarks (sort, grep, WordCount); 2. Search engine workloads (index, PageRank); 3. Social network workloads (connected components (CC), K-means and BFS); 4. E-commerce site workloads (Relational database queries (select, aggregate and join), collaborative filtering (CF) and Naive Bayes; 5. Multimedia analytics workloads (Speech Recognition, Ray Tracing, Image Segmentation, Face Detection); 6. Bioinformatics workloads	Hadoop, DBMSs, NoSQL systems, Hive, Impala, Hbase, MPI, Libc, and other real-time analytics systems	Throughput, Memory, CPU (MIPS, MPKI - Misses per instruction)

Let's go through each of the benchmark in detail.

Hadoop benchmark and test tool:

The hadoop source comes with a number of bench marks. The TestDFSIO, nnbench, mrbench are in hadoop-*test*.jar file and the TeraGen, TeraSort, TeraValidate are in hadoop-*examples*.jar file in the source code of hadoop.

You can check it using the command

$ cd /usr/local/hadoop

$ bin/hadoop jar hadoop-*test*.jar

$ bin/hadoop jar hadoop-*examples*.jar

While running the benchmarks you might want to use time command which measure the elapsed time. This saves you the hassle of navigating to the hadoop JobTracker interface. The relevant metric is real value in the first row.

$ time hadoop jar hadoop-*examples*.jar ...

[...]

real 9m15.510s

user 0m7.075s

sys 0m0.584s

TeraGen, TeraSort and TeraValidate

This is a most well known Hadoop benchmark. The TeraSort is to sort the data as fast as possible. This test suite combines HDFS and mapreduce layers of a hadoop cluster. The TeraSort benchmark consists of 3 steps Generate input via TeraGen, Run TeraSort on input data and Validate sorted output data via TeraValidate. We have a wikipage which explains about this test suite. You can refer Hadoop Build Install And Run Guide

TestDFSIO

It is part of hadoop-mapreduce-client-jobclient.jar file. The Stress test I/O performance (throughput and latency) on a clustered setup. This test will shake out the hardware, OS and Hadoop setup on your cluster machines (NameNode/DataNode). The tests are run as a MapReduce job using 1:1 mapping (1 map / file). This test is helpful to discover performance bottlenecks in your network. The benchmark write test follow up with read test. You can use the switch case -write for write tests and -read for read tests. The results are stored by default in TestDFSIO_results.log. You can use following switch case -resFile to choose different file name.

MR(Map Reduce) Benchmark for MR

The test loops a small job in number of times. It checks whether small job runs are responsive and running efficiently on your cluster. It puts focus on MapReduce layer as its impact on the HDFS layer is very limited. The multiple parallel MRBench issue is resolved. Hence you can run it from different boxes.

Test command to run 50 small test jobs

$ hadoop jar hadoop-*test*.jar mrbench -numRuns 50

Exemplary output, which means in 31 sec the job finished

DataLines Maps Reduces AvgTime (milliseconds)

1 2 1 31414

NN (Name Node) Benchmark for HDFS

This test is useful for load testing the NameNode hardware & configuration. The benchmark test generates a lot of HDFS related requests with normally very small payloads. It puts a high HDFS management stress on the NameNode. The test can be simultaneously run from several machines e.g. from a set of DataNode boxes in order to hit the NameNode from multiple locations at the same time.

TPC (Transaction Processing Performance Council)

The TPC is a non-profit, vendor-neutral organization. The reputation of providing the most credible performance results to the industry. The TPC is a role of “consumer reports” for the computing industry. It is a solid foundation for complete system-level performance. The TPC is a methodology for calculating total-system-price and price-performance. This is a methodology for measuring energy efficiency of complete system

TPC Benchmark

TPCx-HS

We have a collaborate page TPCxHS The X: Express, H: Hadoop, S: Sort. The TPCx-HS kit contains TPCx-HS specification documentation, TPCx-HS User's guide documentation, Scripts to run benchmarks and Java code to execute the benchmark load. A valid run consists of 5 separate phases run sequentially with overlap in their execution The benchmark test consists of 2 runs (Run with lower and higher TPCx-HS Performance Metric). There is no configuration or tuning changes or reboot are allowed between the two runs.

TPC Express Benchmark Standard is easy to implement, run and publish, and less expensive. The test sponsor is required to use TPCx-Hs kit as it is provided. The vendor may choose an independent audit or peer audit which is 60 day review/challenge window apply (as per TPC policy). This is approved by super majority of the TPC General Council. All publications must follow the TPC Fair Use Policy.

TPC-H

TPC-H benchmark focuses on ad-hoc queries

The TPC Benchmark™H (TPC-H) is a decision support benchmark. It consists of a suite of business oriented ad-hoc queries and concurrent data modifications. The queries and the data populating the database have been chosen to have broad industry-wide relevance. This benchmark illustrates decision support systems that examine large volumes of data, execute queries with a high degree of complexity, and give answers to critical business questions. The performance metric reported by TPC-H is called the TPC-H Composite Query-per-Hour Performance Metric (QphH@Size), and reflects multiple aspects of the capability of the system to process queries. These aspects include the selected database size against which the queries are executed, the query processing power when queries are submitted by a single stream, and the query throughput when queries are submitted by multiple concurrent users. The TPC-H Price/Performance metric is expressed as $/QphH@Size.

TPC-DS

This is the standard benchmark for decision support

The TPC Benchmark DS (TPC-DS) is a decision support benchmark that models several generally applicable aspects of a decision support system, including queries and data maintenance. The benchmark provides a representative evaluation of performance as a general purpose decision support system. A benchmark result measures query response time in single user mode, query throughput in multi user mode and data maintenance performance for a given hardware, operating system, and data processing system configuration under a controlled, complex, multi-user decision support workload. The purpose of TPC benchmarks is to provide relevant, objective performance data to industry users. TPC-DS Version 2 enables emerging technologies, such as Big Data systems, to execute the benchmark.

TPC-C

TPC-C is an On-Line Transaction Processing Benchmark

Approved in July of 1992, TPC Benchmark C is an on-line transaction processing (OLTP) benchmark. TPC-C is more complex than previous OLTP benchmarks such as TPC-A because of its multiple transaction types, more complex database and overall execution structure. TPC-C involves a mix of five concurrent transactions of different types and complexity either executed on-line or queued for deferred execution. The database is comprised of nine types of tables with a wide range of record and population sizes. TPC-C is measured in transactions per minute (tpmC). While the benchmark portrays the activity of a wholesale supplier, TPC-C is not limited to the activity of any particular business segment, but, rather represents any industry that must manage, sell, or distribute a product or service.

TPC vs SPEC models

Here is our comparison between TPC Vs SPEC model benchmark

TPC model	SPEC model
Specification based	Kit based
Performance, Price, energy in one benchmark	Performance and energy in separate benchmarks
End-to-End	Server centric
Multiple tests (ACID, Load)	Single test
Independent Review	Summary disclosure
Full disclosure	SPEC research group ICPE
TPC Technology conference	SPEC Research Group, ICPE (International Conference on Performance Engineering)

BigBench

BigBench is a joint effort with partners in industry and academia on creating a comprehensive and standardized BigData benchmark. One of the reference reading about BigBench Toward An Industry Standard Benchmark for BigData Analytics BigBench builds upon and borrows elements from existing benchmarking efforts (such as TPC-xHS, GridMix, PigMix, HiBench, Big Data Benchmark, YCSB and TPC-DS). BigBench is a specification-based benchmark with an open-source reference implementation kit. As a specification-based benchmark, it would be technology-agnostic and provide the necessary formalism and flexibility to support multiple implementations. It is focused around execution time calculation Consists of around 30 queries/workloads (10 of them are from TPC). The drawback is, it is a structured-data-intensive benchmark.

Spark Bench for Apache Spark

We are able to build on ARM64. The setup completed for single node but run scripts are failing. When spark bench examples are run, a KILL signal is observed which terminates all workers. This is still under investigation as there are no useful logs to debug. No proper error description and lack of documentation is a challenge. A ticket is already filed on spark bench git which is unresolved.

Hive TestBench

It is based on TPC-H and TPC-DS benchmarks. You can exeriment Apache Hive at any data scale. The benchmark contains data generator and set of queries. This is very useful to test the basic Hive performance on large data sets. We have a wiki page for Hive TestBench

GridMix

This is a stripped-down version of common Mapreduce jobs. (sorting text data and SequenceFiles). Its a tool for benchmarking Hadoop clusters. This is a trace based benchmark for MapReduce. It

evaluate MapReduce and HDFS performance.

It submits a mix of synthetic jobs , modeling a profile mined from production loads. The benchmark attempt to model the resource profiles of production jobs to identify bottlenecks

Basic command line usage:

$ hadoop gridmix [-generate ] [-users ]

- Destination directory

- Path to a job trace

Con - Challenging to explore the performance impact of combining or separating workloads, e.g., through consolidating from many clusters.

PigMix

The PigMix is a set of queries used test pig component performance. There are queries that test latency (How long it takes to run this query ?). The queries that test scalability (How many fields or records can ping handle before it fails ?).

Usage: Run the below commands from pig home

ant -Dharness.hadoop.home=$HADOOP_HOME pigmix-deploy (generate test dataset)

ant -Dharness.hadoop.home=$HADOOP_HOME pigmix (run the PigMix benchmark)

The documentation can be found at Apache pig - https://pig.apache.org/docs/

SWIM(Statistical Workload Injector for MapReduce)

This benchmark enables rigorous performance measurement of MapReduce systems. The benchmark contains suites of workloads of thousands of jobs, with complex data, arrival, and computation patterns. Informs both highly targeted, workload specific optimizations. This tool is highly recommended for MapReduce operators The performance measurement - https://github.com/SWIMProjectUCB/SWIM/wiki/Performance-measurement-by-executing-synthetic-or-historical-workloads

AmpLab

This is a BigData Benchmark from AMPLab, UC Berkeley provides quantitative and qualitative comparisons of five systems

Redshift – a hosted MPP database offered by Amazon.com based on the ParAccel data warehouse
Hive – a Hadoop-based data warehousing system
Shark – a Hive-compatible SQL engine which runs on top of the Spark computing framework
Impala – a Hive-compatible* SQL engine with its own MPP-like execution engine
Stinger/Tez – Tez is a next generation Hadoop execution engine currently in development

This benchmark measures response time on a handful of relational queries: scans, aggregations, joins, and UDF’s, across different data sizes.

BigDataBench

This is a specification based benchmark. The two key components: A data model specification and a workload/query specification. It's a comprehensive end-to-end big data benchmark suite. The git hub for BigDataBenchmark

BigDataBench is a benchmark suite for scale-out workloads, diﬀerent from SPEC CPU (sequential workloads), and PARSEC (multithreaded workloads). Currently, it simulates ﬁve typical and important big data applications: search engine, social network, e-commerce, multimedia data analytics, and bioinformatics.

Currently, BigDataBench includes 15 real-world data sets, and 34 big data workloads.

HiBench

This benchmark test suite is for Hadoop. It contains 4 different categories tests, 10 workloads and 3 types. This is a best benchmark with metrics: Time (sec) & Throughput (Bytes/Sec)

References

https://www2.eecs.berkeley.edu/Pubs/TechRpts/2011/EECS-2011-21.pdf

Terasort, TestDFSIO, NNBench, MRBench

https://wiki.linaro.org/LEG/Engineering/BigData

https://wiki.linaro.org/LEG/Engineering/BigData/HadoopTuningGuide

https://wiki.linaro.org/LEG/Engineering/BigData/HadoopBuildInstallAndRunGuide

http://www.michael-noll.com/blog/2011/04/09/benchmarking-and-stress-testing-an-hadoop-cluster-with-terasort-testdfsio-nnbench-mrbench/

GridMix3, PigMix, HiBench, TPCx-HS, SWIM, AMPLab, BigBench

https://hadoop.apache.org/docs/current/hadoop-gridmix/GridMix.html

https://cwiki.apache.org/confluence/display/PIG/PigMix

https://wiki.linaro.org/LEG/Engineering/BigData/HiBench

https://wiki.linaro.org/LEG/Engineering/BigData/TPCxHS

https://github.com/SWIMProjectUCB/SWIM/wiki

https://github.com/amplab

https://github.com/intel-hadoop/Big-Data-Benchmark-for-Big-Bench

http://www.academia.edu/15636566/Handbook_of_BigDataBench_Version_3.1_A_Big_Data_Benchmark_Suite

Industry Standard benchmarks

TPC - Transaction Processing Performance Council http://www.tpc.org

SPEC - The Standard Performance Evaluation Corporation https://www.spec.org

CLDS - Center for Largescale Data System Research http://clds.sdsc.edu/bdbc

Search This Blog

Technical

Benchmarking BigData

Comments

Post a Comment

Popular posts from this blog

mockbuild warning in CentOS

Apache Ambari on ARM64