Benchmarking BigData


The purpose of this blog is try to explain about different types of benchmark tools available for BigData components.  We did a talk on BigData benchmark Linaro Connect @LasVegas in 2016. This is one of my effort to collectively put into a one place with more information.

We have to remember that all the BigData/components/benchmarks are developed 
  • Keeping in mind x86 architecture.  
    • So in first place we should make sure that all the relevant benchmark tools compile and run it on AArch64.  
    • Then we should go ahead and try to optimize the same for AArch64.
Different types of benchmarks and standards
  • Micro benchmarks: To evaluate specific lower-level, system operations
    • E.g. HiBench, HDFS DFSIO, AMP Lab Big Data Benchmark, CALDA, Hadoop Workload Examples (sort, grep, wordcount and Terasort, Gridmix, Pigmix)
  • Functional/Component benchmarks: Specific to low level function
    • E.g. Basic SQL: Individual SQL operations like select, project, join, Order-by..
  • Application level
    • Bigbench
    • Spark bench
The below table explains different types of benchmark
Benchmark Efforts - Microbenchmarks
Software Stacks
Generate, read, write, append, and remove data for MapReduce jobs
Execution Time, Throughput
Sort, WordCount, TeraSort, PageRank, K-means, Bayes classification, Index
Hadoop and Hive
Execution Time, Throughput, resource utilization
AMPLab benchmark
Part of CALDA workloads (scan, aggregate and join) and PageRank
Hive, Tez
Execution Time
Load, scan, select, aggregate and join data, count URL links
Hadoop, Hive
Execution Time

Benchmark Efforts - TPC
Software Stacks
HSGen, HSData, Check, HSSort and HSValidate
Performance, price and energy
Datawarehousing operations
Hive, Pig
Execution Time, Throughput
Decision support benchmark
Data loading, queries and maintenance
Hive, Pig
Execution Time, Throughput

Benchmark Efforts - Synthetic
Software Stacks
Synthetic user generated MapReduce jobs of reading, writing, shuffling and sorting
Multiple metrics
Synthetic and basic operations to stress test job scheduler and compression and decompression
Memory, Execution Time, Throughput
17 Pig specific queries
Hadoop, Pig
Execution Time
MapReduce benchmark as a complementary to TeraSort - Datawarehouse operations with 22 TPC-H queries
Execution Time
Load testing namenode and HDFS I/O with small payloads
CPU, memory and shuffle and IO intensive workloads. Machine Learning, Streaming, Graph Computation and SQL Workloads
Execution Time, Data process rate
Interactive-based queries based on synthetic data
Hadoop, Spark
Execution Time

Benchmark Efforts
Software Stacks
1. Micro Benchmarks (sort, grep, WordCount);
2. Search engine workloads (index, PageRank);
3. Social network workloads (connected components (CC), K-means and BFS);
4. E-commerce site workloads (Relational database queries (select, aggregate and join), collaborative filtering (CF) and Naive Bayes;
5. Multimedia analytics workloads (Speech Recognition, Ray Tracing, Image Segmentation, Face Detection);
6. Bioinformatics workloads
Hadoop, DBMSs, NoSQL systems, Hive, Impala, Hbase, MPI, Libc, and other real-time analytics systems
Memory, CPU (MIPS, MPKI - Misses per instruction)

Let's go through each of the benchmark in detail.

Hadoop benchmark and test tool:

The hadoop source comes with a number of bench marks. The TestDFSIO, nnbench, mrbench are in hadoop-*test*.jar file and the TeraGen, TeraSort, TeraValidate are in hadoop-*examples*.jar file in the source code of hadoop.

You can check it using the command

       $ cd /usr/local/hadoop
       $ bin/hadoop jar hadoop-*test*.jar
       $ bin/hadoop jar hadoop-*examples*.jar

While running the benchmarks you might want to use time command which measure the elapsed time.  This saves you the hassle of navigating to the hadoop JobTracker interface.  The relevant metric is real value in the first row.

      $ time hadoop jar hadoop-*examples*.jar ...
      real    9m15.510s
      user    0m7.075s
      sys     0m0.584s

TeraGen, TeraSort and TeraValidate

This is a most well known Hadoop benchmark.  The TeraSort is to sort the data as fast as possible.  This test suite combines HDFS and mapreduce layers of a hadoop cluster.  The TeraSort benchmark consists of 3 steps Generate input via TeraGen, Run TeraSort on input data and Validate sorted output data via TeraValidate.  We have a wikipage which explains about this test suite.  You can refer Hadoop Build Install And Run Guide


It is part of hadoop-mapreduce-client-jobclient.jar file.  The Stress test I/O performance (throughput and latency) on a clustered setup.  This test will shake out the hardware, OS and Hadoop setup on your cluster machines (NameNode/DataNode).  The tests are run as a MapReduce job using 1:1 mapping (1 map / file).  This test is helpful to discover performance bottlenecks in your network.  The benchmark write test follow up with read test.  You can use the switch case -write for write tests and -read for read tests.  The results are stored by default in TestDFSIO_results.log. You can use following switch case -resFile to choose different file name.

MR(Map Reduce) Benchmark for MR

The test loops a small job in number of times.  It checks whether small job runs are responsive and running efficiently on your cluster.  It puts focus on MapReduce layer as its impact on the HDFS layer is very limited.  The multiple parallel MRBench issue is resolved.  Hence you can run it from different boxes.

Test command to run 50 small test jobs
      $ hadoop jar hadoop-*test*.jar mrbench -numRuns 50

Exemplary output, which means in 31 sec the job finished
      DataLines       Maps    Reduces AvgTime (milliseconds)
      1               2       1       31414

NN (Name Node) Benchmark for HDFS

This test is useful for load testing the NameNode hardware & configuration.  The benchmark test generates a lot of HDFS related requests with normally very small payloads.  It puts a high HDFS management stress on the NameNode.  The test can be simultaneously run from several machines e.g. from a set of DataNode boxes in order to hit the NameNode from multiple locations at the same time.

The TPC is a non-profit, vendor-neutral organization. The reputation of providing the most credible performance results to the industry. The TPC is a role of “consumer reports” for the computing industry.  It is a solid foundation for complete system-level performance.  The TPC is a methodology for calculating total-system-price and price-performance.  This is a methodology for measuring energy efficiency of complete system 

TPC Benchmark 
  • TPCx-HS
We have a collaborate page TPCxHS  The X: Express, H: Hadoop, S: Sort.  The TPCx-HS kit contains TPCx-HS specification documentation, TPCx-HS User's guide documentation, Scripts to run benchmarks and Java code to execute the benchmark load. A valid run consists of 5 separate phases run sequentially with overlap in their execution The benchmark test consists of 2 runs (Run with lower and higher TPCx-HS Performance Metric).  There is no configuration or tuning changes or reboot are allowed between the two runs.

TPC Express Benchmark Standard is easy to implement, run and publish, and less expensive.  The test sponsor is required to use TPCx-Hs kit as it is provided.  The vendor may choose an independent audit or peer audit which is 60 day review/challenge window apply (as per TPC policy). This is approved by  super majority of the TPC General Council. All publications must follow the TPC Fair Use Policy.
  • TPC-H
    • TPC-H benchmark focuses on ad-hoc queries
The TPC Benchmark™H (TPC-H) is a decision support benchmark. It consists of a suite of business oriented ad-hoc queries and concurrent data modifications. The queries and the data populating the database have been chosen to have broad industry-wide relevance. This benchmark illustrates decision support systems that examine large volumes of data, execute queries with a high degree of complexity, and give answers to critical business questions. The performance metric reported by TPC-H is called the TPC-H Composite Query-per-Hour Performance Metric (QphH@Size), and reflects multiple aspects of the capability of the system to process queries. These aspects include the selected database size against which the queries are executed, the query processing power when queries are submitted by a single stream, and the query throughput when queries are submitted by multiple concurrent users. The TPC-H Price/Performance metric is expressed as $/QphH@Size.
  • TPC-DS
    • This is the standard benchmark for decision support
The TPC Benchmark DS (TPC-DS) is a decision support benchmark that models several generally applicable aspects of a decision support system, including queries and data maintenance. The benchmark provides a representative evaluation of performance as a general purpose decision support system. A benchmark result measures query response time in single user mode, query throughput in multi user mode and data maintenance performance for a given hardware, operating system, and data processing system configuration under a controlled, complex, multi-user decision support workload. The purpose of TPC benchmarks is to provide relevant, objective performance data to industry users. TPC-DS Version 2 enables emerging technologies, such as Big Data systems, to execute the benchmark.
  • TPC-C
    • TPC-C is an On-Line Transaction Processing Benchmark

Approved in July of 1992, TPC Benchmark C is an on-line transaction processing (OLTP) benchmark. TPC-C is more complex than previous OLTP benchmarks such as TPC-A because of its multiple transaction types, more complex database and overall execution structure. TPC-C involves a mix of five concurrent transactions of different types and complexity either executed on-line or queued for deferred execution. The database is comprised of nine types of tables with a wide range of record and population sizes. TPC-C is measured in transactions per minute (tpmC). While the benchmark portrays the activity of a wholesale supplier, TPC-C is not limited to the activity of any particular business segment, but, rather represents any industry that must manage, sell, or distribute a product or service.

TPC vs SPEC models

Here is our comparison between TPC Vs SPEC model benchmark

TPC modelSPEC model
Specification basedKit based
Performance, Price, energy in one benchmarkPerformance and energy in separate benchmarks
End-to-EndServer centric
Multiple tests (ACID, Load)Single test
Independent ReviewSummary disclosure
Full disclosureSPEC research group ICPE
TPC Technology conferenceSPEC Research Group, ICPE (International
Conference on Performance Engineering)

BigBench is a joint effort with partners in industry and academia on creating a comprehensive and standardized BigData benchmark. One of the reference reading about BigBench Toward An Industry Standard Benchmark for BigData Analytics  BigBench builds upon and borrows elements from existing benchmarking efforts (such as TPC-xHS, GridMix, PigMix, HiBench, Big Data Benchmark, YCSB and TPC-DS).  BigBench is a specification-based benchmark with an open-source reference implementation kit. As a specification-based benchmark, it would be technology-agnostic and provide the necessary formalism and flexibility to support multiple implementations.  It is focused around execution time calculation Consists of around 30 queries/workloads (10 of them are from TPC).  The drawback is, it is a structured-data-intensive benchmark.  

Spark Bench for Apache Spark

We are able to build on ARM64. The setup completed for single node but run scripts are failing. When spark bench examples are run, a KILL signal is observed which terminates all workers.  This is still under investigation as there are no useful logs to debug. No proper error description and lack of documentation is a challenge. A ticket is already filed on spark bench git which is unresolved.

It is based on TPC-H and TPC-DS benchmarks.  You can exeriment Apache Hive at any data scale. The benchmark contains data generator  and set of queries.  This is very useful to test the basic Hive performance on large data sets.  We have a wiki page for Hive TestBench

This is a stripped-down version of common Mapreduce jobs. (sorting text data and SequenceFiles).  Its a tool for benchmarking Hadoop clusters.  This is a trace based benchmark for MapReduce.  It 
evaluate MapReduce and HDFS performance. 

It submits a mix of synthetic jobs , modeling a profile mined from production loads.  The benchmark attempt to model the resource profiles of production jobs to identify bottlenecks

Basic command line usage:

 $ hadoop gridmix [-generate ] [-users ]
                - Destination directory
                - Path to a job trace

Con - Challenging to explore the performance impact of combining or separating workloads, e.g., through consolidating from many clusters.

The PigMix is a set of queries used test pig component performance.  There are queries that test latency (How long it takes to run this query ?).  The queries that test scalability (How many fields or records can ping handle before it fails ?).

Usage: Run the below commands from pig home

ant -Dharness.hadoop.home=$HADOOP_HOME pigmix-deploy (generate test dataset)
ant -Dharness.hadoop.home=$HADOOP_HOME pigmix (run the PigMix benchmark)

The documentation can be found at Apache pig - 

This benchmark enables rigorous performance measurement of MapReduce systems.  The benchmark contains suites of workloads of thousands of jobs, with complex data, arrival, and computation patterns.  Informs both highly targeted, workload specific optimizations.  This tool is highly recommended for MapReduce operators  The performance measurement - 

This is a BigData Benchmark from AMPLab, UC Berkeley provides quantitative and qualitative comparisons of five systems
  • Redshift – a hosted MPP database offered by based on the ParAccel data warehouse
  • Hive – a Hadoop-based data warehousing system
  • Shark – a Hive-compatible SQL engine which runs on top of the Spark computing framework
  • Impala – a Hive-compatible* SQL engine with its own MPP-like execution engine
  • Stinger/Tez – Tez is a next generation Hadoop execution engine currently in development
This benchmark measures response time on a handful of relational queries: scans, aggregations, joins, and UDF’s, across different data sizes.

This is a specification based benchmark.  The two key components: A data model specification and a workload/query specification. It's a comprehensive end-to-end big data benchmark suite.  The git hub for BigDataBenchmark

BigDataBench is a benchmark suite for scale-out workloads, different from SPEC CPU (sequential workloads), and PARSEC (multithreaded workloads). Currently, it simulates five typical and important big data applications: search engine, social network, e-commerce, multimedia data analytics, and bioinformatics.

Currently, BigDataBench includes 15 real-world data sets, and 34 big data workloads.

This benchmark test suite is for Hadoop.  It contains 4 different categories tests, 10 workloads and 3 types.  This is a best benchmark with metrics: Time (sec) & Throughput (Bytes/Sec)

Screenshot from 2016-09-22 18:32:56.png


Terasort, TestDFSIO, NNBench, MRBench 

GridMix3, PigMix, HiBench, TPCx-HS, SWIM, AMPLab, BigBench 

Industry Standard benchmarks

TPC - Transaction Processing Performance Council 
SPEC - The Standard Performance Evaluation Corporation 
CLDS - Center for Largescale Data System Research 


  1. "Great post! I am actually getting ready to across this information, It's very helpful for this blog.Also great with all of the valuable information you have Keep up the good work you are doing well.
    Digital Marketing Training Course in Chennai | Digital Marketing Training Course in Anna Nagar | Digital Marketing Training Course in OMR | Digital Marketing Training Course in Porur | Digital Marketing Training Course in Tambaram | Digital Marketing Training Course in Velachery



Post a Comment

Popular posts from this blog

Apache Ambari on ARM64

CentOS: Create and share your own YUM repository