Apache Drill on ARM64

Apache Drill on ARM64

What is Drill ?
Apache Drill is a distributed MPP query layer that supports SQL and alternative query languages against NoSQL and Hadoop data storage systems. It was inspired in part by Google's Dremel.  Apache Drill is an Apache Foundation project.

Query any non-relational datastore
With the exponential growth of data in recent years, and the shift towards rapid application development, new data is increasingly being stored in non-relational datastores including Hadoop, NoSQL and cloud storage. Apache Drill enables analysts, business users, data scientists and developers to explore and analyze this data without sacrificing the flexibility and agility offered by these datastores.  

Drill supports a variety of NoSQL databases and file systems, including HBase, MongoDB, MapR-DB, HDFS, MapR-FS, Amazon S3, Azure Blob Storage, Google Cloud Storage, Swift, NAS and local files. A single query can join data from multiple datastores. For example, you can join a user profile collection in MongoDB with a directory of event logs in Hadoop.

Drill's datastore-aware optimizer automatically restructures a query plan to leverage the datastore's internal processing capabilities. In addition, Drill supports data locality, so it's a good idea to co-locate Drill and the datastore on the same nodes.

Apache Drill includes a distributed execution environment, purpose built for large-scale data processing. It doesn’t use a general purpose execution engine like MapReduce, Tez or Spark. As a result, Drill is flexible (schema-free JSON model) and performant. Drill’s optimizer leverages rule- and cost-based techniques, as well as data locality and operator push-down, which is the capability to push down query fragments into the back-end data sources. 


Apache Drill is built to achieve high throughput and low latency. It provides the following capabilities.
  • Distributed query optimization and execution: Drill is designed to scale from a single node (your laptop) to large clusters with thousands of servers.
  • Columnar execution: Drill is the world's only columnar execution engine that supports complex data and schema-free data. It uses a shredded, in-memory, columnar data representation.
  • Runtime compilation and code generation: Drill is the world's only query engine that compiles and re-compiles queries at runtime. This allows Drill to achieve high performance without knowing the structure of the data in advance. Drill leverages multiple compilers as well as ASM-based bytecode rewriting to optimize the code.
  • Vectorization: Drill takes advantage of the latest SIMD instructions available in modern processors.
  • Optimistic/pipelined execution: Drill is able to stream data in memory between operators. Drill minimizes the use of disks unless needed to complete the query.
Drill is the only columnar query engine that supports complex data. It features an in-memory shredded columnar representation for complex data which allows Drill to achieve columnar speed with the flexibility of an internal JSON document model.
Runtime compilation enables faster execution than interpreted execution. Drill generates highly efficient custom code for every single query.

Top 10 Reasons to use Apache Drill

1. Get started in minutes
It takes just a few minutes to get started with Drill. Untar the Drill software on your Linux, Mac, or Windows laptop and run a query on a local file. No need to set up any infrastructure or to define schemas. Just point to the data, such as data in a file, directory, HBase table, and drill.

2. Schema-free JSON model
Drill is the world's first and only distributed SQL engine that doesn't require schemas. It shares the same schema-free JSON model as MongoDB and Elasticsearch. No need to define and maintain schemas or transform data (ETL). Drill automatically understands the structure of the data.

3. Query complex, semi-structured data in-situ
Using Drill's schema-free JSON model, you can query complex, semi-structured data in situ. No need to flatten or transform the data prior to or during query execution. Drill also provides intuitive extensions to SQL to work with nested data. 

4. Real SQL -- not "SQL-like"
Drill supports the standard SQL:2003 syntax. No need to learn a new "SQL-like" language or struggle with a semi-functional BI tool. Drill supports many data types including DATE, INTERVAL, TIMESTAMP, and VARCHAR, as well as complex query constructs such as correlated sub-queries and joins in WHERE clauses. 

5. Leverage standard BI tools
Drill works with standard BI tools. You can use your existing tools, such as Tableau, MicroStrategy, QlikView and Excel.

6. Interactive queries on Hive tables
Apache Drill lets you leverage your investments in Hive. You can run interactive queries with Drill on your Hive tables and access all Hive input/output formats (including custom SerDes). You can join tables associated with different Hive metastores, and you can join a Hive table with an HBase table or a directory of log files. 

7. Access multiple data sources
Drill is extensible. You can connect Drill out-of-the-box to file systems (local or distributed, such as S3 and HDFS), HBase and Hive. You can implement a storage plugin to make Drill work with any other data source. Drill can combine data from multiple data sources on the fly in a single query, with no centralized metadata definitions. 

8. User-Defined Functions (UDFs) for Drill and Hive
Drill exposes a simple, high-performance Java API to build custom user-defined functions (UDFs) for adding your own business logic to Drill. Drill also supports Hive UDFs. If you have already built UDFs in Hive, you can reuse them with Drill with no modifications.

9. High performance
Drill is designed from the ground up for high throughput and low latency. It doesn't use a general purpose execution engine like MapReduce, Tez or Spark. As a result, Drill is flexible (schema-free JSON model) and performant. Drill's optimizer leverages rule- and cost-based techniques, as well as data locality and operator push-down, which is the capability to push down query fragments into the back-end data sources. Drill also provides a columnar and vectorized execution engine, resulting in higher memory and CPU efficiency.

10. Scales from a single laptop to a 1000-node cluster
Drill is available as a simple download you can run on your laptop. When you're ready to analyze larger datasets, deploy Drill on your Hadoop cluster (up to 1000 commodity servers). Drill leverages the aggregate memory in the cluster to execute queries using an optimistic pipelined model, and automatically spills to disk when the working set doesn't fit in memory.

The flow of a Drill query

  • The Drill client issues a query. A Drill client is a JDBC, ODBC, command line interface or a REST API. Any Drillbit in the cluster can accept queries from the clients. There is no master-slave concept.
  • The Drillbit then parses the query, optimizes it, and generates a distributed query plan that is optimized for fast and efficient execution.
  • The Drillbit that accepts the query becomes the driving Drillbit node for the request. It gets a list of available Drillbit nodes in the cluster from ZooKeeper. The driving Drillbit determines the appropriate nodes to execute various query plan fragments to maximize data locality.
  • The Drillbit schedules the execution of query fragments on individual nodes according to the execution plan.
  • The individual nodes finish their execution and return data to the driving Drillbit.
  • The driving Drillbit streams results back to the client.

Goals on ARM64

  • Create .deb and rpm packages for Apache Drill for AArch64.
  • Install Drill packages along with the dependency.
  • Do basic workload testing

  • OpenJDK8
  • Zookeeper
  • git
  • maven@v3.3.9

Efforts from Linaro BigData team
  • Implement and upstream DEB/RPM support on Apache Drill
  • Document the following installation steps in collaborate page.
    • Define prerequisites
      • Install HDFS aarch64 bits from debian repo
      • Install YARN aarch64 bits from debian repo
      • Install zookeeper aarch64 bits from debian repo
    • Check YARN and zookeeper versions
    • Setup HDFS in distributed mode
    • Setup YARN in distributed mode
    • Update Hosts files
    • Configure HDFS, YARN and Zookeeper with nodes information.
    • Point Drill to zookeeper quorum
  • Configure Drill to run on YARN distributed mode. This might cause issues, if drill is installed prior to YARN. If so, need to uninstall drill and redo. 
  • Check if drill is running on YARN 
  • Configure drill dfs (hdfs) storage plugin
  • Start drill daemon in each node
  • Start drill bit in distributed mode drillbit.sh 
  • Test basic data import
  • Double check and Re-configure zookeeper
  • Update drill-env.sh settings
  • Download and import github data as json files into HDFS
  • Build drill query
  • Check if the data shows up in drill
  • Configure drill memory and check for optimization
  • Check on caching in drill (Optimistic/pipelined execution)
  • Research on Integrating Zeppelin/Jupyter if possible for drill query
Build/Setup and Run Apache Drill

git clone https://github.com/apache/drill.git

cd drill
mvn clean package -DskipTests

Test drill-embedded

You can launch the drill embedded as below and query sample file or JSON file.  You only need to provide absolute path while doing querry.

linaro@debian:~$ drill-embedded
Apache Drill 1.15.0-SNAPSHOT
"Drill must go on."
0: jdbc:drill:zk=local>
0: jdbc:drill:zk=local> SELECT * FROM dfs.`/home/linaro/Apache-components-build/drill/distribution/target/apache-drill-1.15.0-SNAPSHOT/apache-drill-1.15.0-SNAPSHOT/sample-data/region.parquet`;
0 AFRICA lar deposits. blithe
1 AMERICA hs use ironic, even
2 ASIA ges. thinly even pin
3 EUROPE ly final courts cajo
4 MIDDLE EAST uickly special accou
5 rows selected (1.025 seconds)
0: jdbc:drill:zk=local>

 0: jdbc:drill:zk=local> !quit
Closing: org.apache.drill.jdbc.impl.DrillConnectionImpl

Setup and test drill in clustered mode

  • Edit drill-override.conf to provide zookeeper location
  • Start the drillbit using bin/drillbit.sh start
  • Repeat on other nodes
  • Connect with sqlline by using bin/sqlline -u "jdbc:drill:zk=[zk_host:port]"
  • Run a query (below).
Now we will see one by one in details,

Install OpenJDK

    $ sudo apt-get install openjdk-8-jdk

Make sure you have the right OpenJDK version

    $ java -version

It should display 1.8.0_111


    $ export JAVA_HOME=`readlink -f /usr/bin/java | sed "s:jre/bin/java::"`

Building Apache Zookeeper

Some distributions like Ubuntu/Debian comes with latest zookeeper.  Hence you can just install using apt-get command "sudo apt-get install zookeeper".  If your distribution does not come with zookeeper then just go for latest download and unzip the Zookeeper package from Official Apache archive in all machines that will be used for zookeeper quorum as shown below:

    $ wget https://www-us.apache.org/dist/zookeeper/stable/zookeeper-3.4.12.tar.gz
    $ tar -xzvf zookeeper-3.4.12.tar.gz

Edit the /etc/hosts file across all the nodes and add the ipaddress and hostname (nodenames). If the hostnames are not right, change them in /etc/hosts file


Create zookeeper user

You can create a new user or you can also configure the zookeeper for any existing user.    You can just use any other existing user name instead of zookeeper e.g. ubuntu, centos or debian..etc

    $ sudo adduser zookeeper

Configure zookeeper user or any already existing user

To make an ensemble with Master-slave architecture,  we needed to have odd number of zookeeper server .i.e.{1, 3 ,5,7....etc}.

Now, Create the directory zookeeper under /var/lib folder which will serve as Zookeeper data directory and create another zookeeper directory under /var/log where all the Zookeeper logs will be captured. Both of the directory ownership need to be changed as zookeeper.

    $ sudo mkdir /var/lib/zookeeper
    $ cd /var/lib
    $ sudo chown zookeeper:zookeeper zookeeper/
    $ sudo mkdir /var/log/zookeeper
    $ cd /var/log
    $ sudo chown zookeeper:zookeeper zookeeper/

Note: While running the zookeeper if you get a message something like below you may need to check/change for permissions of the files under /var/lib/zookeeper and /var/log/zookeeper.

Since I have loged-in as linaro and running zookeeper.  I have changed the permission to linaro user.

    linaro@node1:~/drill-setup/zookeeper-3.4.12$ ./bin/zkServer.sh start
    ZooKeeper JMX enabled by default
    Using config: /home/linaro/drill-setup/zookeeper-3.4.12/bin/../conf/zoo.cfg
 Starting zookeeper ... ./bin/zkServer.sh: line 149: /var/lib/zookeeper/zookeeper_server.pid: Permission denied

Edit the bashrc for the zookeeper user via setting up the following Zookeeper environment variables.

    $ export ZOO_LOG_DIR=/var/log/zookeeper

Source the .bashrc in current login session:

    $ source ~/.bashrc

Create the server id for the ensemble. Each zookeeper server should have a unique number in the myid file within the ensemble and should have a value between 1 and 255.

In Node1

    $ sudo sh -c "echo '1' > /var/lib/zookeeper/myid"

In Node2

    $ sudo sh -c "echo '2' > /var/lib/zookeeper/myid"

In Node3

    $ sudo sh -c "echo '3' > /var/lib/zookeeper/myid"

Now, go to the conf folder under the Zookeeper home directory (location of the Zookeeper directory after Archive has been unzipped/extracted).

    $ cd /home/zookeeper/zookeeper-3.4.13/conf/

By default, a sample conf file with name zoo_sample.cfg will be present in conf directory. Make a copy of it with name zoo.cfg as shown below, and edit new zoo.cfg as described across all the nodes.

    $ cp zoo_sample.cfg zoo.cfg

Edit zoo.cfg and the below

    $ vi zoo.cfg


Now, do the below changes in log4.properties file as follows.

    $ vi log4j.properties

    log4j.rootLogger=INFO, CONSOLE, ROLLINGFILE

After the configuration has been done in zoo.cfg file in all three nodes, start zookeeper in all the nodes one by one, using following command:

    $ /home/zookeeper/zookeeper-3.4.12/bin/zkServer.sh start

Zookeeper Service Start on all the Nodes.

    ZooKeeper JMX enabled by default
    Using config: /home/ubuntu/zookeeper-3.4.12/bin/../conf/zoo.cfg
    Starting zookeeper ... STARTED

The log file will be created in /var/log/zookeeper of zookeeper named zookeeper.log, tail the file to see logs for any errors.

    $ tail -f /var/log/zookeeper/zookeeper.log

Verify the Zookeeper Cluster and Ensemble

In Zookeeper ensemble out of three servers, one will be in leader mode and other two will be in follower mode. You can check the status by running the following commands.

    $ /home/zookeeper/zookeeper-3.4.13/bin/zkServer.sh status

Zookeeper Service Status Check.

In Zookeeper ensemble If you have 3 nodes, out of them, one will be in leader mode and other two will be in follower mode. You can check the status by running the following commands. If you have just one then it will be standalone.

With three nodes:


    ZooKeeper JMX enabled by default
    Using config: /home/zookeeper/zookeeper-3.4.12/bin/../conf/zoo.cfg
    Mode: leader


    ZooKeeper JMX enabled by default
    Using config: /home/zookeeper/zookeeper-3.4.12/bin/../conf/zoo.cfg
    Mode: follower


    ZooKeeper JMX enabled by default
    Using config: /home/zookeeper/zookeeper-3.4.12/bin/../conf/zoo.cfg
    Mode: follower


    ZooKeeper JMX enabled by default
    Using config: /home/zookeeper/zookeeper-3.4.12/bin/../conf/zoo.cfg
    Mode: standalone

    $ echo stat | nc node1 2181

Lists brief details for the server and connected clients.

     $ echo mntr | nc node1 2181

Zookeeper list of variables for cluster health monitoring.
       $ echo srvr | nc localhost 2181

Lists full details for the Zookeeper server.
If you need to check and see the znode, you can connect by using the below command on any of the zookeeper node:

    $ /home/zookeeper/zookeeper-3.4.12/bin/zkCli.sh -server `hostname -f`:2181

Connect to Zookeeper data node and lists the contents.

Install Pre-requisites for Build

    $ sudo apt-get install git

Setup environment

Add environment variables to profile file

# setup environments
export LANG="en_US.UTF-8"
export PATH=${HOME}/gradle/bin:$PATH
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-arm64
export JAVA_TOOL_OPTIONS="-Dfile.encoding=UTF8"

$ source ~/.bashrc

Hooking up upstream Maven 3.6.0 (for Debian Jessie only)

 $ wget http://mirrors.gigenet.com/apache/maven/maven-3/3.6.0/binaries/apache-maven-3.6.0-bin.tar.gz
    $ tar xvf apache-maven-3.6.0-bin.tar.gz
    $ cd apache-maven-3.6.0/bin
    $ export PATH=$PWD:$PATH
    $ mvn --version # should list the version as 3.6.0

Clone and Build Apache Drill

    $ git clone https://gitbox.apache.org/repos/asf/drill.git
    $ cd drill
    $ git branch v1.15.0 origin/1.15.0
    $ git checkout v1.15.0

To build .deb package 

    $ mvn clean -X package -Pdeb -DskipTests

To build .rpm package 

    $ mvn clean -X package -Prpm -DskipTests

After successful compilation. Edit your computer /etc/hosts file and make sure that the loopback is commented. e.g. and replace with your host <IP-Address>

    $ cd distribution/target/apache-drill-1.15.0/apache-drill-1.15.0

    # localhost
    # ubuntu
    <IP-address> ubuntu
    <IP-address> localhost

Because in distributed mode the loopback IP cannot be binded reference https://stackoverflow.com/questions/40506221/how-to-start-drillbit-locally-in-distributed-mode

Next you need to edit the conf/drill-override.conf and change the zookeeper cluster ID e.g. as below


    { cluster-id: "1", zk.connect: "<IP-address>:2181" }

Now you can run the drillbit and watchout the log. To play more with drillbit you can refer drill-override-example.conf file.

    $ apache-drill-1.15.0$ ./bin/drillbit.sh help
    Usage: drillbit.sh [--config|--site <site-dir>] (start|stop|status|restart|run|graceful_stop) [args]

In one of the terminal switch on the logs with the tail command

    $ apache-drill-1.15.0$ tail -f log/drillbit.log
    $ apache-drill-1.15.0$ ./bin/drillbit.sh start
    $ apache-drill-1.15.0$ ./bin/drillbit.sh status

    drillbit is running.

    $ apache-drill-1.15.0$ ./bin/drillbit.sh graceful_stop
    Stopping drillbit

You can either stop or do a graceful stop. We can repeat the same steps on more than one machines (nodes).

I could able to run the Drill and access the http://IP-Address:8047 and run a sample querry in distributed mode. So In order to do in a distributed mode. I just need to do a similar setup on multiple machines (nodes). Reference - https://drill.apache.org/docs/starting-the-web-ui/

If you are using the CentOS 7   you should be little careful because the connection errors may be caused because of the firewall issues. I have used below set of commands to disable the firewall.

    $ sudo systemctl stop firewalld

    $ sudo firewall-cmd --zone=public --add-port=2181/udp --add-port=2181/tcp --permanent
    [sudo] password for centos:

    $ sudo firewall-cmd --reload

    $ zkServer.sh restart
    ZooKeeper JMX enabled by default
    Using config: /home/centos/zookeeper-3.4.12/bin/../conf/zoo.cfg
    ZooKeeper JMX enabled by default
    Using config: /home/centos/zookeeper-3.4.12/bin/../conf/zoo.cfg
    Stopping zookeeper ... STOPPED
    ZooKeeper JMX enabled by default
    Using config: /home/centos/zookeeper-3.4.12/bin/../conf/zoo.cfg
    Starting zookeeper ... STARTED


Official web page: http://drill.apache.org 


Popular posts from this blog

Benchmarking BigData

usermod/groupmod tools - Rename username and usergroup in Ubuntu