Apache Drill on ARM64
Apache Drill on ARM64
What is Drill ?
Apache Drill is a distributed MPP query layer that supports SQL and alternative query languages against NoSQL and Hadoop data storage systems. It was inspired in part by Google's Dremel. Apache Drill is an Apache Foundation project.
Query any non-relational datastore
With the exponential growth of data in recent years, and the shift towards rapid application development, new data is increasingly being stored in non-relational datastores including Hadoop, NoSQL and cloud storage. Apache Drill enables analysts, business users, data scientists and developers to explore and analyze this data without sacrificing the flexibility and agility offered by these datastores.
Drill supports a variety of NoSQL databases and file systems, including HBase, MongoDB, MapR-DB, HDFS, MapR-FS, Amazon S3, Azure Blob Storage, Google Cloud Storage, Swift, NAS and local files. A single query can join data from multiple datastores. For example, you can join a user profile collection in MongoDB with a directory of event logs in Hadoop.
Drill's datastore-aware optimizer automatically restructures a query plan to leverage the datastore's internal processing capabilities. In addition, Drill supports data locality, so it's a good idea to co-locate Drill and the datastore on the same nodes.
Apache Drill includes a distributed execution environment,
purpose built for large-scale data processing. It doesn’t use a general
purpose execution engine like MapReduce, Tez or Spark. As a result,
Drill is flexible (schema-free JSON model) and performant. Drill’s
optimizer leverages rule- and cost-based techniques, as well as data
locality and operator push-down, which is the capability to push down
query fragments into the back-end data sources.
Capabilities:
Apache Drill is built to achieve high throughput and low latency. It provides the following capabilities.
- Distributed query optimization and execution: Drill is designed to scale from a single node (your laptop) to large clusters with thousands of servers.
- Columnar execution: Drill is the world's only columnar execution engine that supports complex data and schema-free data. It uses a shredded, in-memory, columnar data representation.
- Runtime compilation and code generation: Drill is the world's only query engine that compiles and re-compiles queries at runtime. This allows Drill to achieve high performance without knowing the structure of the data in advance. Drill leverages multiple compilers as well as ASM-based bytecode rewriting to optimize the code.
- Vectorization: Drill takes advantage of the latest SIMD instructions available in modern processors.
- Optimistic/pipelined execution: Drill is able to stream data in memory between operators. Drill minimizes the use of disks unless needed to complete the query.
Drill is the only columnar query engine that supports complex data.
It features an in-memory shredded columnar representation for complex
data which allows Drill to achieve columnar speed with the flexibility
of an internal JSON document model.
Runtime compilation enables faster execution than interpreted
execution. Drill generates highly efficient custom code for every single
query.
Top 10 Reasons to use Apache Drill
1. Get started in minutes
It takes just a few minutes to get started with Drill. Untar the Drill software on your Linux, Mac, or Windows laptop and run a query on a local file. No need to set up any infrastructure or to define schemas. Just point to the data, such as data in a file, directory, HBase table, and drill.
2. Schema-free JSON model
Drill is the world's first and only distributed SQL engine that doesn't require schemas. It shares the same schema-free JSON model as MongoDB and Elasticsearch. No need to define and maintain schemas or transform data (ETL). Drill automatically understands the structure of the data.
3. Query complex, semi-structured data in-situ
Using Drill's schema-free JSON model, you can query complex, semi-structured data in situ. No need to flatten or transform the data prior to or during query execution. Drill also provides intuitive extensions to SQL to work with nested data.
4. Real SQL -- not "SQL-like"
Drill supports the standard SQL:2003 syntax. No need to learn a new "SQL-like" language or struggle with a semi-functional BI tool. Drill supports many data types including DATE, INTERVAL, TIMESTAMP, and VARCHAR, as well as complex query constructs such as correlated sub-queries and joins in WHERE clauses.
5. Leverage standard BI tools
Drill works with standard BI tools. You can use your existing tools, such as Tableau, MicroStrategy, QlikView and Excel.
6. Interactive queries on Hive tables
Apache Drill lets you leverage your investments in Hive. You can run interactive queries with Drill on your Hive tables and access all Hive input/output formats (including custom SerDes). You can join tables associated with different Hive metastores, and you can join a Hive table with an HBase table or a directory of log files.
7. Access multiple data sources
Drill is extensible. You can connect Drill out-of-the-box to file systems (local or distributed, such as S3 and HDFS), HBase and Hive. You can implement a storage plugin to make Drill work with any other data source. Drill can combine data from multiple data sources on the fly in a single query, with no centralized metadata definitions.
8. User-Defined Functions (UDFs) for Drill and Hive
Drill exposes a simple, high-performance Java API to build custom user-defined functions (UDFs) for adding your own business logic to Drill. Drill also supports Hive UDFs. If you have already built UDFs in Hive, you can reuse them with Drill with no modifications.
9. High performance
Drill is designed from the ground up for high throughput and low latency. It doesn't use a general purpose execution engine like MapReduce, Tez or Spark. As a result, Drill is flexible (schema-free JSON model) and performant. Drill's optimizer leverages rule- and cost-based techniques, as well as data locality and operator push-down, which is the capability to push down query fragments into the back-end data sources. Drill also provides a columnar and vectorized execution engine, resulting in higher memory and CPU efficiency.
10. Scales from a single laptop to a 1000-node cluster
Drill is available as a simple download you can run on your laptop. When you're ready to analyze larger datasets, deploy Drill on your Hadoop cluster (up to 1000 commodity servers). Drill leverages the aggregate memory in the cluster to execute queries using an optimistic pipelined model, and automatically spills to disk when the working set doesn't fit in memory.
The flow of a Drill query
- The Drill client issues a query. A Drill client is a JDBC, ODBC, command line interface or a REST API. Any Drillbit in the cluster can accept queries from the clients. There is no master-slave concept.
- The Drillbit then parses the query, optimizes it, and generates a distributed query plan that is optimized for fast and efficient execution.
- The Drillbit that accepts the query becomes the driving Drillbit node for the request. It gets a list of available Drillbit nodes in the cluster from ZooKeeper. The driving Drillbit determines the appropriate nodes to execute various query plan fragments to maximize data locality.
- The Drillbit schedules the execution of query fragments on individual nodes according to the execution plan.
- The individual nodes finish their execution and return data to the driving Drillbit.
- The driving Drillbit streams results back to the client.
Goals on ARM64
- Create .deb and rpm packages for Apache Drill for AArch64.
- Install Drill packages along with the dependency.
- Do basic workload testing
Pre-requisities
- OpenJDK8
- Zookeeper
- git
- maven@v3.3.9
Efforts from Linaro BigData team
- Implement and upstream DEB/RPM support on Apache Drill
- Document the following installation steps in collaborate page.
- Define prerequisites
- Install HDFS aarch64 bits from debian repo
- Install YARN aarch64 bits from debian repo
- Install zookeeper aarch64 bits from debian repo
- Check YARN and zookeeper versions
- Setup HDFS in distributed mode
- Setup YARN in distributed mode
- Update Hosts files
- Configure HDFS, YARN and Zookeeper with nodes information.
- Point Drill to zookeeper quorum
- Configure Drill to run on YARN distributed mode. This might cause issues, if drill is installed prior to YARN. If so, need to uninstall drill and redo.
- Check if drill is running on YARN
- Configure drill dfs (hdfs) storage plugin
- Start drill daemon in each node
- Start drill bit in distributed mode drillbit.sh
- Test basic data import
- Double check and Re-configure zookeeper
- Update drill-env.sh settings
- Download and import github data as json files into HDFS
- Build drill query
- Check if the data shows up in drill
- Configure drill memory and check for optimization
- Check on caching in drill (Optimistic/pipelined execution)
- Research on Integrating Zeppelin/Jupyter if possible for drill query
Build/Setup and Run Apache Drill
git clone https://github.com/apache/drill.git
cd drill
mvn clean package -DskipTests
Test drill-embedded
You can launch the drill embedded as below and query sample file or JSON file. You only need to provide absolute path while doing querry.
linaro@debian:~$ drill-embedded
Apache Drill 1.15.0-SNAPSHOT
"Drill must go on."
0: jdbc:drill:zk=local>
0: jdbc:drill:zk=local> SELECT * FROM dfs.`/home/linaro/Apache-components-build/drill/distribution/target/apache-drill-1.15.0-SNAPSHOT/apache-drill-1.15.0-SNAPSHOT/sample-data/region.parquet`;
-------------------------++----------------------
-------------------------++----------------------
-------------------------++----------------------
5 rows selected (1.025 seconds)
0: jdbc:drill:zk=local>
0: jdbc:drill:zk=local> !quit
Closing: org.apache.drill.jdbc.impl.DrillConnectionImpl
linaro@debian:~$
cd drill
mvn clean package -DskipTests
Test drill-embedded
You can launch the drill embedded as below and query sample file or JSON file. You only need to provide absolute path while doing querry.
linaro@debian:~$ drill-embedded
Apache Drill 1.15.0-SNAPSHOT
"Drill must go on."
0: jdbc:drill:zk=local>
0: jdbc:drill:zk=local> SELECT * FROM dfs.`/home/linaro/Apache-components-build/drill/distribution/target/apache-drill-1.15.0-SNAPSHOT/apache-drill-1.15.0-SNAPSHOT/sample-data/region.parquet`;
---------------
R_REGIONKEY | R_NAME | R_COMMENT |
0 | AFRICA | lar deposits. blithe |
1 | AMERICA | hs use ironic, even |
2 | ASIA | ges. thinly even pin |
3 | EUROPE | ly final courts cajo |
4 | MIDDLE EAST | uickly special accou |
5 rows selected (1.025 seconds)
0: jdbc:drill:zk=local>
0: jdbc:drill:zk=local> !quit
Closing: org.apache.drill.jdbc.impl.DrillConnectionImpl
linaro@debian:~$
Setup and test drill in clustered mode
Now we will see one by one in details,- Edit drill-override.conf to provide zookeeper location
- Start the drillbit using bin/drillbit.sh start
- Repeat on other nodes
- Connect with sqlline by using bin/sqlline -u "jdbc:drill:zk=[zk_host:port]"
- Run a query (below).
Install OpenJDK
$ sudo apt-get install openjdk-8-jdk
Make sure you have the right OpenJDK version
$ java -version
It should display 1.8.0_111
Set JAVA_HOME
$ export JAVA_HOME=`readlink -f /usr/bin/java | sed "s:jre/bin/java::"`
Building Apache Zookeeper
Some distributions like Ubuntu/Debian comes with latest zookeeper. Hence you can just install using apt-get command "sudo apt-get install zookeeper". If your distribution does not come with zookeeper then just go for latest download and unzip the Zookeeper package from Official Apache archive in all machines that will be used for zookeeper quorum as shown below:
$ wget https://www-us.apache.org/dist/zookeeper/stable/zookeeper-3.4.12.tar.gz
$ tar -xzvf zookeeper-3.4.12.tar.gz
Edit the /etc/hosts file across all the nodes and add the ipaddress and hostname (nodenames). If the hostnames are not right, change them in /etc/hosts file
Example:
192.168.1.102 node1
192.168.1.103 node2
192.168.1.105 node3
Create zookeeper user
You can create a new user or you can also configure the zookeeper for any existing user. You can just use any other existing user name instead of zookeeper e.g. ubuntu, centos or debian..etc
$ sudo adduser zookeeper
Configure zookeeper user or any already existing user
To make an ensemble with Master-slave architecture, we needed to have odd number of zookeeper server .i.e.{1, 3 ,5,7....etc}.
Now, Create the directory zookeeper under /var/lib folder which will serve as Zookeeper data directory and create another zookeeper directory under /var/log where all the Zookeeper logs will be captured. Both of the directory ownership need to be changed as zookeeper.
$ sudo mkdir /var/lib/zookeeper
$ cd /var/lib
$ sudo chown zookeeper:zookeeper zookeeper/
$ sudo mkdir /var/log/zookeeper
$ cd /var/log
$ sudo chown zookeeper:zookeeper zookeeper/
Note: While running the zookeeper if you get a message something like below you may need to check/change for permissions of the files under /var/lib/zookeeper and /var/log/zookeeper.
Since I have loged-in as linaro and running zookeeper. I have changed the permission to linaro user.
linaro@node1:~/drill-setup/zookeeper-3.4.12$ ./bin/zkServer.sh start
ZooKeeper JMX enabled by default
Using config: /home/linaro/drill-setup/zookeeper-3.4.12/bin/../conf/zoo.cfg
Starting zookeeper ... ./bin/zkServer.sh: line 149: /var/lib/zookeeper/zookeeper_server.pid: Permission denied
FAILED TO WRITE PID
Edit the bashrc for the zookeeper user via setting up the following Zookeeper environment variables.
$ export ZOO_LOG_DIR=/var/log/zookeeper
Source the .bashrc in current login session:
$ source ~/.bashrc
Create the server id for the ensemble. Each zookeeper server should have a unique number in the myid file within the ensemble and should have a value between 1 and 255.
In Node1
$ sudo sh -c "echo '1' > /var/lib/zookeeper/myid"
In Node2
$ sudo sh -c "echo '2' > /var/lib/zookeeper/myid"
In Node3
$ sudo sh -c "echo '3' > /var/lib/zookeeper/myid"
Now, go to the conf folder under the Zookeeper home directory (location of the Zookeeper directory after Archive has been unzipped/extracted).
$ cd /home/zookeeper/zookeeper-3.4.13/conf/
By default, a sample conf file with name zoo_sample.cfg will be present in conf directory. Make a copy of it with name zoo.cfg as shown below, and edit new zoo.cfg as described across all the nodes.
$ cp zoo_sample.cfg zoo.cfg
Edit zoo.cfg and the below
$ vi zoo.cfg
dataDir=/var/lib/zookeeper
server.1=node1:2888:3888
server.2=node2:2888:3888
server.3=node3:2888:3888
Now, do the below changes in log4.properties file as follows.
$ vi log4j.properties
zookeeper.log.dir=/var/log/zookeeper
zookeeper.tracelog.dir=/var/log/zookeeper
log4j.rootLogger=INFO, CONSOLE, ROLLINGFILE
After the configuration has been done in zoo.cfg file in all three nodes, start zookeeper in all the nodes one by one, using following command:
$ /home/zookeeper/zookeeper-3.4.12/bin/zkServer.sh start
Zookeeper Service Start on all the Nodes.
ZooKeeper JMX enabled by default
Using config: /home/ubuntu/zookeeper-3.4.12/bin/../conf/zoo.cfg
Starting zookeeper ... STARTED
The log file will be created in /var/log/zookeeper of zookeeper named zookeeper.log, tail the file to see logs for any errors.
$ tail -f /var/log/zookeeper/zookeeper.log
Verify the Zookeeper Cluster and Ensemble
In Zookeeper ensemble out of three servers, one will be in leader mode and other two will be in follower mode. You can check the status by running the following commands.
$ /home/zookeeper/zookeeper-3.4.13/bin/zkServer.sh status
Zookeeper Service Status Check.
In Zookeeper ensemble If you have 3 nodes, out of them, one will be in leader mode and other two will be in follower mode. You can check the status by running the following commands. If you have just one then it will be standalone.
With three nodes:
node1
ZooKeeper JMX enabled by default
Using config: /home/zookeeper/zookeeper-3.4.12/bin/../conf/zoo.cfg
Mode: leader
node2
ZooKeeper JMX enabled by default
Using config: /home/zookeeper/zookeeper-3.4.12/bin/../conf/zoo.cfg
Mode: follower
node3
ZooKeeper JMX enabled by default
Using config: /home/zookeeper/zookeeper-3.4.12/bin/../conf/zoo.cfg
Mode: follower
standalone
ZooKeeper JMX enabled by default
Using config: /home/zookeeper/zookeeper-3.4.12/bin/../conf/zoo.cfg
Mode: standalone
$ echo stat | nc node1 2181
Lists brief details for the server and connected clients.
$ echo mntr | nc node1 2181
Zookeeper list of variables for cluster health monitoring.
$ echo srvr | nc localhost 2181
Lists full details for the Zookeeper server.
If you need to check and see the znode, you can connect by using the below command on any of the zookeeper node:
$ /home/zookeeper/zookeeper-3.4.12/bin/zkCli.sh -server `hostname -f`:2181
Connect to Zookeeper data node and lists the contents.
Install Pre-requisites for Build
$ sudo apt-get install git
Setup environment
Add environment variables to profile file
# setup environments
export LANG="en_US.UTF-8"
export PATH=${HOME}/gradle/bin:$PATH
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-arm64
export JAVA_TOOL_OPTIONS="-Dfile.encoding=UTF8"
$ source ~/.bashrc
Hooking up upstream Maven 3.6.0 (for Debian Jessie only)
$ wget http://mirrors.gigenet.com/apache/maven/maven-3/3.6.0/binaries/apache-maven-3.6.0-bin.tar.gz
$ tar xvf apache-maven-3.6.0-bin.tar.gz
$ cd apache-maven-3.6.0/bin
$ export PATH=$PWD:$PATH
$ mvn --version # should list the version as 3.6.0
Clone and Build Apache Drill
$ git clone https://gitbox.apache.org/repos/asf/drill.git
$ cd drill
$ git branch v1.15.0 origin/1.15.0
$ git checkout v1.15.0
To build .deb package
$ mvn clean -X package -Pdeb -DskipTests
To build .rpm package
$ mvn clean -X package -Prpm -DskipTests
After successful compilation. Edit your computer /etc/hosts file and make sure that the loopback is commented. e.g. and replace with your host <IP-Address>
$ cd distribution/target/apache-drill-1.15.0/apache-drill-1.15.0
#127.0.0.1 localhost
#127.0.1.1 ubuntu
<IP-address> ubuntu
<IP-address> localhost
Because in distributed mode the loopback IP 127.0.1.1 cannot be binded reference https://stackoverflow.com/questions/40506221/how-to-start-drillbit-locally-in-distributed-mode
Next you need to edit the conf/drill-override.conf and change the zookeeper cluster ID e.g. as below
drill.exec:
{ cluster-id: "1", zk.connect: "<IP-address>:2181" }
Now you can run the drillbit and watchout the log. To play more with drillbit you can refer drill-override-example.conf file.
$ apache-drill-1.15.0$ ./bin/drillbit.sh help
Usage: drillbit.sh [--config|--site <site-dir>] (start|stop|status|restart|run|graceful_stop) [args]
In one of the terminal switch on the logs with the tail command
$ apache-drill-1.15.0$ tail -f log/drillbit.log
$ apache-drill-1.15.0$ ./bin/drillbit.sh start
$ apache-drill-1.15.0$ ./bin/drillbit.sh status
drillbit is running.
$ apache-drill-1.15.0$ ./bin/drillbit.sh graceful_stop
Stopping drillbit
...
You can either stop or do a graceful stop. We can repeat the same steps on more than one machines (nodes).
I could able to run the Drill and access the http://IP-Address:8047 and run a sample querry in distributed mode. So In order to do in a distributed mode. I just need to do a similar setup on multiple machines (nodes). Reference - https://drill.apache.org/docs/starting-the-web-ui/
If you are using the CentOS 7 you should be little careful because the connection errors may be caused because of the firewall issues. I have used below set of commands to disable the firewall.
$ sudo systemctl stop firewalld
$ sudo firewall-cmd --zone=public --add-port=2181/udp --add-port=2181/tcp --permanent
[sudo] password for centos:
success
$ sudo firewall-cmd --reload
success
$ zkServer.sh restart
ZooKeeper JMX enabled by default
Using config: /home/centos/zookeeper-3.4.12/bin/../conf/zoo.cfg
ZooKeeper JMX enabled by default
Using config: /home/centos/zookeeper-3.4.12/bin/../conf/zoo.cfg
Stopping zookeeper ... STOPPED
ZooKeeper JMX enabled by default
Using config: /home/centos/zookeeper-3.4.12/bin/../conf/zoo.cfg
Starting zookeeper ... STARTED
REFERENCE:
Git Repository: https://github.com/apache/drill
Official web page: http://drill.apache.org
https://drill.apache.org/docs/launch-drill-under-yarn
https://drill.apache.org/docs/installing-drill-in-distributed-mode
https://drill.apache.org/docs/configuring-storage-plugins
https://drill.apache.org/docs/query-data-introduction
https://drill.apache.org/docs/starting-drill-in-distributed-mode
https://drill.apache.org/docs/json-data-model
https://drill.apache.org/docs/querying-json-files
https://drill.apache.org/docs/query-plans
https://drill.apache.org/docs/drill-query-execution
https://drill.apache.org/docs/sql-reference
https://drill.apache.org/docs/configuring-the-drill-shell
https://stackoverflow.com/questions/13316776/zookeeper-connection-error
https://www.tutorialspoint.com/zookeeper/index.htm
https://blog.redbranch.net/2018/04/19/zookeeper-install-on-centos-7/
https://drill.apache.org/docs/distributed-mode-prerequisites/
Excellent article and this helps to enhance your knowledge regarding new things. Waiting for more updates.
ReplyDeleteAngularJS Course in Chennai
AngularJS Online Training
AngularJS Course in Coimbatore