The storage handler is built as an independent module, hive-hbase-handler-x.y.z.jar, which must be available on the Hive client auxpath, along with HBase, Guava and ZooKeeper jars. It also requires the correct configuration property to be set in order to connect to the right HBase master. See the HBase documentation for how to set up an HBase cluster.
Here's an example using CLI from a source build environment, targeting a single-node HBase server. (Note that the jar locations and names have changed in Hive 0.9.0, so for earlier releases, some changes are needed.)
Here's an example which instead targets a distributed HBase cluster where a quorum of 3 zookeepers is used to elect the HBase master:
The handler requires Hadoop 0.20 or higher, and has only been tested with dependency versions hadoop-0.20.x, hbase-0.92.0 and zookeeper-3.3.4. If you are not using hbase-0.92.0, you will need to rebuild the handler with the HBase jar matching your version, and change the --auxpath above accordingly. Failure to use matching versions will lead to misleading connection failures such as MasterNotRunningException since the HBase RPC protocol changes often.
In order to create a new HBase table which is to be managed by Hive, use the STORED BY clause on CREATE TABLE:
The hbase.columns.mapping property is required and will be explained in the next section. The hbase.table.name property is optional; it controls the name of the table as known by HBase, and allows the Hive table to have a different name. In this example, the table is known as hbase_table_1 within Hive, and as xyz within HBase. If not specified, then the Hive and HBase table names will be identical.
After executing the command above, you should be able to see the new (empty) table in the HBase shell:
Notice that even though a column name "val" is specified in the mapping, only the column family name "cf1" appears in the DESCRIBE output in the HBase shell. This is because in HBase, only column families (not columns) are known in the table-level metadata; column names within a column family are only present at the per-row level.
Here's how to move data from Hive into the HBase table (see for how to create the example table pokes in Hive first):
Use HBase shell to verify that the data actually got loaded:
And then query it back via Hive:
Inserting large amounts of data may be slow due to WAL overhead; if you would like to disable this, make sure you have HIVE-1383 (as of Hive 0.6), and then issue this command before the INSERT:
Warning: disabling WAL may lead to data loss if an HBase failure occurs, so only use this if you have some other recovery strategy available.
If you want to give Hive access to an existing HBase table, use CREATE EXTERNAL TABLE:
Again, hbase.columns.mapping is required (and will be validated against the existing HBase table's column families), whereas hbase.table.name is optional.