Cloudera Enterprise 6.0 Beta | Other versions

Query Vectorization for Apache Hive in CDH

By default, the Hive query execution engine processes one row of a table at a time. The single row of data goes through all the operators in the query before the next row is processed, resulting in very inefficient CPU usage. In vectorized query execution, data rows are batched together and represented as a set of column vectors. The basic idea of vectorized query execution is to process a batch of rows as an array of column vectors:

Figure 1. How Hive Query Vectorization Works

When query vectorization is enabled, the query engine processes vectors of columns, which greatly improves CPU utilization for typical query operations like scans, filters, aggregates, and joins.

Continue reading:

Enabling Hive Query Vectorization

Hive query vectorization is enabled by setting the hive.vectorized.execution.enabled property to true. In both CDH 5 and CDH 6, this property is set to true by default. But in CDH 5, vectorized query execution in Hive is only possible on ORC-formatted tables, which Cloudera recommends that you do not use for overall compatibility with the CDH platform. Instead, Cloudera recommends that you use tables in the Parquet format because all CDH components support this format and can be consumed by all CDH components. In CDH 6, query vectorization is supported for Parquet tables in Hive.

Using Cloudera Manager to Enable Query Vectorization on a Server-wide Basis

For managed clusters, open the Cloudera Manager Admin Console and perform the following steps:

  1. Select the Hive service.
  2. Click the Configuration tab.
  3. Search for "enable vectorization."
  4. Select the Enable Vectorization Optimization option if it is not already selected. This option is enabled by default in CDH.



  5. Click Save Changes.
  6. Click the Instances tab, and then click the Restart the service (or the instance) for the changes to take effect:



Manually Enabling Query Vectorization on a Server-Wide Basis

For unmanaged clusters, set the hive.vectorized.execution.enabled property manually in the hive-site.xml file to enable query vectorization on a server-wide basis:

<property>
     <name>hive.vectorized.execution.enabled</name<
     <value>true</value>
     <description>Enables query vectorization.</description>
</property>
      

Enabling Hive Query Vectorization on a Per-Session Basis

To enable query vectorization on an individual session only, use the Hive SET command:

SET hive.vectorized.execution.enabled=true;
        

Using the SET command to enabled query vectorization on a per-session basis is useful to test the effects of vectorization on execution for specific sets of queries.

Tuning Hive Query Vectorization

When query vectorization is enabled, there are additional properties you can set to tune how your queries are vectorized. These properties can be set in Cloudera Manager, can be set manually in the hive-site.xml file, or can be set on a per-query basis using the Hive SET command. Use the same general steps listed in the previous section to configure these properties in Cloudera Manager or manually.

hive.vectorized.groupby.checkinterval
Description: For vectorized GROUP BY operations, specifies the number of row entries added to the hash table before rechecking the average variable size when estimating memory usage.
Recommendations:
Default Setting: 4096
hive.vectorized.groupby.flush.percent
Description: Sets the percentage between 0.0 and 1.0 percent of entries in the vectorized GROUP BY aggregation hash that is flushed when the memory threshold is exceeded.
Recommendations:
Default Setting: 0.1
hive.vectorized.execution.reduce.enabled
Description: Turns on or off vectorization for the reduce-side of query execution. Applies only when the execution engine is set to Spark.
Recommendations: Enable this property by setting it to true if you are using Hive on Spark. Otherwise, do not enable this property.
Default Setting: true
hive.vectorized.use.vectorized.input.format
Description: Enables Hive to take advantage of input formats that support vectorization when they are available.
Recommendations: Enable this property by setting it to true if you have Parquet or ORC workloads that you want to be vectorized.
Default Setting: true
hive.vectorized.use.vector.serde.deserialize
Description: Enables Hive to use built-in vector SerDes to process text and SequenceFile tables for vectorized query execution.
Recommendations:
Default Setting: false
hive.vectorized.input.format.excludes
Description: Specifies certain input formats to exclude from vectorized query execution. You can select Parquet, ORC, or both. Vectorized execution can still occur for an excluded input format based on whether row SerDes or vector SerDes are enabled.
Recommendations: Use this property to automatically disable certain file formats from vectorized execution. Cloudera recommends that you do not set this property unless you have a problem and which to disable Parquet query vectorization on certain file formats.
Default Setting: Not Applicable

Supported/Unsupported Data Types and Functions

Most common data types and functions are supported by Hive query vectorization on Parquet tables in CDH. The following subsections provide more details about data type and function support.

Supported/Unsupported Data Types

Currently, some complex data types, such as map and list are not supported for Hive query vectorization on Parquet tables in CDH. The following data types are supported.

Table 1. Supported Data Types for Hive Query Vectorization on Parquet Tables
int smallint tinyint
bigint integer long
short timestamp interval_year_month
boolean binary string
byte float double
void struct

Supported/Unsupported Functions

Common arithmetic, boolean (for example AND, OR), comparison, mathematical (for example SIN, COS, LOG), date, and type-cast functions are supported. Also common aggregate functions such as MIN, MAX, COUNT, AVG, and SUM are also supported. If a function is not supported, the vectorizer attempts to vectorize the function based on the configuration value specified for hive.vectorized.adaptor.usage.mode. You can set this property to none, chosen, or all. To set this property in Cloudera Manager, use the Hive Service Advanced Configuration Snippet (Safety Valve) for hive-site.xml. Set it manually in the hive-site.xml file for server-wide scope. To set it on a per-query basis, use the Hive SET command as described above.

Verifying a Query is Vectorized

To verify that a query is vectorized, use the EXPLAIN VECTORIZATION statement. This statement returns a query plan that shows how the Hive query execution engine processes your query and whether vectorization is being triggered.

Example of Verifying that Query Vectorization is Triggered for Your Query

This example uses the Hive table p_clients, which uses the Parquet format and contains the following columns and data types:

DESCRIBE p_clients;
⋮
+------------------+------------+----------+
|     col_name     | data_type  | comment  |
+------------------+------------+----------+
| name             | string     |          |
| symbol           | string     |          |
| lastsale         | double     |          |
| marketlabel      | string     |          |
| marketamount     | bigint     |          |
| ipoyear          | int        |          |
| segment          | string     |          |
| business         | string     |          |
| quote            | string     |          |
+------------------+------------+----------+
        

To get the query execution plan for a query, enter the following commands in a Beeline session:

EXPLAIN VECTORIZATION SELECT COUNT(*) FROM p_clients WHERE ipoyear = 2009;
        

This command returns the following query execution plan:

Figure 2. EXPLAIN VECTORIZATION Query Execution Plan for Hive Table Using the Parquet Format

Vectorization is explained in several parts of this query plan:

  1. Immediately, the query plan shows that the hive.vectorized.execution.enabled property is set to true, so vectorization is enabled on this system .
  2. Then in the STAGE PLANS section, the plans shows that the query was executed in the vectorization mode overall.
  3. This sub-section of the plan shows that the map-side of query execution was vectorized because the hive.vectorized.use.vectorized.input.format property is set to true. This section also contains details about input file format and adaptor settings used in the map-side of query execution.
  4. This sub-section of the query plan shows that the reduce-side of query execution was not vectorized because the hive.vectorized.execution.reduce.enabled property is set to false. If you want the reduce-side of query execution to be vectorized, the execution plan indicates what property you need to set to enable that.

By using the EXPLAIN VECTORIZATION statement with your queries, you can find out before you deploy them whether vectorization will be triggered and what properties you must set to enable it.

Page generated March 7, 2018.