Cloudera Enterprise 6.0 Beta | Other versions

Upgrading Impala

Upgrading Impala involves stopping Impala services, using your operating system's package management tool to upgrade Impala to the latest version, and then restarting Impala services.

  Note:
  • Each version of CDH has an associated version of Impala.
  • When you upgrade Impala, also upgrade Cloudera Manager if necessary. Cloudera Manager is continually updated with configuration settings for features introduced in the latest Impala releases.
  • Make sure you are using the appropriate CDH repositories shown on the CDH version and packaging page, then follow the procedures throughout the rest of this section.
  • Every time you upgrade to a new major or minor Impala release, see Apache Impala Incompatible Changes in the Release Notes for any changes needed in your source code, startup scripts, and so on.
  • Also check Apache Impala Known Issues in the Release Notes for any issues or limitations that require workarounds.

Continue reading:

Upgrading Impala through Cloudera Manager - Parcels

Parcels are an alternative binary distribution format available in Cloudera Manager 4.5 and higher.

  Important: In CDH 5 and higher, there is not a separate Impala parcel; Impala is part of the main CDH parcel. Each level of CDH has a corresponding version of Impala, and you upgrade Impala by upgrading CDH. See the CDH upgrade instructions and choose the instructions for parcels. The remainder of this section only covers parcel upgrades for Impala under CDH 4.

To upgrade Impala for CDH 4 in a Cloudera Managed environment, using parcels:

  1. If you originally installed using packages and now are switching to parcels, remove all the Impala-related packages first. You can check which packages are installed using one of the following commands, depending on your operating system:

    rpm -qa               # RHEL, Oracle Linux, CentOS, Debian
    dpkg --get-selections # Debian
    and then remove the packages using one of the following commands:
    sudo yum remove pkg_names    # RHEL, Oracle Linux, CentOS
    sudo zypper remove pkg_names # SLES
    sudo apt-get purge pkg_names # Ubuntu, Debian
  2. Connect to the Cloudera Manager Admin Console.

  3. Go to the Hosts > Parcels tab. You should see a parcel with a newer version of Impala that you can upgrade to.

  4. Click Download, then Distribute. (The button changes as each step completes.)

  5. Click Activate.

  6. When prompted, click Restart to restart the Impala service.

Upgrading Impala through Cloudera Manager - Packages

To upgrade Impala in a Cloudera Managed environment, using packages:

  1. Connect to the Cloudera Manager Admin Console.
  2. In the Services tab, click the Impala service.
  3. Click Actions and click Stop.
  4. Use one of the following sets of commands to update Impala on each Impala node in your cluster:

    For RHEL, Oracle Linux, or CentOS systems:

    $ sudo yum update impala
    $ sudo yum update hadoop-lzo-cdh4 # Optional; if this package is already installed
    

    For SUSE systems:

    $ sudo zypper update impala
    $ sudo zypper update hadoop-lzo-cdh4 # Optional; if this package is already installed
    

    For Debian or Ubuntu systems:

    $ sudo apt-get install impala
    $ sudo apt-get install hadoop-lzo-cdh4 # Optional; if this package is already installed
    
  5. Use one of the following sets of commands to update Impala shell on each node on which it is installed:

    For RHEL, Oracle Linux, or CentOS systems:

    $ sudo yum update impala-shell

    For SUSE systems:

    $ sudo zypper update impala-shell

    For Debian or Ubuntu systems:

    $ sudo apt-get install impala-shell
  6. Connect to the Cloudera Manager Admin Console.
  7. In the Services tab, click the Impala service.
  8. Click Actions and click Start.

Upgrading Impala from the Command Line

To upgrade Impala on a cluster by using the command-line, run these Linux commands on the appropriate hosts in your cluster:

  1. Stop Impala services.
    1. Stop impalad on each Impala node in your cluster:
      $ sudo service impala-server stop
    2. Stop any instances of the state store in your cluster:
      $ sudo service impala-state-store stop
    3. Stop any instances of the catalog service in your cluster:
      $ sudo service impala-catalog stop
  2. Check if there are new recommended or required configuration settings to put into place in the configuration files, typically under /etc/impala/conf. See Post-Installation Configuration for Impala for settings related to performance and scalability.
  3. Use one of the following sets of commands to update Impala on each Impala node in your cluster:

    For RHEL, Oracle Linux, or CentOS systems:

    $ sudo yum update impala-server
    $ sudo yum update hadoop-lzo-cdh4 # Optional; if this package is already installed
    $ sudo yum update impala-catalog # New in Impala 1.2; do yum install when upgrading from 1.1.
    

    For SUSE systems:

    $ sudo zypper update impala-server
    $ sudo zypper update hadoop-lzo-cdh4 # Optional; if this package is already installed
    $ sudo zypper update impala-catalog # New in Impala 1.2; do zypper install when upgrading from 1.1.
    

    For Debian or Ubuntu systems:

    $ sudo apt-get install impala-server
    $ sudo apt-get install hadoop-lzo-cdh4 # Optional; if this package is already installed
    $ sudo apt-get install impala-catalog # New in Impala 1.2.
    
  4. Use one of the following sets of commands to update Impala shell on each node on which it is installed:

    For RHEL, Oracle Linux, or CentOS systems:

    $ sudo yum update impala-shell

    For SUSE systems:

    $ sudo zypper update impala-shell

    For Debian or Ubuntu systems:

    $ sudo apt-get install impala-shell
  5. Depending on which release of Impala you are upgrading from, you might find that the symbolic links /etc/impala/conf and /usr/lib/impala/sbin are missing. If so, see Apache Impala Known Issues for the procedure to work around this problem.
  6. Restart Impala services:
    1. Restart the Impala state store service on the desired nodes in your cluster. Expect to see a process named statestored if the service started successfully.
      $ sudo service impala-state-store start
      $ ps ax | grep [s]tatestored
       6819 ?        Sl     0:07 /usr/lib/impala/sbin/statestored -log_dir=/var/log/impala -state_store_port=24000
      

      Restart the state store service before the Impala server service to avoid "Not connected" errors when you run impala-shell.

    2. Restart the Impala catalog service on whichever host it runs on in your cluster. Expect to see a process named catalogd if the service started successfully.
      $ sudo service impala-catalog restart
      $ ps ax | grep [c]atalogd
       6068 ?        Sl     4:06 /usr/lib/impala/sbin/catalogd
      
    3. Restart the Impala daemon service on each node in your cluster. Expect to see a process named impalad if the service started successfully.
      $ sudo service impala-server start
      $ ps ax | grep [i]mpalad
       7936 ?        Sl     0:12 /usr/lib/impala/sbin/impalad -log_dir=/var/log/impala -state_store_port=24000
       -state_store_host=127.0.0.1 -be_port=22000
      
  Note: If the services did not start successfully (even though the sudo service command might display [OK]), check for errors in the Impala log file, typically in /var/log/impala.

Impala Upgrade Considerations

Converting Legacy UDFs During Upgrade to CDH 5.12 or Higher

In CDH 5.7 / Impala 2.5 and higher, new syntax is available for creating Java-based UDFs. UDFs created with the new syntax persist across Impala restarts, and are more compatible with Hive UDFs. Because the replication features in CDH 5.12 and higher only work with the new-style syntax, convert any older Java UDFs to use the new syntax at the same time you upgrade to CDH 5.12 or higher.

Follow these steps to convert old-style Java UDFs to the new persistent kind:

  • Use SHOW FUNCTIONS to identify all UDFs and UDAs.
  • For each function, use SHOW CREATE FUNCTION and save the statement in a script file.
  • For Java UDFs, change the output of SHOW CREATE FUNCTION to use the new CREATE FUNCTION syntax (without argument types), which makes the UDF persistent.
  • For each function, drop it and re-create it, using the new CREATE FUNCTION syntax for all Java UDFs.

Handling Large Rows During Upgrade to CDH 5.13 / Impala 2.10 or Higher

In CDH 5.13 / Impala 2.10 and higher, the handling of memory management for large column values is different than in previous releases. Some queries that succeeded previously might now fail immediately with an error message. The --read_size option no longer needs to be increased from its default of 8 MB for queries against tables with huge column values. Instead, the query option MAX_ROW_SIZE lets you fine-tune this value at the level of individual queries or sessions. The default for MAX_ROW_SIZE is 512 KB. If your queries process rows with column values totalling more than 512 KB, you might need to take action to avoid problems after upgrading.

Follow these steps to verify if your deployment needs any special setup to deal with the new way of dealing with large rows:

  1. Check if your impalad daemons are already running with a larger-than-normal value for the --read_size configuration setting.
  2. Examine all tables to find if any have STRING values that are hundreds of kilobytes or more in length. This information is available under the Max Size column in the output from the SHOW TABLE STATS statement, after the COMPUTE STATS statement has been run on the table. In the following example, the S1 column with a maximum length of 700006 could cause an issue by itself, or if a combination of values from the S1, S2, and S3 columns exceeded the 512 KB MAX_ROW_SIZE value.
    show column stats big_strings;
    +--------+--------+------------------+--------+----------+-------------------+
    | Column | Type   | #Distinct Values | #Nulls | Max Size | Avg Size          |
    +--------+--------+------------------+--------+----------+-------------------+
    | x      | BIGINT | 30000            | -1     | 8        | 8                 |
    | s1     | STRING | 30000            | -1     | 700006   | 392625            |
    | s2     | STRING | 30000            | -1     | 10532    | 9232.6669921875   |
    | s3     | STRING | 30000            | -1     | 103      | 87.66670227050781 |
    +--------+--------+------------------+--------+----------+-------------------+
    
  3. For each candidate table, run a query to materialize the largest string values from the largest columns all at once. Check if the query fails with a message suggesting to set the MAX_ROW_SIZE query option.
    select count(distinct s1, s2, s3) from little_strings;
    +----------------------------+
    | count(distinct s1, s2, s3) |
    +----------------------------+
    | 30000                      |
    +----------------------------+
    
    select count(distinct s1, s2, s3) from big_strings;
    WARNINGS: Row of size 692.13 KB could not be materialized in plan node with id 1.
      Increase the max_row_size query option (currently 512.00 KB) to process larger rows.
    

If any of your tables are affected, make sure the MAX_ROW_SIZE is set large enough to allow all queries against the affected tables to deal with the large column values:

  • In SQL scripts run by impala-shell with the -q or -f options, or in interactive impala-shell sessions, issue a statement SET MAX_ROW_SIZE=large_enough_size before the relevant queries:

    $ impala-shell -i localhost -q \
      'set max_row_size=1mb; select count(distinct s1, s2, s3) from big_strings'
    
  • If large column values are common to many of your tables and it is not practical to set MAX_ROW_SIZE only for a limited number of queries or scripts, use the --default_query_options configuration setting for all your impalad daemons, and include the larger MAX_ROW_SIZE setting as part of the argument to that setting. For example:

    impalad --default_query_options='max_row_size=1gb;appx_count_distinct=true'
    
  • If your deployment uses a non-default value for the --read_size configuration setting, remove that setting and let Impala use the default. A high value for --read_size could cause higher memory consumption in CDH 5.13 / Impala 2.10 and higher than in previous versions. The --read_size setting still controls the HDFS I/O read size (which is rarely if ever necessary to change), but no longer affects the spill-to-disk buffer size.

Change Impala catalogd Heap when Upgrading from CDH 5.6 or Lower

The default heap size for Impala catalogd has changed in CDH 5.7 / Impala 2.5 and higher:

  • Before 5.7, by default catalogd was using the JVM's default heap size, which is the smaller of 1/4th of the physical memory or 32 GB.
  • Starting with CDH 5.7.0, the default catalogd heap size is 4 GB.

For example, on a host with 128GB physical memory this will result in catalogd heap decreasing from 32GB to 4GB.

For schemas with large numbers of tables, partitions, and data files, the catalogd daemon might encounter an out-of-memory error. To prevent the error, increase the memory limit for the catalogd daemon:
  1. Check current memory usage for the catalogd daemon by running the following commands on the host where that daemon runs on your cluster:

      jcmd catalogd_pid VM.flags
      jmap -heap catalogd_pid
      
  2. Decide on a large enough value for the catalogd heap. You express it as an environment variable value as follows:

      JAVA_TOOL_OPTIONS="-Xmx8g"
      
  3. On systems managed by Cloudera Manager, include this value in the configuration field Java Heap Size of Catalog Server in Bytes (Cloudera Manager 5.7 and higher), or Impala Catalog Server Environment Advanced Configuration Snippet (Safety Valve) (prior to Cloudera Manager 5.7). Then restart the Impala service.

  4. On systems not managed by Cloudera Manager, put this environment variable setting into the startup script for the catalogd daemon, then restart the catalogd daemon.

  5. Use the same jcmd and jmap commands as earlier to verify that the new settings are in effect.

List of Reserved Words Updated in CDH 6.0

The list of reserved words in Impala was updated in CDH 6.0. If you need to use a reserved word as an identifier, e.g. a table name, enclose the word in quotes.

If you need to use the reserved words from previous versions of CDH, set the impalad and catalogd startup option, reserved_words_version, to "2.11.0".

Decimal V2 Used by Default in CDH 6.0

In Impala, two different implementation of DECIMAL types are supported. In CDH 6.0, DECIMAL V2 is used by default. See DECIMAL Type for detail information.

If you need to continue using the first version of the DECIMAL type for the backward compatibility of your queries, set the DECIMAL_V2 query option to 0.

Behavior of Column Aliases Changed in CDH 6.0

To conform to the SQL standard, Impala no longer performs alias substitution in the subexpressions of GROUP BY, HAVING, and ORDER BY.

For example, the following statements will result in syntax errors.
SELECT int_col / 2 AS x
FROM functional.alltypes
GROUP BY x / 2;

SELECT int_col / 2 AS x
FROM functional.alltypes
ORDER BY -x;

SELECT int_col / 2 AS x
FROM functional.alltypes
GROUP BY x
HAVING x > 3;
The following uses of aliases are supported.
SELECT int_col / 2 AS x
FROM functional.alltypes
GROUP BY x;

SELECT int_col / 2 AS x
FROM functional.alltypes
ORDER BY x;

SELECT NOT bool_col AS nb
FROM functional.alltypes
GROUP BY nb
HAVING nb;

Default PARQUET_ARRAY_RESOLUTION Changed in CDH 6.0

The default value for the PARQUET_ARRAY_RESOLUTION was changed to THREE_LEVEL in CDH 6.0.

The PARQUET_ARRAY_RESOLUTION setting controls the path-resolution behavior for Parquet files with nested arrays. The accepted values are:

  • THREE_LEVEL

    Assumes arrays are encoded with the 3-level representation.

    Also resolves arrays encoded with a single level.

    Does not attempt a 2-level resolution.

  • TWO_LEVEL

    Assumes arrays are encoded with the 2-level representation.

    Also resolves arrays encoded with a single level.

    Does not attempt a 3-level resolution.

  • TWO_LEVEL_THEN_THREE_LEVEL

    First tries to resolve assuming the 2-level representation, and if unsuccessful, tries the 3-level representation.

    Also resolves arrays encoded with a single level.

Enable Clustering Hint for Inserts

In CDH 6.0, the clustered hint is enabled by default. The hint inserts a local sort by the partitioning columns to a query plan.

The clustered hint is only effective for HDFS and Kudu tables.

As in previous versions, the noclustered hint prevents clustering. If a table has ordering columns defined, the noclustered hint is ignored with a warning.

Deprecated Query Options Removed in CDH 6.0

The following query options have been deprecated for several releases and removed in CDH 6.0: DEFAULT_ORDER_BY_LIMIT, ABORT_ON_DEFAULT_LIMIT_EXCEEDED, V_CPU_CORES, RESERVATION_REQUEST_TIMEOUT, RM_INITIAL_MEM, SCAN_NODE_CODEGEN_THRESHOLD, MAX_IO_BUFFERS, RM_INITIAL_MEM, DISABLE_CACHED_READS.

refresh_after_connect Impala Shell Option Removed in CDH 6.0

The deprecated refresh_after_connect option was removed from Impala Shell in CDH 6.0

Page generated March 7, 2018.