Cloudera Enterprise 6.0 Beta | Other versions

Spark Guide

Apache Spark is a general framework for distributed computing that offers high performance for both batch and interactive processing. It exposes APIs for Java, Python, and Scala and consists of Spark core and several related projects:

  • Spark SQL - Module for working with structured data. Allows you to seamlessly mix SQL queries with Spark programs.
  • Spark Streaming - API that allows you to build scalable fault-tolerant streaming applications.
  • MLlib - API that implements common machine learning algorithms.
  • GraphX - API for graphs and graph-parallel computation.

You can run Spark applications locally or distributed across a cluster, either by using an interactive shell or by submitting an application. Running Spark applications interactively is commonly performed during the data-exploration phase and for ad hoc analysis.

To run applications distributed across a cluster, Spark requires a cluster manager. In CDH 6, Cloudera supports only the YARN cluster manager. When run on YARN, Spark application processes are managed by the YARN ResourceManager and NodeManager roles. Spark Standalone is no longer supported.

  Note:

This page contains information related to Spark 2.x, which is included with CDH beginning with CDH 6. This information supercedes the documentation for the separately available parcel for Cloudera Distribution of Apache Spark 2.

Unsupported Features

The following Spark features are not supported:

  • Spark SQL:
    • Thrift JDBC/ODBC server
    • Spark SQL CLI
  • Spark Dataset API
  • SparkR
  • GraphX
  • Spark on Scala 2.11
  • Mesos cluster manager

Consult Spark 2 Known Issues for a comprehensive list of features that are not supported with the Cloudera Distribution of Apache Spark 2.

  Note:

This documentation refers to the Cloudera Distribution of Apache Spark 2.2 release 1. This component is generally available and is supported on CDH 5.7 through CDH 5.12.

A Hive compatibility issue in Cloudera Distribution of Apache Spark 2.0 release 1 affects CDH 5.10.1 and higher, CDH 5.9.2 and higher, CDH 5.8.5 and higher, and CDH 5.7.6 and higher. If you are using one of these CDH versions, you must upgrade to the Spark 2.0 release 2 or higher parcel, to avoid Spark 2 job failures when using Hive functionality.

is a general framework for distributed computing that offers high performance for both batch and interactive processing. It exposes APIs for Java, Python, and Scala.

For detailed API information, see the Apache Spark project site.
  Note: Although this document makes some references to the external Spark site, not all the features, components, recommendations, and so on are applicable to Spark when used on CDH. Always cross-check the Cloudera documentation before building a reliance on some aspect of Spark that might not be supported or recommended by Cloudera. In particular, see Spark 2 Known Issues for components and features to avoid.

The Cloudera distribution of Apache Spark 2 consists of Spark core and several related projects:

Module for working with structured data. Allows you to seamlessly mix SQL queries with Spark programs.
API that allows you to build scalable fault-tolerant streaming applications.
API that implements common machine learning algorithms.

The Cloudera Enterprise product includes the Cloudera Distribution of Apache Spark 2.2. The Spark 2.x support was previously shipped as its own parcel, separate from CDH.

In CDH 6, the Spark 1.6 service does not exist. The port of the Spark History Server is 18088, which is the same as formerly with Spark 1.6, and a change from port 18089 formerly used for the Spark 2 parcel.

Page generated March 7, 2018.