Cloudera Enterprise 6.0 Beta | Other versions

Deploy Erasure Coding

Before You Begin

  Warning: HDFS Erasure Coding is an experimental feature that is not supported.

Before you can deploy Erasure Coding (EC), perform the following tasks:

  • Verify that the clusters run CDH 6.0 or higher.
  • Determine which EC policy you want to use.
  • Determine if you want to use EC for existing data or new data

Understanding Erasure Coding Policies

The EC policy determines how data is encoded and decoded. An EC policy is comprised of the following parts: codec-number of data blocks-number of parity blocks-cell size.

  • Codec: The erasure codec that the policy uses. It can be XOR or Reed-Solomon (RS)
  • Number of Data Blocks: The number of data blocks per stripe. The higher this number, the more nodes that need to be read when accessing data.
  • Number of Parity Blocks: The number of parity blocks per stripe.
  • Cell Size: The size of one basic unit of striped data.

For example, a RS-10-4-1024k policy has the following attributes:

  • Codec: Reed-Solomon
  • Number of Data Blocks: 10
  • Number of Parity Blocks: 4
  • Cell Size: 1024k

The sum of the number of data blocks and parity blocks is the data stripe width. Therefore, the number of racks should at least equal the stripe width in order for the data to be resistant to rack failures. Ideally, the number of racks exceeds the data stripe width to account for downtime and outages. If there are fewer racks than the data stripe width, HDFS spreads data across multiple nodes to maintain fault tolerance at the node level. To achieve node-level fault tolerance, the number of nodes needs to equal the data stripe width.

For example, in order for a RS-10-4-1024k policy to be rack-failure tolerant, you must have at least 14 racks, 10 racks for the data blocks and 4 racks for the parity blocks. This comes with a tradeoff though. Data must be read from 10 blocks, increasing the read time. Therefore, the larger the cluster and colder the data, the more appropriate it is to use EC policies with large data stripe widths.

Enabling Erasure Coding

Enable EC using the Cloudera Manager Admin Console:

  1. Select Clusters and choose the HDFS cluster you want to enable EC for.
  2. Navigate to the Configuration tab and select the Erasure Coding category.
  3. Configure the EC properties:
    • DataNode Striped Read Timeout
    • DataNode Striped Read Threads
    • Erasure Coding Reconstruction Weight
    • Fallback Erasure Coding Policy: Select the EC policy that HDFS uses if no policy is specified when the CLI command to apply EC to a directory is run.
    • Erasure Coding Enabled: Select this option to enable EC.
  4. Run the following command:
    hdfs ec -setPolicy -path directory [-policy policyName]
    • path. Required. Specify the HDFS directory you want to apply the EC policy to.
    • policy. Optional. The EC policy you want to use for the directory you specified. If you do not provide this parameter, the EC policy you specified in step 3 for the Fallback Erasure Coding Policy is used.
    This command applies the EC policy to data written after the command is run. It does not apply EC policies to existing data. See Using Erasure Coding for Existing Data.

Using Erasure Coding for Existing Data

In order to use EC with existing data, that data must be copied into a directory that has EC enabled. To copy the data, you can use distcp or Cloudera Manager's BDR.

Using Erasure Coding for New Data

In order to use EC with new data, set the destination for the data to a directory with EC enabled. No action beyond that is required. When data is written to the directory, it will be erasure coded based on the policy you set.

Page generated March 7, 2018.