Cloudera Enterprise 6.0 Beta | Other versions

Data Durability

Overview

  Warning: HDFS Erasure Coding is an experimental feature that is not supported.

CDH provides two options for data durability when data is stored in HDFS. You can either replicate data or use Erasure Coding (EC). By default, HDFS replicates data two times, resulting in three copies of data. These copies are stored on separate data nodes to guard against data loss. EC is an alternative to this default replication scheme.

When an HDFS cluster uses EC, no additional copies of the data are generated. Instead, data is striped into blocks and encoded to generate a parity bit. If there is data missing or corrupt, the remaining information and parity bit are used to reconstruct the data. This process provides a similar level of data durability to replication but at a lower storage cost.

EC can be the only data protection policy in effect or it can be used in conjunction with data replication in a hybrid deployment. You must replicate existing data to directories with EC set as the policy or write new data to directories with EC set as the policy.

    Comparing Replication and Erasure Coding

    Consider the following factors when you examine which data protection scheme to use:

    Data Temperature
    Data temperature refers to how often data is accessed. EC works best with cool data that is accessed and modified infrequently. Replication is more suitable for hot data, data that is accessed and modified frequently.
    I/O Cost
    EC has higher I/O costs than replication for the following reasons:
    • EC spreads data across nodes and racks, which means reading and writing data comes at a higher cost.
    • A parity bit is generated when data is written, thus impacting write speed.
    • If data is missing or corrupt, a data node needs to read the remaining data and parity bit in order to reconstruct the data. This process requires CPU and network resources.

    Cloudera recommends at least a 10GB network connection if you want to use EC.

    Storage Cost
    EC has a lower storage overhead than replication because multiple copies of data are not maintained. Instead, a number of parity bits are generated based on the EC policy. Compared to standard 3x replication, EC can reduce storage costs by up to 50%.
    File Size
    Erasure coding works best with larger files. For example, a 10x1 GB file using RS(1,4 leads to one 14x1 GB EC blocks.
    Page generated March 7, 2018.