Understanding HDFS Recovery Processes (Part 2)

1510阅读 0评论2019-07-07 speckle
分类:HADOOP

本文转自:https://blog.cloudera.com/blog/2015/03/understanding-hdfs-recovery-processes-part-2/

Understanding HDFS Recovery Processes (Part 2)

March 11, 2015 By Yongjun Zhang 3 Comments
Categories: Hadoop HDFS

Having a good grasp of HDFS recovery processes is important when running or moving toward production-ready Apache Hadoop. In the conclusion to this two-part post, pipeline recovery is explained.

An important design requirement of HDFS is to ensure continuous and correct operations that support production deployments. For that reason, it’s important for operators to understand how HDFS recovery processes work. In Part 1 of this post, we looked at lease recovery and block recovery. Now, in Part 2, we explore pipeline recovery.

All three recovery processes are essential for HDFS fault tolerance. Together, they help to ensure that writes are durable and consistent in HDFS, even in the presence of network and node failures.

Recap

In HDFS, files are divided into blocks, and file access follows multi-reader, single-writer semantics. To meet the fault-tolerance requirement, multiple replicas of a block are stored on different DataNodes. The number of replicas is called the replication factor. When a new file block is created, or an existing file is opened for append, the HDFS write operation creates a pipeline of DataNodes to receive and store the replicas (the replication factor generally determines the number of DataNodes in the pipeline). Subsequent writes to that block go through the pipeline (Figure 1).

Figure 1. HDFS Write Pipeline

For read operations the client chooses one of the DataNodes holding copies of the block and requests a data transfer from it.

For a deeper dive into this background information, read Part 1 of this post.

Pipeline Recovery

The Write Pipeline

When an HDFS client writes to file, the data is written as sequential blocks. To write or construct a block, HDFS breaks the block into packets (not actually network packets but rather messages; the term packets refers to the class which embodies these messages), and propagates them to the DataNodes in the write pipeline, as shown in Figure 2.

Figure 2. HDFS Write Pipeline Stages

There are three stages of a write pipeline:

  1. Pipeline setup. The client sends a Write_Block request along the pipeline and the last DataNode sends an acknowledgement back. After receiving the acknowledgement, the pipeline is ready for writing.
  2. Data streaming. The data is sent through the pipeline in packets. The client buffers the data until a packet is filled up, and then sends the packet to the pipeline. If the client calls hflush(), then even if a packet is not full, it will nevertheless be sent to the pipeline and the next packet will not be sent until the acknowledgement of the previous hflush’ed packet is received by the client.
  3. Close (finalize the replica and shutdown the pipeline). The client waits until all packets have been acknowledged and then sends a close request. All DataNodes in the pipeline change the corresponding replica into the FINALIZED state and report back to the NameNode. The NameNode then changes the block’s state to COMPLETE if at least the configured minimum replication number of DataNodes reported a FINALIZED state of their corresponding replicas.
Pipeline Recovery

Pipeline recovery is initiated when one or more DataNodes in the pipeline encounter an error in any of the three stages while a block is being written.

Recovery from Pipeline Setup Failure

  1. If the pipeline was created for a new block, the client abandons the block and asks the NameNode for a new block and a new list of DataNodes. The pipeline is reinitialized for the new block.
  2. If the pipeline was created to append to a block, the client rebuilds the pipeline with the remaining DataNodes and increments the block’s generation stamp.

Recovery from Data Streaming Failure

  1. When a DataNode in the pipeline detects an error (for example, a checksum error or a failure to write to disk), that DataNode takes itself out of the pipeline by closing up all TCP/IP connections. If the data is deemed not corrupted, it also writes buffered data to the relevant block and checksum (METADATA) files.
  2. When the client detects the failure, it stops sending data to the pipeline, and reconstructs a new pipeline using the remaining good DataNodes. As a result, all replicas of the block are bumped up to a new GS.
  3. The client resumes sending data packets with this new GS. If the data sent has already been received by some of the DataNodes, they just ignore the packet and pass it downstream in the pipeline.

Recovery from Close Failure

  1. When the client detects a failure in the close state, it rebuilds the pipeline with the remaining DataNodes. Each DataNode bumps up the block’s GS and finalizes the replica if it’s not finalized yet.

When one DataNode is bad, it removes itself from the pipeline. During the pipeline recovery process, the client may need to rebuild a new pipeline with the remaining DataNodes. (It may or may not replace bad DataNodes with new DataNodes, depending on the DataNode replacement policy described in the next section.) The replication monitor will take care of replicating the block to satisfy the configured replication factor.

DataNode Replacement Policy upon Failure

There are four configurable policies regarding whether to add additional DataNodes to replace the bad ones when setting up a pipeline for recovery with the remaining DataNodes:

  1. DISABLE: Disables DataNode replacement and throws an error (at the server); this acts like NEVER at the client.
  2. NEVER: Never replace a DataNode when a pipeline fails (generally not a desirable action).
  3. DEFAULT:  Replace based on the following conditions:
    1. Let r be the configured replication number.
    2. Let n be the number of existing replica datanodes.
    3. Add a new DataNode only if r >= 3 and EITHER
      •   floor(r/2) >= n; OR
      •   r > n and the block is hflushed/appended.
  4. ALWAYS: Always add a new DataNode when an existing DataNode failed. This fails if a DataNode can’t be replaced.

To disable using any of these policies, you can set the following configuration property to false (the default is true):

When enabled, the default policy is DEFAULT. The following config property changes the policy:

When using DEFAULT or ALWAYS, if only one DataNode succeeds in the pipeline, the recovery will never succeed and client will not be able to perform the write. This problem is addressed with this configuration property:

which defaults to false. With the default setting, the client will keep trying until the specified policy is satisfied. When this property is set to true, even if the specified policy can’t be satisfied (for example, there is only one DataNode that succeeds in the pipeline, which is less than the policy requirement), the client will still be allowed to continue to write.

Some Solved Issues

Known Open Issues

Conclusion

Lease recovery, block recovery, and pipeline recovery are all essential for HDFS fault tolerance. Together, they insure that writes are durable and consistent in HDFS, even in the presence of network and node failures.

Hopefully, after reading these posts, you have a better understanding of when and why these processes are invoked, and what they do. If you are interested in learning more, you can read through some of the links including the design specification, JIRAs referenced here, or the relevant code.

Yongjun Zhang is a Software Engineer at Cloudera, and a Hadoop committer.

上一篇:Understanding HDFS Recovery Processes (Part 1)
下一篇:没有了