This section explains how to replace nodes that have reached retirement or end of life by performing a two-stage transparent platform refresh on clusters that run Qumulo Core 6.1.0.3 (and higher).

How Transparent Platform Refresh Works

Transparent platform refresh comprises two stages. For help with your node replacement plan, contact the Qumulo Care team.

Stage 1: Register a Node Replacement Plan

In this stage, you register a node replacement plan with your cluster. The plan includes information about the nodes to replace and the data protection configuration.

In following example, we use a four-node cluster and:

  • Replace nodes 1-4 with five new nodes in a single step

  • Change the data protection configuration to the 8.6 stripe configuration with 1-node fault tolerance

Stage 2: Execute the Node Replacement Plan Steps

In this stage, you execute the node replacement plan’s steps and Qumulo Core performs data protection reconfiguration.

There are two node replacement plan types:

  • Single-Step Node Replacement: Qumulo Core adds all new nodes and removes all nodes marked for replacement in a single step. Use this approach when the node replacement speed is a priority.

  • Multi-Step Node Replacement: Each step of the plan adds some the new nodes and removes some nodes marked for replacement. Use this approach when rack space or switch port capacity in your data center is limited.

Cluster Properties During Node Replacement

  • When a replacement step begins, Qumulo Core distributes floating IP addresses among the nodes in the combined cluster. After Qumulo Core removes nodes marked for replacement, it redistributes any client connections that use floating IP addresses among the nodes that remain in the cluster.

  • While a node replacement step is in progress, both new nodes and nodes marked for replacement appear on the Cluster page of the Qumulo Core Web UI and clients can connect to any of the nodes in the combined cluster while the step is in progress.

  • When a node replacement step is complete, the reassignment of static IP addresses differs between versions of Qumulo Core:

    • In Qumulo Core 6.3.0.1 (and higher), the static IP addresses assigned to nodes remain unchanged and Qumulo Core removes only the static IP addresses for nodes removed from the cluster.

    • In Qumulo Core versions lower than 6.3.0.1, Qumulo Core reassigns static IP addresses to different nodes. To view the reassigned IP addresses in the Qumulo Core Web UI, click Cluster > Network Configuration.

  • When Qumulo Core adds nodes to a cluster, it assigns node IDs sequentially, without reusing or changing IDs.

    For example, if you have a four-node cluster with node IDs 1-4, and you replace node IDs 2 and 3 with two new nodes, after node replacement the cluster contains node IDs 1, 4, 5, and 6. If you add another node, it has the ID 7.

  • A cluster’s usable capacity doesn’t increase until:

    • Any data protection reconfiguration is complete

    • The last step of the node replacement plan is in progress

    For example, if you replace nodes in a single step without data protection reconfiguration, usable capacity increases as soon as Qumulo Core begins the step.

Prerequisites

Ensure that the number of static and floating IP addresses is equal to or greater than the number of nodes in the combined cluster.

Step 1: Register a Node Replacement Plan by Using the qq CLI

  1. Run the qq replace_nodes register_plan command and the --nodes-to-be-replaced flag to specify the nodes to replace and the --target-stripe-config flag to specify the stripe configuration. For example:

    qq replace_nodes register_plan \
      --nodes-to-be-replaced 1 2 3 4 \
      --target-stripe-config 8 6
    

    Qumulo Core stores the node replacement plan on your cluster.

  2. Rack and wire your new nodes and then power them on.

  3. To determine the UUIDs of the nodes to add to your cluster, run the qq unconfigured_nodes_list command.

  4. Write down the UUIDs of the nodes that you want to add to the cluster, in the order that you want to add them.

Step 2: Execute the Node Replacement Plan Steps by Using the qq CLI

  1. Run the qq replace_nodes add_nodes_and_replace command to initiate each step, the --nodes-being-replaced flag to specify the nodes to replace, and the --node-uuids flag to specify the nodes to add during the current step.

    If your plan includes data protection reconfiguration, use the --reconfigure-data-protection and --target-max-node-failures flags to initiate the reconfiguration during the current step. For example:

    qq replace_nodes add_nodes_and_replace \
      --nodes-being-replaced 1 2 3 4 \
      --node-uuids 12345a6b-7c89-0d12-3456-78fe9012f345 abcde1f2-g3hi-j4kl-mnop-qr56stuv7wxy \
      --reconfigure-data-protection \
      --target-max-node-failures 1
    

    The following is example output.

    Current cluster:
        Usable capacity: 200 TB
        Node fault tolerance level: 1 node
    With the selected node replacement step:
        Usable capacity: 220 TB
        Node fault tolerance level: 1 node
    
  2. To confirm the reconfiguration with the selected node-replace and data protection configuration operations, enter yes.

    For more information, see Monitoring the Data Protection Reconfiguration Process.

  3. Wait for the node replacement step to complete.

    After each node replacement step, Qumulo Core begins to migrate data from existing nodes in the background.

  4. Unrack the removed nodes from your data center.

  5. Initiate the next node replacement step.

Viewing, Editing, and Canceling the Node Replacement Plan

  • To view the current node replacement plan, run the qq replace_nodes command with the get_plan subcommand.

    If a node replacement step is in progress, the command shows the list of nodes in process of being replaced during the current step.

  • To edit the node replacement plan after you register it with your cluster, run the qq replace_nodes with the register_plan subcommand and a new node replacement plan.

  • To cancel the current node replacement plan, run the qq replace_nodes command with the cancel_plan subcommand.

Monitoring the Data Protection Reconfiguration Process

To view the progress of the three stages of the data protection reconfiguration process, log in to the Qumulo Core Web UI and click Cluster.

  1. Qumulo Core begins to move data to new nodes in the cluster and the Qumulo Core Web UI displays the message Rebalancing for data protection reconfiguration.

  2. Qumulo Core reencodes all data on your cluster and the Qumulo Core Web UI displays the message Reconfiguring data protection.

    When this stage is complete, your data is protected according to the cluster’s new configuration and the system begins to use the new drive and node fault tolerance levels.

  3. Qumulo Core adds new capacity to your cluster and the Qumulo Core Web UI displays the message Rebalancing.

    If you initiated the reconfiguration process as part of a node replacement step, the system migrates data from the existing nodes in the cluster.

Cluster Availability During the Reconfiguration Process

Your cluster remains available throughout the data protection reconfiguration process.

  • You can upgrade Qumulo Core.

  • Your cluster maintains the ability to recover from node and drive failure automatically.

    During the reconfiguration process, drive and node fault tolerance levels remain at the minimums that the existing and new configurations specify. For example, if your existing cluster has 2-node and 2-drive fault tolerance, and you initiate reconfiguration where the new configuration has 1-node and 3-drive fault tolerance, your cluster has 1-node and 2-drive fault tolerance during the reconfiguration process.