Replacing Nodes in a Qumulo Cluster by Performing a Transparent Platform Refresh

This section explains how to replace nodes that have reached retirement or end of life by performing a two-stage transparent platform refresh on clusters that run Qumulo Core 6.1.0.3 (and higher).

Important

Qumulo Core doesn't support replacing nodes in clusters with more than 100 nodes.
The total capacity of the planned cluster configuration can't be less than the total capacity of the current cluster configuration.
In Qumulo Core 6.1.2.2 (and higher), you can use the qq CLI to replace nodes. To replace nodes on a lower version of Qumulo Core, contact the Qumulo Care team.

How Transparent Platform Refresh Works

Transparent platform refresh comprises two stages. For help with your node replacement plan, contact the Qumulo Care team.

Stage 1: Register a Node Replacement Plan

In this stage, you register a node replacement plan with your cluster. The plan includes information about the nodes to replace and the data protection configuration.

In following example, we use a four-node cluster and:

Replace nodes 1-4 with five new nodes in a single step
Change the data protection configuration to the 8.6 stripe configuration with 1-node fault tolerance

Stage 2: Execute the Node Replacement Plan Steps

In this stage, you execute the node replacement plan’s steps and Qumulo Core performs data protection reconfiguration.

Note
It isn’t possible to add nodes or begin another node replacement step while a node replacement step is already in progress.

There are two node replacement plan types:

Single-Step Node Replacement: Qumulo Core adds all new nodes and removes all nodes marked for replacement in a single step. Use this approach when the node replacement speed is a priority.
Multi-Step Node Replacement: Each step of the plan adds some the new nodes and removes some nodes marked for replacement. Use this approach when rack space or switch port capacity in your data center is limited.

Cluster Properties During Node Replacement

When a replacement step begins, Qumulo Core distributes floating IP addresses among the nodes in the combined cluster. After Qumulo Core removes nodes marked for replacement, it redistributes any client connections that use floating IP addresses among the nodes that remain in the cluster.
While a node replacement step is in progress, both new nodes and nodes marked for replacement appear on the Cluster page of the Qumulo Core Web UI and clients can connect to any of the nodes in the combined cluster while the step is in progress.
When a node replacement step is complete, the reassignment of static IP addresses differs between versions of Qumulo Core:
- In Qumulo Core 6.3.0.1 (and higher), the static IP addresses assigned to nodes remain unchanged and Qumulo Core removes only the static IP addresses for nodes removed from the cluster.
- In Qumulo Core versions lower than 6.3.0.1, Qumulo Core reassigns static IP addresses to different nodes. To view the reassigned IP addresses in the Qumulo Core Web UI, click Cluster > Network Configuration.
When Qumulo Core adds nodes to a cluster, it assigns node IDs sequentially, without reusing or changing IDs.

For example, if you have a four-node cluster with node IDs 1-4, and you replace node IDs 2 and 3 with two new nodes, after node replacement the cluster contains node IDs 1, 4, 5, and 6. If you add another node, it has the ID 7.
A cluster’s usable capacity doesn’t increase until:
- Any data protection reconfiguration is complete
- The last step of the node replacement plan is in progress
For example, if you replace nodes in a single step without data protection reconfiguration, usable capacity increases as soon as Qumulo Core begins the step.

Prerequisites

Ensure that the number of static and floating IP addresses is equal to or greater than the number of nodes in the combined cluster.

Step 1: Register a Node Replacement Plan by Using the qq CLI

Run the qq replace_nodes register_plan command and the --nodes-to-be-replaced flag to specify the nodes to replace and the --target-stripe-config flag to specify the stripe configuration. For example:
```
qq replace_nodes register_plan \
  --nodes-to-be-replaced 1 2 3 4 \
  --target-stripe-config 8 6
```
Qumulo Core stores the node replacement plan on your cluster.
Note
- If your plan includes data protection reconfiguration, Qumulo Core records only the stripe configuration. You specify the node fault tolerance when you execute the plan steps.
- If your plan doesn't include data protection reconfiguration, you can omit the --target-stripe-config flag.
- To replace all nodes in the cluster, use the --replace-all flag instead of the --nodes-to-be-replaced flag.
Rack and wire your new nodes and then power them on.
To determine the UUIDs of the nodes to add to your cluster, run the qq unconfigured_nodes_list command.
Write down the UUIDs of the nodes that you want to add to the cluster, in the order that you want to add them.

Step 2: Execute the Node Replacement Plan Steps by Using the qq CLI

Run the qq replace_nodes add_nodes_and_replace command to initiate each step, the --nodes-being-replaced flag to specify the nodes to replace, and the --node-uuids flag to specify the nodes to add during the current step.

Important
Qumulo Core adds nodes to the cluster in the order in which you list their UUIDs after the --node-uuids flag. When you begin the node replacement step, it isn’t possible to revert this operation or reorder nodes after adding them to a cluster.

If your plan includes data protection reconfiguration, use the --reconfigure-data-protection and --target-max-node-failures flags to initiate the reconfiguration during the current step. For example:
```
qq replace_nodes add_nodes_and_replace \
  --nodes-being-replaced 1 2 3 4 \
  --node-uuids 12345a6b-7c89-0d12-3456-78fe9012f345 abcde1f2-g3hi-j4kl-mnop-qr56stuv7wxy \
  --reconfigure-data-protection \
  --target-max-node-failures 1
```
The following is example output.
```
Current cluster:
    Usable capacity: 200 TB
    Node fault tolerance level: 1 node
With the selected node replacement step:
    Usable capacity: 220 TB
    Node fault tolerance level: 1 node
```
Note
To replace all nodes in the cluster, use the --replace-all flag instead of the --nodes-being-replaced flag.
To confirm the reconfiguration with the selected node-replace and data protection configuration operations, enter yes.

For more information, see Monitoring the Data Protection Reconfiguration Process.
Wait for the node replacement step to complete.

After each node replacement step, Qumulo Core begins to migrate data from existing nodes in the background.

Note
This is a long process (that can take days or weeks). When the data migration is complete, Qumulo Core removes the nodes marked for replacement from the cluster. These nodes no longer appear on the Cluster page of the Qumulo Core Web UI.
Unrack the removed nodes from your data center.
Initiate the next node replacement step.

Viewing, Editing, and Canceling the Node Replacement Plan

To view the current node replacement plan, run the qq replace_nodes command with the get_plan subcommand.

If a node replacement step is in progress, the command shows the list of nodes in process of being replaced during the current step.
To edit the node replacement plan after you register it with your cluster, run the qq replace_nodes with the register_plan subcommand and a new node replacement plan.
To cancel the current node replacement plan, run the qq replace_nodes command with the cancel_plan subcommand.

Important
Canceling a node replacement plan after executing one or more steps might make it impossible to reregister and complete the plan.

Monitoring the Data Protection Reconfiguration Process

To view the progress of the three stages of the data protection reconfiguration process, log in to the Qumulo Core Web UI and click Cluster.

Qumulo Core begins to move data to new nodes in the cluster and the Qumulo Core Web UI displays the message Rebalancing for data protection reconfiguration.
Qumulo Core reencodes all data on your cluster and the Qumulo Core Web UI displays the message Reconfiguring data protection.

Note
In certain scenarios, this stage might appear to pause while the system performs preparatory work on the cluster.

When this stage is complete, your data is protected according to the cluster’s new configuration and the system begins to use the new drive and node fault tolerance levels.
Qumulo Core adds new capacity to your cluster and the Qumulo Core Web UI displays the message Rebalancing.

If you initiated the reconfiguration process as part of a node replacement step, the system migrates data from the existing nodes in the cluster.

Cluster Availability During the Reconfiguration Process

Your cluster remains available throughout the data protection reconfiguration process.

You can upgrade Qumulo Core.
Your cluster maintains the ability to recover from node and drive failure automatically.

During the reconfiguration process, drive and node fault tolerance levels remain at the minimums that the existing and new configurations specify. For example, if your existing cluster has 2-node and 2-drive fault tolerance, and you initiate reconfiguration where the new configuration has 1-node and 3-drive fault tolerance, your cluster has 1-node and 2-drive fault tolerance during the reconfiguration process.

Note

To avoid impact to front-end workloads, Qumulo Core slows down the reconfiguration process automatically.
When Qumulo Core finds missing nodes or drives, it pauses the reconfiguration process. When you replace or bring the nodes or drives online, the reconfiguration process continues.
It isn't possible to add or replace nodes during the reconfiguration process.