This section explains how to replace nodes that have reached retirement or end of life by performing a two-stage transparent platform refresh on clusters that run Qumulo Core 18.104.22.168 (and higher).
- Qumulo Core doesn't support replacing nodes in clusters with more than 100 nodes.
- In Qumulo Core 22.214.171.124 (and higher), you can use the
How Transparent Platform Refresh Works
Transparent platform refresh comprises two stages. For help with your node replacement plan, contact the Qumulo Care team.
Stage 1: Register a Node Replacement Plan
In this stage, you register a node replacement plan with your cluster. The plan includes information about the nodes to replace and the data protection configuration.
In following example, we use a four-node cluster and:
Replace nodes 1-4 with five new nodes in a single step
Change the data protection configuration to the
8.6stripe configuration with 1-node fault tolerance
Stage 2: Execute the Node Replacement Plan Steps
In this stage, you execute the node replacement plan’s steps and Qumulo Core performs data protection reconfiguration.
It isn’t possible to add nodes or begin another node replacement step while a node replacement step is already in progress.
There are two node replacement plan types:
Single-Step Node Replacement: Qumulo Core adds all new nodes and removes all nodes marked for replacement in a single step. Use this approach when the node replacement speed is a priority.
Multi-Step Node Replacement: Each step of the plan adds some the new nodes and removes some nodes marked for replacement. Use this approach when rack space or switch port capacity in your data center is limited.
Cluster Properties During Node Replacement
When a replacement step begins, Qumulo Core distributes floating IP addresses among the nodes in the combined cluster. After Qumulo Core removes nodes marked for replacement, it redistributes any client connections that use floating IP addresses among the nodes that remain in the cluster.
While a node replacement step is in progress, both new nodes and nodes marked for replacement appear on the Cluster page of the Web UI and clients can connect to any of the nodes in the combined cluster while the step is in progress.
When a node replacement step is complete, the reassignment of static IP addresses differs between versions of Qumulo Core:
In Qumulo Core 126.96.36.199 (and higher), the static IP addresses assigned to nodes remain unchanged and Qumulo Core removes only the static IP addresses for nodes removed from the cluster.
In Qumulo Core versions lower than 188.8.131.52, Qumulo Core reassigns static IP addresses to different nodes. To view the reassigned IP addresses in the Web UI, click Cluster > Network Configuration.
When Qumulo Core adds nodes to a cluster, it assigns node IDs sequentially, without reusing or changing IDs.
For example, if you have a four-node cluster with node IDs 1-4, and you replace node IDs 2 and 3 with two new nodes, after node replacement the cluster contains node IDs 1, 4, 5, and 6. If you add another node, it has the ID 7.
A cluster’s usable capacity doesn’t increase until:
Any data protection reconfiguration is complete
The last step of the node replacement plan is in progress
For example, if you replace nodes in a single step without data protection reconfiguration, usable capacity increases as soon as Qumulo Core begins the step.
Ensure that the number of static and floating IP addresses is equal to or greater than the number of nodes in the combined cluster.
Step 1: Register a Node Replacement Plan by Using the qq CLI
qq replace_nodes register_plancommand and the
--nodes-to-be-replacedflag to specify the nodes to replace and the
--target-stripe-configflag to specify the stripe configuration. For example:
qq replace_nodes register_plan \ --nodes-to-be-replaced 1 2 3 4 \ --target-stripe-config 8 6
Qumulo Core stores the node replacement plan on your cluster.Note
- If your plan includes data protection reconfiguration, Qumulo Core records only the stripe configuration. You specify the node fault tolerance when you execute the plan steps.
- If your plan doesn't include data protection reconfiguration, you can omit the
- To replace all nodes in the cluster, use the
--replace-allflag instead of the
Rack and wire your new nodes and then power them on.
To determine the UUIDs of the nodes to add to your cluster, use the
Write down the UUIDs of the nodes that you want to add to the cluster, in the order that you want to add them.
Step 2: Execute the Node Replacement Plan Steps by Using the qq CLI
qq replace_nodes add_nodes_and_replacecommand to initiate each step, the
--nodes-being-replacedflag to specify the nodes to replace, and the
--node-uuidsflag to specify the nodes to add during the current step.Important
Qumulo Core adds nodes to the cluster in the order in which you list their UUIDs after the
--node-uuidsflag. When you begin the node replacement step, it isn’t possible to revert this operation or reorder nodes after adding them to a cluster.
If your plan includes data protection reconfiguration, use the
--target-max-node-failuresflags to initiate the reconfiguration during the current step. For example:
qq replace_nodes add_nodes_and_replace \ --nodes-being-replaced 1 2 3 4 \ --node-uuids 12345a6b-7c89-0d12-3456-78fe9012f345 abcde1f2-g3hi-j4kl-mnop-qr56stuv7wxy \ --reconfigure-data-protection \ --target-max-node-failures 1
The following is example output from the command:
Current cluster: Usable capacity: 200 TB Node fault tolerance level: 1 node With the selected node replacement step: Usable capacity: 220 TB Node fault tolerance level: 1 nodeNote
To replace all nodes in the cluster, use the
--replace-allflag instead of the
To confirm the reconfiguration with the selected node-replace and data protection configuration operations, enter
For more information, see Monitoring the Data Protection Reconfiguration Process.
Wait for the node replacement step to complete.
After each node replacement step, Qumulo Core begins to migrate data from existing nodes in the backround.Note
This is a long process (that can take days or weeks). When the data migration is complete, Qumulo Core removes the nodes marked for replacement from the cluster. These nodes no longer appear on the Cluster page of the Web UI.
Unrack the removed nodes from your data center.
Initiate the next node replacement step.
Viewing, Editing, and Cancelling the Node Replacement Plan
To view the current node replacement plan, use the
qq replace_nodes get_plancommand.
If a node replacement step is in progress, the command shows the list of nodes in process of being replaced during the current step.
To edit the node replacement plan after you register it with your cluster, use the
qq replace_nodes register_plancommand with a new node replacement plan.
To cancel the current node replacement plan, use the
qq replace_nodes cancel_plancommand.Important
Caneling a node replacement plan after executing one or more steps might make it impossible to reregister and complete the plan.
Monitoring the Data Protection Reconfiguration Process
To view the progress of the three stages of the data protection reconfiguration process, log in to the Qumulo Core Web UI and click Cluster.
Qumulo Core begins to move data to new nodes in the cluster and the Web UI displays the message Rebalancing for data protection reconfiguration.
Qumulo Core reencodes all data on your cluster and the Web UI displays the message Reconfiguring data protection.Note
In certain scenarios, this stage might appear to pause while the system performs preparatory work on the cluster.
When this stage is complete, your data is protected according to the cluster’s new configuration and the system begins to use the new drive and node fault tolerance levels.
Qumulo Core adds new capacity to your cluster and the Web UI displays the message Rabalancing.
If you initiated the reconfiguration process as part of a node replacement step, the system migrates data from the existing nodes in the cluster.
Cluster Availability During the Reconfiguration Process
Your cluster remains available throughout the data protection reconfiguration process.
You can upgrade Qumulo Core.
Your cluster maintains the ability to recover from node and drive failure automatically.
During the reconfiguration process, drive and node fault tolerance levels remain at the minimums that the existing and new configurations specify. For example, if your existing cluster has 2-node and 2-drive fault tolerance, and you initiate reconfiguration where the new configuration has 1-node and 3-drive fault tolerance, your cluster has 1-node and 2-drive fault tolerance during the reconfiguration process.
- To avoid impact to frontend workloads, Qumulo Core slows down the reconfiguration process automatically.
- When Qumulo Core finds missing nodes or drives, it pauses the reconfiguration process. When you replace or bring the nodes or drives online, the reconfiguration process continues.
- It isn't possible to add or replace nodes during the reconfiguration process.