This section explains how to safely replace hardware components on Arrow appliance platforms running Qumulo Core.
This section provides guidance for safely replacing hardware components on Arrow appliance platforms running Qumulo Core.
Component Replacement Overview
| Component | Hot-Swappable | Node Offline Required |
|---|---|---|
| Drives (NVMe, SSD, HDD) | Yes | No |
| Boot Drive | No | Yes (special procedure) |
| Power Supplies | Yes | No |
| DIMMs (Memory) | No | Yes |
| CPUs | No | Yes |
| Motherboard | No | Yes |
| NICs | No | Yes |
For components that require the node to be taken offline, you must follow the **Node Recuse and Reintroduction** procedure below to ensure data protection.
Hot-Swappable Components
Drive Replacement
Drives in Qumulo clusters are hot-swappable. You can replace a failed drive without taking the node offline.
- Identify the failed drive using the Qumulo Web UI or
qq cluster_slots_statuscommand. - Locate the physical drive using the Drive Bay Mapping.
- Remove the failed drive.
- Insert the replacement drive.
- Qumulo Core automatically detects and incorporates the new drive.
Power Supply Replacement
Power supplies are hot-swappable when the server has redundant power supplies.
- Verify the server has redundant power supplies and the remaining PSU is functioning.
- Remove the failed power supply.
- Insert the replacement power supply.
Boot Drive Replacement
After you replace the boot drive, you must initialize the replacement boot drive by using the Qumulo Core Installer and then rebuild the replacement boot drive by using a script on the node in your cluster.
Step 1: Initialize the Replacement Boot Drive
To get the correct version of the Qumulo Core Installer for the node in your cluster, contact the Qumulo Care Team
-
Power on your node, enter the boot menu, and select your USB drive.
The Qumulo Core Installer begins to run automatically.
-
When prompted, take the following steps:
-
Select
[x] Perform maintenance. -
Select
[1] Boot drive resetand then follow the prompts.
The Qumulo Core Installer initializes the boot drive.
-
-
When the process is complete, the node is powered down automatically.
Step 2: Rebuild the Replacement Boot Drive
-
Power on your node and log in to the node by using the
qqCLI. -
To get
rootprivileges, run thesudo qshcommand. -
To stop the Qumulo Networking Services, run the
service qumulo-networking stopcommand. -
To configure the IP address for the node, run the
ip addr addcommand and specify the node’s IP address. For example:ip addr add 203.0.113.0/CDR dev bond0 -
Ensure that the node can ping other nodes in the cluster.
-
Run the
rebuild_boot_drive.pyscript and specify the IP address of another node in the cluster, the ID of the node whose boot drive has been replaced, and the password of the administrative account of the cluster. For example:Note
If your password includes special characters such as the parenthesis (() or the asterisk (*), use the backslash (\) to escape these characters./opt/qumulo/rebuild_boot_drive.py \ --address 203.0.113.1 \ --node-id 2 \ --username admin \ --password my\(Special\*PasswordFollow the prompts.
-
When the process is complete, reboot the node.
Components Requiring Node Offline
For DIMM, CPU, motherboard, or NIC replacements, the node must be safely taken offline using the following procedure.
Before You Begin
Before taking any node offline, you **must** verify that cluster protection allows for a node to be removed. Failing to verify this could result in data loss or cluster unavailability.
Step 1: Verify Cluster Protection Status
You can verify cluster protection status using either the Web UI or the CLI.
Using the Qumulo Web UI
- Log in to the Qumulo Web UI.
- Navigate to Cluster > Cluster Overview.
- Under Data Protection, verify that the cluster can tolerate a node failure.
Using the qq CLI
SSH into any node and run:
qq protection_status_get
If `remaining_node_failures` is **greater than or equal to 1**, you can safely proceed with taking a node offline. If `remaining_node_failures` is **0**, do **not** proceed. Contact contact the Qumulo Care Team for guidance.
Step 2: Recuse the Node
Recusing a node safely removes it from the cluster quorum.
-
SSH into the node that requires maintenance.
-
Run the recuse command:
/opt/qumulo/recuse_node.py --reason "Component replacement" -
After running this command, a red banner appears in the Qumulo Web UI stating:
Unable to communicate with node X -
This confirms the node was successfully recused from the cluster.
Step 3: Power Off the Node
After the node is recused, power it off using IPMI, the CLI, or the physical power button.
Step 4: Replace the Component
Perform the hardware replacement. For detailed procedures, contact contact the Qumulo Care Team.
DIMM Replacement Notes
When replacing DIMMs:
- Follow proper DIMM population guidelines.
- Ensure replacement DIMMs match the specifications of the original.
- Use proper ESD precautions.
Step 5: Power On the Node
After completing the component replacement:
- Power on the node.
- Wait for the node to complete POST and boot Qumulo Core.
- Verify the component is functioning.
Step 6: Reintroduce the Node
After verifying the component replacement was successful:
-
SSH into the node that was serviced.
-
Run the reintroduce command:
/opt/qumulo/sbin/reintroduce_node.sh -
Monitor the Qumulo Web UI to verify:
- The red banner disappears
- The node appears healthy in the cluster overview
- Data reprotection begins (if applicable)
Step 7: Verify Cluster Health
After reintroduction:
-
Check cluster protection status:
qq protection_status_get -
Verify all nodes are healthy:
qq cluster_slots_status
Getting Help
If you have any questions or encounter issues during component replacement, contact contact the Qumulo Care Team through Slack, email, or by phone.