This section explains how to safely replace hardware components on Supermicro platforms running Qumulo Core.
This section provides guidance for safely replacing hardware components on Supermicro platforms running Qumulo Core. It covers procedures for both hot-swappable components (drives, power supplies) and components that require the node to be taken offline (DIMMs, fans, CPUs, motherboards, NICs).
Component Replacement Overview
| Component | Hot-Swappable | Node Offline Required |
|---|---|---|
| Drives (NVMe, HDD) | Yes | No |
| Power Supplies | Yes | No |
| Fans | No | Yes |
| DIMMs (Memory) | No | Yes |
| CPUs | No | Yes |
| Motherboard | No | Yes |
| NICs | No | Yes |
For components that require the node to be taken offline, you must follow the Components Requiring Node Offline procedure below to ensure data protection.
Hot-Swappable Components
Drive Replacement
For the drive replacement procedure, see Drive Replacement.
Platform-Specific Notes
GrandTwin (AS-2115GT-HNTR)
Each node in the GrandTwin chassis has dedicated drive bays. Ensure you identify the correct node before replacing drives.
Hybrid Models (SH-series)
The SH-series models have separate NVMe cache bays and HDD storage bays. Verify you are replacing the correct drive type.
Power Supply Replacement
All Supermicro platforms use redundant power supplies, allowing hot-swap replacement.
- Verify redundancy: ensure both PSUs are connected and at least one is functioning.
- Identify the failed PSU by checking the PSU LED (amber indicates failure).
- Remove the failed PSU:
- Disconnect the power cord.
- Press the release latch.
- Slide the PSU out of the chassis.
- Install the replacement PSU:
- Slide the new PSU into the bay until it clicks.
- Connect the power cord.
- Verify the PSU LED shows green.
Components Requiring Node Offline
For DIMM, CPU, motherboard, or NIC replacements, the node must be safely taken offline using the following procedure.
Before You Begin
Before taking any node offline, you must verify that cluster protection allows for a node to be removed. Failing to verify this could result in data loss or cluster unavailability.
Step 1: Verify Cluster Protection Status
You can verify cluster protection status using either the Web UI or the CLI.
Using the Qumulo Web UI
- Log in to the Qumulo Web UI.
- Navigate to Cluster > Cluster Overview.
- Under Data Protection, verify that the cluster can tolerate a node failure.
Using the qq CLI
SSH into any node and run:
qq protection_status_get
You will see output similar to:
{
"remaining_drive_failures": 1,
"remaining_node_failures": 2
}
If
remaining_node_failures is greater than or equal to 1, you can safely proceed with taking a node offline.
If remaining_node_failures is 0, do not proceed and contact the Qumulo Care Team for guidance.
Step 2: Recuse the Node
Recusing a node safely removes it from the cluster quorum, allowing data protection mechanisms to account for the missing node.
-
SSH into the node that requires maintenance.
-
Run the recuse command:
/opt/qumulo/recuse_node.py --reason "Component replacement" -
After running this command, a red banner appears in the Qumulo Web UI stating:
Unable to communicate with node X -
This confirms the node was successfully recused from the cluster.
Step 3: Power Off the Node
After the node is recused, power it off:
Using IPMI
- Log in to the node’s IPMI interface.
- Navigate to Remote Control > Power Control.
- Click Power Off.
Using the CLI
From another node or management system:
ipmitool -I lanplus -H <IPMI-IP> -U <your_user> -P <password> power off
Using the Physical Power Button
Press and hold the power button on the front panel until the server powers off.
Step 4: Replace the Component
Perform the hardware replacement according to your hardware vendor’s documentation:
- AS-1115HS-TNR User’s Manual (SF-46TB, SF-92TB, SF-184TB)
- AS-2115GT-HNTR User’s Manual (GrandTwin 62T, 185T, 369T)
- ASG-2015S-E1CR24L User’s Manual (SH-48TB, SH-96TB, SH-240TB, SH-576TB, SH-720TB)
- AS-2115HS-TNR User’s Manual (SHF-983T)
- ASG-2115S-NE332R User’s Manual (SHF-1475T)
DIMM Replacement Notes
When replacing DIMMs:
- Follow your vendor’s DIMM population guidelines.
- Ensure replacement DIMMs match the specifications of the original.
- Use proper ESD precautions.
Step 5: Power On the Node
After completing the component replacement:
- Power on the node using the power button, IPMI, or IPMI command.
- Wait for the node to complete POST and boot Qumulo Core.
- Verify the component is functioning:
- For DIMMs: Check total memory in IPMI or using
qq node_status - For NICs: Check network connectivity
- For fans: Check IPMI sensor readings
- For DIMMs: Check total memory in IPMI or using
Step 6: Reintroduce the Node
After verifying the component replacement was successful, reintroduce the node to the cluster:
-
SSH into the node that was serviced.
-
Run the reintroduce command:
/opt/qumulo/sbin/reintroduce_node.sh -
Monitor the Qumulo Web UI to verify:
- The red banner disappears
- The node appears healthy in the cluster overview
- Data reprotection begins (if applicable)
Step 7: Verify Cluster Health
After reintroduction:
-
Check cluster protection status:
qq protection_status_get -
Verify all nodes are healthy:
qq cluster_slots -
Monitor the Web UI for any alerts or warnings.
Getting Help
If you have any questions or encounter issues during component replacement, contact contact the Qumulo Care Team through Slack, email, or by phone.
When contacting support, provide:
- Cluster name and node serial number
- Description of the issue
- Error messages from the Qumulo Web UI or CLI
- IPMI event logs (if available)