I am hoping someone can give me so information on what might be going on. I have a HL-15 with the X11SPH-nCTPF motherboard. Its been about a year since i have bought this. Everything worked great. I was messing around and had to reinstall I have installed Rocky8 and truenas and for some reason my drives are coming up with errors.
I ordered more drives from amazon and they show bad. The array will be good for about a day then crash. I keep swapping out drives and they keep crashing the only thing I can think of is the raid controller on the motherboard is going bad. Its just hard to believe that my drives went bad at the same time and new ones i get are also bad.
What happened to prompt the reinstall? Are there other hardware changes? Reinstalling an OS shouldn’t cause drives to start failing. Was the unit moved? Could a cable be loose?
Provide a bit more information about the type of drives (eg, SAS or SATA, brand(s), model(s)), the pool layout, which HL15 slots the drives are in and which are showing failing drives. On the Z11SPH, half of the drives are connected to the onboard HBA controller, and half to a SATA controller. By default I think 1 to 8 go to SATA and 9 to 15 to the HBA.
What symptoms/errors do you get when the array "crash"es? Does one disk show as failed or more than one? Which slot(s) are the disks in when they fail? Have you run smartctl or Crystal DiskInfo on the “failed” drives to interrogate their SMART data?
Although probably not your issue, it would not be unexpected for a set of similar drives purchased and run together to start failing around the same time, as they will all have similar in-use characteristics. If the original failing drives (not the replacements) have substantly differing manufacture dates and/or brand/model, that is different.
I tried upgrading from ubuntu 20 to 22 and messed everything up. At that point I installed truenas then went to Rocky8 and back to truenas. On each system I would build an array of 5 drives 1 to 2 would take the pool offline because they have errors. I would create a new pool with different drives and the same thing would happen. No matter what drives i put in always have 1 to 2 crash the pool.
I have tried putting the disks in starting at slot 1 and I have tried it starting at slot 15. I have 13 total 12TB sata drives and I cant get past creating 1 small pool without the drives showing errors and taking the pool offline. I did find if you pull the drive and put it back in the pool will come back online for a while then go off again.
Yesterday I got 2 drives from Amazon put one in and one drive that was in a pool failed while adding the new drive to the pool and the new drive said it failed.
I would look at the SMART data and confirm if the drive(s) are acrually showing errors. I’d also look in the system logs–dmesg or whatever–for error messages about the controller or communication with the drives.
Since you are having trouble with all slots, I’m not sure I’d suspect the motherboard first, because as I said slot 1 and 15 will be on different controllers. I’d normally suggest checking the cables are all plugged in securely, and that couldn’t hurt, but it doesn’t sound like you had the case open making hardware changes.
See what help you get here, but I’d also send a note to info@45homelab.com. Although uncommon, this seems like it may be a backplane issue, and I don’t know how to diagnose that and rule it in or out, they will.
Hi @herc5854 ,
You don’t say what type of drives that you are using. Are they SAS or SATA?
Several of us had some issues with errors when using SAS drives. It was a cable issue.
There are plenty of cut and paste commands in that thread that you can use. If this is the issue, you will be able to drop the speed transfer to SAS2 for the drives and the errors will go away. Then you can replace the cables and re-enable full speed SAS3
Are their any consistencies in which slots are erroring out? Our drives are normally in groups of 4 so there is a chance if they are located in the same area it could be a bad cable.
Somewhat, Slot 5 and 6 I havent added in the rest of the drives. If i remember it was also slot 13 and one other I dont remember. Yesterday everything said it was good I went to add a drive to the array and that is when slot 5 failed along with the new drive in slot 6.
Hey @herc5854
Hmm okay! It could be a backplane issue do you have the ability to try removing the drives and seeing if errors pop up on the pool with those drives gone? If there is a lot of data it will be harder however if the slots stay consistent that is most likely a backplane issue if I had to make an educated guess.
It may be beneficial to put in a support ticket and let one of our specialist take a look just in case we have to send some new hardware!
Currently I am running truenas and it will remove the drive. So I can pull the drive and put it right back in and it will come up and the pool will be good again. Just seems to be when I start putting stress on the system by adding a drive or copying over large amount a data it then errors out. Thanks I will add more drives to see if it is certain slots and will put in a ticket so I have more information to give them.
Just wanted to give an update on this. I opened a support ticket and with in 3 hours I had a new backplane shipped. I have installed the backplane and have not had one issue with any of my drives. This company is awesome on how fast they corrected my issues and helped me get my system back online.