RESOLVED:ZFS Write Errors with HL15 Full Build and SAS Drives

SM support agreed that stepping down the link speed made sense, but didn’t explain why. They mentioned the age of my drives (2014) and the firmware, but I pointed out that you were running different drives from a different manufacturer with the same problem.

I asked them if an LSIget dump from you would be helpful and am waiting for a response.

I also re-emphasized that these drives appeared to be fine at SAS3/12Gbps when connected to the 9300-8i on the same firmware.

1 Like

Using LSIUTIL to set SAS3008 Controller to SAS-2 6Gbps Max Speed

  1. Download a copy of lsiutil. I used lsiutil version 1.7 archived on github thanks to Thomas Lovell. I’m running Ubuntu 22.04 so wget to download and chmod to make it executable.
wget https://raw.githubusercontent.com/thomaslovell/LSIUtil/master/Binaries/LSIutil_1.70_release_binaries/linux/lsiutil.x86_64 && chmod +x lsiutil.x86_64
  1. Execute lsiutil with elevated privileges or as root.
sudo ./lsiutil.x86_64
  1. Now in lsiutil, select the controller by entering its number. There’s probably only one so enter 1.
Step 3 Results
LSI Logic MPT Configuration Utility, Version 1.71, Sep 18, 2013

1 MPT Port found

     Port Name         Chip Vendor/Type/Rev    MPT Rev  Firmware Rev  IOC
 1.  ioc0              LSI Logic SAS3008 C0      205      10000a00     0

Select a device:  [1-1 or 0 to quit] 1
  1. Verify your the SAS controller settings using option 68. You should see SAS3008's links are 12.0 G, ....
Step 4 Results
 Main menu, select an option:  [1-99 or e/p/w or 0 to quit] 68 

Current Port State
------------------
SAS3008's links are 12.0 G, 12.0 G, 12.0 G, 12.0 G, 12.0 G, 12.0 G, 12.0 G, down

Software Version Information
----------------------------
Current active firmware version is 10000a00 (16.00.10)
Firmware image's version is MPTFW-16.00.10.00-IT
  LSI Logic
  Not Packaged Yet
x86 BIOS image's version is MPT3BIOS-8.37.00.00 (2018.04.04)
EFI BIOS image's version is 18.00.00.00

Firmware Settings
-----------------
SAS WWID:                       5003048xxxxxxxxx
Multi-pathing:                  Disabled
SATA Native Command Queuing:    Enabled
SATA Write Caching:             Enabled
SATA Maximum Queue Depth:       128
SAS Max Queue Depth, Narrow:    256
SAS Max Queue Depth, Wide:      256
Device Missing Report Delay:    0 seconds
Device Missing I/O Delay:       0 seconds
Phy Parameters for Phynum:      0    1    2    3    4    5    6    7    
  Link Enabled:                 Yes  Yes  Yes  Yes  Yes  Yes  Yes  Yes  
  Link Min Rate:                3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  
  Link Max Rate:                12.0 12.0 12.0 12.0 12.0 12.0 12.0 12.0 
  SSP Initiator Enabled:        Yes  Yes  Yes  Yes  Yes  Yes  Yes  Yes  
  SSP Target Enabled:           No   No   No   No   No   No   No   No   
  Port Configuration:           Auto Auto Auto Auto Auto Auto Auto Auto 
Interrupt Coalescing:           Enabled, timeout is 10 us, depth is 4
  1. OPTIONAL: Reset to Defaults for good measure by using option 61. You can also check if anything is other than defaults using 60.
Step 5 Results
Main menu, select an option:  [1-99 or e/p/w or 0 to quit] 61

Restoring default persistent settings...

IO Unit Page 1 is persistent

IO Unit Page 8 is persistent

IOC Page 1 is persistent
Defaults restored

IOC Page 8 is persistent

BIOS Page 1 is persistent

BIOS Page 3 is persistent
Defaults restored

BIOS Page 4 is persistent

SAS IO Unit Page 1 is persistent

SAS IO Unit Page 4 is persistent

SAS IO Unit Page 5 is persistent
Defaults restored

SAS IO Unit Page 7 is persistent
Defaults restored

SAS Phy Page 3 is persistent

Type 17h Page 0 is persistent

Type 19h Page 1 is persistent
  1. Change maximum sas link speed rate to 6Gbps by using option 13 on the main menu. Press enter to accept defaults for the first five prompts.
Step 6 Results
Main menu, select an option:  [1-99 or e/p/w or 0 to quit] 13
SATA Maximum Queue Depth:  [0 to 255, default is 128] 
SAS Max Queue Depth, Narrow:  [0 to 65535, default is 256] 
SAS Max Queue Depth, Wide:  [0 to 65535, default is 256] 
Device Missing Report Delay:  [0 to 2047, default is 0] 
Device Missing I/O Delay:  [0 to 255, default is 0] 

PhyNum  Link      MinRate  MaxRate  Initiator  Target    Port
   0    Enabled     3.0      12.0    Enabled    Disabled  Auto
   1    Enabled     3.0      12.0    Enabled    Disabled  Auto
   2    Enabled     3.0      12.0    Enabled    Disabled  Auto
   3    Enabled     3.0      12.0    Enabled    Disabled  Auto
   4    Enabled     3.0      12.0    Enabled    Disabled  Auto
   5    Enabled     3.0      12.0    Enabled    Disabled  Auto
   6    Enabled     3.0      12.0    Enabled    Disabled  Auto
   7    Enabled     3.0      12.0    Enabled    Disabled  Auto
  1. Enter 8 when prompted for Select a Phy: [0-7, 8=AllPhys, RETURN to quit] to change settings on all the SAS links.
Step 7 Results
PhyNum  Link      MinRate  MaxRate  Initiator  Target    Port
   0    Enabled     3.0      12.0    Enabled    Disabled  Auto
   1    Enabled     3.0      12.0    Enabled    Disabled  Auto
   2    Enabled     3.0      12.0    Enabled    Disabled  Auto
   3    Enabled     3.0      12.0    Enabled    Disabled  Auto
   4    Enabled     3.0      12.0    Enabled    Disabled  Auto
   5    Enabled     3.0      12.0    Enabled    Disabled  Auto
   6    Enabled     3.0      12.0    Enabled    Disabled  Auto
   7    Enabled     3.0      12.0    Enabled    Disabled  Auto

Select a Phy:  [0-7, 8=AllPhys, RETURN to quit] 8
  1. Press enter for the next two prompts. When asked MaxRate: [0=1.5 Gbps, 1=3.0 Gbps, 2=6.0 Gbps, 3=12.0 Gpbs, or RETURN to not change] enter a value of 2 to set the max speed to 6.0Gbps.
Step 8 Results
Link:  [0=Disabled, 1=Enabled, or RETURN to not change] 
MinRate:  [0=1.5 Gbps, 1=3.0 Gbps, 2=6.0 Gbps, 3=12.0 Gbps, or RETURN to not change]
MaxRate:  [0=1.5 Gbps, 1=3.0 Gbps, 2=6.0 Gbps, 3=12.0 Gpbs, or RETURN to not change] 2
Initiator:  [0=Disabled, 1=Enabled, or RETURN to not change]  
Target:  [0=Disabled, 1=Enabled, or RETURN to not change] 
Port configuration:  [1=Auto, 2=Narrow, 3=Wide, or RETURN to not change] 

PhyNum  Link      MinRate  MaxRate  Initiator  Target    Port
   0    Enabled     3.0      6.0    Enabled    Disabled  Auto
   1    Enabled     3.0      6.0    Enabled    Disabled  Auto
   2    Enabled     3.0      6.0    Enabled    Disabled  Auto
   3    Enabled     3.0      6.0    Enabled    Disabled  Auto
   4    Enabled     3.0      6.0    Enabled    Disabled  Auto
   5    Enabled     3.0      6.0    Enabled    Disabled  Auto
   6    Enabled     3.0      6.0    Enabled    Disabled  Auto
   7    Enabled     3.0      6.0    Enabled    Disabled  Auto

Select a Phy:  [0-7, 8=AllPhys, RETURN to quit] 
  1. Press enter to return to the main menu.

  2. OPTIONAL: Try option 99 to reset the links to pick up the changes. NOTE: This will cause drives to disconnect momentarily.

  3. Enter 0 and 0 again at the next prompt to exit lsiutil.

  4. Poweroff or reboot the computer to pick up changes (assuming step 10 wasn’t executed or didn’t work).

  5. Use sudo smartctl -x /dev/sgX to see the drive link speed reported as 6 Gbps on the Protocol Specific port log page for SAS SSP section.

Step 13 Results
Protocol Specific port log page for SAS SSP
relative target port id = 1
  generation code = 0
  number of phys = 1
  phy identifier = 0
    attached device type: SAS or SATA device
    attached reason: unknown
    reason: unknown
    negotiated logical link rate: phy enabled; 6 Gbps
    attached initiator port: ssp=1 stp=1 smp=1
    attached target port: ssp=0 stp=0 smp=0
    SAS address = 0x5000c50084dfd3b1
    attached SAS address = 0x500304801d01d40e
    attached phy identifier = 2
    Invalid DWORD count = 0
    Running disparity error count = 0
    Loss of DWORD synchronization = 2
    Phy reset problem = 0
    Phy event descriptors:
     Invalid word count: 0
     Running disparity error count: 0
     Loss of dword synchronization count: 2
     Phy reset problem count: 0
relative target port id = 2
  generation code = 0
  number of phys = 1
  phy identifier = 1
    attached device type: no device attached
    attached reason: unknown
    reason: unknown
    negotiated logical link rate: phy enabled; unknown
    attached initiator port: ssp=0 stp=0 smp=0
    attached target port: ssp=0 stp=0 smp=0
    SAS address = 0x5000c50084dfd3b2
    attached SAS address = 0x0
    attached phy identifier = 0
    Invalid DWORD count = 0
    Running disparity error count = 0
    Loss of DWORD synchronization = 0
    Phy reset problem = 0
    Phy event descriptors:
     Invalid word count: 0
     Running disparity error count: 0
     Loss of dword synchronization count: 0
     Phy reset problem count: 0

Check out this LSI User Guide for more information:

3 Likes

Found some newer firmware (AD05) and updated all 8 drives, reset them to SAS3/12Gbps, and will run some more tests.

SM recommended a firmware update, so this is to satisfy them. Will report back.

EDIT: lmfao, took about 8 seconds to clock write errors on the usual culprits. So much for that.

The Seagates I’m using now were made in 2016 and are all on the latest E004 firmware. Feel free to pass that along if you think it’ll help.

Fantastic detective work here fellas. Obviously sucks that you had to go down this rabbit hole, but if y’all are anything like me - the times I’ve learned the most are the times I’ve gone down the rabbit hole and come out the other side :joy:. You are right that most of leadership (myself included) are on holidays this week. We will be back in full force with tons of gusto to hit the ground running come Tuesday!

Apologies if y’all felt left on your own over the last few days. I’ll talk to the team and see what we can do for you guys. Great little guide you put together there too @rymandle05 !

Happy new year guys! I’ll check in from the office Tuesday to see if it did in-fact completely fix the errors for you.

1 Like

The recognition is much appreciated! I’m a problem solver at heart so there’s a lot of personal reward in figuring this out. I’m also off work all this week so, of course, that means I have the most time to tinker where as I’m usually pretty limited. :joy:

2 Likes

FYI - if you run into similar issues with lsiutil like I did, you set up symlinks between the newer version of ncurses and the old filenames, like so:

ln -s /usr/lib64/libncursesw.so.6.1 /usr/lib64/libncurses.so.5
ln -s /usr/lib64/libncursesw.so.6.1 /usr/lib64/libtinfo.so.5

It’s worked fine … so far.

I haven’t pushed this to GitHub, yet, but here’s what I wrote up for the HGST drives so far.

Note this is for the HUH728080AL5200 - I don’t have other HGST drives to test. YMMV. Also, note that HUGO is a WD/HGST tool and probably won’t work with other drives.

HGST

Configure Link Speed

Set to SAS2/6Gbps

sdparm -p pcd --set=PMALR=10 -t sas -S /dev/sdX
sdparm -p pcd --set=PMALR.1=10 -t sas -S /dev/sdX

Set to SAS3/12Gbps

sdparm -p pcd --set=PMALR=11 -t sas -S /dev/sdX
sdparm -p pcd --set=PMALR.1=11 -t sas -S /dev/sdX

Breakdown:

  • -p pcd is the “page”, in this case it’s the PHY Control and Discovery, or PCD. Equivalent to -p 0x19,0x01 in hex.
  • --set=PMALR=XX sets the Programmed Maximum Link Rate (PMALR). I believe there are two SAS channels, so there’s a PMALR.1 as well. Equivalent to --set=0x29:4:1=XX and --set=0x59:4:1=XX respectively.
  • -t sas defines the transport protocol.
  • -S tells sdparam to save the update (next drive or system power-down/up) rather than just edit the current setting.
  • /dev/sdX is the device being modified.

Check PMALR values:

# sdparm -g PMALR -l /dev/sda
    /dev/sda: HGST      HUH728080AL5200   A515
PMALR         10  [cha: y, def: 10, sav: 10]  Programmed maximum link rate
# sdparm -g PMALR.1 -l /dev/sda
    /dev/sda: HGST      HUH728080AL5200   A515
PMALR.1       10  [cha: y, def: 11, sav: 10]  Programmed maximum link rate

Firmware Update

1. Download HUGO.

Download HUGO 7.4.5 from the TrueNAS community resources library. Extract the contents of the .zip file.

2. Download new firmware.

Download firmware from HDDGuru. Extract the firmware .bin file from the .zip file.

Model Firmware
HUH728080AL5200 A4GNAD05.zip

3. Install ncurses.

Install the latest version of ncurses-devel.

dnf install ncurses-devel -y

4. Create symlinks.

HUGO 7.4.5 requires libncurses.so.5 and libtinfo.so.5, but your OS repo may include a newer version. We can symlink the old library names to the new libraries installed.

ln -s /usr/lib64/libncursesw.so.6.1 /usr/lib64/libncurses.so.5
ln -s /usr/lib64/libncursesw.so.6.1 /usr/lib64/libtinfo.so.5

5. Update firmware.

Update drive firmware. This command updates one drive at a time. See hugo help for more options.

/path/to/hugo update -g /dev/sdX -f /path/to/A4GNAD05.bin

Same as @rymandle05 - off this week, so a lot of time to tinker. And a borderline obsessive need to solve problems once discovered.

I plan to continue to research the SAS2 vs. SAS3 issue with the SAS3008 onboard controller, and both of our Supermicro tickets are still open and have been cross-referenced with each other.

I haven’t decided if want to move forward with the SAS2/6Gbps solution and finish the build, or if I want to keep it all half torn apart to find out what the real problem is.

3 Likes

If I’ve followed the tests, and the 9300-8i worked un the same firmware with the 45Drives-provided SFF8643 cables at 12G, then I think it’s 45Drives problem. Maybe it’s a batch of cables with bad shielding, maybe it’s something electrical with the SAS 3008 implementation on the X11SPH motherboard in general (that’s why I asked about the rev), or maybe its something with a specific batch of motherboards that you and rymandle05 received.

What percentage of the people who purchased the full build are running SAS drives? No-one has chimed in to say; “I’ve got the full build and no issues with my SAS drives at 12G”.
I think you said you have, or have on order, cables that you know work properly with the onboard HBA? I’d RMA the 45Drives cables and get some sort of refund or credit off the purchase and let them figure out which component is actually not playing nicely.

I have no idea what the burn in tests for the full build are, but it seems like they should include a test of the (seven?) SAS drive slots specifically with SAS drives, perhaps based on one of the tests you all ran above. The thing is, this is homelab, so although current users might only have SATA, they may start experimenting with SAS drives at some point and then boom, they’re broadsided by this issue and may not have all the technical knowledge or time to diagnose it or find this thread.

I would be interested to see what other users have experienced with SAS3-capable drives on the onboard SAS3008.

I’ve wondered the same about SAS vs. SATA drive use. I happen to have a ton of them from an old environment tear down. Used Enterprise SAS drives are common on eBay, so I assume there are other HL15 users out there who’ll run into the same problem. However, I’m sure there are just as many that’ll use nothing but SATA drives and never run into this issue.

I’m sure the full build was tested, but we have to keep in mind that even if SAS drives were tested, not all SAS drives would have or could have been tested. It could still be an issue unique to something in common with these drives. Maybe 45D will run some additional tests with known good SAS3 disks to validate.

1 Like

Per SM:

As you and the other user are both experiencing the same issue, I’ve escalated to our RD in charge of the onboard SAS.

Nice.

1 Like

FYI - confirmed temperature isn’t a factor. With how hot these HBA chips can run, I wanted to make sure this wasn’t some manifestation of an overheat.

Speed Fluke FLIR IR
6Gbps 55C ? ?
6Gbps 54C 62C ?
6Gbps 56C 64C ?
6Gbps 55C 64C ?
6Gbps 54C 59C 54C
6Gbps 55C 60C 54C
6Gbps 56C 61C 56C
6Gbps 56C 61C 55C
12Gbps 57C 63C 57C
12Gbps 57C 63C 56C
  • 23C/73F room temp.
  • 26C/78F @ ~1.5" above the SAS3008 heat sink.

I didn’t test @ 12Gbps as much because the first transfer clocked write errors and took out four disks instead of the normal two, so I figured there wasn’t a point in testing further.

If it was temperature related, I would have expected a sharp jump and not … 1C difference. Plus, I checked the 9300-8i and the idiot is at about the same temp. idle as the onboard SAS3008 is while moving data, so …

On the SAS3008:

Operating: 0°C to 55°C (ambient)
Storage: -45°C to 105°C (ambient)
5 to 90% non-condensing
Airflow: 200 LFM

The actual IC can hit much higher temps, 100C+.

Source(s): AOC-S3008L-L8e / AOC-S3008L-L8e+ | Add-on Cards | Accessories | Products - Super Micro Computer, Inc., https://docs.broadcom.com/doc/12352000

As an aside, it looks like either my Fluke and IR or FLIR camera are huffing paint. Bothers me they’re so far off from each other.

2 Likes

No surprise here I received the same response. :grinning:

What is a pleasant surprise is SuperMicro’s attentiveness to date through their support channel. I was skeptical SM would give us much more than the time of day without some big company muscle like 45Drives. We will see what the RD of onboard SAS has to say.

@pxpunx did you receive the 10gtek SAS cables the other day? I have some coming today to test with at SAS-3 speeds.

I haven’t tried them. Kind of tired of tearing out the backplane, and didn’t seem necessary when faced with the onboard vs. PCIe SAS3008 results. It’s still an option.

The few times I’ve interacted with SM support they’ve been great, so I had high hopes this interaction would follow suit.

The last time I worked with them was on an m-ITX board I’d bought from a reseller that was years old; they still RMA’d it. Granted, they didn’t find a problem and it suddenly worked once returned, so … go figure.

I replaced mine today and set it back to SAS-3. We will see…

1 Like

I just did the same … got bored, and re-did the power wiring so it was easier to deal with. Testing now.

I’m making the following statement exclusively to jinx this test run to make the errors re-occur…

“After replacing the cables, I see no errors on SAS3/12Gbps to the onboard SAS3008. About 600GB transferred so far.”

2 Likes

I’ve written a few TB so far and no errors @ 12Gbps on the new set of cables. So, to sum up:

  • Errors observed when 45D cables are used w/ onboard SAS3008 at 12Gbps.
  • No errors observed at 12Gbps when I swapped to a PCIe SAS3008.
  • No errors observed at 12Gbps when I swapped one of the 45D SAS cables out for a Supermicro SAS cable with the onboard SAS3008.
  • Back to 45D cables and onboard SAS3008 where errors are observed at 12Gbps.
  • Pinned drives to 6Gbps, no errors observed w/ 45D cables and onboard SAS3008.
  • Swapped to 10GTek SAS cables, still with onboard SAS3008, no errors observed at 12Gbps.

I’m a little tired of swapping parts and cables around at this point. :smiley:

The outlier is the 9300-8i … other than, perhaps, a difference in design that accounted for better error correction of some kind.

Let’s see what @rymandle05 comes back with.

2 Likes