ZFS causing kernel panic on resume


#1

It appears there’s timing variation that puts zfs latency over some threshold (or zfs isn’t up by some event) which causes a kernel panic on resume. Any clues on how to fix this?

During normal boot I’ve noticed errors about ZFS failing to start, but something waits/retries till it’s up (which is after trueosinit).

During resume, it’s not always up in time. Here’s a resume attempt ending in kernel panic in log/messages [note that at first I thought this was SATA latency but ada0 is brought up]:

Apr 10 13:36:12  kernel: pci0: set ACPI power state D0 on \_SB_.PCI0.PE23
Apr 10 13:36:12  kernel: hdaa0: Pin sense: nid=3 sense=0x7fffffff (disconnected, ELD valid)
Apr 10 13:36:12  kernel: pci0: set ACPI power state D0 on \_SB_.PCI0.SATA
Apr 10 13:36:12  kernel: ahcich0: AHCI reset...
Apr 10 13:36:12  kernel: ahcich0: SATA connect time=100us status=00000123
Apr 10 13:36:12  kernel: ahcich0: AHCI reset: device found
Apr 10 13:36:12  kernel: ahcich1: AHCI reset...
Apr 10 13:36:12  kernel: ahcich1: SATA connect time=1000us status=00000113
Apr 10 13:36:12  kernel: ahcich1: AHCI reset: device found
Apr 10 13:36:12  kernel: ahcich1: AHCI reset: device ready after 1ms
Apr 10 13:36:12  kernel: ahcich2: AHCI reset...
Apr 10 13:36:12  kernel: ahcich2: SATA connect timeout time=10000us status=00000000
Apr 10 13:36:12  kernel: ahcich2: AHCI reset: device not found
Apr 10 13:36:12  kernel: ahcich3: AHCI reset...
Apr 10 13:36:12  kernel: ahcich3: SATA connect timeout time=10000us status=00000000
Apr 10 13:36:12  kernel: ahcich3: AHCI reset: device not found
Apr 10 13:36:12  kernel: ahcich4: AHCI reset...
Apr 10 13:36:12  kernel: ahcich4: SATA connect timeout time=10000us status=00000000
Apr 10 13:36:12  kernel: ahcich4: AHCI reset: device not found
Apr 10 13:36:12  kernel: ahcich5: AHCI reset...
Apr 10 13:36:12  kernel: ahcich5: SATA connect timeout time=10000us status=00000000
Apr 10 13:36:12  kernel: ahcich5: AHCI reset: device not found
Apr 10 13:36:12  kernel: pci0: set ACPI power state D0 on \_SB_.PCI0.UHC1
Apr 10 13:36:12  kernel: pci0: set ACPI power state D0 on \_SB_.PCI0.UHC2
Apr 10 13:36:12  kernel: pci0:0:18:2: Transition from D3 to D0
Apr 10 13:36:12  kernel: pci0: set ACPI power state D0 on \_SB_.PCI0.USB3
Apr 10 13:36:12  kernel: pci0: set ACPI power state D0 on \_SB_.PCI0.UHC4
Apr 10 13:36:12  kernel: pci0:0:19:2: Transition from D3 to D0
Apr 10 13:36:12  kernel: pci0: set ACPI power state D0 on \_SB_.PCI0.SBAZ
Apr 10 13:36:12  kernel: hdaa1: Headphones redirection for association 0 nid=33 using unsolicited responses.
Apr 10 13:36:12  kernel: pci0: set ACPI power state D0 on \_SB_.PCI0.UHC7
Apr 10 13:36:12  kernel: atkbd: the current kbd controller command byte 0047
Apr 10 13:36:12  kernel: atkbd: keyboard ID 0x41ab (2)
Apr 10 13:36:12  kernel: kbdc: RESET_KBD return code:ffffffff
Apr 10 13:36:12  last message repeated 2 times
Apr 10 13:36:12  kernel: kbdc: DIAGNOSE status:0055
Apr 10 13:36:12  kernel: kbdc: TEST_KBD_PORT status:0000
Apr 10 13:36:12  kernel: atkbd: failed to reset the keyboard.
Apr 10 13:36:12  kernel: kbdc: TEST_AUX_PORT status:0000
Apr 10 13:36:12  kernel: kbdc: RESET_AUX return code:00fa
Apr 10 13:36:12  kernel: kbdc: RESET_AUX status:00aa
Apr 10 13:36:12  kernel: kbdc: RESET_AUX ID:0000
Apr 10 13:36:12  kernel: pcib3: HotPlug interrupt: 0x148
Apr 10 13:36:12  kernel: pcib3: Presence Detect Changed to card present
Apr 10 13:36:12  kernel: pcib3: Data Link Layer State Changed to active
Apr 10 13:36:12  kernel: pcib5: HotPlug interrupt: 0x148
Apr 10 13:36:12  kernel: pcib5: Presence Detect Changed to card present
Apr 10 13:36:12  kernel: pcib5: Data Link Layer State Changed to active
Apr 10 13:36:12  kernel: ahcich0: AHCI reset: device ready after 100ms
Apr 10 13:36:12  kernel: battery0: battery initialization start
Apr 10 13:36:12  kernel: panic: I/O to pool 'tank' appears to be hung on vdev guid 1561076400391828821 at '/dev/ada0p2.eli'.
Apr 10 13:36:12  kernel: cpuid = 1
Apr 10 13:36:12  kernel: time = 1523393372
Apr 10 13:36:12  kernel: KDB: stack backtrace:
Apr 10 13:36:12  kernel: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe002429e990
Apr 10 13:36:12  kernel: vpanic() at vpanic+0x18d/frame 0xfffffe002429e9f0
Apr 10 13:36:12  kernel: panic() at panic+0x43/frame 0xfffffe002429ea50
Apr 10 13:36:12  kernel: vdev_deadman() at vdev_deadman+0x128/frame 0xfffffe002429eaa0
Apr 10 13:36:12  kernel: vdev_deadman() at vdev_deadman+0x40/frame 0xfffffe002429eaf0
Apr 10 13:36:12  kernel: spa_deadman() at spa_deadman+0x87/frame 0xfffffe002429eb20
Apr 10 13:36:12  kernel: taskqueue_run_locked() at taskqueue_run_locked+0x154/frame 0xfffffe002429eb80
Apr 10 13:36:12  kernel: taskqueue_thread_loop() at taskqueue_thread_loop+0x98/frame 0xfffffe002429ebb0
Apr 10 13:36:12  kernel: fork_exit() at fork_exit+0x83/frame 0xfffffe002429ebf0
Apr 10 13:36:12  kernel: fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe002429ebf0
Apr 10 13:36:12  kernel: --- trap 0, rip = 0, rsp = 0, rbp = 0 ---
Apr 10 13:36:12  kernel: KDB: enter: panic
Apr 10 13:36:12  kernel: Table 'FACP' at 0xafb46bd0
Apr 10 13:36:12  kernel: Table 'APIC' at 0xafb46cc8
Apr 10 13:36:12  kernel: APIC: Found table at 0xafb46cc8
Apr 10 13:36:12  kernel: APIC: Using the MADT enumerator.
Apr 10 13:36:12  kernel: Copyright (c) 1992-2018 The FreeBSD Project.
Apr 10 13:36:12  kernel: Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994

#2

During normal boot are you seeing similar AHCI/SATA messages?
What hardware, specifically what is the device that the zpool is on?
Have you setup and run smartctl stuff against the device?

To me, those messages are saying the device is not “spinning up” or becoming available on the bus and gets into a reset/retry situation.


#3

@mer:
tank is on /dev/ada0, which is ahcich0, put on scbus0 target 0 lun 0 at boot.

I already ran smartctl, which reports it healthy (though the drive is new, I checked two days ago as I started to suspect this was the issue).

but there’s no assignment to the scsi address before the panic (should there be?) and no

`kernel: (ada0:achich0:0:0:0):  resume`
`acpi: resumed [...]`

message that seems to be part of successful resumes. In all cases in my current log/messages achich0 connects in 100usec; it’s also consistently

`kernel: ahcich0: AHCI reset: device ready after 100ms`

whether or not there’s a successful resume or panic and during normal bootup.