Process hang in state "zilog->zl_writer_lock" on Unstable


#4

Ok, then the only other idea I have is enter something over in FreeBSD bug tracker. It could be something in FireFox, but I doubt that. zilog is probably the ZFS Intent Log, zl_writer_lock is kind of obvious. If you go back a week or so in svn-src-head mailing list, there was a bunch of stuff around a VM change that seemed to cause problems for some folks. I don’t recall specifics or if that is in TrueOS tree, but it’s a place to start.

Did you do anything special with your ZFS configuration, like separate devices for L2Arc, intent log, etc?


#5

No special config except: vfs.zfs.arc_max=“1024M” in loader.conf.
Just single 4TB disk.

Only thing search has turned up so far is: https://github.com/freebsd/freebsd/commit/16b20104c426b76ed302fa467895eec5fc045765

which does show change relating to zl_writer_lock

I only had 2 days of uptime so maybe I can replicate and get a bit more info for a bug report, but will likely have to wait until after Christmas.


#6

The problem happened again, after around two days of use (firefox,corebird,claws were locked processes) the total uptime was 7 days, but machine was idle, not logged in for first 5. The lockup coincided with using corebird and occurred about the time of gstreamer error (corebird cannot play video beacause of gstreamer codec bug on FreeBSD).
I use ssh to login and tried lldb to attach to the hung processes, but nothing; I suppose the lock is in system state so need to get into kernel debugger to get any more info. None of the processes respond to signals but sigquit did terminate lldb. A shutdown now did not get to single user mode, though I only waited 15 min. Hard reset to reboot.
I have done some updates, but need to find out how to get to kernel debugger; no serial port on this machine.

(this Just to document my testing)

updated kernel is now: FreeBSD 12.0-CURRENT #11 81fec14cc(trueos-master): Wed Dec 20 14:49:06 UTC 2017 root@chimera:/usr/obj/usr/src/amd64.amd64/sys/GENERIC


#7

Happened again, but unable to get to kernel debug.
I think Ctrl-Alt-Esc was detected, but there was no display - perhaps EFI framebuffer is the problem.
The machine does have a COM header, so I need to get the connecte + USB serial, to get DDB access.


#8

Happened several more times, but not able to get DDB either on console or serial port.


#9

After today’s update ae187cb00 2018-01-26 videos now play in corebird so I was able to get locup much faster and it is now renamed to: zilog->zl_issuer_lock Also dconf-service is 100% and also unkillable.


#10

“fantastic” only because that matches what I ran into.

Here’s what I did that appeared to help:
cd /usr/local/bin
ls -ltr | grep dconf

find all the dconf executables and rename them, do the same for dbus-launch. You may lose a little functionality in some things (not sure what exactly), but I have not experienced any problems running other things.

After that, see if you still get the dconf at 100% and a lockup. The lockup could be due to dconf spinning, or dconf spinning could be due to the lock. Based on what I’ve seen, I’m guessing the lock is due to dconf spinning. ZFS has a few processes that run periodically, it’s possible that the process running in the lock is getting starved.


#11

After I saw Unstable update 2018-01-30 problems by @mer

I did the following tests:

  • wait for second machine to finish update (success)
  • install corebird and start it
  • click on “kitten” animated gif a few times (video runs now ffmpeg is updated)
  • system locked and power cycle
  • mv dconf.service dconf.service-DISABLED
  • run corebird and test video playback - no lock so far

#12

Interesting and thanks for the update. So it appears that something in dconf is a trigger .
Edit:
If you have a chance could you see what happens if:
mv dconf.service-DISABLED back
in a term window, start top
then start corebird and click on kitten

What I’m looking for is in top, see if dconf goes to 100%. If it does then it’s similar to what I saw and if the keyboard responds, trying entering the command
truss -p PID
where PID is the PID of the process spinning at 100%. That should tell us what it’s doing as far as syscalls, may give a clue as to why it spins.


#13

No output from truss; kill truss just gives unexpected stop in waitpid.

Also tried lldb -p but does not get to a prompt, no response to Ctrl-C.

according to fstat it only has one file in my home dir in .config and attempting to access that file
seem to permanently hang that process (like ls or stat) leaviing those processes in “zfs” state.
(only corebird is in the locked state). the dconf-service is using a lot of CPU often > 100% (has three threads)

Attempt to enter KDB just total lockup and need to power off (not sure if KDB can use EFI FB)


#14

Thanks for the info.
So there’s a file in $HOME/.config that any access causes a problem? Wow. Do you have a problem trying to access it as root? Basically:
sudo ls -ltr ~username/.config (substitute your username for “username”)

If that works, maybe as root chown and chmod the file or simply delete the file.

A possible test:
mv $HOME/.config $HOME/.config.old
mkdir .config
chown -R user:user .config (put your username in for user)

That would give you a new .config directory with nothing in it so you’d lose all other config, but if nothing hangs, then you could selectively move things from the .config.old to the new .config and then see what happens. I’d do the chown -R after every directory I moved.

Edit:
another way of testing this would be to simply create a new user with a new directory, log in as that user and do the same corebird test.


#15

I think, ‘mer’ might be totally on the right track.

You always have to remember one very important fact:

If You do “sudo …”, You keep You previous $HOME by default (this can be reconfigured).

If You do “su -l”, You get the ‘root’-user’s $HOME.

Conclusion: Don’t use wrong applications while using “sudo …”. It will clobber Your $HOME/.config/… .


#16

Some more testing using truss to monitor dconf-session, before lockup.

dconf is creating a file: .config/dconf/user.RANDOM then later renames it to .config/dconf/user`.
At the point of the lock-up it look like the next call would be a call to fsync as the log ends with:

openat(AT_FDCWD,".config/dconf/user.1L3YDZ",O_RDWR|O_CREAT|O_EXCL,0666) = 10 (0xa)              
write(10,"GVariant\0\0\0\0\0\0\0\0\^X\0\0"...,400) = 400 (0x190)
fstatat(AT_FDCWD,".config/dconf/user",{ mode=-rw-r--r-- ,inode=268997,size=400,blksize=4096 },AT_SYMLINK_NOFOLLOW) = 0 (0x0)              

One other thing is that ls,fstat or any process trying to access the user.1L3YD file hangs in zfs state, even for root user. However the sync command hangs in zfs_issue_lock; as shown by Ctrl-T.

For reference some typical messages: (shows fsync call)

openat(AT_FDCWD,".config/dconf/user.ZN8YDZ",O_RDWR|O_CREAT|O_EXCL,0666) = 10 (0xa)              
sendmsg(5,{NULL,0,[{"l\^D\^A\^AG\0\0\0\M-4\v\0\0f\0\0"...,191}],1,{},0,0},MSG_NOSIGNAL) = 191 (0xbf)
poll({ 8/POLLIN },1,0)                           = 0 (0x0)
write(10,"GVariant\0\0\0\0\0\0\0\0\^X\0\0"...,400) = 400 (0x190)
fstatat(AT_FDCWD,".config/dconf/user",{ mode=-rw-r--r-- ,inode=268993,size=400,blksize=4096 },AT_SYMLINK_NOFOLLOW) = 0 (0x0)              
fsync(0xa)                                       = 0 (0x0)
close(10)                                        = 0 (0x0)
write(4,"\^A",1)                                 = 1 (0x1)
rename(".config/dconf/user.ZN8YDZ",".config/dconf/user") = 0 (0x0)                            
openat(AT_FDCWD,".cache/dconf/user",O_RDWR,00) = 10 (0xa)               
poll({ 5/POLLIN },1,0)                           = 0 (0x0)
pwrite(10,"\0",1,0x1)                            = 1 (0x1)
mmap(0x0,1,PROT_READ|PROT_WRITE,MAP_SHARED,10,0x0) = 34399133696 (0x802592000)
munmap(0x802592000,1)                            = 0 (0x0)
close(10)                                        = 0 (0x0)

#17

good sleuthing. Not exactly sure what to make of it yet, but that’s good info.
edit:

I looked on my system there is a $HOME/.config/dconf/user, the command “file user” says “GVariant Database file, version 0”.


#18

À propos locking: Maybe, this is related, maybe it isn’t.

On TrueOS STABLE, I just now had my very own total freezing event. I had to power off the machine.

Symptoms: After moving mouse on graphical desktop:
-Colormap changes: Psychedelic colors
-Picture and Pointer stay visible.
-No keyboard input, no mouse input
-No answer to ssh from other host
-No answer to ping from other host.

And then, there’s this:

% cat /var/log/Xorg.0.log.old | grep -A 1000 ‘EE’
[302061.180] nvLock: client timed out, taking the lock
(EE) [mi] EQ overflowing. Additional events will be discarded until existing events are processed.
(EE)
(EE) Backtrace:
(EE) 0: /usr/local/bin/X (?+0x0) [0x586310]
(EE) 1: /usr/local/bin/X (?+0x0) [0x44dd40]
(EE) 2: /usr/local/bin/X (?+0x0) [0x488f90]
(EE) 3: /usr/local/lib/xorg/modules/input/mouse_drv.so (?+0x0) [0x839b2bf80]
(EE) 4: /usr/local/lib/xorg/modules/input/mouse_drv.so (?+0x0) [0x839b29420]
(EE) 5: /usr/local/lib/xorg/modules/input/mouse_drv.so (?+0x0) [0x839b28630]
(EE) 6: /usr/local/bin/X (?+0x0) [0x47abb0]
(EE) 7: /usr/local/bin/X (?+0x0) [0x49e540]
(EE) 8: /lib/libthr.so.3 (_pthread_sigmask+0x536) [0x802399da6]
(EE) 9: /lib/libthr.so.3 (_pthread_getspecific+0xe2f) [0x802399bdf]
(EE) 10: ? (?+0xe2f) [0x7fffffffffc2]
(EE) 11: /lib/libc.so.7 (__sys_sigprocmask+0xa) [0x8026c91a4]
(EE) 12: /lib/libthr.so.3 (_pthread_sigmask+0x51c) [0x802399d8c]
(EE) 13: /lib/libthr.so.3 (_pthread_getspecific+0xe2f) [0x802399bdf]
(EE) 14: ? (?+0xe2f) [0x7fffffffffc2]
(EE) 15: /lib/libc.so.7 (__sys_nanosleep+0xa) [0x8026c9284]
(EE) 16: /lib/libthr.so.3 (_pthread_suspend_all_np+0x128c) [0x8023978ac]
(EE) 17: /lib/libc.so.7 (__usleep+0x46) [0x802720186]
(EE) 18: /usr/local/lib/xorg/modules/drivers/nvidia_drv.so (?+0x46) [0x80843819f]
(EE) 19: /usr/local/lib/xorg/modules/drivers/nvidia_drv.so (?+0x46) [0x808438388]
(EE) 20: /usr/local/lib/xorg/modules/drivers/nvidia_drv.so (?+0x46) [0x8084383ee]
(EE)
(EE) [mi] These backtraces from mieqEnqueue may point to a culprit higher up the stack.
(EE) [mi] mieq is NOT the cause. It is a victim.
(EE) [mi] EQ overflow continuing. 100 events have been dropped.
(EE)
(EE) Backtrace:
(EE) 0: /usr/local/bin/X (?+0x0) [0x44dd40]
(EE) 1: /usr/local/bin/X (?+0x0) [0x488f90]
(EE) 2: /usr/local/lib/xorg/modules/input/mouse_drv.so (?+0x0) [0x839b2bf80]
(EE) 3: /usr/local/lib/xorg/modules/input/mouse_drv.so (?+0x0) [0x839b29420]
(EE) 4: /usr/local/lib/xorg/modules/input/mouse_drv.so (?+0x0) [0x839b28630]
(EE) 5: /usr/local/bin/X (?+0x0) [0x47abb0]
(EE) 6: /usr/local/bin/X (?+0x0) [0x49e540]
(EE) 7: /lib/libthr.so.3 (_pthread_sigmask+0x536) [0x802399da6]
(EE) 8: /lib/libthr.so.3 (_pthread_getspecific+0xe2f) [0x802399bdf]
(EE) 9: ? (?+0xe2f) [0x7fffffffffc2]
(EE) 10: /lib/libc.so.7 (__sys_sigprocmask+0xa) [0x8026c91a4]
(EE) 11: /lib/libthr.so.3 (_pthread_sigmask+0x51c) [0x802399d8c]
(EE) 12: /lib/libthr.so.3 (_pthread_getspecific+0xe2f) [0x802399bdf]
(EE) 13: ? (?+0xe2f) [0x7fffffffffc2]
(EE) 14: /lib/libc.so.7 (__sys_nanosleep+0xa) [0x8026c9284]
(EE) 15: /lib/libthr.so.3 (_pthread_suspend_all_np+0x128c) [0x8023978ac]
(EE) 16: /lib/libc.so.7 (__usleep+0x46) [0x802720186]
(EE) 17: /usr/local/lib/xorg/modules/drivers/nvidia_drv.so (?+0x46) [0x80843819f]
(EE) 18: /usr/local/lib/xorg/modules/drivers/nvidia_drv.so (?+0x46) [0x808438388]
(EE) 19: /usr/local/lib/xorg/modules/drivers/nvidia_drv.so (?+0x46) [0x8084383ee]
(EE)
(EE) [mi] EQ overflow continuing. 200 events have been dropped.
(EE)
(EE) Backtrace:
(EE) 0: /usr/local/bin/X (?+0x0) [0x44dd40]
(EE) 1: /usr/local/bin/X (?+0x0) [0x488f90]
(EE) 2: /usr/local/lib/xorg/modules/input/mouse_drv.so (?+0x0) [0x839b2bf80]
(EE) 3: /usr/local/lib/xorg/modules/input/mouse_drv.so (?+0x0) [0x839b29420]
(EE) 4: /usr/local/lib/xorg/modules/input/mouse_drv.so (?+0x0) [0x839b28630]
(EE) 5: /usr/local/bin/X (?+0x0) [0x47abb0]
(EE) 6: /usr/local/bin/X (?+0x0) [0x49e540]
(EE) 7: /lib/libthr.so.3 (_pthread_sigmask+0x536) [0x802399da6]
(EE) 8: /lib/libthr.so.3 (_pthread_getspecific+0xe2f) [0x802399bdf]
(EE) 9: ? (?+0xe2f) [0x7fffffffffc2]
(EE) 10: /lib/libc.so.7 (__sys_sigprocmask+0xa) [0x8026c91a4]
(EE) 11: /lib/libthr.so.3 (_pthread_sigmask+0x51c) [0x802399d8c]
(EE) 12: /lib/libthr.so.3 (_pthread_getspecific+0xe2f) [0x802399bdf]
(EE) 13: ? (?+0xe2f) [0x7fffffffffc2]
(EE) 14: /lib/libc.so.7 (__sys_nanosleep+0xa) [0x8026c9284]
(EE) 15: /lib/libthr.so.3 (_pthread_suspend_all_np+0x128c) [0x8023978ac]
(EE) 16: /lib/libc.so.7 (__usleep+0x46) [0x802720186]
(EE) 17: /usr/local/lib/xorg/modules/drivers/nvidia_drv.so (?+0x46) [0x80843819f]
(EE) 18: /usr/local/lib/xorg/modules/drivers/nvidia_drv.so (?+0x46) [0x808438388]
(EE) 19: /usr/local/lib/xorg/modules/drivers/nvidia_drv.so (?+0x46) [0x8084383ee]
(EE)
(EE) [mi] EQ overflow continuing. 300 events have been dropped.
(EE)
(EE) Backtrace:
(EE) 0: /usr/local/bin/X (?+0x0) [0x44dd40]
(EE) 1: /usr/local/bin/X (?+0x0) [0x488f90]
(EE) 2: /usr/local/lib/xorg/modules/input/mouse_drv.so (?+0x0) [0x839b2bf80]
(EE) 3: /usr/local/lib/xorg/modules/input/mouse_drv.so (?+0x0) [0x839b29420]
(EE) 4: /usr/local/lib/xorg/modules/input/mouse_drv.so (?+0x0) [0x839b28630]
(EE) 5: /usr/local/bin/X (?+0x0) [0x47abb0]
(EE) 6: /usr/local/bin/X (?+0x0) [0x49e540]
(EE) 7: /lib/libthr.so.3 (_pthread_sigmask+0x536) [0x802399da6]
(EE) 8: /lib/libthr.so.3 (_pthread_getspecific+0xe2f) [0x802399bdf]
(EE) 9: ? (?+0xe2f) [0x7fffffffffc2]
(EE) 10: /lib/libc.so.7 (__sys_sigprocmask+0xa) [0x8026c91a4]
(EE) 11: /lib/libthr.so.3 (_pthread_sigmask+0x51c) [0x802399d8c]
(EE) 12: /lib/libthr.so.3 (_pthread_getspecific+0xe2f) [0x802399bdf]
(EE) 13: ? (?+0xe2f) [0x7fffffffffc2]
(EE) 14: /lib/libc.so.7 (__sys_nanosleep+0xa) [0x8026c9284]
(EE) 15: /lib/libthr.so.3 (_pthread_suspend_all_np+0x128c) [0x8023978ac]
(EE) 16: /lib/libc.so.7 (__usleep+0x46) [0x802720186]
(EE) 17: /usr/local/lib/xorg/modules/drivers/nvidia_drv.so (?+0x46) [0x80843819f]
(EE) 18: /usr/local/lib/xorg/modules/drivers/nvidia_drv.so (?+0x46) [0x808438388]
(EE) 19: /usr/local/lib/xorg/modules/drivers/nvidia_drv.so (?+0x46) [0x8084383ee]
(EE)
[514815.694] nvLock: client timed out, taking the lock
[514815.696] [mi] Increasing EQ size to 1024 to prevent dropped events.
[514815.698] [mi] EQ processing has resumed after 333 dropped events.
[514815.698] [mi] This may be caused by a misbehaving driver monopolizing the server’s resources.

Edit: For completeness:

This is /var/log/messages:

Feb 3 18:05:18 trueos kernel: pid 6683 (Xorg), uid 0: exited on signal 10
Feb 3 18:05:18 trueos kernel: vgapci0: child nvidia0 requested pci_enable_io
Feb 3 18:05:18 trueos kernel: NVRM: GPU at 0000:02:00.0 has fallen off the bus.
Feb 3 18:05:18 trueos kernel: vgapci0: child nvidia0 requested pci_enable_io
Feb 3 18:05:18 trueos last message repeated 343 times
Feb 3 18:05:18 trueos kernel: pid 6729 (fluxbox), uid 1001: exited on signal 10
Feb 3 18:05:18 trueos kernel: vgapci0: child nvidia0 requested pci_enable_io


#19

I wrote a short test to write a byte the fsync and loop. it fails, if run on the same zfs dataset:

% ~/tl
W
load: 3.19  cmd: tl 35723 [zilog->zl_issuer_lock] 5.13r 0.00u 0.00s 1% 1868k

It will run on a different dataset (e.g. /tmp) during the failed state, but since HOME is a single dataset any program that does fsync will lockup. Running this test a few times soon deadlocked the whole system and not even terminal would respond.

So it seems as though dconf-service is getting stuck in its fsync and then any program trying to fsync ends up blocked. I think I have to find a way to get to KDB - on all 3 machines Ctrl-Alt-Esc simply hangs, no idea if KDB actually starts. The machine with a serial port shows boot messages, but hangs on a Break signal no KDB :frowning:


#20

There is a dataset property called sync that controls the use of the ZFS Intent Log, similar to mounting UFS filesystems async. The default value of sync is “standard” which means for synchronous requests, the ZIL is used (ZIL is on device storage that synchronous writes are pushed to, the kernel can ack back to the program, then the writes go from the ZIL to the dataset later). The value “always” means make all writes synchronous, “disabled” means pretend synchronous writes are done even if they’re not.

zfs get sync datasetname

to see what it’s set to (probably standard)

zfs set sync=disabled datasetname

to disable it (which may be ok on $HOME, but not in general)

The ZIL is on a device, you can make a dataset use a separate device (never done this, going by info in Michael W Lucas books) which may help. So fsync is likely the trigger, this is likely an upstream issue, if your test prog is simple (sounds like it is), you may want to see if the ixSystems folks (Ken, Kris, etc) can work with you to ask upstream. Or simply file a bug upstream.


#21

Sorry for long delay, not had much time to work on this, so just to summarise the tests I managed to carry out so far:

  1. corebird on FreeBSD 11.1 usning ssh X forwarding - no problem
  2. add xorg,lumina,… to the FreeBSD 11.1 install - no problems
  3. boot FreeBSD 12 snapshot kldload video and chroot into FreeBSD11.1 - no problems
  4. install FreeBSD 12 snapshot lumina,xork,corebird,dconf - lock after 40 min

The first two tests ran a few days and the chroot test has 7 days uptime.
It is interesting that the chroot has not failed.

The FreeBSD 12 snapshot install is able to get to kernel debugger, Just before I tried to umount the memory stick that contained the initial dot files to set up my home directory, it was MSDOSFS mounted RO and the umount command locked in zfs.
The backtrace of umount and corebird look like they are both stuck in some kind of lock:

_sx_xlock_hard
_sx_xlock
zil_commit_impl

dcon-service is in:

dnode_hold_impl
_xs_xlock
witness_lock

I will try to get some photos as no serial port on that machine :frowning:. I will likely just post those logs to FreeBSD bug tracker and post a link here.


#22

Ahh witness locks. That definitely sounds like upstream issue. Witness locks are debugging tools trying to help determine any lock order issues; in “release” they should be turned off but since this is 12-CURRENT, it’s turned on (maybe adding performance hit)
I glanced at some dconf code/documents a while back and I think they are basically optimized for read performance; I’m pretty sure they mmap the file. That is a perfectly legal operation, but starts to involve more interactions between the VM processing and FS (ZFS) than normal.


#23

Maybe this is fixed now by: FreeBSD-EN-18:18.zfs
this fix notes: "Processes may hang on the waitchan “zilog->zl_writer_lock”