amarsh04 | git bisecting a problem with hardware MIDI playback that started with kernel 6.5-rc1 kernels | 02:59 |
---|---|---|
systemdlete | I have a daedalus VM with a frozen desktop. I can ssh into the VM from the host, and I can run htop, etc. The desktop in the VM simply doesn't take input. The mouse, however, does move normally. What should I look at first? | 22:21 |
systemdlete | I have another daedalus VM on a different host that does not freeze. | 22:22 |
systemdlete | Both hosts have 32GB ram, but one is a FX8350 and the other has an Athlon II X6. | 22:23 |
systemdlete | The host does not seem to have any noticeable problems with its own desktop. The VM's desktop will run for many hours, then freeze. | 22:24 |
systemdlete | While ssh'd into the daedalus VM, df shows me there is plenty of disk space. And htop does not seem to be screaming any pain. | 22:25 |
gnarface | the frozen xorg instance isn't taking up any cpu? | 22:26 |
systemdlete | gnarface, htop shows xorg using 2.0 MB and 0 cpu | 22:28 |
gnarface | systemdlete: what about its children? | 22:28 |
gnarface | not sure but i suspect something running inside that xorg instance is hung and needs to be killed, then the desktop will unfreeze | 22:29 |
systemdlete | one child, using same cpu and mem, child marked D | 22:29 |
gnarface | hmmm | 22:30 |
systemdlete | maybe restart xorg? | 22:30 |
gnarface | yea that doesn't seem like enough children | 22:30 |
systemdlete | or will that cause widespread mayhem | 22:30 |
gnarface | usually you should have a window manager and some other stuff... | 22:30 |
gnarface | mine has one child that's just another Xorg instance | 22:31 |
systemdlete | xfwm4 is running | 22:31 |
gnarface | but nothing inside it? | 22:32 |
gnarface | no actual programs just an empty desktop? | 22:32 |
systemdlete | looks like xfwm4 has 11 children | 22:33 |
systemdlete | I have several shell windows open, but only 1 is max'd the others are minimized | 22:34 |
gnarface | it's gotta be one of them, not necessarily a maxxed one | 22:34 |
systemdlete | btw, I did that specifically to run a test on this problem. I wanted to eliminate things like browsers etc that can mess things up | 22:34 |
systemdlete | so maybe try killing one at a time? | 22:34 |
gnarface | yea | 22:35 |
systemdlete | I will start with the highest numbered pid and work backwards | 22:36 |
systemdlete | whoa. | 22:37 |
systemdlete | SIGTERM didn't work, so I tried SIGHUP. That didn't work either. | 22:37 |
systemdlete | So I sent that pid SIGKILL, but it killed off all of them apparently. | 22:38 |
systemdlete | VM desktop is still frozen though | 22:38 |
gnarface | brutal | 22:39 |
systemdlete | I wonder, gnarface, if maybe I should try killing off the windows themselves | 22:39 |
gnarface | maybe yea | 22:39 |
systemdlete | stand by have to install wmctrl | 22:40 |
gnarface | you killed the processes, the processes are all gone, and the windows from them are still there? | 22:40 |
gnarface | that's not right... | 22:41 |
gnarface | that suggests xfce itself froze | 22:41 |
gnarface | or went out to lunch somehow | 22:41 |
systemdlete | well, the desktop still has not refreshed | 22:41 |
systemdlete | so check panera bread? | 22:41 |
systemdlete | :D | 22:41 |
systemdlete | (sorry, getting punchy here. Been scratching my head for days over this.) | 22:42 |
gnarface | try perf top and radentop? | 22:42 |
systemdlete | not familiar with those, but I can install them | 22:42 |
gnarface | it's amd gpu too right? | 22:43 |
gnarface | or is it nvidia? | 22:43 |
gnarface | radeontop won't help for nvidia | 22:43 |
systemdlete | hold on... | 22:43 |
gnarface | but perf might still | 22:43 |
systemdlete | M5A78L/USB3 board iirc | 22:44 |
systemdlete | and no external graphics card, so its all amd | 22:44 |
systemdlete | (yes) | 22:44 |
* systemdlete looking through system logs inside VM for clues... | 22:46 | |
gnarface | check what size the xorg log is | 22:46 |
gnarface | see if it's got a lot of repeating warnings | 22:46 |
gnarface | (or errors) | 22:47 |
systemdlete | uh-oh... | 22:47 |
systemdlete | .xsession-errors is dated Mar 3 and xorg log Mar 4. I've had to do the nasty to reboot the VM, so probably those files were not updated | 22:47 |
systemdlete | which makes me wonder | 22:48 |
* systemdlete checks to see if file systems are mounted ro instead of rw | 22:48 | |
systemdlete | nope. FS are all mounted rw. | 22:48 |
systemdlete | (was just a thought) | 22:48 |
gnarface | you're looking in ~/.local/share/xorg/ ? | 22:48 |
systemdlete | oh, sorry no. Was looking at /var/log/Xorg... | 22:49 |
systemdlete | ah, now that one is dated Mar 14 | 22:49 |
systemdlete | uptime is 8 days | 22:50 |
gnarface | being not nvidia, i would assume it would have migrated to running as the user instead of suid root, which would have moved the log to ~/.local/share/xorg/ | 22:50 |
gnarface | this is a new change | 22:50 |
systemdlete | ? | 22:50 |
gnarface | pretty much everything except nvidia runs xorg as the user now | 22:50 |
gnarface | so the logs go in the home dir instead of /var/log/ | 22:51 |
systemdlete | I launch xfce from command line after log in. | 22:51 |
gnarface | even if you run startx, nvidia drivers are still wired to start as suid root, afaik | 22:51 |
systemdlete | Was having a lot of problems with WMs | 22:51 |
systemdlete | well, apparently that is not the problem here, as you said | 22:51 |
gnarface | yes | 22:52 |
gnarface | is that Xorg log very large? | 22:52 |
systemdlete | All I meant is that xorg is running as user | 22:52 |
systemdlete | the one in ~/.local/share/xorg is 27132 bytes | 22:52 |
gnarface | nah that's not a problem then | 22:52 |
systemdlete | last 2 messages are (EE) No surface to present from. | 22:53 |
gnarface | oh, that's a problem | 22:53 |
gnarface | are there more errors before that? | 22:53 |
systemdlete | lots of messages, one other error earlier: "(EE) open /dev/fb0: Permission denied" | 22:54 |
systemdlete | but that was at 52 secs or so, and the desktop had been working fine for a few days | 22:55 |
gnarface | a couple errors might be normal while auto-detect brute-force fails its way through every driver until it finds one that works | 22:55 |
gnarface | the last one where it says your surface has gone missing though, that seems like a smoking gun | 22:55 |
gnarface | this makes it seem more like a driver issue | 22:56 |
systemdlete | Those are at 60 secs in | 22:56 |
systemdlete | oooh | 22:56 |
systemdlete | maybe I need to install some specific FW for this board? | 22:57 |
systemdlete | (main board) | 22:57 |
gnarface | possible, or maybe just make it use a different xorg driver | 22:57 |
systemdlete | hmmm. I never installed firmware-amd-graphics | 22:59 |
systemdlete | do I need that in a VM? | 22:59 |
gnarface | no, i doubt that | 22:59 |
gnarface | but you might need some vm drivers | 23:00 |
systemdlete | The VM drivers are installed | 23:00 |
gnarface | virtio or something like that | 23:00 |
rustyaxe | ya fbdev is quite old, do any modern drivers use that? | 23:00 |
gnarface | arm hardware still maybe | 23:00 |
rustyaxe | I think thats just xorg probing for the display and trying an old device (fbdev still exists after all but you probably dont want to use it instead of a more optimized driver) | 23:00 |
systemdlete | rustyaxe, gnarface: Keep in mind this is 15+ year old tech. AM3 platforms | 23:01 |
systemdlete | using a built-in video fw | 23:01 |
rustyaxe | fbdev still predates that | 23:01 |
rustyaxe | you're passing the video through to the vm? or using emulated video? | 23:02 |
rustyaxe | That'll decide which driver the guest needs | 23:02 |
systemdlete | well, I'm wondering if parts of the system are starting to drop support for "older" hardware, esp. if the drivers from them might not be quite up to the standard for u-know-what | 23:02 |
systemdlete | virtualbox, using kvm virtualization | 23:03 |
rustyaxe | we still have phenom ii machines running fine | 23:03 |
rustyaxe | so no the guest shouldnt need the amd graphics stuff as it wont be talking to it, but rather the emulated video card | 23:03 |
systemdlete | btw, there is just the host and two VMs, and one of the VMs is small (under 1MB) | 23:04 |
systemdlete | cool. | 23:04 |
gnarface | if it's like qemu, you'll want to make sure you're loading the virtual driver modules inside the guest | 23:04 |
gnarface | i forget if you'll need to set xorg.conf too | 23:04 |
systemdlete | so maybe gnarface's suggestion to switch to a different xorg driver? | 23:04 |
rustyaxe | yea you can likely select which video card is emulated which will change which driver in the guest you need | 23:04 |
rustyaxe | I dont use virtualbox, rather proxmox and virt-manager where needful, but im sure its similar to them | 23:05 |
systemdlete | I have the VM set to use VMSVGA, which is the one recommended for most VMs | 23:05 |
systemdlete | video memory is 128K | 23:06 |
systemdlete | ooops | 23:06 |
systemdlete | 128M | 23:06 |
systemdlete | and the VM has 8GB RAM | 23:06 |
gnarface | does it have its own system clock or do they all use the host system clock? | 23:07 |
systemdlete | I have all of my VMs and hosts using one NTP server, which in turn uses an upstream NTP server | 23:08 |
gnarface | grasping at straws here, but maybe emulated clock drift could be destabilizing it? | 23:08 |
systemdlete | good point | 23:08 |
systemdlete | let me see if it is off | 23:08 |
systemdlete | no, not by more than a second or so | 23:08 |
systemdlete | but good idea to check that | 23:08 |
systemdlete | time sync can be a hazard, esp for network communications | 23:09 |
gnarface | another daedalus change was the forced migration to ntpsec from ntp, and in the merge of the new example ntp.conf, you might have, like me, accidentally inherited a "...minsec 3" line which, if you're using just one ntp server, will cause it to ignore that server | 23:09 |
gnarface | and then it will drift if it doesn't have a real clock | 23:09 |
systemdlete | Vbox does provide a clock, but as far as I know, I don't use that (except for sync'ing up at VM boot). | 23:10 |
gnarface | ah | 23:10 |
systemdlete | right | 23:10 |
gnarface | with qemu i tell it to use the host's clock because when left to its own devices it screws up | 23:10 |
systemdlete | do you mean minsane? I have that set to 1 | 23:10 |
gnarface | yea, meant minsane, sorry | 23:11 |
systemdlete | np. I knew what you meant. | 23:11 |
gnarface | yes, it should be 1 | 23:11 |
systemdlete | of course, I'm no longer sure just how much any of this means now that browsers and maybe other programs are using NTP over HTTP or something | 23:11 |
systemdlete | at any rate, I am not noticing any huge amount of drift, at least not in this case | 23:13 |
systemdlete | although, maybe at the very moment that the freeze begins, there might be a lag. The only problem with that theory, is that I have other VMs that do not have desktops freezing intermittently | 23:14 |
systemdlete | I have developed an extensive checklist of gotchas for new VMs and hosts, exactly for this reason. Updating the minsane value is just one of dozens | 23:16 |
gnarface | do you have any shared mounts with the VMs? | 23:17 |
systemdlete | yes! | 23:17 |
gnarface | i wonder if it could be file contention in a shared mount | 23:17 |
systemdlete | I have a LAN server and just about every host and VM is normally mounted to it. | 23:17 |
gnarface | cache directory or something maybe...? | 23:17 |
systemdlete | I use that as a sort of "clipboard" for passing files and data back and forth | 23:18 |
rustyaxe | A hazard? | 23:18 |
rustyaxe | You mean a must | 23:18 |
systemdlete | I'm not using it to boot from, nor for any ongoing file operations. Just to transfer files around. | 23:18 |
rustyaxe | Many network protocols wont work without good time sync | 23:19 |
systemdlete | rustyaxe, yes. I have noticed that! | 23:19 |
rustyaxe | Generally just throw chrony in them and it'll do the right thing | 23:19 |
systemdlete | N.B.: elogind-daemon is in "D" state and does not seem to ever change. | 23:23 |
gnarface | do you actually need elogind if you're starting Xorg with startx? | 23:24 |
systemdlete | no, probably not. I think it is an artifact of when I used to login using a WM | 23:25 |
systemdlete | ok, I have disabled elogind | 23:26 |
systemdlete | I could kill off the remaining processes, one by one, hoping to narrow it down a bit. | 23:28 |
systemdlete | But if I kill them in the wrong order, it could kill dependent processes off also, so I won't get an accurate fix. | 23:28 |
systemdlete | now this is whacked. "service elogind stop" but elogind-daemon is still running | 23:28 |
gnarface | suspicious | 23:28 |
systemdlete | it won't die | 23:29 |
systemdlete | kill -KILL 1885 (pid of elogind daemon) doesn't do anything | 23:29 |
gnarface | give it a few | 23:29 |
systemdlete | its parent is PID 1 | 23:29 |
gnarface | maybe Xorg is gonna have to be killed | 23:29 |
gnarface | i would still try to kill the running programs under xfce first though | 23:30 |
systemdlete | there is still a xfce4-terminal running | 23:31 |
systemdlete | I thought I'd killed that with the shells | 23:31 |
systemdlete | stopped dbus, but dbus-launch still running--is that right? | 23:35 |
systemdlete | desktop is still frozen at this point | 23:35 |
systemdlete | even though I've killed off xfce4-* processes | 23:35 |
systemdlete | (I am shelled into the VM as root, incidientally) | 23:36 |
systemdlete | killed off at-spi-bus-launcher and now dbus-daemon is gone | 23:37 |
systemdlete | and desktop still frozen | 23:37 |
gnarface | is the window manager process still there? | 23:39 |
systemdlete | ps -ef |grep wm shows nothing | 23:40 |
systemdlete | ok, killed off all the vbox client processes, still frozen | 23:40 |
systemdlete | xinit, Xorg, xfsettingsd, and xfdesktop still running | 23:43 |
Generated by irclog2html.py 2.17.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!