Due to an influx of spam, we have had to temporarily disable account registrations. Please write an email to accountsupport@archlinux.org, with your desired username, if you want to get access. Sorry for the inconvenience.
Since linux-6.9.1.arch1-1, I've been experiencing system freezes that require a hard reset. These usually occur right after resuming from S3 (suspend), and quite consistently too - 1-2 S3->resume cycles are all that is needed to cause the freeze. More rarely, they occur seemingly at random. In either case the logs tend to follow the same pattern seen in the attached portions of my systemd journal (I had to use magic SysRq to hard reset without dropping logs).
Many Arch users have reported their systems freezing on suspend - my searching yielded 123 - but these tend to differ in the contents of the logs, and the system details (I'm not using AMD or Nvidia drivers).
I have confirmed, via a build of linux-git (v6.10-rc3), that the issue is not caused by Arch's patchset.
I have begun a git bisection, but it'll probably take a while. I'm mostly posting here to confirm that this should be reported on https://bugzilla.kernel.org (or should I send it to one of the mailing lists? Guidance would be much appreciated...). I'd also like to confirm the bisection range - should I bisect the mainline kernel tree with v6.8 as 'good' and v6.9 as 'bad'?
Thanks for reporting this issue! This very much looks like kernel regression, which should be bisected and reported upstream to the kernel developers and the regression mailing list
Are you confident to do the bisection on your own or do you need some help?
If you want we could also provide you with prebuilt kernel images for you to test
Generally the Bugzilla is often not checked as much as the Mailing list by the kernel developers, so once you're done with the bisection we can also have a look on where to send the Bugreport afterwards
If you want we could also provide you with prebuilt kernel images for you to test
Thanks for the offer, that's quite kind of you. But do you already have prebuilt images for all the commits on mainline I'd have to test? If you have to compile them on your end depending on my test results, the delays due to communication might make it slower than me doing the compilation myself... The compilation takes me ~50min each time.
The 'tainted' messages start showing up around the time of the freeze. The logs I've attached contain their earliest occurrences in that boot. I guess I should have mentioned that...
While doing the bisection, I would occasionally notice that some kernel modules related to virtualbox were failing to be installed (this error) into the built kernels. I uninstalled virtualbox and now the main issue (of freeze on resume) seems to have gone away on linux-6.9.4.arch1-1 as well. This may also have been the cause of the 'taint' messages.
I'll need to wait for a while to ensure that the issue has indeed gone away. Here is the bisection log from when virtualbox was installed (I didn't finish the bisection):
git bisect start# status: waiting for both good and bad commits# good: [e8f897f4afef0031fe618a8e94127a0934896aba] Linux 6.8git bisect good e8f897f4afef0031fe618a8e94127a0934896aba# bad: [83a7eefedc9b56fe7bfeff13b6c7356688ffa670] Linux 6.10-rc3git bisect bad 83a7eefedc9b56fe7bfeff13b6c7356688ffa670# bad: [445e60303883950161f67e18b9f048b18d7fb706] Merge branch '1GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queuegit bisect bad 445e60303883950161f67e18b9f048b18d7fb706# bad: [e5e038b7ae9da96b93974bf072ca1876899a01a3] Merge tag 'fs_for_v6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fsgit bisect bad e5e038b7ae9da96b93974bf072ca1876899a01a3# good: [1f440397665f4241346e4cc6d93f8b73880815d1] Merge tag 'docs-6.9' of git://git.lwn.net/linuxgit bisect good 1f440397665f4241346e4cc6d93f8b73880815d1# good: [a2f24c8a955c8f941d6ac08dd7f401f54eef4627] Merge branch 'mptcp-some-clean-up-patches'git bisect good a2f24c8a955c8f941d6ac08dd7f401f54eef4627# bad: [aa7d6513d68bad539142f9d6c3e2faa629bc27d8] Merge tag 'tag-chrome-platform-firmware-for-v6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/chrome-platform/linuxgit bisect bad aa7d6513d68bad539142f9d6c3e2faa629bc27d8# bad: [f095fefacdd35b4ea97dc6d88d054f2749a73d07] ptp: Move from simple ida to xarraygit bisect bad f095fefacdd35b4ea97dc6d88d054f2749a73d07# bad: [75c2946db360e625f1447a37f47dbbb38b1dd478] Merge tag 'wireless-next-2024-03-08' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-nextgit bisect bad 75c2946db3https://bbs.archlinux.org/viewtopic.php?id=26300860e625f1447a37f47dbbb38b1dd478# good: [a12c1e7a6449c39b3dd6ae12bf410281ea79a9ad] ionic: remove unnecessary NULL testgit bisect good a12c1e7a6449c39b3dd6ae12bf410281ea79a9ad# bad: [5fcc7c51f9e72d1e62991f8b32be4a5adf44d556] wifi: mac80211: handle netif carrier up/down with link AP during MLOgit bisect bad 5fcc7c51f9e72d1e62991f8b32be4a5adf44d556
...and, RIGHT after I posted the above comment, I suspended my system and got a freeze with a blinking caps lock LED (HP laptop), which has happened on occasion during previous freezes.
I no longer have any idea what's going on here. Could the issue be affected by the Wi-Fi network I'm connected to? Could there be multiple separate issues?
It really could be two problems we're looking at Please try to find the culprit without the VBox module, as the upstream kernel devs don't care about Out-of-tree modules
Also how reproducible is the issue without the module loaded?
Alright, I redid the entire bisection after uninstalling VirtualBox. The issue is very much still present; it just usually takes more suspend+resume cycles to show up, and the crash is immediate without any window to use SysRq (I had to set up kdumpst instead). Accordingly, I've rewritten my original post below, as a draft email for whichever mailing list -
I've been experiencing system freezes that require a hard reset with the power button. These usually occur right after resuming from S3 (suspend), typically after the session has been suspended and resumed a few times. They have also occurred (seemingly) at random, although this is quite rare. They do seem to be more likely to occur after the system has been hibernated, although I don't really have any way to confirm this.
Since a hard reset causes the systemd journal to be dropped, I set up kdumpst so that I could record dmesg output at the time of the freeze. I've attached the crash logs produced by that tool for every bad commit in the bisection, as well as a crash log triggered by SysRq (Alt+SysRq+c) on a good commit for comparison.
I've already submitted this as an Arch Linux bug report: #61 (closed)
Git bisect log:
git bisect start# status: waiting for both good and bad commits# bad: [83a7eefedc9b56fe7bfeff13b6c7356688ffa670] Linux 6.10-rc3git bisect bad 83a7eefedc9b56fe7bfeff13b6c7356688ffa670# good: [e8f897f4afef0031fe618a8e94127a0934896aba] Linux 6.8git bisect good e8f897f4afef0031fe618a8e94127a0934896aba# bad: [445e60303883950161f67e18b9f048b18d7fb706] Merge branch '1GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queuegit bisect bad 445e60303883950161f67e18b9f048b18d7fb706# bad: [e5e038b7ae9da96b93974bf072ca1876899a01a3] Merge tag 'fs_for_v6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs# this one froze on the first suspend/resumegit bisect bad e5e038b7ae9da96b93974bf072ca1876899a01a3# good: [1f440397665f4241346e4cc6d93f8b73880815d1] Merge tag 'docs-6.9' of git://git.lwn.net/linuxgit bisect good 1f440397665f4241346e4cc6d93f8b73880815d1# bad: [a2f24c8a955c8f941d6ac08dd7f401f54eef4627] Merge branch 'mptcp-some-clean-up-patches'# this one was really hard to reproduce; needed to hibernate, and after that it took till like the 5th cyclegit bisect bad a2f24c8a955c8f941d6ac08dd7f401f54eef4627# bad: [26f4dac11775a1ca24e2605cb30e828d4dbdea93] netfilter: x_tables: Use unsafe_memcpy() for 0-sized destination# EVEN harder to reproduce - I had to hibernate TWICE, and it must have taken 25-30 cycles totalgit bisect bad 26f4dac11775a1ca24e2605cb30e828d4dbdea93# good: [4f5e5092fdbf5cec6bedc19fbe69cce4f5f08372] Merge tag 'net-6.8-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netgit bisect good 4f5e5092fdbf5cec6bedc19fbe69cce4f5f08372# good: [b4e8ae5c8c41355791a99fdf2fcac16deace1e79] net: add napi_busy_loop_rcu()git bisect good b4e8ae5c8c41355791a99fdf2fcac16deace1e79# bad: [20ea9327c2fd545d6b96e998727bcd724290694d] net: dccp: Simplify the allocation of slab caches in dccp_ackvec_initgit bisect bad 20ea9327c2fd545d6b96e998727bcd724290694d# bad: [92046e83c07b064ca65ac4ae7660a540016bdfc1] Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextgit bisect bad 92046e83c07b064ca65ac4ae7660a540016bdfc1# bad: [b54846da45942bbe4e5ebc59d497e4a48525ba5a] Merge tag 'wireless-next-2024-01-25' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-nextgit bisect bad b54846da45942bbe4e5ebc59d497e4a48525ba5a# good: [3832a9c40b356500c5b85a6fdf9577c590fcd637] wifi: rtw89: fw: extend JOIN H2C command to support WiFi 7 chipsgit bisect good 3832a9c40b356500c5b85a6fdf9577c590fcd637# bad: [5ba45ba77616637e554d66a57ef0334e5cc2efe4] wifi: rtw89: fix disabling concurrent mode TX hang issue# I see the same kernel warnings here as with the other bad commits. However,# there is no crash after a few suspend/resume cycles - the system simply# restarts when I trigger the nth suspend. I'm not able to trigger a crash with# SysRq either (Alt+SysRq+c). This is the case with all bad commits after this.git bisect bad 5ba45ba77616637e554d66a57ef0334e5cc2efe4# good: [85da8f71aaa7b83ea7ef0e89182e0cd47e16d465] wifi: brcmfmac: Demote vendor-specific attach/detach messages to infogit bisect good 85da8f71aaa7b83ea7ef0e89182e0cd47e16d465# good: [295304040d9f6f350b68652acd99650c7e16d0a8] wifi: rtw89: 8922a: add TX power related opsgit bisect good 295304040d9f6f350b68652acd99650c7e16d0a8# good: [7cf6b6764b2f665d317ba0f91c247437019a2f4c] wifi: rtw89: Set default CQM config if not presentgit bisect good 7cf6b6764b2f665d317ba0f91c247437019a2f4c# good: [7e11a2966f51695c0af0b1f976a32d64dee243b2] wifi: rtw89: fix null pointer access when abort scangit bisect good 7e11a2966f51695c0af0b1f976a32d64dee243b2# bad: [f59a98c82534e986b06615ba94e060aa3129b08b] wifi: rtw89: fix HW scan timeout due to TSF sync issuegit bisect bad f59a98c82534e986b06615ba94e060aa3129b08b# bad: [bcbefbd032df6bfe925e6afeca82eb9d2cc0cb23] wifi: rtw89: add wait/completion for abort scangit bisect bad bcbefbd032df6bfe925e6afeca82eb9d2cc0cb23# first bad commit: [bcbefbd032df6bfe925e6afeca82eb9d2cc0cb23] wifi: rtw89: add wait/completion for abort scan
Thats a very nice offer, but I also want all of the kernels there to be built & signed by me But good job on the bisection and all the work that went into it!
I have been running into this with a Lenovo P50. I discovered that the latest nvidia-dkms (555.58.02-1 as of this writing) as opposed to the non-DKMS nvidia driver (provided by the nvidia package) seems to work around this issue.
In addition, nvidia-dkms 555.58.02-1 appears to "freeeze" when coming out of suspend, but while in a graphical environment. I found that downgrading to nvidia-dkms 550.78-3 seems to restore stability.
Edit: After 8 days of using 550.78-3, I haven't experienced one restore-from-suspend or graphical "freeze" crash. I have been doing system updates about every other day without issues.
Thanks for your contribution - I will consider to downgrade to nvidia-dkms 550.78-3 if stable and appropriate for my rig - I use the S3 suspend feature many times through out the day and cannot accept not having it
Seems that wine and windows games are too dependent on the newest nvidia dkms so downgrade is not an option. Have to wait for a proper fix of current issue.
Jan Alexander Steffens (heftig)changed title from Freeze on resume from s3 (suspend) to Freeze on resume from s3 (suspend) while running hostapd hotspot
changed title from Freeze on resume from s3 (suspend) to Freeze on resume from s3 (suspend) while running hostapd hotspot
Jan Alexander Steffens (heftig)changed title from Freeze on resume from s3 (suspend) while running hostapd hotspot to Freeze on resume from s3 while running hostapd hotspot
changed title from Freeze on resume from s3 (suspend) while running hostapd hotspot to Freeze on resume from s3 while running hostapd hotspot
The problem is still present on ArchLinux kernel version: 6.10.3-arch1-2 using nvidia-dkms driver version: 555.58.02. I am not running hostapd but have the problem still and very consistently.
So you mean I need to create a new bug report on this issue?
To me the issues seems related and the behavior of the S3 suspend/resume is acting very similar which could indicate that it is not only related to hostapd and thus need a broader investigation?
This one seems to have run its course. No recent response from @45mg. If still happening with latest kernels, please follow up with upstream (because it's not an Arch packaging issue).