NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s

(This is a duplicate post I'm making as seemingly my old forum post was never approved by moderation)

We're running ubilinux 4 on most of our Upboards. On some of them we're running a custom kernel with the librealsense patches applied (see https://github.com/Delicode/ubilinux-kernel/commit/9ce517fa5d907928752c2285e75c54cb96d11ed7).

On some boards, stock ubilinux 4 or custom kernel, we sometimes see ubilinux not boot properly. This is the error shown over UART cable + minicom after BIOS + GRUB screens have passed:

Loading Linux 4.9.45-upboard ...
Loading initial ramdisk ...
error: no suitable video mode found.
Booting in blind mode
[    0.000000] efi: requested map not found.
[    0.000000] esrt: ESRT header is not in the memory map.
[    1.551913] Failed to find cpu0 device node
[    2.087959] i8042: Can't read CTR while initializing i8042
[   28.360040] NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [systemd-udevd:92]
[   56.359127] NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [systemd-udevd:92]
[   84.358218] NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [systemd-udevd:92]
[  112.357308] NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [systemd-udevd:92]
[  140.356399] NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [systemd-udevd:92]
[  169.584984] NMI watchdog: Watchdog detected hard LOCKUP on cpu 3dModules linked in:c dwc3(E)c udc_core(E)c crc32c_intel(E)c xhci_pci(E)c xhci_hcd(E)c r8169(E)c mii(E)c usbcore(E)c         c

Hardware watchdog is enabled in /etc/systemd/system.conf with the following settings:

RuntimeWatchdogSec=300
ShutdownWatchdogSec=10min

However, on one of our problematic Ups, after waiting for 16 minutes the device has not booted itself. Testing hardware watchdog using echo c > /proc/sysrq-trigger correctly boots the board after 5 or so minutes.

Comments

  • Jesse Kaukonen
    Jesse Kaukonen New Member Posts: 42 ✭✭

    I tried to post an added comment yesterday, but it too went into moderation limbo. I'm posting it again:

    Another boot hang happens with this log:

    Loading Linux 4.9.45-upboard ...
    Loading initial ramdisk ...
    error: no suitable video mode found.
    Booting in blind mode
    [    0.000000] efi: requested map not found.
    [    0.000000] esrt: ESRT header is not in the memory map.
    [    1.553924] Failed to find cpu0 device node
    [    2.089942] i8042: Can't read CTR while initializing i8042
    

    No watchdog things are printed. I've included kern.log for this failed boot, with a successful boot following it. The log file contains the following:

    line 1: The boot that ends up hanging begins
    line 615: Final line of this boot. Hang happens here. I cut the power to the Up after about 17 minutes. This produces 4695 0x0 bytes written in the kern.log. On the same line, the first line of the next boot begins, continuing until the end of the file to a running ubilinux.

    I've attempted to reproduce this for all of yesterday and I seem to have found a few patterns:

    • This appears to only happen if I have a 4G Huawei USB modem connected.
    • This appears to happen more often on a cold system. Maybe some capacitor is not getting charged fast enough? I don't know.
    • This sometimes happen on a reboot, sometimes when I replug the power cable.

    I'm using 5V / 4A Up power supply. No other USB peripherals connected. Using UART terminal.

  • Jesse Kaukonen
    Jesse Kaukonen New Member Posts: 42 ✭✭

    After more testing, I've done the following tests:

    • This happens on two of our 2GB UpBoards with fan. Fanless not yet tested
    • This does not happen with stock ubilinux 3. I booted a board 30 times issuing reboots.
    • This happens on stock Ubilinux 4. I booted the board 13 times, and on 14th boot this problem happened.
    • This problem happens with 2 different PSUs as well as a POE hat.
    • This problem happens with nothing except PSU + UART connected - no USB or network cables connected
    • Issuing a normal reboot occasionally produces this, but the problem also seems to happen if the device is offline for a while.
    • Doing a clean shutdown vs losing power by unplugging the PSU does not seem to be a variable.
  • Filippo Bellini
    Filippo Bellini Emutex Posts: 10 mod

    Hello Jesse,

    Thanks for your information. Could you please run this script (see link at the end of the post) on the boards were you are experiencing the problem as well as on one that is working correctly, and attach the logs in your reply? That will help us to reproduce the issue and investigate it.

    https://wiki.up-community.org/Software#Requesting_support_for_Linux_OS

    Many thanks,
    Kind regards

  • Jesse Kaukonen
    Jesse Kaukonen New Member Posts: 42 ✭✭

    Here's the debug info from a fresh UpBoard with a clean install of Ubilinux 4. After 12 reboots, this problem manifested.

    I'll install ubilinux 3 on this (once I find the installer somewhere) and include the debug log for that later.

  • Jesse Kaukonen
    Jesse Kaukonen New Member Posts: 42 ✭✭

    I installed ubilinux 3 on the same device and booted it 20 times - no problems. Here's the sysinfo.

  • Jesse Kaukonen
    Jesse Kaukonen New Member Posts: 42 ✭✭

    As an added test result, I reinstalled ubilinux 4 and booted using a VGA terminal (keyboard + monitor) as opposed to UART + minicom over serial cable. The same problem happened on 4th boot.

  • Filippo Bellini
    Filippo Bellini Emutex Posts: 10 mod

    Hi Jesse,
    thank you for your reports. I hope to get back to you as soon as possible.

    Kind regards
    F.

  • Javier Arteaga
    Javier Arteaga Emutex Posts: 163 mod

    Hi Jesse,

    Thanks for the detailed report. From what I can see the actual issue is that udev (device manager) sometimes hangs when this Huawei USB modem is connected. The hardware watchdog itself, and your settings, look OK. "NMI watchdog" is just a kernel component involved in discovering and reporting this kind of lockup.

    I don't know yet what causes this bad interaction with ubilinux 4 only - here are some followup questions:

    • It seems you ran the support script while the USB modem was unplugged or otherwise unavailable. Could you please run it (preferably on ubilinux 4) on a session with a working modem?
    • What are the modem details (model, hardware revision)?
  • Jesse Kaukonen
    Jesse Kaukonen New Member Posts: 42 ✭✭

    Apologies if I was unclear: This happens even without the 4G Huawei USB modem connected. Originally I thought this was the cause, but further tests showed this problem happens even with only the power + UART cable connected. However, here's the sysinfo for a clean Ubilinux 4 install with the modem connected in case you want it.

  • Javier Arteaga
    Javier Arteaga Emutex Posts: 163 mod

    No apologies needed :) I did miss this point:

    @Jesse Kaukonen said:

    • This problem happens with nothing except PSU + UART connected - no USB or network cables connected

    If it's not a particular device, I wonder what makes these two failing UP boards special then. If I may trouble you again, could you also send the same output for a UP+ubilinux4 system where this never happens?

    Thanks again!

  • Jesse Kaukonen
    Jesse Kaukonen New Member Posts: 42 ✭✭

    upboard "u7009" + stock ubilinux4 + heatsink
    14 reboots: hang, see u7009_error.txt
    I've also included the kernel logs for u7009 showing this particular boot followed by a successful boot.

    upboard "u7023" + custom kernel + fan (system was installed via clonezilla and existing custom image)
    15 reboots - hangs on boot, no messages printed
    ubilinux4 install tried, failed, BIOS updated, ubilinux4 install still failed after cloning when growing the partition:

    current block:       7872, total block:    1048576, Complete: 100.00%
    Total Time: 00:00:02, Ave. Rate:   39.8MB/min, 100.00% completed!
    Syncing... OK!
    Partclone successfully restored the image (-) to the device (/dev/disk/by-partlabel/ESP)
    Cloned successfully.
    Partclone v0.2.89 http://partclone.org
    Starting to restore image (-) to device (/dev/disk/by-partlabel/root)
    Calculating bitmap... Please wait... done!
    File system:  EXTFS
    Device size:    5.0 GB = 1220608 Blocks
    Space in use:   3.4 GB = 824482 Blocks
    Free Space:     1.6 GB = 396126 Blocks
    Block size:   4096 Byte
    Elapsed: 00:02:24, Remaining: 00:00:00, Completed: 100.00%, Rate:   1.41GB/min,
    current block:    1154391, total block:    1220608, Complete: 100.00%
    Total Time: 00:02:24, Ave. Rate:    1.4GB/min, 100.00% completed!
    Syncing... OK!
    Partclone successfully restored the image (-) to the device (/dev/disk/by-partlabel/root)
    Cloned successfully.
    - Growing root partition...
    e2fsck: No such file or directory while trying to open /dev/disk/by-partlabel/root
    ~/.automated_script.sh  91.62s user 58.69s system 87% cpu 2:51.63 total
    8 root@ubilinux4-installer ~ #                                               :(
    

    I manually executed resize2fs after booting to the newly installed ubilinux4. After 50 boots the device hadn't experienced this problem. After this I did apt-get update and apt-get install lshw tree and generated the sysinfo. Now I issued one more reboot, and for whatever reason I now got some reboot messages in my serial terminal that previously didn't happen ("Stopping User Manger for UID 113..." etc) - something in installing lshw and tree started producing these messages. After doing 9 more boots, the problem manifested. See u7023_error.txt

    upboard "u7024" + ubilinux4 with opencl tests + fan (syslog_huawei from above post)
    51 reboots, no problems
    This device was used to make the OpenCL + USB tests in this thread: https://forum.up-community.org/discussion/2919/upboard-crashes-on-opencl-and-usb-device-use
    The device is essentially stock ubilinux4 with OpenCL stuff installed

    upboard "u7024" + stock ubilinux4, sysinfo_u7024 + fan
    17 reboots - hangs on boot, NMI watchdog messages printed
    Oddly, reinstalling ubilinux4 on this device causes the problems. Is this simply random chance (51 successful boots earlier) or is there some relation?

    upboard "u7006", ubilinux4, fan
    Hangs on first boot after installing ubilinux4, NMI watchdog messages printed

    That's all the Upboards I can test for now - the rest are at customer locations and going there soon. This means all our currently tested devices show this problem in one way or another.

  • Javier Arteaga
    Javier Arteaga Emutex Posts: 163 mod

    Hi Jesse,

    Thanks again for the detailed information. From what I can see now, there is more to this than just udev hanging. Going through your last post,

    @Jesse Kaukonen said:
    upboard "u7009" + stock ubilinux4 + heatsink
    14 reboots: hang, see u7009_error.txt

    Here, the failure appears to come from the graphics driver (i915). There's no obvious signs of a CPU hang like on previous reports.

    upboard "u7023" + custom kernel + fan (system was installed via clonezilla and existing custom image)
    ubilinux4 install still failed after cloning when growing the partition:
    [...]
    I manually executed resize2fs after booting to the newly installed ubilinux4.

    Separate, software bug but yes - the ubilinux 4 installer sometimes fails and needs to be re-run.

    After doing 9 more boots, the problem manifested. See u7023_error.txt

    Same GPU failure symptoms as u7009, plus a failure to communicate with the on-board CPLD:

    [    3.694933] CPLD verify error (expected: 7E5F6CB4, actual: 0)
    

    That last part is unusual and also hints at some lower-level root cause.

    Suggested steps for further debugging:

    • BIOS: latest release is UPC1DM07. From your logs, all these boards are on older factory-flashed releases. Upgrading before testing will rule out any SoC issues that may have already been fixed in firmware and/or CPU microcode.
    • Power supply: your video indicates you're using a 5A/6V PSU. Which one specifically? Is it the same one for all tests?
    • Kernel: re-running the tests on different kernels may put us on the right track. For example, 4.13 from Ubuntu 17.10 for a newer kernel. Confirming whether this never happens on ubilinux 3 across a few more boards would also be of help.
  • Jesse Kaukonen
    Jesse Kaukonen New Member Posts: 42 ✭✭

    BIOS: latest release is UPC1DM07. From your logs, all these boards are on older factory-flashed releases. Upgrading before testing will rule out any SoC issues that may have already been fixed in firmware and/or CPU microcode.

    The Up wiki link was outdated (and now dead? https://wiki.up-community.org/Bios_Update), so I seem to have used an older BIOS earlier. Here's some stats after BIOS update to UPC1DM07, no other test conditions changed (same PSU):

    u7009: ubilinux4 not reinstalled. boot #4 and boot #6: hangs. See u7009_fail.txt, u7009_kern.log, sysinfo_u7009_UPC1DM07.txt, kernel log again showing the failed boot followed by successful
    u7024: ubilinux4 not reinstalled. boot #9: hangs. See u7024_fail.txt u7024_kern.log, sysinfo_u7024_UPC1DM07.txt
    u7023: ubilinux4 reinstalled. Booted 70 times and no hangs. sysinfo_u7023_UPC1DM07.txt, u7023_kern.log
    u7022: ubilinux4 installed. Hangs on boot #29. Forgot to store the error message, but it's the same as u7009 and u7024. sysinfo_u7022_UPC1DM07.txt, u7022_kern.log. This particular Up had been running almost uninterrupted for probably 6 months with ubilinux 3 on our test platform, with a POE bridge, with no problems at all.
    u7006: ubilinux4 reinstalled. Hangs on boot #51. u7006_error.txt, u7006_kern.log, sysinfo_u7006_UPC1DM07.txt

    Btw, excellent work on enabling the LEDs on system power up. This gives the Up at least some measure of communication at early boot stages.

    Power supply: your video indicates you're using a 5A/6V PSU. Which one specifically? Is it the same one for all tests?

    Same 5V/6A PSU on all Ups, bought from Mouser. It's the official AAEON PSU. https://www.mouser.fi/ProductDetail/AAEON-UP/EP-PS5V6A65WUPS?qs=sGAEpiMZZMve4/bfQkoj%2bPo0V0IgBFDNrLMM9H84DIc=

    We originally suspected the Ups ran out of current with the 5V/4A PSU when OpenCL + USB devices were in use, so we upgraded the PSUs to the 6A model. We have several new ones available and can do the tests on another one.

    Kernel: re-running the tests on different kernels may put us on the right track. For example, 4.13 from Ubuntu 17.10 for a newer kernel. Confirming whether this never happens on ubilinux 3 across a few more boards would also be of help.

    u7006 with Ubuntu 17: booted 100 times with no hangs, so maybe it's safe to say a newer kernel works. Tests done with mouse + monitor as I couldn't be bothered to figure out how UART is enabled on Ubuntu. Ubuntu was installed using the official Ubuntu release followed by apt-get update + upgrade. I removed "splash quiet" from grub boot params.

    I didn't have time to test more Ups with Ubuntu today as installing it can take 30+ minutes. I suppose Ubuntu Server or something could be lighter.

    I'll have the ubilinux 3 tests tomorrow.

    Do you have plans to update ubilinux to a new kernel version?

  • Jesse Kaukonen
    Jesse Kaukonen New Member Posts: 42 ✭✭

    After installing ubilinux 3, BIOS UPC1DM07

    u7006:

    Successfully booted 40 times.
    On boot #21 and #38: [ 2.813732] CPLD verify error (expected: 7E5F6CB4, actual: 0)
    Boot was still successful. See u7006_ubilinux3_cpld.log

    There's another post on these forums about this error from 1.5 years ago, so I guess it was on ubilinux 3: https://forum.up-community.org/discussion/778/cpld-verify-error-on-boot

    u7022
    40 boots, no problems, no CPLD verify errors in kern.log

    u7024
    Successfully booted 40 times.
    On boot #26: [ 2.762457] CPLD verify error (expected: 7E5F6CB4, actual: 0)
    Boot still successful.

  • Javier Arteaga
    Javier Arteaga Emutex Posts: 163 mod

    Thanks again Jesse.

    At this point it seems safe to assume the issue is probably in the ubilinux 4 kernel. I'll look into reproducing it on my own - or at least throwing a few updated kernels your way if I can't.

    CPLD I/O errors appear to be a different problem altogether, but it's still important to debug. I'll try to reproduce these too. (That old ubilinux 3 report may or may not be related, given that the CPLD/FPGA initialization stage from firmware has changed since.)