UP Squared Freezes Randomly (Linux)

fprnd
fprnd New Member Posts: 6

We are running the UP Squared with a Debian Linux 10 (buster) at various installation places.

At some places the UP Squared just randomly freezes after some days / weeks of operation. The observation in these moments of failure is: there is no way of accessing or interacting with the UP Squared anymore. No ping to the network interfaces provides a reply, nothing is shown on an attached display (black screen). Only the LEDs on the network ports were still indicating activity; same applies to the Power LED.

All these freezings occur randomly on some of the UP Squared systems. We are neither able to reproduce by request nor could we find any cause in the overall environment.
Looking at the linux logs on the UP Squared there is nothing logged anymore from the moment of failure on. Even if in such a moment a network cable is unplugged and plugged in again, there is no indication in the Linux-Logs.

The BIOS Version of the UP Squared systems differs; it is e.g. UPA1AM40. Temperature is checked and perceived as between the given boundaries.

Does anybody have experienced something like this as well and could provide a (hint for a) solution?

Comments

  • camillus
    camillus Administrator, Moderator, AAEON Posts: 95 admin

    Hi @fprnd ,

    Unfortunately, its not possible to have an idea what's happening without checking the logs. However I see that you mentioned the device is running on BIOS UPA1AM40, which is quite old. Can you update the devices having this issue to the latest BIOS, you can download from here, monitor it and if it happens again we can take a next step.

    Best Regards,

  • fprnd
    fprnd New Member Posts: 6
    edited November 25

    Hi @camillus,

    thanks for your response.

    Which logs would you advise us to check? What would be your advise to »monitor« it in the right way to determine the cause? Because, as said, the Linux logs did not contain any message which would provide a failure indication. And during the time the system was frozen, no new / further log entry has been created.

    Regards.

  • camillus
    camillus Administrator, Moderator, AAEON Posts: 95 admin
    edited November 25

    Hi @fprnd ,

    I will suggest the following, update the BIOS on the device, install stress-ng and openssh on it and run a stress test eg stress-ng --cpu 3 --io 4 --vm 3 --vm-bytes 3160M --fork 4 --timeout 172800 Then try to access the device via ssh from another device and print the dmesg log to screen with command dmesg -wH. That way you could be able to see the logs externally.

    Best Regards,

  • fprnd
    fprnd New Member Posts: 6

    Hi @camillus

    thanks for your reply. We'll execute the suggested stress test now on one of our test-systems. But the problem is, that we do not encounter this issue on all our systems! Only some of them are affected. Unfortunately the affected systems are not our test-systems but some production-systems. That's why I made in parallel some further investigation:

    With regards to your suggestion of observing the syslog I looked up the /var/log/syslog from one of the UP Squared boxes where we have this issue again. The system was frozen on Nov 22nd (somewhere) between 00:17 and 07:14. Here is the content of the syslog file in the relevant period:

    Nov 22 00:04:37 customerbox kernel: [ 1946.573655] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    Nov 22 00:04:37 customerbox kernel: [ 1946.573658] CR2: 00007f427dc50160 CR3: 0000000176e46000 CR4: 00000000003406e0
    Nov 22 00:17:01 customerbox CRON[2063]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
    ===== here the system was frozen ====
    Nov 22 07:14:00 customerbox systemd[1]: Starting Flush Journal to Persistent Storage...
    Nov 22 07:14:00 customerbox kernel: [ 0.000000] Linux version 4.19.0-17-amd64 ([email protected]) (gcc version 8.3.0 (Debian 8.3.0-6)) #1 SMP Debian 4.19.194-3 (2021-07-18)
    Nov 22 07:14:00 customerbox kernel: [ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-4.19.0-17-amd64 root=UUID=e7a60afd-1992-4fff-8bfd-b23b26ba47f3 ro quiet
    Nov 22 07:14:00 customerbox kernel: [ 0.000000] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'

    So as you can see: Nothing was logged during the relevant period. Therefore, I am not entirely confident that the stress test will produce a satisfactory result. As soon as the stress test is completed (we are supposed to do it for 48 hours), I will update the ticket here. But maybe there are other ideas based on the information I have made available?

    Regards

  • camillus
    camillus Administrator, Moderator, AAEON Posts: 95 admin

    Hi @fprnd,

    Can you also show the dmesg log from the affected system? I see that you could show the /var/log/syslog. Also out of curiousity I will like to know the BIOS versions of an affected system vs non-affected system.

    Best Regards,

  • fprnd
    fprnd New Member Posts: 6

    Hi @camillus

    thanks for your reply. Hereby you will find a part of the kern.log (which should be identical to dmesg, right?) attached to this post (The exact moment of failure we cannot determine; it has been somewhere between midnight and 07:14).

    There are systems with BIOS version UPA1AM40 which are affected and there are systems with the same BIOS version which are not.

    Regards

  • camillus
    camillus Administrator, Moderator, AAEON Posts: 95 admin
    edited November 29

    Hi @fprnd,

    Thanks. This log is useful. Will look at it and get back to you.

    Best Regards,

  • camillus
    camillus Administrator, Moderator, AAEON Posts: 95 admin

    Hi @fprnd ,

    I have checked the log and can see there is something going on with the networking. Are you running a packet sniffer ? Also what was the outcome of the stress-ng test?

  • fprnd
    fprnd New Member Posts: 6

    Hi @camillus,

    thanks for your reply. About which lines are you talking when mentioning that there is something going on with the networking? We are not running a packet sniffer. The outcome on the stress-test can be found on the attached image.

    Regards

  • camillus
    camillus Administrator, Moderator, AAEON Posts: 95 admin
    edited November 29

    Hi @fprnd ,

    Thanks, the result for the stress-ng looks good. You can find the line highlighted. Was the stress test performed on an affected system? and on same network as before?

  • fprnd
    fprnd New Member Posts: 6

    Hi @camillus,

    the stress test was performed on an identical system in our lab.

    Regarding your findings in the log: The mentioned promiscuous mode is correct for our usage. Because one of the network interfaces is setup like this to be able to get bridged to a docker container running on the device.

    Looking for more ideas / suggestions ... regards