UP Squared Freezes Randomly (Linux)

fprnd
fprnd New Member Posts: 7

We are running the UP Squared with a Debian Linux 10 (buster) at various installation places.

At some places the UP Squared just randomly freezes after some days / weeks of operation. The observation in these moments of failure is: there is no way of accessing or interacting with the UP Squared anymore. No ping to the network interfaces provides a reply, nothing is shown on an attached display (black screen). Only the LEDs on the network ports were still indicating activity; same applies to the Power LED.

All these freezings occur randomly on some of the UP Squared systems. We are neither able to reproduce by request nor could we find any cause in the overall environment.
Looking at the linux logs on the UP Squared there is nothing logged anymore from the moment of failure on. Even if in such a moment a network cable is unplugged and plugged in again, there is no indication in the Linux-Logs.

The BIOS Version of the UP Squared systems differs; it is e.g. UPA1AM40. Temperature is checked and perceived as between the given boundaries.

Does anybody have experienced something like this as well and could provide a (hint for a) solution?

Comments

  • camillus
    camillus Administrator, Moderator, AAEON Posts: 177 admin

    Hi @fprnd ,

    Unfortunately, its not possible to have an idea what's happening without checking the logs. However I see that you mentioned the device is running on BIOS UPA1AM40, which is quite old. Can you update the devices having this issue to the latest BIOS, you can download from here, monitor it and if it happens again we can take a next step.

    Best Regards,

  • fprnd
    fprnd New Member Posts: 7
    edited November 2021

    Hi @camillus,

    thanks for your response.

    Which logs would you advise us to check? What would be your advise to »monitor« it in the right way to determine the cause? Because, as said, the Linux logs did not contain any message which would provide a failure indication. And during the time the system was frozen, no new / further log entry has been created.

    Regards.

  • camillus
    camillus Administrator, Moderator, AAEON Posts: 177 admin
    edited November 2021

    Hi @fprnd ,

    I will suggest the following, update the BIOS on the device, install stress-ng and openssh on it and run a stress test eg stress-ng --cpu 3 --io 4 --vm 3 --vm-bytes 3160M --fork 4 --timeout 172800 Then try to access the device via ssh from another device and print the dmesg log to screen with command dmesg -wH. That way you could be able to see the logs externally.

    Best Regards,

  • fprnd
    fprnd New Member Posts: 7

    Hi @camillus

    thanks for your reply. We'll execute the suggested stress test now on one of our test-systems. But the problem is, that we do not encounter this issue on all our systems! Only some of them are affected. Unfortunately the affected systems are not our test-systems but some production-systems. That's why I made in parallel some further investigation:

    With regards to your suggestion of observing the syslog I looked up the /var/log/syslog from one of the UP Squared boxes where we have this issue again. The system was frozen on Nov 22nd (somewhere) between 00:17 and 07:14. Here is the content of the syslog file in the relevant period:

    Nov 22 00:04:37 customerbox kernel: [ 1946.573655] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    Nov 22 00:04:37 customerbox kernel: [ 1946.573658] CR2: 00007f427dc50160 CR3: 0000000176e46000 CR4: 00000000003406e0
    Nov 22 00:17:01 customerbox CRON[2063]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
    ===== here the system was frozen ====
    Nov 22 07:14:00 customerbox systemd[1]: Starting Flush Journal to Persistent Storage...
    Nov 22 07:14:00 customerbox kernel: [ 0.000000] Linux version 4.19.0-17-amd64 ([email protected]) (gcc version 8.3.0 (Debian 8.3.0-6)) #1 SMP Debian 4.19.194-3 (2021-07-18)
    Nov 22 07:14:00 customerbox kernel: [ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-4.19.0-17-amd64 root=UUID=e7a60afd-1992-4fff-8bfd-b23b26ba47f3 ro quiet
    Nov 22 07:14:00 customerbox kernel: [ 0.000000] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'

    So as you can see: Nothing was logged during the relevant period. Therefore, I am not entirely confident that the stress test will produce a satisfactory result. As soon as the stress test is completed (we are supposed to do it for 48 hours), I will update the ticket here. But maybe there are other ideas based on the information I have made available?

    Regards

  • camillus
    camillus Administrator, Moderator, AAEON Posts: 177 admin

    Hi @fprnd,

    Can you also show the dmesg log from the affected system? I see that you could show the /var/log/syslog. Also out of curiousity I will like to know the BIOS versions of an affected system vs non-affected system.

    Best Regards,

  • fprnd
    fprnd New Member Posts: 7

    Hi @camillus

    thanks for your reply. Hereby you will find a part of the kern.log (which should be identical to dmesg, right?) attached to this post (The exact moment of failure we cannot determine; it has been somewhere between midnight and 07:14).

    There are systems with BIOS version UPA1AM40 which are affected and there are systems with the same BIOS version which are not.

    Regards

  • camillus
    camillus Administrator, Moderator, AAEON Posts: 177 admin
    edited November 2021

    Hi @fprnd,

    Thanks. This log is useful. Will look at it and get back to you.

    Best Regards,

  • camillus
    camillus Administrator, Moderator, AAEON Posts: 177 admin

    Hi @fprnd ,

    I have checked the log and can see there is something going on with the networking. Are you running a packet sniffer ? Also what was the outcome of the stress-ng test?

  • fprnd
    fprnd New Member Posts: 7

    Hi @camillus,

    thanks for your reply. About which lines are you talking when mentioning that there is something going on with the networking? We are not running a packet sniffer. The outcome on the stress-test can be found on the attached image.

    Regards

  • camillus
    camillus Administrator, Moderator, AAEON Posts: 177 admin
    edited November 2021

    Hi @fprnd ,

    Thanks, the result for the stress-ng looks good. You can find the line highlighted. Was the stress test performed on an affected system? and on same network as before?

  • fprnd
    fprnd New Member Posts: 7

    Hi @camillus,

    the stress test was performed on an identical system in our lab.

    Regarding your findings in the log: The mentioned promiscuous mode is correct for our usage. Because one of the network interfaces is setup like this to be able to get bridged to a docker container running on the device.

    Looking for more ideas / suggestions ... regards

  • camillus
    camillus Administrator, Moderator, AAEON Posts: 177 admin
    edited December 2021

    Hi @fprnd,

    I took a look at the log again and I have two suggestions to try:

    let us know if it works.

    Best Regards,

  • fprnd
    fprnd New Member Posts: 7

    Hi @camillus,

    thanks for your reply.

    But we have to admit that we do not expect that the hints you provided in the last post will allow us to solve the issue. Because we've checked the logs at other (still working) systems as well and the IPv6-messages are appearing there as well - without locking the system. Looking at the linked forum request it is obvious as well, that there is a different situation described.

    Regarding your hint of upgrading the kernel it would be interesting to get to know why you come to this conclusion?

    Overall, as said: We have already looked around in the log-files pretty much. TBH we expect that we have to search somewhere else on / in the UP2 – therefore we opened this forum-discussion. So could we ask you for some hints / suggestions / ideas besides the log-files?

    Regards

  • DCleri
    DCleri Administrator, AAEON Posts: 1,212 admin

    Hello @fprnd

    First of all I would like to mention the configuration (distribution and kernel version) you are using has not been tested and we cannot offer direct support on that via the forums.

    What you can do to help us identify if there is an hardware issue on some of your systems, it is using a validated configuration and providing a way for us to reproduce the issue, unfortunately we cannot help more otherwise.

    Also it is important that you run the tests on the systems affected by the freeze too as the issue could be only specific to those, it is not possible for us to have a full picture if you cannot test those units.

    Testing:
    Please make sure you are using our boards with a proper power supply: UP Squared requires a 5V6A to function properly

    Make sure you have the latest BIOS installed, current BIOS is available here: https://downloads.up-community.org/download/up-squared-uefi-bios-v6-1/

    Install a validated configuration on your board: https://github.com/up-board/up-community/wiki/Ubuntu_20.04

    Run isolated test cases to check where the freezes happen when stressing a particular item:

    if none of the above can cause a recurring freeze, you would need to check your specific application running on your system and its related software