Skip to content

Memory leak linux-6.16.8.arch3-1 in ESXi VMs?

Description:

I have three Hashicorp Vault VMs. They're ESXi guests I manage through vSphere.

They're thinly provisioned and do a good job as a lightweight vault cluster in very small VMs of few resources.

After most recent updates, all three started showing up in ESXi as 100% CPU Util. I was unable to SSH into them nor did they respond to my salt-master for management either. It takes hours but they eventually fill up their memory and lock themselves up trying to cope.

I did eventually pry my way into one of them. kswapd0 was using 100% CPU and dmesg showed they had already killed all of their processes off and couldn't kill off anything more. I really mean just about every single process OOM Killer was allowed to kill. Leaving each of them cornered with kswapd0 maxing out their CPUs trying to save the system.

This has been going on as soon as I updated them the other week, but I notice 6.16.8.arch3-1 was ad ded to the repo's on Tue 23 Sep 2025 08:08:35 AM AEST

I tried upping them from 2GB of memory to 4GB thinking it was one of our metrics collection services kicking too hard. Then 4GB to 8GB.

htop shows vault and elastic-agent (I run that on these) using 0.5% total memory and 0.2% total memory, there was nothing else in the list even close to that.

The system's reported memory usages after 12+ hours of runtime did not seem to be allocated to any particular program running on these guests.

Switching to linux-lts seemed to stop this from happening for the third VM, so I applied this change to the other two as well.

Additional info:

  • package version(s): 6.16.8.arch3-1
  • config and/or log files: They run vault and elastic-agent. Practically nothing else outside a stock archinstall. Restarting all services I can see does not relieve the memory pressure.
  • link to upstream bug report, if any: None yet.

Steps to reproduce:

  1. Upgrade the system with pacman -Syu, making sure core/linux 6.16.8.arch3-1 gets installed and is in use.
  2. Reboot into the new kernel
  3. Depending on memory allocated to the guest, wait 10 or so hours for memory to run out and OOM Killer to start killing off processes desperately
  4. Eventually there's nothing left to kill, the system deadbolts with kswapd0 maxing out a CPU thread.

How I worked around it

Switching to linux-lts (pacman -S linux-lts) and changing the boot conf to use that and it's lts images instead. All three VMs seem stable now

Additional issue

Here's the lcpci -nn output for one of the impacted guests in case the real cause could be related to some VMWare-specific virtual hardware.

00:00.0 Host bridge [0600]: Intel Corporation 440BX/ZX/DX - 82443BX/ZX/DX Host bridge [8086:7190] (rev 01)
00:01.0 PCI bridge [0604]: Intel Corporation 440BX/ZX/DX - 82443BX/ZX/DX AGP bridge [8086:7191] (rev 01)
00:07.0 ISA bridge [0601]: Intel Corporation 82371AB/EB/MB PIIX4 ISA [8086:7110] (rev 08)
00:07.1 IDE interface [0101]: Intel Corporation 82371AB/EB/MB PIIX4 IDE [8086:7111] (rev 01)
00:07.3 Bridge [0680]: Intel Corporation 82371AB/EB/MB PIIX4 ACPI [8086:7113] (rev 08)
00:07.7 System peripheral [0880]: VMware Virtual Machine Communication Interface [15ad:0740] (rev 10)
00:0f.0 VGA compatible controller [0300]: VMware SVGA II Adapter [15ad:0405]
00:11.0 PCI bridge [0604]: VMware PCI bridge [15ad:0790] (rev 02)
00:15.0 PCI bridge [0604]: VMware PCI Express Root Port [15ad:07a0] (rev 01)
00:15.1 PCI bridge [0604]: VMware PCI Express Root Port [15ad:07a0] (rev 01)
00:15.2 PCI bridge [0604]: VMware PCI Express Root Port [15ad:07a0] (rev 01)
00:15.3 PCI bridge [0604]: VMware PCI Express Root Port [15ad:07a0] (rev 01)
00:15.4 PCI bridge [0604]: VMware PCI Express Root Port [15ad:07a0] (rev 01)
00:15.5 PCI bridge [0604]: VMware PCI Express Root Port [15ad:07a0] (rev 01)
00:15.6 PCI bridge [0604]: VMware PCI Express Root Port [15ad:07a0] (rev 01)
00:15.7 PCI bridge [0604]: VMware PCI Express Root Port [15ad:07a0] (rev 01)
00:16.0 PCI bridge [0604]: VMware PCI Express Root Port [15ad:07a0] (rev 01)
00:16.1 PCI bridge [0604]: VMware PCI Express Root Port [15ad:07a0] (rev 01)
00:16.2 PCI bridge [0604]: VMware PCI Express Root Port [15ad:07a0] (rev 01)
00:16.3 PCI bridge [0604]: VMware PCI Express Root Port [15ad:07a0] (rev 01)
00:16.4 PCI bridge [0604]: VMware PCI Express Root Port [15ad:07a0] (rev 01)
00:16.5 PCI bridge [0604]: VMware PCI Express Root Port [15ad:07a0] (rev 01)
00:16.6 PCI bridge [0604]: VMware PCI Express Root Port [15ad:07a0] (rev 01)
00:16.7 PCI bridge [0604]: VMware PCI Express Root Port [15ad:07a0] (rev 01)
00:17.0 PCI bridge [0604]: VMware PCI Express Root Port [15ad:07a0] (rev 01)
00:17.1 PCI bridge [0604]: VMware PCI Express Root Port [15ad:07a0] (rev 01)
00:17.2 PCI bridge [0604]: VMware PCI Express Root Port [15ad:07a0] (rev 01)
00:17.3 PCI bridge [0604]: VMware PCI Express Root Port [15ad:07a0] (rev 01)
00:17.4 PCI bridge [0604]: VMware PCI Express Root Port [15ad:07a0] (rev 01)
00:17.5 PCI bridge [0604]: VMware PCI Express Root Port [15ad:07a0] (rev 01)
00:17.6 PCI bridge [0604]: VMware PCI Express Root Port [15ad:07a0] (rev 01)
00:17.7 PCI bridge [0604]: VMware PCI Express Root Port [15ad:07a0] (rev 01)
00:18.0 PCI bridge [0604]: VMware PCI Express Root Port [15ad:07a0] (rev 01)
00:18.1 PCI bridge [0604]: VMware PCI Express Root Port [15ad:07a0] (rev 01)
00:18.2 PCI bridge [0604]: VMware PCI Express Root Port [15ad:07a0] (rev 01)
00:18.3 PCI bridge [0604]: VMware PCI Express Root Port [15ad:07a0] (rev 01)
00:18.4 PCI bridge [0604]: VMware PCI Express Root Port [15ad:07a0] (rev 01)
00:18.5 PCI bridge [0604]: VMware PCI Express Root Port [15ad:07a0] (rev 01)
00:18.6 PCI bridge [0604]: VMware PCI Express Root Port [15ad:07a0] (rev 01)
00:18.7 PCI bridge [0604]: VMware PCI Express Root Port [15ad:07a0] (rev 01)
03:00.0 Serial Attached SCSI controller [0107]: VMware PVSCSI SCSI Controller [15ad:07c0] (rev 02)
0b:00.0 Ethernet controller [0200]: VMware VMXNET3 Ethernet Controller [15ad:07b0] (rev 01)

I told the servers to run a few commands into a log file every hour during this problem. Here's the latest memory info from 3:46AM this morning, the last log file created before vault01 crashed this morning:

### Last log file before hard lockup at 3:46AM this morning: # meminfo.1759167998.txt

# uname -a
Linux vault01.us.local 6.16.8-arch3-1 #1 SMP PREEMPT_DYNAMIC Mon, 22 Sep 2025 22:08:35 +0000 x86_64 GNU/Linux

# cat /proc/meminfo
MemTotal:        8088980 kB
MemFree:          124496 kB
MemAvailable:      31472 kB
Buffers:            2528 kB
Cached:            58156 kB
SwapCached:            0 kB
Active:           233080 kB
Inactive:         179100 kB
Active(anon):     207036 kB
Inactive(anon):   145580 kB
Active(file):      26044 kB
Inactive(file):    33520 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:             0 kB
SwapFree:              0 kB
Zswap:                 0 kB
Zswapped:              0 kB
Dirty:              5660 kB
Writeback:             0 kB
AnonPages:        351336 kB
Mapped:            49492 kB
Shmem:              1624 kB
KReclaimable:      26588 kB
Slab:            3516888 kB
SReclaimable:      26588 kB
SUnreclaim:      3490300 kB
KernelStack:        6992 kB
PageTables:        25996 kB
SecPageTables:         0 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:     4044488 kB
Committed_AS:     958200 kB
VmallocTotal:   34359738367 kB
VmallocUsed:       86352 kB
VmallocChunk:          0 kB
Percpu:          3880960 kB
HardwareCorrupted:     0 kB
AnonHugePages:     20480 kB
ShmemHugePages:        0 kB
ShmemPmdMapped:        0 kB
FileHugePages:         0 kB
FilePmdMapped:         0 kB
CmaTotal:              0 kB
CmaFree:               0 kB
Unaccepted:            0 kB
Balloon:               0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:               0 kB
DirectMap4k:      109992 kB
DirectMap2M:     4083712 kB
DirectMap1G:     6291456 kB

# vmstat -s -S M
         7899 M total memory
         7858 M used memory
          183 M active memory
          207 M inactive memory
          132 M free memory
            3 M buffer memory
           82 M swap cache
            0 M total swap
            0 M used swap
            0 M free swap
        65765 non-nice user cpu ticks
           26 nice user cpu ticks
      5341110 system cpu ticks
     22484657 idle cpu ticks
        39761 IO-wait cpu ticks
        23768 IRQ cpu ticks
         7241 softirq cpu ticks
            0 stolen cpu ticks
            0 non-nice guest cpu ticks
            0 nice guest cpu ticks
     85238580 K paged in
     13040549 K paged out
            0 pages swapped in
            0 pages swapped out
          256 pages alloc in dma
      8178784 pages alloc in dma32
            0 pages alloc in high
            0 pages alloc in movable
     42339215 pages alloc in normal
     52790786 pages free
     87472549 interrupts
     68867892 CPU context switches
   1759097965 boot time
        63668 forks
Edited by Jared Johnstone
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information