20 and
This article is in my category directory ‘sfreview’, software review, although it is actually about hardware and software. I encountered the problem, not very severely, in Linux Mint 18.3 Sylvia, more severely in Mint 19, less so in 20.1, and imperatively harshly when trying to install Linux Mint 21.1 Vera. Why there were those differences, I don’t know.
Bodhi Linux (like Mint, based on Ubuntu) was also uninstallable on the same hardware, due to this. I also noticed the problem in Alpine Linux 3.17, but there it was quite bearable.
So the issue isn’t specific to Mint. It is specific to Linux, in the sense that it doesn’t seem to occur with Windows. And it is specific to certain products by HP (Hewlett-Packard). In my case: an HP Pavilion x360 Convertible, bought 2 July 2016, when I urgently needed a computer because another one had suddenly broken down. It is convertible in the sense that it can be used as a tablet, and as a laptop. I only ever used it as a laptop computer.
HP doesn’t support Linux, but only Windows. So when a problem arises, you’re basically on your own. Unless someone somewhere on the internet knows a solution.
So what is the problem? And what solutions have I found?
I started the HP computer from a USB-stick containing an installable ISO image for Linux Mint 21.1 Vera, with Cinnamon as the desktop. Later I used Ventoy, so more ISOs fit on one stick. Ventoy or the ISO starts a live session, in which you can try out Mint, and optionally you can install the system to hard disk from there.
There were three processes that together sucked up a lot of computer
resources, both in processor power and in disk IO. Summarised
output of command top
:
PR | NI | %CPU | COMMAND |
---|---|---|---|
−51 | 0 | 93,4 | irq/123-aerdrv |
19 | −1 | 93,4 | systemd-journal |
20 | 0 | 81,4 | rsyslogd |
We see that the IRQ process has a priority (PR) of minus 51, which is actually very high. A higher number means a lower priority, 20 is normal, negative is exceptionally high. Makes sense because an IRQ, an interrupt request, is urgent and asks the processor to interrupt its current activity, and serve the IRQ first.
The nice value (NI) is usually zero, which is normal, but minus 1,
i.e. slightly higher, for the systemd-journal
process. In this case that results in a priority that is also
slightly higher than usual, 19 instead of 20.
The three processes together use up almost 300% of available processor power. The actual numbers vary somewhat, between 80 and 105% each. This model has a chip (Intel i3, generation 6) with four processor cores, two in hardware, each with two in firmware, or maybe I should say: microcode.
This heavy load by itself doesn’t make the computer unresponsive, although it is slow. That’s probably because a fourth processor core is still free, to serve keyboard or mouse input, process it, and send results to the graphics processor, so you see things happening on the screen. And the other processors too can devote some time to other processes in sequential multitasking, giving them time slices.
When I noticed the same problem in Alpine Linux instead of Linux Mint, there was only one process that took up about 10% of available processor core capacity. Not nice, but not a problem. Probably under Alpine, the priority isn’t as high? Or the interrupt is handled more efficiently?
There was only one process then, which means there was no logging.
And that’s where it bites under Linux Mint, in 21.1 and perhaps
other versions too. These two logging processes write to disk
files, /var/log/kern.log
and
/var/log/syslog
. That has a clear impact on
overall performance, even with the powerful and very fast
hardware of modern computers.
When later the problem occurred after installation to hard disk, after
just a few minutes the log files had a size of 16 gigabytes
each! Not funny, although the disk is large enough to cope with
it. This explains why in a live session situation, these processes
are imperative, and soon make the computer completely unusable:
a live session has limited ‘disk’ space, in a squashfs
file system, which actually resides in RAM. Very fast, but not as
spacious as a magnetic hard disk or an SSD.
So after a few minutes: disk full, nothing else worked. With zero bytes of free storage, hardly anything can still function. There was not enough time to install the system. So brick the hardware? That would be a shame. This HP Pavilion has a fine screen with very warm colours. And I think almost seven years isn’t too old for a computer. I want to use it for another five years if possible, although perhaps only as a back-up computer in case the primary one fails.
(Side note: Shouldn’t log rotation have kicked in
to handle those humongous log files? I thought I had it enabled. Perhaps
it only checks every hour or so, and didn’t find the time yet. Anyway,
clearing he log manually cannot be done using something like
sudo cat > syslog
, because the shell (bash, for example)
opens the log file, and cat
has obtained root capabilities,
but bash
has not. Workaround: sudo tee syslog
,
then press ctrl-d
to end input.)
Below I list a typical set of log file entries, taken from
/var/log/kern.log
.
There are some 20 thousand such report blocks for each second! Some
variation occurs, sometimes there are fewer lines per set, sometimes
more lines are repeated. I added some line breaks and tabs for readability,
where in reality everything was in one long line for each timestamp.
Mar 19 18:51:03 rudhar-HP-Pavilion-x360-Convertible kernel: [ 41.417238] pcieport 0000:00:1c.4: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID) Mar 19 18:51:03 rudhar-HP-Pavilion-x360-Convertible kernel: [ 41.417240] pcieport 0000:00:1c.4: device [8086:9d14] error status/mask=00000001/00002000 Mar 19 18:51:03 rudhar-HP-Pavilion-x360-Convertible kernel: [ 41.417247] pcieport 0000:00:1c.4: [ 0] RxErr Mar 19 18:51:03 rudhar-HP-Pavilion-x360-Convertible kernel: [ 41.417257] pcieport 0000:00:1c.4: AER: Corrected error received: 0000:00:1c.4 Mar 19 18:51:03 rudhar-HP-Pavilion-x360-Convertible kernel: [ 41.417264] pcieport 0000:00:1c.4: AER: can't find device of ID00e4 Mar 19 18:51:03 rudhar-HP-Pavilion-x360-Convertible kernel: [ 41.417266] pcieport 0000:00:1c.4: AER: Corrected error received: 0000:00:1c.4 Mar 19 18:51:03 rudhar-HP-Pavilion-x360-Convertible kernel: [ 41.417281] pcieport 0000:00:1c.4: AER: can't find device of ID00e4 Mar 19 18:51:03 rudhar-HP-Pavilion-x360-Convertible kernel: [ 41.417283] pcieport 0000:00:1c.4: AER: Corrected error received: 0000:00:1c.4
Difficult to say whose fault this is, and who might have fixed it. Is it HP, who used faulty chips that cause errors that shouldn’t have happened? Or Intel, supplying the chipset? Is it something in PCI or PCIe (Peripheral Component Interconnect Express)? In MSI (Message Signalled Interrupts)? Or in AER maybe (Advanced Error Reporting), that shouldn’t report, or shouldn’t log, errors of severity “corrected”? Or should only report them three times and then stop, not the thousands or millions of times they occur?
I know far too little of this area of technology to be able to judge that.
I did see someone mention the command lspci
somewhere,
which lists all PCI devices. It reveals that the mentioned device
00:1c.4
is a PCI bridge. There are four of them in the
system:
00:1c.0 PCI bridge: Intel Corporation Sunrise Point-LP PCI Express Root Port #1 (rev f1) 00:1c.4 PCI bridge: Intel Corporation Sunrise Point-LP PCI Express Root Port #5 (rev f1) 00:1c.5 PCI bridge: Intel Corporation Sunrise Point-LP PCI Express Root Port #6 (rev f1) 00:1d.0 PCI bridge: Intel Corporation Sunrise Point-LP PCI Express Root Port #9 (rev f1)
Commands sudo lspci -v
and sudo lspci -v
give me some more
info. I see no difference between this bridge and the three others. There is
this line:
Interrupt: pin A routed to IRQ 123
that mentions the same interrupt number I also saw in the process name in
top
, the process reporting program:
irq/123-aerdrv
.
If perhaps Hewlett-Packard could have fixed the problem, by a firmware update or something, it seems they don’t even try. They don’t take the problem seriously. In an HP Community forum, I found this reaction by an HP employee nicknamed A4Apollo. Quote:
“HP does not support dual boot options unless
the unit has been shipped with two operating systems.
You have to contact Linux support for more assistance.”
Various places on the internet can be found where solutions are described, (like here, for example) basically variants of the same thing. Not really solutions maybe, but workarounds. Those helped me in the past. They didn’t help me this time. But see the next chapter. So I deliberately numbered that one number 1, because chronologically, I had to apply that first. But I learned about it last.
Those solutions I found on the internet entail that you edit the file
/etc/default/grub
. Where there is a line like:
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash"
you change it to:
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash pci=noaer"
Or if it isn’t there, add it. Then you run update-grub
(which is sometimes called update-grub2
; and Alpine Linux doesn’t
have this), which, using grub-mkconfig
and os-prober
,
generates a new
/boot/grub/grub.cfg
. That controls the menu that the
grub
bootloader presents to the user, and the parameter
pci=noaer
will be passed to the Linux kernel.
This setting noaer
means ‘No Advanced Error Reporting’.
It doesn’t really solve the problem, the errors (which are automatically
corrected) are still happening, only they are no longer reported,
so no longer logged, and no disk space is eaten up.
Some also suggest the setting pci=nomsi
instead, which disables
the MSI or Message Signalled Interrupts. Does it have consequences?
It means interrupts should be done in a more traditional and old-fashioned
manner? Does that always work correctly with all modern hardware? I don’t know.
Some also mentioned pci=nommconf
to disable Memory-Mapped PCI
Configuration Space. No idea what that does.
Just pci=noaer
worked for me.
The previous chapter, which I numbered number 2, may help to understand this one.
The problem with the grub
file edits is that they become effective
only after a reboot. And when you reboot a Linux live session (i.e. without
having installed anything to hard disk yet), the live stick will reinstate
its own settings, so you have the interrupt looping problem
yet again.
The solution I suddenly thought of (the evening of 19 March, I wish the idea had
come up earlier!): In the grub menu that the booted ISO, or Ventoy, or
wherever it comes from, presents to you, press the letter e
!
That's e
for edit. Then you can add pci=noaer
there,
just before the live session starts! And the looping and logging problem
will be gone!
This allowed me to finally install Linux Mint 21.1 Vera on my HP Pavilion x360
laptop. Then I still had to edit /etc/default/grub
, run
update-grub
, and reboot
, as
described earlier, to make the solution permanent. I felt so relieved!
In hindsight, the answer was already here, but I had overlooked it.
(This chapter added 3 April 2023)
Later on I found some notes on how, in late February 2021, I managed to install
Linux Mint 20.1 Ulyssa, also with Cinnamon if that makes any difference,
on that same machine, an HP Pavilion x360 Convertible. That time I solved
the looping problem, more or less, in a different
way: by renice
-ing.
This can be done in the command line
(sudo renice nice-value process-number
)
or in top
with the r command. A nice value 19, if I understand
process scheduling correctly, means the process only gets a time slice when
nothing else wants to run, or if other processes have already used up
a lot of CPU time.
I reniced the processes that top
reported with the names
systemd-journal
and rsyslogd
. The process
irq/123-aerdrv
however could not be reniced.
The system was still quite slow, but it helped, and I managed to start the
installation of Linux Mint from the live session run from a USB stick.
During the installation I used sudo top
to find active
processes, which I reniced to minus 10, to give them a high scheduling
priority. I had to repeat this several times, because not all processes
continued to run during the whole installation.
The result of these measures was that the computer remained overburdened and slow, but the installation did continue, and finally made it to the end.
Earlier tests were all in OS’es that use Linux kernels 5.4 and 5.15.
Today, 5 April 2023, following a
question in StackExchange, I tested
in a live (uninstalled) session of Manjaro 22.0 Sikaris with
Xfce, which has kernel 6.1, using the same Hewlett-Packard hardware
as before. Output of uname -a
was:
Linux manjaro 6.1.19-1-MANJARO #1 SMP PREEMPT_DYNAMIC
Mon Mar 13 12:59:35 UTC 2023 x86_64 GNU/Linux
Result, seen in top
: process systemd-journal
used 90 to 100% of a CPU, and irq/123-aerdrv
took
about 25%. The system remained quite responsive, though. I noticed
no rapidly growing logfile, disk usage remained stable.
Later that day, following
instructions found on the internet, I tried
to compile the latest stable kernel, 6.2.9,
downloaded as a compressed
tarball. That was under Linux Mint 21.1 Vera.
It went well, but took a long time. Eventually, in the
linking (ld
) phase of vmlinux.o
, the
procedure was killed by the system for lack of memory, despite having
4 GB of RAM and 2 GiB of swap. That wasn’t enough. Strange. And a pity.
The next day I tried again on other hardware, now under Linux Mint 20.3 Una,
and with 8 GB RAM and 2 GiB of swap space. That too was not enough. I
now noticed in top
that the offending process was not
ld
, but rather objtool
, which had 6,7g of
resident (not virtual) memory allocated (is that GB or GiB?). That
should be possible, because I had purposely stopped Firefox, another
memory hog. But the make
process was killed with error
code 137, which, I find, indeed means “Out of memory”.
I think there’s something wrong here. vmlinuz-5.15.0-69-generic
is only about 11 MB (11468936 bytes, to be exact), so why should
compiling and linking it require almost a 1000 times more than that
in RAM? OK, that file is compressed. But still.
Copyright © 2023 by R. Harmsen, all rights reserved.