Last night I was mulling over an overheating laptop issue that was reported by a user that turned out to be fluff and dust clogging up the fan rather than the intel_pstate driver being broken.
While it is a relief that the kernel driver is not at fault, it still bothered me that this kind of issue should be very simple to diagnose but I overlooked the obvious. When solving these issues it is very easy to doubt that the complex part of a system is working correctly (e.g. a kernel driver) rather than the simpler part (e.g. the fan not working efficiently). Normally, I try to apply
Occam's Razor which in the simplest form can be phrased as:
"when you have two competing theories that make exactly the same predictions, the simpler one is the better."
..e.g. in this case, the fan is clogged up.
Fortunately, laptops invariably provide
Thermal Zone information that can be monitored and hence one can correlate CPU activity with the temperature of various components of a laptop. So last night I added Thermal Zone sampling to powerstat 0.02.00 which is enabled with the new -t option.
powerstat -tfR 0.5
Running for 60.0 seconds (120 samples at 0.5 second intervals).
Power measurements will start in 0 seconds time.
Time User Nice Sys Idle IO Run Ctxt/s IRQ/s Watts x86_pk acpitz CPU Freq
11:13:15 5.1 0.0 2.1 92.8 0.0 1 7902 1152 7.97 62.00 63.00 1.93 GHz
11:13:16 3.9 0.0 2.5 93.1 0.5 1 7168 960 7.64 63.00 63.00 2.73 GHz
11:13:16 1.0 0.0 2.0 96.9 0.0 1 7014 950 7.20 63.00 63.00 2.61 GHz
11:13:17 2.0 0.0 3.0 94.5 0.5 1 6950 960 6.76 64.00 63.00 2.60 GHz
11:13:17 3.0 0.0 3.0 93.9 0.0 1 6738 994 6.21 63.00 63.00 1.68 GHz
11:13:18 3.5 0.0 2.5 93.6 0.5 1 6976 948 7.08 64.00 63.00 2.29 GHz
....
..the -t option now shows x86_pk (x86 CPU package temperature) and acpitz (APCI thermal zone) temperature readings in degrees Celsius.
Now this is where the fun begins. I ran powerstat for 60 seconds at 2 samples per second and then imported the data into LibreOffice. To easily show corrleations between CPU load, power consumption, temperature and CPU frequency I normalized the data so that the lowest values were 0.0 and the highest were 1.0 and produced the following graph:
One can see that the CPU frequency (green) scales with the the CPU load (blue) and so does the CPU power (orange). CPU temperature (yellow) jumps up quickly when the CPU is loaded and then steadily increases. Meanwhile, the ACPI thermal zone (purple) trails the CPU load because it takes time for the machine to warm up and then cool down (it takes time for a fan to pump out the heat from the machine).
So, next time a laptop runs hot, running powerstat will capture the activity and correlating temperature with CPU activity should allow one to see if the overheating is related to a real CPU frequency scaling issue or a clogged up fan (or broken heat pipe!).