There is a lot going on, of course, but a higher density of pixels means more information moving through smaller spaces which has heat build up faster. The actual pixel size we are talking about with many high density sensors is actually approaching the size of the wavelength of certain bands of light. Pixels that can fit an entire wavelength are more efficient. Pixels that don't, are less. Microlenses are not perfectly efficient. Then there is the size of the bucket argument. If you have a 1 gal bucket and a 5 gal bucket, there is no difference if you are measuring 0.5 gals of water. They are both great. But try to measure 3 gal of water, the water overflows from the smaller bucket, but can still be quantified by the larger bucket.
Taking a step back, I actually agree with most of what you are saying. To a very large extent, aperture can be used to offset sensor size and smaller sensor cameras can have faster lenses. This is literally one of the reasons why I bought the G7X II. This is actually supported by these articles. As I said earlier, I think Clarkvision has some of the more interesting write ups on this subject. You might like these:
http://clarkvision.com/articles/does.pixel.size.matter/
and he continues it here:
http://clarkvision.com/articles/does.pixel.size.matter2/
Basically, the sensors on these two (abeit older) cameras are different sizes but give very similar S/N ratios once normalized to size. So, they test out similarly (hence the DXO test results). Yet the smaller sensor was visibly nosier results. Why? If you read, it is actually because of the smaller lens having a smaller aperture diameters which, when set to the same aperture, is actually letting in less overall light.
So, another way to state this is if you set an aperture diameter to be the same, and if the sensor behaves ideally, then sensor size becomes irrelevant.
But to think that sensors behave ideally or even the same is incorrect. There are minor differences. The one I see most often when going through photos is I simply have less headroom with the G7X II. I attribute that to the "bucket" size being exceeded with smaller pixels. But in midtones, I see very similar results.