I am talking about something like this:

Another way to put that would be Frequency (low to high) on the X axis, and Amplitude (high to low) on the Y axis.
Contrast is simply the amplitude of the frequency wave.
.../cut/...
No. Using the word "amplitude" as a straight replacement for the word "contrast" (red-marked text) - is actually very misleading.
The amplitude is not equal to contrast in optics, and especially not when you're talking about visual contrast. Contrast, as normal people speak of it, is in most cases closely related to [amplitude divided by average level]. And so are MTF figures - this is not a coincidence.
An amplitude of +/-10 is a relatively large contrast if the average level is 20
-giving an absolute amplitude swing from 10 to 30 >> an MTF of 0.5
But if the average level is 100, then swing is 90-110 >> MTF is only 0.1. That's a very much lower contrast, and a lot harder to see or accurately reproduce.
Contrast is what we "see", not amplitude swing.
And no, noise in general is not generally disjointed from MTF... Patterned noise is separable from image detail in an FFT, and you can eliminate most of it without disturbing underlying material. Poisson noise or any other non-patterned noise on the other hand isn't separable, by any known algorithm. And since the FFT of Poisson is basically a Gauss bell curve, you remove Poisson noise by applying a Gaussian blur... Any attempt to reconstruct the actual underlying material will be - at worst - a wild guess, and - at best - and educated guess. The educated guess is still a guess, and the reliability of the result is highly dependent on non-local surrounds.
The Gaussian blur radius you need to apply to dampen non-patterned noise by a factor "X" is (again, not by coincidence!) almost exactly the same as the amount of downwards shift in MTF that you get.
As noise suppression algorithms get smarter and smarter, the amount of correct guesses-estimates in a certain image with a certain noise amount present will continue to increase (correlation to reality will get better and better) - but they're still guesses. But that's good enough for most commercial use. What we're doing today in commercial post-processing regarding noise reduction is mostly adapting to psycho-visuals. We find ways to make the viewer THINK that: -"Ah... That looks good, that must be right" - by finding what types of noise patterns that humans react strongly to, and then trying to avoid creating those patterns when blurring the image (all noise suppression is blurring!) and making/estimating new sharp edges.
Well, I can't speak directly to optics specifically.
I was thinking more in the context of the image itself, as recorded by the sensor. The image is a digital signal. There is more than one way to "think about" an image, and in one sense any image can be logically decomposed into discrete waves. Any row or column of pixels, block pixels, however you want to decompose it, could be treated as a Fourier series. The whole image can even be projected into a three dimensional surface shaped by a composition of waves in the X and Y axes, with amplitude defining the Z axis.
Performing such a decomposition is very complex, I won't deny that. Sure, a certain amount of guesswork is involved, and it is not perfect. Some algorithms are blind, and use multiple passes to guess the right functions for deconvolution, choosing the one that produces the best result. It is possible, however, to closely reproduce the inverse of the Poisson noise signal, apply it to the series, and largely eliminate that noise...with minimal impact to the rest of the image. Banding noise can be removed the same way. The process of doing so accurately is intense, and requires a considerable amount of computing power. And since a certain amount of guesswork IS involved, it can't be done perfectly without affecting the rest of the image at all. But it can be done fairly accurately with minimal blurring or other impact.
Assuming the image is just a digital signal, which in turn is just a composition of discrete waveforms, opens up a lot of possibilities. It would also mean that, assuming we generate a wave for just the bottom row of pixels in the sample image (the one without noise)...we have a modulated signal of high amplitude and decreasing frequency. The "contrast" of each line pair in that wave is fundamentally determined by the amplitude of the wavelet. The row half-way up the image would have half the amplitude...which leads to what we would perceive as less contrast.
Perhaps it is incorrect to say that amplitude itself IS contrast, I guess I wouldn't dispute that. A shrinking amplitude around the middle gray tone of the image as a whole does directly lead to less contrast as you move up from the bottom row of pixels to the top in that image. Amplitude divided by average level sounds like a good way to describe it then, so again, I don't disagree. I apologize for being misleading.
I'd also offer that there is contrast on multiple levels. There is the overall contrast of the image (or an area of the image), as well as "microcontrast". If we use the noisy version of the image I created, the bottom row could not be represented as a single smoothly modulated wave. It is the combination of the base waveform of increasing frequency, as well as a separate waveform that represents the noise. The noise increases contrast on a per-pixel level, without hugely affecting the contrast of the image overall.
Perhaps this is an incorrect way of thinking about real light passing through a real lens in analog form. I know far less about optics. I do believe Hjulenissen was talking about algorithms processing a digital image on a computer, in which case discussing spatial frequencies of a digital signal seemed more appropriate. And in that context, a white/black line pair's contrast is directly affected by amplitude (again, sorry for the misleading notion that amplitude IS contrast...I agree that is incorrect.)