Having worked in the semiconductor industry, I would guess the design is (relatively) the easy part. I don't think there is any real magic there. The complexity would certainly increase with each successive generation but I think that is where manufacturing is the problem. A higher pixel density sensor, of a larger size, is a huge capital investment. Everything is orders of magnitude more expensive. The clean room, the manufacturing equipment, etc. In addition, because the size of the sensor is larger (for full frame, compared to, say, a point and shoot) their yields would plummet.
This would result in a much higher cost sensor.
And if the manufacturer is pushing the limits of the existing equipment and facilities, quality issues would start to creep into the final product. These would be much more pernicious as they could not be designed out if, for instance, they were told to produce a 32MP sensor on a 24MP production line. A good analogy might be high precision machining. The accuracy of the final product is very much dependent on the accuracy of the equipment. You can't just say machine more accurately if the machines can't hold the tolerances. I have seen photos taken with an electron microscope showing clearly the difference between a well manufactured chip and one that is poorly done. In this case it was microprocessors.
When siting new facilities things like a rail line within a few miles of the plant would have to be taken into consideration. The vibration, transmitted through the ground, could be enough to cause major problems in the manufacturing process. Just an idea of the kind of tolerances you are dealing with.
A new facility, designed to produce state of the art (read angstroms, for tolerances, and for microprocessors) costs multi-billions of dollars. If those kinds of investments are involved in camera sensors then it would explain a lot in terms of high pixel density, low noise, and cost of the camera.
Just my 2 cents.