There is a reason why statistical texts spill copious amounts of ink on data sampling and survey designs. Historically speaking, data was incredibly laborious to collect and to analyze with limited computational resources. This led to the development of mathematical modeling techniques that relied on small amounts of information. These models are simplistic and replete with assumptions in order to make the math tractable. The paucity of data and the forcing of simplifying assumptions meant sacrificing significant predictive power. Well, not really a sacrifice as there was no alternative!
Nowadays we are awash in data. Easy access to data and to huge amounts of computational power is completely changing our relationship with data and our approach to mathematical modeling. Generalizing from small amounts of data required making careful and tedious assumptions about model behavior. That is no longer needed, nor is the tractability of mathematical modeling all that important. What is all-important is what has been so all along – how useful is the model? Whether the model can be found in an old statistical textbooks is not only irrelevant, it can chain innovative thinking. The only affirmation of a model is its performance, not its pedigree.
I expect that these views will not be welcomed by all, particularly the purists who believe in the frequentist/bayesian paradigms that have dominated historical work with predictive modeling. I believe that with the data available today to build and test models – it is irrelevant whether the modeling is ‘sound’ based on some arcane definition, or whether it can be interpreted in some academically narrow or sanctioned way. What is relevant is whether the model facilitates some useful outcome, and that it does so in a way that is robust to new data.
To be clear, in some applications the transparency or interpretability of a model is a valid concern – and we make performance trade-offs to accommodate those features. I suspect that with the vast new neural nets that are being designed and built for ubiquitous applications (e.g. organizing our photographs, reading diagnostic charts, etc.) – whether a human understands the model or not will be an increasingly marginal concern in the future.