Garbage in, Garbage out: Data-Centricity and Theory-Driven AI/ML
Andrew Ng recently led a shift in thinking on AI model performance (see MLOps: From Model-centric to Data-centric AI). Given the advancement of AI/ML algorithms, he argued that the major limiting factor for model performance is data quality. Traditionally, large datasets have been employed to train models and quality labelling is expensive. However, Dr. Ng asserted that small training datasets can yield high model performance if they are carefully engineered, and attention is focused on improving the small percentage of lower quality cases.
The focus on data quality in analytical work is age-old. Labelling training data for supervised AL/ML learning is by its very nature a measurement problem. Hence, the principles of sampling and measurement theory apply for controlling the reliability and validity of training in relation to the “real world”. Data labels provide detailed information about the content and structure of a dataset (e.g., data types, measurement units, time periods) and the to-be-predicted outcomes. Since the labelling is often performed by human raters, the labels may be inaccurate or unreliable.
Common best practices include obtaining judgments from multiple raters and reconciling the inconsistencies, but this impacts the reliability of the data and does not necessarily ensure that the agreement of the raters is also accurate. Bad labelling may be due to poor instructions too in that they do not adequately convey how to label (all) the cases. It is helpful for the raters to have contextual knowledge of the intended use of the labels and the actions to be taken based on the results of the model. This is to say, domain expertise is important, and proceeding without domain knowledge can lead to quantitatively and qualitatively incorrect models. Domain expertise guides the development of an effective taxonomy of features to be labelled and the appropriate application of these labels. The expression of this domain expertise is based in the mental models and expectations held by the individuals involved – theories of how things should work.
The discussion of theory is tied up heavily in the issue of model interpretability. Understanding why the AI produced a certain decision can be critical to the utility of the model. But it’s also ultimately critical to training AI models if we want them to achieve higher levels of performance and complexity. A systematic approach driven by theory to define relevant features underlying an accurate decision is ultimately where we need to be. Feature engineering is one of the main ways in AI/ML machine learning to help increase the accuracy of a predictive model. It involves providing an algorithm with added information by altering the input data. Often the features don’t exist in a raw data form and need to be derived. Knowing what features to derive, and what form to generate them is done based on theory, domain-expertise, and our understanding of the world. Thus, improving data quality not only involves improving the target/labels, but improving the nature of the input data.
Where the interpretability issue get interesting is the question of what needs to be known about the internal workings of AI systems themselves. Can we continue to advance these systems beyond the simple emulation of behavior without understanding what’s going on inside? This is akin to reductionistic behaviorism – a largely dismissed field of psychology that treats humans as a black box. The behaviorist perspective is to study inputs and outputs (data), and training AI systems (and rats) is viewed as a matter of operant conditioning involving positive and/or negative reinforcement. In contrast, a constructivist perspective is that learners build knowledge schemas, and intelligence involves actively incorporating new information and creating new models in problem-solving. The fields of cognitive science and computer science still have a lot to learn from one another going forward.
In the “new” data-centric world of AI/ML, domain experts tag and annotate the data, but data engineers use analysis programs to identify errors for the purpose of improving the data. Data engineers continue to be looked on to contribute to training data collection, selection, and augmentation, as well as fixing inaccuracies and inconsistencies. In that context, let’s make sure our data engineers have a very good sense of the business they support, operate on well-integrated teams that include domain experts, and engage regularly with users.