AI should be built on rigorous knowledge…
Ali Rahimi 
Note: This is a follow-up to an earlier article on causal machine learning, “AI Needs More Why”.
There’s much to be excited about with artificial intelligence (AI) in healthcare: Google AI is improving the workflow of clinicians with predictive models for diabetic retinopathy , many new approaches are achieving expert-level performance in tasks such as classification of skin cancer , and others surpassing the capabilities of doctors — notably the recent report of DeepMind’s AI for predicting acute kidney disease, capable of detecting potentially fatal kidney injuries 48 hours before symptoms are recognized by doctors .
Yet medical practitioners and researchers at the intersection of machine learning (ML) and medicine are quick to point out these successes are not representative of the more nuanced, non-trivial challenges presented by medical research and clinical applications. These ML success stories (notably all deep learning) are disease prediction problems, learning patterns that map well-defined inputs to well-labeled outputs .
Domains where instinctive pattern recognition works powerfully are what psychologist Robin Hogarth termed “kind learning environments” . Patterns repeat over and over, and feedback is usually rapid and accurate. Exemplary domains are chess or Go where pieces are moved in a discrete sequence with defined rules and boundaries. AI has dominated these domains, in 1997 with chess champion DeepBlue, and 2016 with AlphaGo.
Kind learning environments are where AI in medicine has shown successes. Datasets are relatively-structured and isolated, and tasks are clear and well-defined. Even so these domains are too difficult and complex for standard statistical methods. AI muscle (read: deep learning) is able to parse data for structure and patterns better than human experts ever could. Models can effectively address questions like what is the likelihood of a patient to reach mortality within 6 months?
Yet not all medical domains are so kind. The majority of medical applications ask questions like what are comorbid conditions that could complicate this treatment? and what would’ve happened if the patient had taken drug Y instead of X?
Hogarth calls these domains “wicked”. Herein lie the thorny real-world problems in medicine. The rules are often unclear or incomplete. Feedback is often delayed and inaccurate. There may or may not be repetitive patterns, and they may not be learnable. In the most wicked of environments, experience will reward the exact wrong actions. Here’s a fun example:
In this example game training environment used by OpenAI, the deep reinforcement learning agent finds an isolated corner where it can repeatedly knock over several targets, timed perfectly to accumulate points as the targets repopulate. Despite repeatedly catching on fire, crashing into other boats, and going the wrong way on the track, the agent scores higher with this strategy than is possible by completing the course in normal fashion. 
Less fun is when rewarding the exact wrong action happens in the real world: in developing a risk-estimation model for those hospitalized with pneumonia, a model trained on real-world data learned that asthmatics are less likely to die from pneumonia . Unbeknownst to the data-centric model there was something underlying the dataset causing erroneous associations: The researchers traced the strange result back to an existing policy, where asthmatics with pneumonia were directly admitted to the intensive care unit (ICU), therefore receiving more aggressive treatment, and thus less likely to die than patients not given the same attention. The model learned that asthmatics should not be recommended to the ICU!
In general, learned risk estimates are highly susceptible to the providers’ practice patterns. David Sontag, a lead researcher in causal machine learning for medicine, takes care to point out that an unstructured model learning from clinical data can only hope to do as well as the doctors, who can be an unreliable source of information and often make poor decisions . Calling a model “unstructured” in this sense implies it lacks explicit structure that would be encoded by the engineer or scientists to represent causal links between variables.
Concrete causal dilemmas with cholesterol
Examples of Simpson’s paradox can illuminate the effects of causal variables in medicine, such as the above study of exercise effects on cholesterol. The paradox characterizes a reversal or cancellation of a global association between two variables, when conditioned upon a third. Here the association between exercise and cholesterol is reversed when conditioned on age. Sure this is an obvious example, but elucidates the power of “confounders” in healthcare. In causal machine learning, by writing the causal graph or expressing causal logic via do-calculus , the scientist or engineer explicitly models these variables, or makes assumptions that constrain the power of the resulting models. Without these causal formalisms, the hidden confounders are left unchecked.
Consider this study of cholesterol mediation on a multiple sclerosis drug (by Eshagi et al. ):
In this multiple sclerosis case study, the authors applied structural equation models to the data from random control trials (RCTs) to investigate causal associations that underlie treatment effects. They specifically modeled whether cholesterol, hypothesized to have associative or causal effects across central nervous system disorders, confounds the observed drug effects. TL;DR the “results suggest that beneficial effects of simvastatin on reducing the rate of brain atrophy and slowing the deterioration of disability are independent of serum cholesterol reduction. Our work demonstrates that structural models can elucidate the statistical pathways underlying treatment effects in clinical trials of poorly understood neurodegenerative disorders, such as progressive multiple sclerosis” .
In neurological diseases it is common for clinical trials to use outcome measures that don’t directly relate to the mechanism of action of the medication — e.g., using simple cognitive and motor assessments as clinical endpoints in Alzheimer’s trials. Without modeling the causal structures representing pathological mechanisms, the outcome measures of clinical trials can be insignificant or misleading.
Causal AI can be wicked smart
Causal inference *is* a part of AI and machine learning. And not surprisingly, some of the best research in causal inference & ML is being done by researchers in medical AI such as @suchisaria.
— Thomas G. Dietterich (@tdietterich) May 4, 2019
One powerful example of Professor Saria’s work is in developing reliable decision support algorithms with counterfactuals . The task of estimating the disease course or outcome under different scenarios, where only one (or zero) scenarios are actually observed, is counterfactual reasoning. Decision makers faced with questions like is this patient likely to die if I do not intervene? or what if I give this patient the red pill vs the blue pill? may look to predictive ML models for answers. Yet as in the ICU example earlier, these supervised learning algorithms are highly sensitive to the policy used to choose actions in the training data. Causal inference can be leveraged to reason explicitly about actions-and-effects underlying observational data. Saria and colleague Peter Schulam accomplish this by encoding causal reasoning into the learning process (more specifically, encoding the problem in a potential outcomes framework to obtain cause-effect estimates from observed data, and training on counterfactual learning objectives); see the paper for details . The result is a safer, more reliable decision support system developed with the tools of causal inference.
More great work in this direction comes from Professor Mihaela van der Schaar’s group at the Turing Institute . Critically, the work from her lab goes beyond developing novel AI techniques on medical data: a priority is placed on developing ML systems to integrate into clinical workflows, that address real-world pain points, and emphasize interpretability (i.e. the implementation challenge for AI in medicine ).
Embracing the machine
Beyond clinical trial analytics and augmenting physician workflows, a driving goal for AI engineers and scientists is to bring a transformative technology to healthcare: in silico trials. The combination of real-world data and the right computational tools can help evaluate new treatments when randomization to placebo for clinical trials may be impossible, impractical, or unethical . Straight from the FDA :
Analyses of RWD [real world data], using appropriate methods, may in some cases provide similar information with comparable or even superior characteristics to information collected and analyzed through a traditional clinical trial.
What are the right computational tools and appropriate methods? RCTs (despite their flaws ) are considered the gold standard for evaluating the performance of new medical therapies because of their scientific rigor. To replace RCTs calls for an AI platform that maintains or surpasses that rigor: models and algorithms that explain the cause and effect relationships underlying clinical and real-world data, explaining the “why” in a transparent way. Advances in causal machine learning will provide trials that are conducted completely within the confines of a computer.
If we can codify it and pass it to computers, they will do it better.
Famous for falling to IBM’s DeepBlue, chess grandmaster Kasparov alludes to the prowess of AI beyond chess and games. Indeed the healthcare ecosystem is gradually embracing the capabilities of AI. But it is crucial for machine learning scientists to “codify” with the language of causality, lest the wicked problems in medicine remain unsolved.
Notes & references
 “… not on alchemy” the rest of the quote goes, from Ali Rahimi’s talk at the 2017 Neural Information Processing Conference. Receiving the Test of Time award, his talk on the newfound lack of scientific rigor in deep learning was engaging and exciting. See the post “Reflections on Random Kitchen Sinks” by Rahimi and Ben Recht, and the talk video is here.
FWIW here’s a version more relevant to the material discussed here:
AI should be built atop rigorous science. – Alexander Lavin
 Google AI blog post (with papers linked): “Improving the Effectiveness of Diabetic Retinopathy Models”
 Esteva, Andre et al. “Dermatologist-level classification of skin cancer with deep neural networks.” Nature 542 (2017): 115-118.
Tomašev, Nenad et al. “A clinically applicable approach to continuous prediction of future acute kidney injury.” Nature 572 (2019): 116-119.
 Not to belittle the accomplishments. These are novel data-driven solutions to challenging problems, with real clinical impact. Professor Suchi Saria and colleagues present good discussion in “Better medicine through machine learning: What’s real, and what’s artificial?”.
 Hogarth, Robin M. et al. “The Two Settings of Kind and Wicked Learning Environments.” (2015).
 OpenAI blog post “Faulty Reward Functions in the Wild”. FWIW this is a comical example, and OpenAI has some of the more robust and rigorous development in AI research.
 Caruana, Rich et al. “Intelligible Models for HealthCare: Predicting Pneumonia Risk and Hospital 30-day Readmission.” 2015 ACM SIGKDD Int’l Conference on Knowledge Discovery and Data Mining (KDD ’15). 1721-1730.
 David Sontag and Fredrik Johansson lecture on “AI for health needs causality” at the Broad Institute in 2018.
 The do-calculus is a formalism for causal logic by Judea Pearl. It’s beyond the scope of this article, so the curious, ML-minded reader may dive deeper with Ferenc Huszar’s series of posts on causal inference.
 Eshaghi, Arman et al. “Applying causal models to explore the mechanism of action of simvastatin in progressive multiple sclerosis.” Proceedings of the National Academy of Sciences of the United States of America (2019).
 Schulam, Peter and Suchi Saria. “What-If Reasoning with Counterfactual Gaussian Processes.” ArXiv abs/1703.10651 (2017)
 See this recent lecture of her work: “Turing Lecture: Transforming medicine through AI-enabled healthcare pathways”. And dive in to more projects and papers at her lab’s website: ML-AIM.
 Discussed by David Shaywitz in “Winning Health Tech Entrepreneurs Will Focus On Implementation, Not Fetishize Invention”
 Another great write-up by David Shaywitz “Will Real World Performance Replace RCTs As Healthcare’s Most Important Standard?”
 See the interesting piece by medical historian Laura Bothwell and colleagues, “Assessing the Gold Standard — Lessons from the History of RCTs”. And check out some enjoyable commentary from Derek Lowe, “Making Excuses, the Modern Way”.
social experiment by Livio Acerbo #greengroundit #thisisnotapost #thisisart