1  The Golem of Prague

“Animated by truth, but lacking free will, a golem always does exactly what it is told.” (1)

1.1 Models are robots

Statistical models are (or can be) like golems in the sense that they are concerned with modeling the world (the truth), but they have no sense of morality on their own. They only do what is intended, regardless of consequences, implications, values, trade-offs, etc. It is up to us to direct the models towards what is useful, and use them in a way that makes sense–through logic, reasoning, common sense, etc. Therefore, we must separate the mathematical procedure of statistical modeling from the interpretation or implication that it has.

“Sometimes their underlying logic reveals implications previously hidden to their designers. These implications can be priceless discoveries. Or they may produce silly and dangerous behavior…there is no wisdom in the golem. It doesn’t discern when the context is inappropriate for its answers. It just knows its own procedures, nothing else. It just does as it’s told.” (2)

It’s often the case that statistics is taught straight from this golem point of view: people look at statistics as an objective, deterministic check list to do science. So they follow diagrams (like the test decision tree shown on page 2) to choose the “right” test, without considering any of the consequences of the golem that are only garnered through “wisdom”.

“So students and many scientists tend to use charts like [the testing decision tree] without much thought to their underlying structure, without much awareness of the models that each procedure embodies, and without any framework to help them make the inevitable compromises required by real research. It’s not their fault.” (3)

McElreath compares this to plumbers doing their jobs without knowing fluid dynamics, implying it can be “fine” in certain contexts. It’s when the user steps outside of the intended use by extrapolating: “It’s as if we got our hydraulic engineers by promoting plumbers” (3). Now I get the point there, but I think there probably is inherent knowledge that plumbers could contribute to the hydraulic engineering process, which seems to be an afterthought here.

1.1.1 Good to play in everyone’s yard?

One common perk of statistics is that you can “play in everyone’s backyard”. It’s cool. But I’ve also more recently thought this to be a somewhat negative thing for producing truly useful statistical results. McElreath hints at that here as well, first in describing the inflexibility and fragility of classical statistical methods and how they are used so frequently, yet are rarely truly appropriate. More pointedly, he makes the case:

“The point is that classical tools are not diverse enough to handle many common research questions. Every active area of science contends with unique difficulties of measurement and interpretation, converses with idiosyncratic theories in a dialect barely understood by other scientists from other tribes. Statistical experts outside the discipline can help, but they are limited by lack of fluency in the empirical and theoretical concerns of the discipline.” (3)

This is exactly right. Domain knowledge and intuition is king, and oftentimes statisticians (myself included) don’t truly have those things when approached for statistical assistance, limiting their ability to have optimal impact (though you may still be able to get the job done, if the goal is to say, just get a publication). I always say that a domain expert with some statistical chops is more effective than someone with tons of technical skills. That’s just the nature of it, so it’s about finding ways to collaborate with synergy to bring out what is needed from both sides.

1.2 A new way to think

He sets the stage for the book’s thesis quite clearly:

“So there are benefits in rethinking statistical inference as a set of strategies, instead of a set of pre-made tools.” (4)

It’s about statistical practice, of course, but also how we can teach statistics, even at an introductory level, differently so it’s not just based on objective, pre-made tools. This would be glorious. Instead of teaching students to systematically look up z-scores in the back of the text book and write out “significance” statements, teach them to solve problems with their brain. I’ve thought one simple way to start would just be to use varying (or random) confidence/significance levels on different problems, which at the very least would elicit some questioning and skepticism about what those numbers mean.

1.2.1 Flaws in falsifiability

He discusses Karl Popper and the perceived objective of science being to falsify hypotheses. I agree that in principle, this makes sense: there’s never really a right answer, but we can move closer and closer to the truth by finding evidence of what it’s not. But I think one thing that’s overlooked is how this plays out in the real-world (and what I deduce as an undertone from McElreath): null hypotheses that we test against are arbitrary and blindly applied. We test default hypotheses that we don’t even think ourselves to be true (i.e., null effects). It’s just part of the statistical procedure (on page 5 in the purple side box he actually goes into this explicitly). So I think this is a component of it, but he then goes more into other reasons why this is flawed.

  • Hypotheses are not models: Getting at the idea that somehow each hypothesis tested is somehow its own thing that move us in some direction. But these are often actually just different subcomponents of the same phenomenon. Considering a model of the world for something, the angles that it is studied at (e.g., by different researchers), what is actually tested, how things are measured, etc. There is so much varible overlap that it’s impossible to make any conclusive statements about that true model that we believe exists.

  • Measurement matters: This is a huge point. Our study may suggest something. But you have a different group of people conduct the same study, there will be many nuances that likely differ. These things can always be critiqued, and cause real inconsistencies and variation–and rightly so. “They don’t trust the data. Sometimes they are right”. To me, this is one of the major flaws in current academic research when trying to confer individual implication to a study. There are literally infinite ways to conduct a study on the same topic.

Models are always wrong, as he states, so how can we falsify one when we know whatever we choose next is wrong too? To me, this is a segway into Bayesian thinking. Instead of just looking at hypotheses in a traditional sense, we want to know what is most likely to have been the thing to created the data (out of all possibilities).

“These hypotheses have vague boundries, because they begin as verbal conjectures, not precise models. There are hundreds of possible potential processes that [describe the null hypothesis].” (5)

1.3 Levels of hypotheses

Digging further into the hypotheses are not models point, he describes layers to framing the model:

  1. Hypotheses: These are the vague things stated above
  2. Process models: These are more specific possibilities of actual data generating processes (and causal structure) that could fit into those vague bins. These are the conceptual, time dependent (i.e., cause-effect ordering), causal structures of what may explain a phenomenon.
  3. Statistical models: We need to translate the causal model into some (associative) statistical model to be able to estimate things. The problem is that these models can be a result of multiple process models, even from different hypotheses.

A key point here is that the statistical model (e.g., in this example) is what is represented by the statistical procedure/estimates, but that representation can be the same even with different data-generating processes, because the same data/predictions, for example, can be generated by completely different processes. Discussing typical hypothesis testing:

“If we reject the null, we can’t really conclude that selection matters, because there are other neutral models that predict different distributions of alleles. And if we fail to reject the null, we can’t really conclude that evolution is neutral, because some selection models expect the same frequency distribution.” (7)

1.3.1 What you should think about

“If it turns out that all of the process models of interest make very similar predictions, then you know to search for a different description of the evidence, a description under which the processes look different.” (7)

This means that the way we represent, estimate, or quantify the process then should be changed because the way you’re currently doing it leads to ambiguity. So find different metrics, different representations, etc. Also, focus on the data-generating process:

“Process models allow us to design statistical models with [e.g., unobserved variables and sampling bias] in mind. The statistical model alone is not enough.” (7)

1.4 Inching towards Bayes

The first part of section 1.2.2 (and everthing leading up to it) hints at a Bayesian point of view.

“In contrast, finding D tells us nothing certain about H, because other hypotheses might also predict D.” (7)

The idea is that many states of the world could very conceivably produce some observation, but that doesn’t mean that state of the world is therefore likely to be the one that did. This is very much related to the fallacy of the transposed conditional which I talk about here:

and is a central point to one of my favorite books, The Cult of Statistical Significance and something I wrote about in a prior blog post.

1.4.1 Measuring things

The all swans are white null example is useful as a reference point to compare with other important scientific hypotheses we are truly after. In this case, with such a “big” statement (applying to ALL swans) you just need to see one black one and it’s over and done with. But the problem, at least in part, is:

  1. Not all scientific inquiries are this simple, objective, and clear
  2. Not all measurements about the thing are clear either.

For example, what if the “black” swan you saw was actually a white swan that had a bad encounter with black paint or tar? Would you know? Maybe in that circumstance you would, but across all scientific inquiries, these nuances and possibilities for measurement error, bias, etc. are infinite.

“At the edges of scientific knowledge, the ability to measure a hypothetical phenomenon is often in question as much as the phenomenon itself.” (8)

The example about neutrinos and the speed of light is a really great, intuitive example about the complexities and biases inherent to measurement. It’s people’s preconceived notion about what level of evidence is sufficient, mixed with the phenomenon being measured, the conditions under which it is being measured, and the fact that despite how “objective” we think a measurement device is, it is always an estimate of the real thing. Great section on page 8.

“But the probabilistic nature of evidence rarely appears when practicing scientists discuss the philosophy and practice of falsification. My reading of the history of science is that these sorts of measurement problems are the norm, not the exception.” (9)

In discussing more opaque hypotheses,

“Now, nearly everyone agrees that it is a good practice to design experiments and observations that can differentiate competeing hypotheses. But in many cases, the comparison must be probabilistic, a matter of degree, not kind.” (9)

Both of these hint at Bayesian thinking.

The last paragraph in section 1.2 is very important:

“But falsification is always consensual, not logical. In light of real problems of measurement error and the continuous nature of natural phenomena, scientific communities argue towards consensus about the meaning of evidence.” (9)

It’s worth reading the rest of it. But the point being that we cannot objectively, and logical falsify something completely. It’s more so the consensus of people agree that, based on evidence, we all agree it is false. And the last sentence is key: “And it may hurt the public, by exaggerating the definitiveness of scientific knowledge.”

1.5 Model the world

In order to take a new approach from falsification (and conform pre-existing golems to every situtation), we are to re-think everything in terms of models. Instead of applying a statistical test, we consider the model for how the data was generated–the data-generating process.

“But if you are a good golem engineer, at least you’ll notice the destruction.” (10)

The point of this quote is to say that by focusing on the model, you focus on mechanism. You aren’t using the golem (statistical test/model) as a brute force objective procedure to “apply” to your data, but rather tailor a unique golem for the specific purpose of answering your research question. Thus it has to do with constructing the right model, informed by common sense, domain knowledge, causal pathways, etc. so that the interpretation side (in which the golem doesn’t have) is better positioned to answer your question. This means you have to dig deeper than is traditionally done, because this is how true science is done. You have to think. You have to know your model. You have to reason. Nothing about this is about mechanistically just thoughtlessly throwing data into modeling software, getting a value, following some rule, and writing a blanket statement about the results. Basically, you need to feel the result in your bones.

1.6 The tools to get there

The book’s primary focus is causal modeling through Bayesian statistics. An important thing to note is that he doesn’t use Bayesian modeling for ideological reasons, per say, it’s more so that it just happens to be the framework that makes sense to facilitate modeling from a generative perspective.

1.6.1 Bayesian vs. frequentist

He simplifies Bayesian statistics as a way to count the ways things could happen, and then see what is most likely. And that’s a good way to think about it: we just want to know what is the most likely explanation for generating our observation, not how likely our data is under some hypothetical truth.

It’s also interesting he discusses frequentist statistics as a special case of Bayesian probability. He emphasizes that it’s based on the idea of imaginary resampling of data, and points out, and deservedly so, sure this could be a good way to think about it in a controlled lab experiment when you could conceivably keep sampling over and over, but for many other things, it just doesn’t make sense conceptually.

“This resampling is never done, and in general it doesn’t even make sense–it is absurd to consider repeat sampling of the diversification of song birds in the Andes. As Sir Ronald Fisher, one of the most important frequentist statisticians of the twentieth century, put it: ‘the only populations that can be referred to in a test of significance have no objective reality, being exclusively the product of the statistician’s imagination.’” (11)

Quite the interesting closing paragraph of page 11, stating that “Bayesian golems treat ‘randomness’ as a property of information, not of the world.”

“Nothing in the real world–excepting controversial interpretations of quantum physics–is actually random. Presumably, if we had more information, we could exactly predict everything. We just use randomness to describe our uncertainty in the face of incomplete knowledge. From the perspective of our golem, the coin toss is ‘random’, but it’s really the golem that is random, not the coin.” (11)

He acknowledges the quantum point here. But that is precisely one of the main objections to the idea of “everything being predictable”. This is a potentially useful way to view a lot of things that are studied, because it makes a lot of sense, but probability fields in quantum physics seem to point to something deeper at a more fundamental level. This is explicitly discussed in the book, Light of the Mind, Light of the World: Illuminating Science Through Faith. It’s a central theme in the book, but Chapter 3 in particular discusses the “small gods” as being those who view the world from this mechanical philosophy: “a conviction that the world was made of bodies, whose movements and collisions were as sharply regulated as those displayed by an intricately calibrated machine…if they could be described using numbers alone, atoms might reveal the truth underlying all things” (page 69).

1.6.2 Bayesianing, not Bayesianism

He further solidifies his stance that he’s not using Bayesian modeling on ideological grounds.

“Note that the preceding description doesn’t invoke anyone’s ‘beliefs’ or subjective opinions. Bayesian data analysis is just a logical procedure for processing information. There is a tradition of using this procedure as a normative description of rational belief, a tradition called Bayesianism. But this book neither describes nor advocates it. In fact, I’ll argue that no statistical approach, Bayesian or otherwise, is by itself sufficient.” (12)

He then goes on to describe the advantage of Bayesian analysis–it is more intuitive. People can grasp the practical implications easier, and it just makes sense. Otherwise you are just trying to remember some steps or criteria or what boilerplate sentence to write about a p-value, without actually thinking through what it means. The interesting case he makes it how often a p-value (or confidence interval) is interpreted like a posterior probability (go back up the page, this is related to the fallacy of the transposed conditional as well), but the reverse is not done.

“The opposite pattern of mistake–interpreting a posterior probability as a p-value–seems to happen only rarely. None of this ensures Bayesian analysis will be more correct…the scientist’s intuition will less commonly be at odds with actual logic of the framework.” (12)

1.7 Centering the model

As I’ve been learning about statistical history, philosophy, and Bayesian thinking, one huge differentiator I’ve realized is the basic idea that in the Bayesian setting, the model becomes the center of attention, and the data is just there to supplement it. To me, this changes the entire view of the problem. I’ve talked about this before, like in this blog post, and in this supporting video:

I bring this up because the first sentence of section 1.3.2, although stated casually, is very profound:

“Bayesian data analysis provides a way for models to learn from data.” (13)

The emphasis being that our model can learn from data. Meaning our model is a constant, a living, breathing object that describes the world. We only use data, as it comes and goes, to help inform that model as it is needed.

1.7.1 Beyond accuracy

He talks about the concept of overfitting, and the need for cross-validation, etc. But one attribute of Bayesian thinking that is related to this is the fact that multiple models can produce the same predictions, and so choosing the simpler one is in large part what Bayesian thinking does. So beyond getting good predictions, we’re chosing the model/mechanism that more likely generated the data. There is a fascinating paper related to this topic that discusses how Bayesian modeling is essentially the realization of Occam’s Razor–definitely worth the read.

1.7.2 A multilevel approach

He implicitly describes multilevel models as a key concept in this book. Yes, these are models that we think of: individual students in classrooms, classrooms in schools, schools in district, etc. But the effect of using these models is profound:

“Cross-validation and information criteria measure overfitting risk and help us to recognize it. Multilevel models actually do something about it.” (14)

Partial pooling is what allows us to structure models and share information where it is necessary, while preserving variation and other important components of the data-generating process. In this sense, he extends this concept more generally to how we can think about multilevel modeling, in the context of missing data, factor analysis, etc. The concept that “each parameter can be thought of as a placeholder for a missing model” makes you realize the hierarchical structure of every modeling process.

He then makes a big claim: “multilevel regression deserves to be the default form of regression”. Stating that papers should have to justify not doing this, rather than the current practice. Intuitively, I think this makes sense. Part of the reason you can easily critique this in academic papers is because there are a lot of things worth critiquing, stemming from things like the cult of statistical significance, poor publication standards, reliance on “positive” p-values, and overall lack of accountability for anything that happens with the results beyond them being printed on a page, leaving open an infinite number of ways to “do” a study and get it published…but I digress.

“Fitting and interpreting multilevel models can be considerably harder than fitting and interpreting a traditional regression model.” (15)

Exactly. It’s harder. And there’s no incentive to do it, because they’ll still get the funding and publication, so same end result.

1.8 Mechanism and causality

“Facts outside the data are needed to decide which explanation is correct.” (16)

He makes a similar point here described above, that models can have similar predictions but not be the true causal mechanism. This has to do with overfitting (think the geocentric vs. heliocentric models), and inferring what makes sense logically regardless of the outputs of the model.

The concept of identification is very important here, because it distinguishes the simple act of predicting what will happen (i.e., I can predict that the wind is blowing given I see tree branches swaying) versus intervening on a process and inferring what will happen as a result (I can’t then go and move the branches around to make wind start blowing (besides the force generated from the branches which is different)).

1.8.1 The diagram

We will use directed acyclic graphs (DAG) as the choice of graphical causal model (as this isn’t the only kind). They are not models themselves, but diagrams to help us think through what model and statistical procedures are most appropriate for the research question.

In the “Rethinking” box he makes the point about causal salad, which is just the idea that we can do causal inference by “adjusting for all confounders”, which is a very common practice in science, to just throw everything we can think of in a regression model and say we “controlled” for stuff. It’s curious why every one of those papers have a “limitation” section stating that causal inference can’t be assumed, like a disclaimer. Tells you that the people doing this probably don’t believe the results themselves, otherwise they’d be repressed to stand by such a statement.


Video Lecture From the Author