Noise: Book Review

Noise: A Flaw in Human Judgment (2021), Daniel Kahneman, Olivier Sibony and Cass Sunstein. Several books on decision making and psychology have a one-word title (Grit, Nudge), which are relatively simple with a lot of storytelling to make a fun read. This is closer to a textbook, with a large number of concepts and limited storytelling. It’s still worthwhile because of the content. Noise is a relatively simple concept, describing diverse judgments that are widely scattered. The authors are looking at human error, which is made up of bias (systematic deviation) and noise (random scatter). Medicine is noisy, as are forecasts, personnel decisions, forensic science, bail judgments, and patent decisions.


Part I: Finding Noise. They start with criminal justice, unique business opportunities, dealing with a pandemic; noise is unwanted variability: “wherever there is judgment, there is noise—and more of it than you think” (p. 16).


Chapter 1: Crime and Noisy Punishment. People convicted of similar crimes receive different penalties, with different judges imposing them (“absence of consensus was the norm”). The same judge also gives different penalties (at the beginning versus end of the day and so on). Rules reduce this, but is this an improvement? Congress passed the Sentencing Reform Act in 1984 to reduce discretion; these cut noise, but met resistance to judges. The Supreme Court stuck the law down (the rules became advisory)—rules versus guidelines. System noise is unwanted variability in judgments of similar.


Chapter 2: A Noisy System. A noise audit of an insurance company showed considerable noise when claims adjusters calculated a premium, when little was expected (55% versus less than 10% expected—an illusion of agreement). Note that noise is unwanted, although variability in judgment not always. “Noise … was tolerated not because it was thought acceptable but because it had remained unnoticed” (p. 32); partly because judgments are informal. Systems in place try to minimize disagreements.


Chapter 3: Singular Decisions. A specific crisis such as an Ebola epidemic would be a singular decision, like the 2014 crisis that Obama responded to (he sent health workers and soldiers to West Africa). Singular decisions are those not made recurrently. Recurrent decisions can be examined for patterns and tested statistically. A pandemic is considered a singular event, but governments around the world handle it differently—providing evidence of noise.


Part II. Your Mind is a Measuring Instrument. Measurement is using an instrument to assign a value on a scale to an object or event, with the goal of accuracy. “Judgment is a conclusion that can be summarized in a word or phrase. … Judgment, like measurement, refers both to the mental activity of making a judgment and its product” (p. 41). Standard deviation is a common measure of variability (one standard deviation on either side of the mean represents about two-thirds of the total variation).


Chapter 4: Matters of Judgment. “A matter of judgment is one with some uncertainty about the answer” (p. 44); somewhere between fact and opinion with “expectation of bounded disagreement” (p. 44). “Two ways of evaluating a judgment by comparing it to an outcome and by assessing the quality of the process that led to it” (p. 50). There are predictive judgments (e.g., forecasts) and evaluative judgments (a criminal sentence). “System noise is inconsistency, and inconsistency damages the credibility of the system” (p. 53), associated with multiple judgments of the same problem (which can be without a true value).


Chapter 5: Measuring Error. Overall errors include bias and noise. Bias is the average of errors. The most common measure of errors is least squares (developed by Carl Gauss in 1795); overall error is mean squared errors of measurement (the value that minimizes overall error); note that squaring gives larger errors a larger weight. Bias is the average error and the residual can be considered “noisy error” (which is positive if larger than the bias). Therefore, overall error (MSE) = Bias2 + Noise2. (This fits the Pythagorean theorem.) MSE can be minimized by avoiding large errors—but avoid mixing values with facts.


Chapter 6: The Analysis of Noise. Judgment variability can be intentional. Analysis of judicial sentencing assumed the mean sentence was the correct one. Average was 7 years, but with a standard deviation of 3.4 years. Some judges have a reputation as harsh (e.g., judges in the South), others lenient. These are level errors (differences by judges); this could be based on incapacitation (removing criminals from society), rehabilitation or deterrence. The average was 2.4 years. The difference is pattern noise (differences by the same judges over time: judge x case interaction). In summary: system noise2 = level noise2 + pattern noise2. Occasion noise relates to transient effects.


Chapter 7: Occasion Noise. On test of variability is test-retest reliability. Wisdom-of-crowds effect averages independent judgments of multiple people (going back to Francis Galton in 1907 on the weight of an ox). Averaging judgments is less noisy, but not less biased. Or ask the same person twice at different times (crowd within). Dialectical bootstrapping is an example, with new assumptions and considerations. Occasional noise can be caused by mood. People in a good mood are more gullible in general. The trolley problem: utilitarian calculation (greater good); deontological ethics (Kant) where killing is prohibited. Stress and fatigue limit judgment. Preceding cases influence current cases (sequence effects). Gambler’s fallacy: underestimate the likelihood that streaks happen by chance.


Chapter 8: How Groups Amplify Noise. There can be wise crowds or crowds that “follow tyrants, that fuel market bubbles, that believe in magic, or that are under the sway of a shared illusion” (p. 92). The test is for social influence, the group dynamics that influence decisions and cause noise across groups. Group outcomes can be manipulated. Political positions can depend on initial popularity. “Independence is a prerequisite for the wisdom of crowds” (p. 96). Information cascades: groups can go in multiple directions, often caused by small changes. The bandwagon effect is similar. Group polarization: when people speak with each other they may end up at a more extreme point: “Internal discussions often create greater confidence, greater unity, and greater extremism, frequently in the form of increased enthusiasm” (p. 100). Deliberation can increase noise. Shifts are often made to the dominant tendency; the effects is the group is more unified, confident and extreme.


Part III. Noise in Predictive Judgments. Percent concordant is an alternative measure to correlation (r). PC goes from 50% to 100% as r goes from 0 to 100%. Objective ignorance means the future cannot be known.


Chapter 9: Judgments and Models. Clinical judgment is an informal approach to problems. Multiple regression (an example of a mechanical prediction) is a model-based statistical test that results in explanatory power and significance and direction of independent variables. Judges make clinical decisions; a rule produces mechanical predictions. Simple mechanical rules usually beat humans. Lewis Goldberg and big five personality traits: agreeableness, conscientiousness, extraversion, openness to experience, and neuroticism. Model eliminates pattern noise.


Chapter 10: Noiseless Rules. Importance of artificial intelligence in decision making. An algorithm is “a process or set of rules to be followed in calculations or other problem-solving operations. … mechanical approaches are noise-free” (p. 118). Robyn Dawes introduced improper linear models, giving all predictors equal weights. Multiple regression optimizes weights to minimize squared errors, but minimizes error in the original data, leaving the question of how well the model does for original non-sample data (that is, the weights are no longer optimal). “The correct measure of a model’s predictive accuracy is its performance in a new sample, called its cross-validated correlation” (p. 120). A major problem is small sample size. The key point of determining predictors (independent variables) is they are correlated with outcomes.


Even easier are frugal models or simple rules. “Predictors are almost always correlated to one another, this statistical fact supports the use of frugal approaches to prediction, which use a small number of predictors” (p. 121)—(e.g., out of 137 possible variables). To predict jail jumpers, two variables work okay: defendant’s age (older people are low flight risks) and number of past court dates missed. A risk score can be computed; this does better than human bail judges. Similar approaches can work for credit score or diagnosing heart disease.


Artificial intelligence (AI) is more sophisticated: “use many more predictors, gather much more data about each of them, spot relationship patterns that no human could detect, and model these patterns to achieve better predictions” (p. 123). Then the key is knowing when to override the model. AI was used on bail decision, based on over 750,000 bail decisions: 74% released, 15% failed to appear. The AI model produced a numerical score. AI worked well with high flight risk candidates. The model also discovered considerable system noise across judges. It worked in Moneyball.


Chapter 11: Objective Ignorance. How good could a predictive judgment get? This limit they call objective ignorance.

Executives asked about their ability to pick the high achiever (of two candidates) thought they were about 80% correct. Their average predictive correlation was 28%. “Both intractable uncertainty (what cannot possibly be known) and imperfect information (what can be known but isn’t) make perfect prediction impossible” (p. 132). They replace uncertainty with ignorance. Overconfidence is a well-documented cognitive bias, denial of ignorance. Philip Tetlock focused on the super forecasters (more doable for yes/no short-term forecasts). Formulas had better hit rates than professionals.


Chapter 12: The Valley of the Normal. Causal chains are common in some areas like disease diagnosis. “The ability to make a prediction is a measure of whether such a causal chain has been identified” (p. 144). Causal thinking creates stories of events, people and so on. Thanks to hindsight bias, events seem normal, but could not have been predicted. Statistical thinking requires system 2, deliberate thinking, plus special training—like looking at individual cases from broad categories. “In the valley of the normal, events are neither expected nor surprising—they just explain themselves” (p. 149).


Part IV. How Noise Happens.


Chapter 13: Heuristics, Biases, and Noise. Heuristics (rules of thumb) are used to tackle difficult questions, which is simplified with System 1 thinking. “Psychological biases create statistical bias when they are broadly shared … and system noise when judges are biased in different ways” (p. 151). The planning fallacy includes the low estimate for the time to complete a project. Scope intensity means some factors are ignored when making judgments (the probability of a person staying on a job two versus three years is different; people usually give the same probability.


“Errors are bound to occur when a judgment of similarity is substituted for a judgment of probability, because probability is constrained by a special logic. Venn diagrams apply only to probability, not to similarity” (p. 155). CEO turnover in the US is about 15% a year (base rate). Ignoring base rates is called base-rate neglect. Availability heuristic is substituting an easy example in place of an assessment of frequency. Conclusion bias is pre-judgment: coming up with a particular conclusion (considered system 1). This is similar to confirmation bias and desirability bias, associated with collecting and interpreting evidence selectively which is believed or wished to be true. Affect biases determining what we think based on our feelings (e.g., about individual politicians). Also, why companies work hard at maintaining a positive brand. Anchoring is related to conclusion bias. For example, think of the last two digits of a social security number—this will “anchor” on a price say for a bottle of wine. Confirmation bias: looking at information that agrees with prejudgments, disregard conflicting evidence; related to excessive coherence, being slow to change initial impressions. Halo effect: positive first impression, then slow to change opinions (“we jump to conclusions, then stick to them,” p. 161).


“Three types of biases operate in different ways: substitution biases, which lead to a mis-weighting of the evidence; conclusion biases, which lead us either to bypass the evidence or to consider it in a distorted way; and excessive coherence, which magnifies the effect of initial impressions and reduces the impact of contradictory information” (p. 162). All produce statistical bias and can produce noise.


Chapter 14: The Matching Operation. Matching is using a value on a judgment scale, say from one to ten; many are qualitative. “Our ability to compare cases (e.g., which of two is better) is much better than our ability to place them on a scale” (p. 172).


Chapter 15: Scales. Apparently, the response scale is a noise source. The authors examined jury-awarded punitive damages, where juries have no relevant information like base rates. Punishment is related to the intensity of outrage. Variance of judgments = variance of just punishment + level noise2 + pattern noise2. They rate variance in judgments is 94% noise. “The law explicitly prohibits any communication to the jury of the size of punitive awards in other cases. The assumption implicit in the law is that jurors’ sense of justice will lead them directly from a consideration of an offense to the correct punishment. This assumption is psychological nonsense” (p. 186).


Chapter 16: Patterns. Illusion of agreement: when people “cannot imagine possible alternative to their conclusions, they will naturally assume that other observers must reach the same conclusion, too” (p. 189). “Pattern noise [can arise] from systematic differences in the ability to make valid judgments about different dimensions of a case” (p. 192)—like what’s important to picking players for a sports team (e.g., batting average, speed). The most accepted model of personality has five traits: extraversion, agreeableness, conscientiousness, openness to experience, and neuroticism. These don’t work that well predicting specific behaviors (r = .3) and are affected by situations.


Chapter 17: The Sources of Noise. “Noise is mostly a product not of level differences but of interactions: how different judges deal with particular defendants, how different teachers deal with particular students … Noise is mostly a by-product of our uniqueness. … Average of errors (the bias) and the variability of errors (the noise) play equivalent roles in the error equation. … We easily make sense of events in hindsight, although we could not have predicted them before they happened. … We do feel a need to explain abnormal outcomes” (p. 202-3). Fundamental attribution error: “A strong tendency to assign blame or credit to agents for actions and outcomes that are better explained by luck or by objective circumstances. Another bias, hindsight bias, distort judgments so the outcomes that could not have been anticipated appear easily foreseeable in retrospect” (p. 203). “A psychological bias is a legitimate causal explanation of a judgment error if the bias could have been predicted in advance or detected in real time” (p. 203). Noise is the variability of statistics.


Part V; Improving Judgments. Consider replacing judgment with rules or algorithms. The authors talk about decision observers, a bias checklist, and decision hygiene.


Chapter 18: Better Judges for Better Judgments. “Good judgments depend on what you know, how well you think, and how you think. Good judges tend to be experienced and smart, but they also tend to be actively open-minded and willing to learn from new information. … highly skilled people are less noisy, and they also show less bias” (p. 210). General mental ability: “predicts both occupational level attained and performance within one’s chosen occupation and does so better than any other ability, trait, or disposition and better than job experience” (p. 213) Personality traits are important like grit and conscientiousness; also, fluid intelligence, the ability to solve novel problems. These become more important as job complexity increases. There are differences in cognitive style, how people approach making judgments. The cognitive reflection test (CRT) compares reflective (system 2) versus impulsive thinking. Critical thinking measures are mentioned like the Halpern Critical Thinking Assessment. Actively open-minded thinking means looking for data that contradicts your preexisting hypothesis.


Chapter 19: Debiasing and Decision Hygiene. Ex post versus ex ante debiasing; e.g., correcting judgments after the fact versus intervening before decisions/judgments. Ex ante debiasing can be modifying the environment (including nudges, like automatic enrollment in pensions). An alternative is training to recognize biases and correct for them (examples include instructional videos and role playing). Debiasing targets specific biases, which may be wrong. Overconfidence may be an investment bias, but so is loss aversion or status quo bias (all have difference effects on investment decisions); the planning fallacy is related to overconfidence, like being optimistic on completion times. Being unaware of these is a bias blind spot. A decision observer who watches the process may diagnose biases. Checklists can be useful to improve decisions, especially to avoid repeating past errors. The federal government uses regulatory impact analysis before formal regulations are issued on cost/benefit, alternatives and so on (OMB Circular A-4). Bias is directional, noise in unpredictable.

Chapter 20: Sequencing Information in Forensic Science. Forensics starts with fingerprints for identification, leading to forensic confirmation bias and bias cascades, where an initial error influences continued errors, also related to bias blind spots. The authors suggest documenting judgments at each step and sequencing information (noise can have any number of triggers). Independent analysis is important to prevent confirmation and other biases.


Chapter 21: Selection and Aggregation in Forecasting. “Analysis of forecasting—of when it goes wrong and why—make a sharp distinction between bias and noise (also called inconsistency or unreliability)” (p. 241). Budget forecasters show overoptimism, like predicting high economic growth and low spending, leading to low deficits. Aggregating independent estimates is useful to overcome this. “Wisdom of crowds works best when judgments are independent. … A select-crowd strategy selects the best judges according to the accuracy of their recent judgments and averages the judgments of a small number of judges” (p. 242). The Delphi method is an old strategy using multiple rounds submitting anonymous estimates, include reasons and respond to others, then continue with the next round. Kahneman et al. suggest a mini-Delphi called estimate-talk-estimate.


Tetlock’s Good Judgment Project asked thousands of volunteers to make short-term (basically yes/no) forecasts of dozens of potential items (e.g., who will win an election, chance of war between two countries, specific economic forecast); these are stated as probabilities, with adjustments as new information becomes available (“perpetual beta”). Brier scores measured accuracy. About 2% of the forecasters stood out. Their analysis suggested these superforecasters were good with numbers, could think analytically and probabilistically, and could structure and disaggregate problems. Base rates were important. In summary: “active open-mindedness.” This included looking at evidence that opposes your beliefs. In summary: “the hard work of research, the careful thought and self-criticism, the gathering and synthesizing of other perspectives, the granular judgments and relentless up-dating” (p. 249). Three strategies: (1) training with probabilistic reasoning, considering biases (bae-rate neglect, overconfidence, confirmation), looking at multiple predictions, (2) teaming with others, (3) selection. Consider systematic forecast errors favoring change or stability.


Chapter 22: Guidelines in Medicine. Diagnostic guidelines have variations, about 44% related to skill level. Between person noise is called the kappa statistic (0 to 1; higher kappa means less noise). Suggestions include training and algorithms (e.g., Apgar score for newborn distress). Big disagreements among psychiatrists: different theories, training, experience, and interview styles. Criteria of disorders vague.


Chapter 23: Defining the Scale in Performance Ratings. Rating people on kindness (collegiality), intelligence and diligence. Businesses rate performance, which people normally hate. Appraisals are noisy (level, pattern, and occasional). Aggregation of ratings help reduce noise. Rise of 360-degree feedback to predict measurable performance. One technique is forced ranking (rankings are less noisy than ratings). Rating people on one dimension at a time reduces noise. Structuring complex decisions into multiple dimensions limits the halo effect. Ratings should be anchored on specific descriptions (associated with frame-of-reference).


Chapter 24: Structure in Hiring. In terms of determining who will succeed in a job and who won’t, standard (structured) interviews aren’t that useful (r=28%)—with “a minefield of psychological biases” (p. 281): culturally similar candidates, first impression, physical appearance, other idiosyncratic reactions. Formulas and clinical aggregation preferred. They talk about structuring complex judgments (decomposition, independence, and delayed holistic judgment). Structured behavioral interviews can be useful, then assign scores (on a predetermined rating scale). Ask for work sample tests.


Chapter 25: The Mediating Assessment Protocol. “Using a structured approach will force us to postpone the goal of reaching a decision until we have made all the assessments” (p. 292). Important considerations include outside perspectives, base rates to a reference class, and estimate-talk-estimate method. A summary table is presented on p. 299.


Part VI. Optimal Noise.


Chapter 26: The Costs of Noise Reduction. Objections to reform: perverse efforts, futile, put important values in jeopardy; “algorithm bias,” which is a possibility is data set is biased.


Chapter 27: Dignity. Case-by-case decisions can have cultural roots; e.g., flexibility, adaptability, mercy. Rules can be considered unacceptable, with the tax code leading the list. People in authority want discretion.


Chapter 28: Rules or Standards? Standards normally ask for “prudence” or reasonableness, leaving discretion and delegating power. Rules eliminate discretion and reduce noise, “bureaucratic justice.” Algorithms work as rules. Consider Facebook standards, for which they receive considerable flack. A key point is the cost of decisions versus the cost of errors. “Principals need to impose rules when they have reason to distrust their agents” (p. 332).


Review and Conclusions: Taking Noise Seriously. Judgments can be predictive or evaluative. “Matters of judgment are characterized by an expectation of bounded disagreement” (p. 337)--withinlimits. System noise: “noise observed in organizations that employ interchangeable professionals” (p. 337). Mean square error “yields the sample mean as an unbiased estimate of the population mean, treats positive and negative errors equally, and disproportionately penalizes large errors” (p. 338). System noise includes level noise (variability of average judgments by different people) and pattern noise (reduces a different ranking based on idiosyncratic responses of judges, statistical interaction). Pattern noise can be stable or occasional. Exaggerated confidence in decision ability understates their objective ignorance. Simple models consistently outperform humans.


Debiasing includes six principles: “The goal of judgment is accuracy, not individual expression; think statistically, and take the outside view of the case (anchor on similar cases); structure judgments into several independent tasks; resist premature intuitions (sequence the information); obtain independent judgments from multiple judges, then consider aggregating those judgments; favor relative judgments and relative scales” (p. 348).


Epilogue: A Less Noisy World. Appendixes: How to conduct a noise audit; a checklist for a decision observer, including a bias observation checklist (p. 358); correcting predictions.