🧠

Algorithms To Live By

🔖How would I describe this book in 1 sentence?

Powerful implications from computer science to psychology.

🗺️What was the role of this book in my journey?

I found this book recommended by many sources. The main reason I decided to read it is that I'd been working on my internal Notion systems. I hoped that insights from this book would help me to make certain decisions about the classifications and directions to go.

But, I found something more in this book. Algorithms To Live By told me that I probably should not be worried about the classifications at all. The book offered really great frameworks on decision-making and productivity, which I was intuitively following on my own before. And now I feel even more confident in my approach.

Also, this book brought some knowledge about machine learning and modeling which can be quite useful for me in the future.

💡Key Insights

  1. Life is full of problems that are, quite simply, hard. And the mistakes made by people often say more about the intrinsic difficulties of the problem than about the fallibility of human brains. Thinking algorithmically about the world, learning about the fundamental structures of the problems we face and about the properties of their solutions, can help us see how good we actually are, and better understand the errors that we make.
  2. The optimal solution takes the form of what we’ll call the Look-Then-Leap Rule: You set a predetermined amount of time for “looking”—that is, exploring your options, gathering data—in which you categorically don’t choose anyone, no matter how impressive. After that point, you enter the “leap” phase, prepared to instantly commit to anyone who outshines the best applicant you saw in the look phase.
  3. When balancing favorite experiences and new ones, nothing matters as much as the interval over which we plan to enjoy them.
  4. In 2007, Google product manager Dan Siroker took a leave of absence to join the presidential campaign of then senator Barack Obama in Chicago. Heading the “New Media Analytics” team, Siroker brought one of Google’s web practices to bear on the campaign’s bright-red DONATE button. The result was nothing short of astonishing: $57 million of additional donations were raised as a direct result of his work. What exactly did he do to that button? He A/B tested it.
  5. To live in a restless world requires a certain restlessness in oneself. So long as things continue to change, you must never fully cease exploring.
  6. This is the first and most fundamental insight of sorting theory. Scale hurts.
  7. Computer science, as undergraduates are taught, is all about tradeoffs. And one of the most central tradeoffs is between sorting and searching. The basic principle is this: the effort expended on sorting materials is just a preemptive strike against the effort it’ll take to search through them later.
  8. The verdict is clear: ordering your bookshelf will take more time and energy than scanning through it ever will.
  9. Computer science shows that the hazards of mess and the hazards of order are quantifiable and that their costs can be measured in the same currency: time. Leaving something unsorted might be thought of as an act of procrastination—passing the buck to one’s future self, who’ll have to pay off with interest what we chose not to pay up front. But the whole story is subtler than that. Sometimes mess is more than just the easy choice. It’s the optimal choice.
  10. In some ways the most important skill as a professional poker player is to be able to evaluate how good you are. If you’re anything short of the very best poker player in the world, you can be pretty much assured of going broke if you are endlessly willing to play people better than you. — Isaac Haxton
  11. LRU teaches us that the next thing we can expect to need is the last one we needed, while the thing we’ll need after that is probably the second-most-recent one. And the last thing we can expect to need is the one we’ve already gone longest without.
  12. In short, the mathematics of self-organizing lists suggests something radical: the big pile of papers on your desk, far from being a guilt-inducing fester of chaos, is actually one of the most well-designed and efficient structures available. What might appear to others to be an unorganized mess is, in fact, a self-organizing mess. Tossing things back on the top of the pile is the very best you can do, shy of knowing the future. Leaving something unsorted is more efficient than taking the time to sort everything; here, however, there’s a very different reason why you don’t need to organize it. You already have.
  13. The mind has essentially infinite capacity for memories, but we have only a finite amount of time in which to search for them
  14. Through a series of simulations, the researchers showed that simply knowing more makes things harder when it comes to recognizing words, names, and even letters. No matter how good your organization scheme is, having to search through more things will inevitably take longer. It’s not that we’re forgetting; it’s that we’re remembering. We’re becoming archives.
  15. This is a sufficiently fundamental and counterintuitive point that it’s worth repeating. If you have only a single machine, and you’re going to do all of your tasks, then any ordering of the tasks will take you the same amount of time.
  16. Thus we encounter the first lesson in single-machine scheduling literally before we even begin: make your goals explicit. We can’t declare some schedule a winner until we know how we’re keeping score. This is something of a theme in computer science: before you can have a plan, you must first choose a metric. And as it turns out, which metric we pick here will directly affect which scheduling approaches fare best.
  17. Only prioritize a task that takes twice as long if it’s twice as important.
  18. A simple prescription for time management: each time a new piece of work comes in, divide its importance by the amount of time it will take to complete. If that figure is higher than for the task you’re currently doing, switch to the new one; otherwise stick with the current task. This algorithm is the closest thing that scheduling theory has to a skeleton key or Swiss Army knife, the optimal strategy not just for one flavor of problem but for many.
  19. When the future is foggy, it turns out you don’t need a calendar—just a to-do list.
  20. None of this switching back and forth is “real work”—that is, none of it actually advances the state of any of the various programs the computer is switching between. It’s metawork. Every context switch is wasted time.
  21. Anyone you interrupt more than a few times an hour is in danger of doing no work at all.
  22. You should try to stay on a single task as long as possible without decreasing your responsiveness below the minimum acceptable limit. Decide how responsive you need to be—and then, if you want to get things done, be no more responsive than that.
  23. In the mid-twentieth century, the Bayesian statistician Harold Jeffreys had looked into determining the number of tramcars in a city given the serial number on just one tramcar, and came up with the same answer: double the serial number. And a similar problem had arisen even earlier, during World War II, when the Allies sought to estimate the number of tanks being produced by Germany. Purely mathematical estimates based on captured tanks’ serial numbers predicted that the Germans were producing 246 tanks every month, while estimates obtained by extensive (and highly risky) aerial reconnaissance suggested the figure was more like 1,400. After the war, German records revealed the true figure: 245.
  24. Knowing what distribution you’re up against can make all the difference.
  25. Small data is big data in disguise. The reason we can often make good predictions from a small number of observations—or just a single one—is that our priors are so rich. Whether we know it or not, we appear to carry around in our heads surprisingly accurate priors about movie grosses and running times, poem lengths, and political terms of office, not to mention human life spans. We don’t need to gather them explicitly; we absorb them from the world.
  26. Our judgments betray our expectations, and our expectations betray our experience. What we project about the future reveals a lot—about the world we live in, and about our own past.
  27. The best way to make good predictions, as Bayes’s Rule shows us, is to be accurately informed about the things you’re predicting.
  28. Giving yourself more time to decide about something does not necessarily mean that you’ll make a better decision. But it does guarantee that you’ll end up considering more factors, more hypotheticals, more pros and cons, and thus risk overfitting.
  29. The effectiveness of regularization in all kinds of machine-learning tasks suggests that we can make better decisions by deliberately thinking and doing less.
  30. Going with our first instinct can be the rational solution. The more complex, unstable, and uncertain the decision, the more rational an approach that is.
  31. If we’re willing to accept solutions that are close enough, then even some of the hairiest problems around can be tamed with the right techniques.
  32. When it comes to stimulating creativity, a common technique is introducing a random element, such as a word that people have to form associations with. For example, musician Brian Eno and artist Peter Schmidt created a deck of cards known as Oblique Strategies for solving creative problems. Pick a card, any card, and you will get a random new perspective on your project. (And if that sounds like too much work, you can now download an app that will pick a card for you.)
  33. The lesson of the TCP sawtooth is that in an unpredictable and changing environment, pushing things to the point of failure is indeed sometimes the best (or the only) way to use all the resources to their fullest. What matters is making sure that the response to failure is both sharp and resilient. Under AIMD, every connection that isn’t dropping the ball is accelerated until it is—and then it’s cut in half, and immediately begins accelerating again. And though it would violate almost every norm of current corporate culture, one can imagine a corporation in which, annually, every employee is always either promoted a single step up the org chart or sent part of the way back down.
  34. We use the idiom of “dropped balls” almost exclusively in a derogatory sense, implying that the person in question was lazy, complacent, or forgetful. But the tactical dropping of balls is a critical part of getting things done under overload.
  35. The most prevalent critique of modern communications is that we are “always connected.” But the problem isn’t that we’re always connected; we’re not. The problem is that we’re always buffered. The difference is enormous.
  36. The object of study in mathematics is truth; the object of study in computer science is complexity.
  37. At the present moment, the Bay Area (where the two of us live) is attempting to remedy this sorry state of affairs by going through a radical paradigm shift when it comes to vacation policy—a shift that is very well meaning and completely, apocalyptically doomed. The premise sounds innocent enough: instead of metering out some fixed arbitrary number of days for each employee, then wasting HR man-hours making sure no one goes over their limit, why not just let your employees free? Why not simply allow them unlimited vacation? Anecdotal reports thus far are mixed—but from a game-theoretic perspective, this approach is a nightmare. All employees want, in theory, to take as much vacation as possible. But they also all want to take just slightly less vacation than each other, to be perceived as more loyal, more committed, and more dedicated (hence more promotion-worthy). Everyone looks to the others for a baseline, and will take just slightly less than that. The Nash equilibrium of this game is zero. As the CEO of software company Travis CI, Mathias Meyer, writes, “People will hesitate to take a vacation as they don’t want to seem like that person who’s taking the most vacation days. It’s a race to the bottom.”
  38. An interesting aspect of the 2007–2009 mortgage crisis is that everybody involved seemed to feel like they were unfairly punished for simply doing what they were supposed to. A generation of Americans who grew up believing that houses were fail-safe investments, and who saw everyone around them buying houses despite (or because of) rapidly rising prices, were badly burned when those prices finally started to tumble. Bankers, meanwhile, felt they were unfairly blamed for doing what they had always done—offering opportunities, which their clients could accept or decline.
  39. Imagine there are ten companies that might bid on the rights for a given tract. One of them has a geological survey suggesting the tract is rich with oil; another’s survey is inconclusive; the reconnaissance of the other eight suggests it’s barren. But being competitors, of course, the companies do not share their survey results with each other, and instead can only watch each other’s actions. When the auction begins, the first company, with the promising report, makes a high initial bid. The second company, encouraged by this bid to take an optimistic view of their own ambiguous survey, bids even higher. The third company has a weak survey but now doesn’t trust it in light of what they take to be two independent surveys that suggest it’s a gold mine, so they make a new high bid. The fourth company, which also has a lackluster survey, is now even more strongly inclined to disregard it, as it seems like three of their competitors all think it’s a winner. So they bid too. The “consensus” unglues from reality. A cascade has formed. No single bidder has acted irrationally, yet the net result is catastrophe.
  40. Be wary of cases where public information seems to exceed private information, where you know more about what people are doing than why they’re doing it, where you’re more concerned with your judgments fitting the consensus than fitting the facts. When you’re mostly looking to others to set a course, they may well be looking right back at you to do the same.
  41. If you’re the kind of person who always does what you think is right, no matter how crazy others think it is, take heart. The bad news is that you will be wrong more often than the herd followers. The good news is that sticking to your convictions creates a positive externality, letting people make accurate inferences from your behavior. There may come a time when you will save the entire herd from disaster.
  42. One of the implicit principles of computer science, as odd as it may sound, is that computation is bad: the underlying directive of any good algorithm is to minimize the labor of thought.
  43. The intuitive standard for rational decision-making is carefully considering all available options and taking the best one. At first glance, computers look like the paragons of this approach, grinding their way through complex computations for as long as it takes to get perfect answers. But as we’ve seen, that is an outdated picture of what computers do: it’s a luxury afforded by an easy problem. In the hard cases, the best algorithms are all about doing what makes the most sense in the least amount of time, which by no means involves giving careful consideration to every factor and pursuing every computation to the end. Life is just too complicated for that.
Insights

🦅Key Principles

  1. Don't compromise, don't second-guess, and don't look back
  2. Look-Then-Leap. Use 37% of time screening and establishing the benchmark. Afterward, say "Yes" to the next best candidate/offer
  3. No matter what, never hire someone who’s below average unless you’re totally out of options.
  4. Always be stopping. Time spent on decision-making has a huge impact
  5. Never stop exploring
  6. A/B test. Everywhere
  7. Explore/exploit. Balance out new experiences with the proven ones. Don't forget to utilize on the known information
  8. Err on the side of messiness.
  9. Get rid of the items by using "last recently used" principle
  10. Exploit geography. Make sure things are in whatever cache is closest to the place where they’re typically used.
  11. Having a cache is efficient, but having multiple levels of caches—from smallest and fastest to largest and slowest—can be even better.
  12. Unorganized mess = self-organized mess. Put things on the top of the pile.
  13. For a good memory, predict which items you are most likely to need in the future. Then practice them
  14. Keep the most important things closest to hand.
  15. Make your goals explicit
  16. Stay focused not just on getting things done but on getting weighty things done. Do the most important work you can at every moment.
  17. Inherit priorities. If a low-priority task is blocking a high-priority resource, then a lower priority task should become the highest priority thing
  18. Don't try to create a "perfect" schedule. React to things on the fly. Build your to-do list, not the calendar.
  19. When you feel overwhelmed (thrashed), work dumber — just start doing things without thinking about their priority. Even doing tasks in the wrong order is better than doing nothing at all
  20. Stay on a single task as long as possible without decreasing your responsiveness below the minimum acceptable limit.
  21. Interrupt coalesce. Batch process things that take a lot of context switching
  22. To predict an unfamiliar event, assume that you arrived precisely halfway into something’s duration
  23. Double the serial number. To predict the number of items in the dataset by a single point of reference, multiple its identifier by 2.
  24. Know what distribution you're up against
  25. Small data is big data in disguise. We often make good predictions from a small number of observations because our priors are so rich.
  26. Protect your priors. Turn off the news — make better predictions
  27. It’s not always better to use a more complex model, one that takes a greater number of factors into account. Including more factors in a model will always, by definition, make it a better fit for the data we have already. But a better fit for the available data does not necessarily mean a better prediction
  28. Think less. Avoid overfitting. We can make better decisions by deliberately thinking and doing less.
  29. If you have all the facts, they’re free of all error and uncertainty, and you can directly assess whatever is important to you, then think long and hard
  30. If you have high uncertainty and limited data, then stop thinking early and make a decision quick
  31. Let it slide. Accept the solution that is close enough. Often perfect is just slightly better than good but takes much longer to come up with.
  32. Shake things up a bit. If it looks like you're stuck, mix things a little.
  33. Avoid the local maxima. Introduce the randomness factor into your choice
  34. To stimulate creativity, introduce a random element into your process that you will interact with (word, phrase, melody, etc.)
  35. Back off exponentially
  36. Push things to the point of failure, but make sure your response to failure is both sharp and resilient
  37. Under overload, don't be afraid to drop the ball
  38. Only play one level above your opponent
  39. Be wary of cases where public information seems to exceed private information, where you know more about what people are doing than why they’re doing it, where you’re more concerned with your judgments fitting the consensus than fitting the facts.
  40. Actions are not beliefs; cascades get caused in part when we misinterpret what others think based on what they do.
  41. Sometimes a game can have irredeemably lousy rules. Avoid such a game in the first place
  42. Seek the games where honesty is the dominant strategy. Then just be yourself
  43. Politely asserting your preferences (“Personally, I’m inclined toward x. What do you think?”) or try to reduce, rather than maximize, the number of options that you give other people
  44. Live by "computational kindness." Protect people from unnecessary tension, friction, and mental labor. Make it as easy as possible for them to make a decision or do something

Principles

✍️Notes

Introduction

  • If you want the best odds of getting the best apartment, spend 37% of your apartment hunt (eleven days, if you’ve given yourself a month for the search) noncommittally exploring options. Leave the checkbook at home; you’re just calibrating. But after that point, be prepared to immediately commit—deposit and all—to the very first place you see that beats whatever you’ve already seen. This is not merely an intuitively satisfying compromise between looking and leaping. It is the provably optimal solution.
  • We think about computers, we think about coldly mechanical, deterministic systems: machines applying rigid deductive logic, making decisions by exhaustively enumerating the options, and grinding out the exact right answer no matter how long and hard they have to think. Indeed, the person who first imagined computers had something essentially like this in mind. Alan Turing defined the very notion of computation by an analogy to a human mathematician who carefully works through the steps of a lengthy calculation, yielding an unmistakably right answer.
  • So it might come as a surprise that this is not what modern computers are actually doing when they face a difficult problem. Straightforward arithmetic, of course, isn’t particularly challenging for a modern computer. Rather, it’s tasks like conversing with people, fixing a corrupted file, or winning a game of Go—problems where the rules aren’t clear, some of the required information is missing, or finding exactly the right answer would require considering an astronomical number of possibilities—that now pose the biggest challenges in computer science. And the algorithms that researchers have developed to solve the hardest classes of problems have moved computers away from an extreme reliance on exhaustive calculation. Instead, tackling real-world tasks requires being comfortable with chance, trading off time with accuracy, and using approximations.
  • Over the past decade or two, behavioral economics has told a very particular story about human beings: that we are irrational and error-prone, owing in large part to the buggy, idiosyncratic hardware of the brain. This self-deprecating story has become increasingly familiar, but certain questions remain vexing. Why are four-year-olds, for instance, still better than million-dollar supercomputers at a host of cognitive tasks, including vision, language, and causal reasoning?
  • Life is full of problems that are, quite simply, hard. And the mistakes made by people often say more about the intrinsic difficulties of the problem than about the fallibility of human brains. Thinking algorithmically about the world, learning about the fundamental structures of the problems we face and about the properties of their solutions, can help us see how good we actually are, and better understand the errors that we make.

1. Optimal Stopping: When to Stop Looking

  • The 37% Rule* derives from optimal stopping’s most famous puzzle, which has come to be known as the “secretary problem.” Its setup is much like the apartment hunter’s dilemma that we considered earlier. Imagine you’re interviewing a set of applicants for a position as a secretary, and your goal is to maximize the chance of hiring the single best applicant in the pool. While you have no idea how to assign scores to individual applicants, you can easily judge which one you prefer. (A mathematician might say you have access only to the ordinal numbers—the relative ranks of the applicants compared to each other—but not to the cardinal numbers, their ratings on some kind of general scale.) You interview the applicants in random order, one at a time. You can decide to offer the job to an applicant at any point and they are guaranteed to accept, terminating the search. But if you pass over an applicant, deciding not to hire them, they are gone forever.
  • In your search for a secretary, there are two ways you can fail: stopping early and stopping late. When you stop too early, you leave the best applicant undiscovered. When you stop too late, you hold out for a better applicant who doesn’t exist. The optimal strategy will clearly require finding the right balance between the two, walking the tightrope between looking too much and not enough.
  • The optimal solution takes the form of what we’ll call the Look-Then-Leap Rule: You set a predetermined amount of time for “looking”—that is, exploring your options, gathering data—in which you categorically don’t choose anyone, no matter how impressive. After that point, you enter the “leap” phase, prepared to instantly commit to anyone who outshines the best applicant you saw in the look phase.
image
  • A 63% failure rate, when following the best possible strategy, is a sobering fact. Even when we act optimally in the secretary problem, we will still fail most of the time—that is, we won’t end up with the single best applicant in the pool.
  • In the decades since the secretary problem was first introduced, a wide range of variants on the scenario have been studied, with strategies for optimal stopping worked out under a number of different conditions. The possibility of rejection, for instance, has a straightforward mathematical solution: propose early and often. If you have, say, a 50/50 chance of being rejected, then the same kind of mathematical analysis that yielded the 37% Rule says you should start making offers after just a quarter of your search. If turned down, keep making offers to every best-yet person you see until somebody accepts. With such a strategy, your chance of overall success—that is, proposing and being accepted by the best applicant in the pool—will also be 25%. Not such terrible odds, perhaps, for a scenario that combines the obstacle of rejection with the general difficulty of establishing one’s standards in the first place.
  • For example, assume an immediate proposal is a sure thing but belated proposals are rejected half the time. Then the math says you should keep looking noncommittally until you’ve seen 61% of applicants, and then only leap if someone in the remaining 39% of the pool proves to be the best yet. If you’re still single after considering all the possibilities then go back to the best one that got away. The symmetry between strategy and outcome holds in this case once again, with your chances of ending up with the best applicant under this second-chances-allowed scenario also being 61%.
  • Full information means that we don’t need to look before we leap. We can instead use the Threshold Rule, where we immediately accept an applicant if she is above a certain percentile.
  • The math shows that when there are a lot of applicants left in the pool, you should pass up even a very good applicant in the hopes of finding someone still better than that—but as your options dwindle, you should be prepared to hire anyone who’s simply better than average. It’s a familiar, if not exactly inspiring, message: in the face of slim pickings, lower your standards.
  • No matter what, never hire someone who’s below average unless you’re totally out of options. (And since you’re still interested only in finding the very best person in the applicant pool, never hire someone who isn’t the best you’ve seen so far.)
  • The chance of ending up with the single best applicant in this full-information version of the secretary problem comes to 58%—still far from a guarantee, but considerably better than the 37% success rate offered by the 37% Rule in the no-information game.
image

When to Sell

  • Imagine selling a house, for instance. After consulting with several real estate agents, you put your place on the market; a new coat of paint, some landscaping, and then it’s just a matter of waiting for the offers to come in. As each offer arrives, you typically have to decide whether to accept it or turn it down. But turning down an offer comes at a cost—another week (or month) of mortgage payments while you wait for the next offer, which isn’t guaranteed to be any better.
  • This particular mathematical result doesn’t care whether you’re selling a mansion worth millions or a ramshackle shed. The only thing it cares about is the difference between the highest and lowest offers you’re likely to receive.
  • For instance, let’s say the range of offers we’re expecting runs from $400,000 to $500,000. First, if the cost of waiting is trivial, we’re able to be almost infinitely choosy. If the cost of getting another offer is only a dollar, we’ll maximize our earnings by waiting for someone willing to offer us $499,552.79 and not a dime less. If waiting costs $2,000 an offer, we should hold out for an even $480,000. In a slow market where waiting costs $10,000 an offer, we should take anything over $455,279. Finally, if waiting costs half or more of our expected range of offers—in this case, $50,000—then there’s no advantage whatsoever to holding out; we’ll do best by taking the very first offer that comes along and calling it done.
  • image
  • Even if it’s possible to reconsider an earlier offer, and even if that offer is guaranteed to still be on the table, you should nonetheless never do so. If it wasn’t above your threshold then, it won’t be above your threshold now. What you’ve paid to keep searching is a sunk cost. Don’t compromise, don’t second-guess. And don’t look back.

When to Park

  • When occupancy goes from 90% to 95%, it accommodates only 5% more cars but doubles the length of everyone’s search.
  • If this infinite street has a big-city occupancy rate of 99%, with just 1% of spots vacant, then you should take the first spot you see starting at almost 70 spots—more than a quarter mile—from your destination.
  • image

When to Quit

  • The “burglar problem.” In this problem, a burglar has the opportunity to carry out a sequence of robberies. Each robbery provides some reward, and there’s a chance of getting away with it each time. But if the burglar is caught, he gets arrested and loses all his accumulated gains. What algorithm should he follow to maximize his expected take?
  • The fact that this problem has a solution is bad news for heist movie screenplays: when the team is trying to lure the old burglar out of retirement for one last job, the canny thief need only crunch the numbers. Moreover, the results are pretty intuitive: the number of robberies you should carry out is roughly equal to the chance you get away, divided by the chance you get caught. If you’re a skilled burglar and have a 90% chance of pulling off each robbery (and a 10% chance of losing it all), then retire after 90/10 = 9 robberies.
  • Surprisingly, not giving up—ever—also makes an appearance in the optimal stopping literature. It might not seem like it from the wide range of problems we have discussed, but there are sequential decision-making problems for which there is no optimal stopping rule. A simple example is the game of “triple or nothing.” Imagine you have $1.00, and can play the following game as many times as you want: bet all your money, and have a 50% chance of receiving triple the amount and a 50% chance of losing your entire stake. How many times should you play? Despite its simplicity, there is no optimal stopping rule for this problem, since each time you play, your average gains are a little higher. Starting with $1.00, you will get $3.00 half the time and $0.00 half the time, so on average you expect to end the first round with $1.50 in your pocket. Then, if you were lucky in the first round, the two possibilities from the $3.00 you’ve just won are $9.00 and $0.00—for an average return of $4.50 from the second bet. The math shows that you should always keep playing. But if you follow this strategy, you will eventually lose everything. Some problems are better avoided than solved.

Always Be Stopping

  • About a dozen studies have produced the same result: people tend to stop early, leaving better applicants unseen.
  • Another consideration that isn’t taken into account in the classical secretary problem: the role of time. After all, the whole time you’re searching for a secretary, you don’t have a secretary. What’s more, you’re spending the day conducting interviews instead of getting your own work done.
  • But this doesn’t make optimal stopping problems less important; it actually makes them more important, because the flow of time turns all decision-making into optimal stopping.

2. Explore/Exploit: The Latest vs. the Greatest

  • Every day we are constantly forced to make decisions between options that differ in a very specific dimension: do we try new things or stick with our favorite ones?
  • Computer scientists have been working on finding this balance for more than fifty years. They even have a name for it: the explore/exploit tradeoff.

Explore/Exploit

  • In English, the words “explore” and “exploit” come loaded with completely opposite connotations. But to a computer scientist, these words have much more specific and neutral meanings. Simply put, exploration is gathering information, and exploitation is using the information you have to get a known good result.
  • In computer science, the tension between exploration and exploitation takes its most concrete form in a scenario called the “multi-armed bandit problem.” The odd name comes from the colloquial term for a casino slot machine, the “one-armed bandit.” Imagine walking into a casino full of different slot machines, each one with its own odds of a payoff. The rub, of course, is that you aren’t told those odds in advance: until you start playing, you won’t have any idea which machines are the most lucrative (“loose,” as slot-machine aficionados call it) and which ones are just money sinks.
  • Naturally, you’re interested in maximizing your total winnings. And it’s clear that this is going to involve some combination of pulling the arms on different machines to test them out (exploring), and favoring the most promising machines you’ve found (exploiting).
  • To get a sense for the problem’s subtleties, imagine being faced with only two machines. One you’ve played a total of 15 times; 9 times it paid out, and 6 times it didn’t. The other you’ve played only twice, and it once paid out and once did not. Which is more promising?
  • Simply dividing the wins by the total number of pulls will give you the machine’s “expected value,” and by this method the first machine clearly comes out ahead. Its 9–6 record makes for an expected value of 60%, whereas the second machine’s 1–1 record yields an expected value of only 50%. But there’s more to it than that. After all, just two pulls aren’t really very many. So there’s a sense in which we just don’t yet know how good the second machine might actually be.
  • Decisions are almost never isolated, and expected value isn’t the end of the story. If you’re thinking not just about the next decision, but about all the decisions you are going to make about the same options in the future, the explore/exploit tradeoff is crucial to the process.

Seize the Interval

  • When balancing favorite experiences and new ones, nothing matters as much as the interval over which we plan to enjoy them.
  • A sobering property of trying new things is that the value of exploration, of finding a new favorite, can only go down over time, as the remaining opportunities to savor it dwindle
  • The flip side is that the value of exploitation can only go up over time.
  • So explore when you will have time to use the resulting knowledge, exploit when you’re ready to cash in. The interval makes the strategy.
  • Take Hollywood, for instance: Among the ten highest-grossing movies of 1981, only two were sequels. In 1991, it was three. In 2001, it was five. And in 2011, eight of the top ten highest-grossing films were sequels. In fact, 2011 set a record for the greatest percentage of sequels among major studio releases. Then 2012 immediately broke that record; the next year would break it again. In December 2012, journalist Nick Allen looked ahead with palpable fatigue to the year to come:
    • Audiences will be given a sixth helping of X-Men plus Fast and Furious 6Die Hard 5Scary Movie 5 and Paranormal Activity 5. There will also be Iron Man 3The Hangover 3, and second outings for The MuppetsThe SmurfsGI Joe and Bad Santa.
    • From a studio’s perspective, a sequel is a movie with a guaranteed fan base: a cash cow, a sure thing, an exploit. And an overload of sure things signals a short-termist approach, as with Stucchio on his way out of town. The sequels are more likely than brand-new movies to be hits this year, but where will the beloved franchises of the future come from? Such a sequel deluge is not only lamentable (certainly critics think so); it’s also somewhat poignant. By entering an almost purely exploit-focused phase, the film industry seems to be signaling a belief that it is near the end of its interval.

Win-Stay

  • Win-Stay, Lose-Shift algorithm: choose an arm at random, and keep pulling it as long as it keeps paying off. If the arm doesn’t pay off after a particular pull, then switch to the other one. Although this simple strategy is far from a complete solution, Robbins proved in 1952 that it performs reliably better than chance.

The Gittins Index

  • Unlike previous researchers, Gittins approached the multi-armed bandit problem in those terms. He conceived the goal as maximizing payoffs not for a fixed interval of time, but for a future that is endless yet discounted.
  • Gittins, for his part, made the assumption that the value assigned to payoffs decreases geometrically: that is, each restaurant visit you make is worth a constant fraction of the last one. If, let’s say, you believe there is a 1% chance you’ll get hit by a bus on any given day, then you should value tomorrow’s dinner at 99% of the value of tonight’s, if only because you might never get to eat it.
  • For every slot machine we know little or nothing about, there is some guaranteed payout rate which, if offered to us in lieu of that machine, will make us quite content never to pull its handle again. This number—which Gittins called the “dynamic allocation index,” and which the world now knows as the Gittins index—suggests an obvious strategy on the casino floor: always play the arm with the highest index.
  • In fact, the index strategy turned out to be more than a good approximation. It completely solves the multi-armed bandit with geometrically discounted payoffs. The tension between exploration and exploitation resolves into the simpler task of maximizing a single quantity that accounts for both.
  • Looking at the Gittins index values in the table, there are a few other interesting observations. First, you can see the win-stay principle at work: as you go from left to right in any row, the index scores always increase. So if an arm is ever the correct one to pull, and that pull is a winner, then (following the chart to the right) it can only make more sense to pull the same arm again. Second, you can see where lose-shift would get you into trouble. Having nine initial wins followed by a loss gets you an index of 0.8695, which is still higher than most of the other values in the table—so you should probably stay with that arm for at least another pull.
  • image
    image
  • But perhaps the most interesting part of the table is the top-left entry. A record of 0–0—an arm that’s a complete unknown—has an expected value of 0.5000 but a Gittins index of 0.7029. In other words, something you have no experience with whatsoever is more attractive than a machine that you know pays out seven times out of ten! As you go down the diagonal, notice that a record of 1–1 yields an index of 0.6346, a record of 2–2 yields 0.6010, and so on. If such 50%-successful performance persists, the index does ultimately converge on 0.5000, as experience confirms that the machine is indeed nothing special and takes away the “bonus” that spurs further exploration. But the convergence happens fairly slowly; the exploration bonus is a powerful force. Indeed, note that even a failure on the very first pull, producing a record of 0–1, makes for a Gittins index that’s still above 50%.

Regret and Optimism

To try and fail is at least to learn; to fail to try is to suffer the inestimable loss of what might have been.
  • The framework I found, which made the decision incredibly easy, was what I called—which only a nerd would call—a “regret minimization framework.” So I wanted to project myself forward to age 80 and say, “Okay, now I’m looking back on my life. I want to have minimized the number of regrets I have.” I knew that when I was 80 I was not going to regret having tried this. I was not going to regret trying to participate in this thing called the Internet that I thought was going to be a really big deal. I knew that if I failed I wouldn’t regret that, but I knew the one thing I might regret is not ever having tried. I knew that that would haunt me every day, and so, when I thought about it that way it was an incredibly easy decision. — Jeff Bezos
  • Computer science can’t offer you a life with no regret. But it can, potentially, offer you just what Bezos was looking for: a life with minimal regret.
  • Regret is the result of comparing what we actually did with what would have been best in hindsight.
  • In general we can’t realistically expect someday to never have any more regrets. But if we’re following a regret-minimizing algorithm, every year we can expect to have fewer new regrets than we did the year before.

Bandits Online

  • In 2007, Google product manager Dan Siroker took a leave of absence to join the presidential campaign of then senator Barack Obama in Chicago. Heading the “New Media Analytics” team, Siroker brought one of Google’s web practices to bear on the campaign’s bright-red DONATE button. The result was nothing short of astonishing: $57 million of additional donations were raised as a direct result of his work. What exactly did he do to that button? He A/B tested it.
  • If you’ve used the Internet basically at all over the past decade, then you’ve been a part of someone else’s explore/exploit problem.
  • Companies want to discover the things that make them the most money while simultaneously making as much of it as they can—explore, exploit. Big tech firms such as Amazon and Google began carrying out live A/B tests on their users starting in about 2000, and over the following years the Internet has become the world’s largest controlled experiment. What are these companies exploring and exploiting? In a word, you: whatever it is that makes you move your mouse and open your wallet.
  • Within a decade or so after its first tentative use, A/B testing was no longer a secret weapon. It has become such a deeply embedded part of how business and politics are conducted online as to be effectively taken for granted.
  • The process of A/B testing itself has become increasingly refined over time. The most canonical A/B setup—splitting the traffic evenly between two options, running the test for a set period of time, and thereafter giving all the traffic to the winner—might not necessarily be the best algorithm for solving the problem, since it means half the users are stuck getting the inferior option as long as the test continues.

The Restless World

  • In general, it seems that people tend to over-explore—to favor the new disproportionately over the best.
  • How much? People chose to observe 505 times, on average, placing bets the other 495 times. But the math says they should have started to bet after just 38 observations—leaving 962 chances to cash in.
  • Psychologists Mark Steyvers, Michael Lee, and E.-J. Wagenmakers have run an experiment with a four-armed bandit, asking a group of people to choose which arm to play over a sequence of fifteen opportunities. They then classified the strategies that participants seemed to use. The results suggested that 30% were closest to the optimal strategy, 47% most resembled Win-Stay, Lose-Shift, and 22% seemed to move at random between selecting a new arm and playing the best arm found so far.
  • The standard multi-armed bandit problem assumes that the probabilities with which the arms pay off are fixed over time. But that’s not necessarily true of airlines, restaurants, or other contexts in which people have to make repeated choices. If the probabilities of a payoff on the different arms change over time—what has been termed a “restless bandit”—the problem becomes much harder. (So much harder, in fact, that there’s no tractable algorithm for completely solving it, and it’s believed there never will be.)
  • To live in a restless world requires a certain restlessness in oneself. So long as things continue to change, you must never fully cease exploring.

Explore …

  • Childhood gives you a period in which you can just explore possibilities, and you don’t have to worry about payoffs because payoffs are being taken care of by the mamas and the papas and the grandmas and the babysitters.
  • Children are cognitively deficient in various ways—because if you look at their exploit capacities, they look terrible. They can’t tie their shoes, they’re not good at long-term planning, they’re not good at focused attention. Those are all things that kids are really awful at.” But pressing buttons at random, being very interested in new toys, and jumping quickly from one thing to another are all things that kids are really great at. And those are exactly what they should be doing if their goal is exploration.

… And Exploit

  • The basic pattern is clear: the size of people’s social networks (that is, the number of social relationships they engage in) almost invariably decreases over time.
  • The traditional explanation for the elderly having smaller social networks is that it’s just one example of the decrease in quality of life that comes with aging—the result of diminished ability to contribute to social relationships, greater fragility, and general disengagement from society. But Carstensen has argued that, in fact, the elderly have fewer social relationships by choice. As she puts it, these decreases are “the result of lifelong selection processes by which people strategically and adaptively cultivate their social networks to maximize social and emotional gains and minimize social and emotional risks.”
  • Being sensitive to how much time you have left is exactly what the computer science of the explore/exploit dilemma suggests. We think of the young as stereotypically fickle; the old, stereotypically set in their ways. In fact, both are behaving completely appropriately with respect to their intervals. The deliberate honing of a social network down to the most meaningful relationships is the rational response to having less time to enjoy them.
  • The explore/exploit tradeoff also tells us how to think about advice from our elders. When your grandfather tells you which restaurants are good, you should listen—these are pearls gleaned from decades of searching. But when he only goes to the same restaurant at 5:00 p.m. every day, you should feel free to explore other options, even though they’ll likely be worse.

3. Sorting: Making Order

  • The roommate pulled a sock out of the clean laundry hamper. Next he pulled another sock out at random. If it didn’t match the first one, he tossed it back in. Then he continued this process, pulling out socks one by one and tossing them back until he found a match for the first. With just 10 different pairs of socks, following this method will take on average 19 pulls merely to complete the first pair, and 17 more pulls to complete the second. In total, the roommate can expect to go fishing in the hamper 110 times just to pair 20 socks. It was enough to make any budding computer scientist request a room transfer.
  • Inspired by the punched railway tickets of the time, an inventor by the name of Herman Hollerith devised a system of punched manila cards to store information, and a machine, which he called the Hollerith Machine, to count and sort them. Hollerith was awarded a patent in 1889, and the government adopted the Hollerith Machine for the 1890 census. No one had ever seen anything like it. Wrote one awestruck observer, “The apparatus works as unerringly as the mills of the Gods, but beats them hollow as to speed.”

The Agony of Sorting

  • With sorting, size is a recipe for disaster: perversely, as a sort grows larger, “the unit cost of sorting, instead of falling, rises.” Sorting involves steep diseconomies of scale, violating our normal intuitions about the virtues of doing things in bulk.
  • Cooking for two is typically no harder than cooking for one, and it’s certainly easier than cooking for one person twice. But sorting, say, a shelf of a hundred books will take you longer than sorting two bookshelves of fifty apiece: you have twice as many things to organize, and there are twice as many places each of them could go. The more you take on, the worse it gets.
  • This is the first and most fundamental insight of sorting theory. Scale hurts.
  • The Guinness Book of World Records attributes the record for sorting a deck of cards to the Czech magician Zdeněk Bradáč. On May 15, 2008, Bradáč sorted a 52-card deck in just 36.16 seconds
  • The fine folks at Guinness care only about best-case performance (and beer). They’re hardly blameworthy, of course: all records in sports reflect the single best performance. Computer science, however, almost never cares about the best case. Instead, computer scientists might want to know the average sort time of someone like Bradáč: get him to sort all of the 80 unvigintillion deck orders, or a reasonably sized sample, and score him on his average speed across all attempts. (You can see why they don’t let computer scientists run these things.)
  • Moreover, a computer scientist would want to know the worst sort time. Worst-case analysis lets us make hard guarantees: that a critical process will finish in time, that deadlines won’t be blown.
  • Imagine you’re hosting a dinner party with n guests. The time required to clean the house for their arrival doesn’t depend on the number of guests at all. This is the rosiest class of problems there is: called “Big-O of one,” written O(1), it is also known as “constant time.”
  • Now, the time required to pass the roast around the table will be “Big-O of n,” written O(n), also known as “linear time”—with twice the guests, you’ll wait twice as long for the dish to come around.
  • What if, as the guests arrived, each one hugged the others in greeting? Your first guest hugs you; your second guest has two hugs to give; your third guest, three. How many hugs will there be in total? This turns out to be “Big-O of n-squared,” written O(n2) and also known as “quadratic time.”
  • It gets worse from there. There’s “exponential time,” O(2n), where each additional guest doubles your work. Even worse is “factorial time,” O(n!), a class of problems so truly hellish that computer scientists only talk about it when they’re joking—as we were in imagining shuffling a deck until it’s sorted—or when they really, really wish they were
  • image

The Squares: Bubble Sort and Insertion Sort

Bubble Sort

  • Imagine you want to alphabetize your unsorted collection of books. A natural approach would be just to scan across the shelf looking for out-of-order pairs—Wallace followed by Pynchon, for instance—and flipping them around. Put Pynchon ahead of Wallace, then continue your scan, looping around to the beginning of the shelf each time you reach the end. When you make a complete pass without finding any more out-of-order pairs on the entire shelf, then you know the job is done.
  • This is Bubble Sort, and it lands us in quadratic time. There are n books out of order, and each scan through the shelf can move each one at most one position. (We spot a tiny problem, make a tiny fix.) So in the worst case, where the shelf is perfectly backward, at least one book will need to be moved n positions. Thus a maximum of n passes through n books, which gives us O(n2) in the worst case.

Insertion Sort

  • You might take a different tack—pulling all the books off the shelf and putting them back in place one by one. You’d put the first book in the middle of the shelf, then take the second and compare it to the first, inserting it either to the right or to the left. Picking up the third book, you’d run through the books on the shelf from left to right until you found the right spot to tuck it in. Repeating this process, gradually all of the books would end up sorted on the shelf and you’d be done.
  • Computer scientists call this, appropriately enough, Insertion Sort. The good news is that it’s arguably even more intuitive than Bubble Sort and doesn’t have quite the bad reputation. The bad news is that it’s not actually that much faster. You still have to do one insertion for each book. And each insertion still involves moving past about half the books on the shelf, on average, to find the correct place
  • Although in practice Insertion Sort does run a bit faster than Bubble Sort, again we land squarely, if you will, in quadratic time.

Breaking the Quadratic Barrier: Divide and Conquer

  • Could we find a constant-time sort, O(1), one that (like cleaning the house before the bevy of guests arrive) can sort a list of any size in the same amount of time? Well, even just confirming that a shelf of n books is sorted cannot be done in constant time, since it requires checking all n of them.
  • What about a linear-time sort, O(n), as efficient as passing a dish around a table, where doubling the number of items to sort merely doubles the work? Thinking about the examples above, it’s tough to imagine how that might work either. The n2 in each case comes from the fact that you need to move n books, and the work required in each move scales with n as well. How would we get from n moves of size n down to just n by itself?
  • A linear-time sort means handling each book for constant time regardless of how many others it needs to find its place among. Doesn’t seem likely.
  • So we know that we can do at least as well as quadratic time, but probably not as well as linear time. Perhaps our limit lies somewhere between linear time and quadratic time.
  • The program that John von Neumann wrote in 1945 to demonstrate the power of the stored-program computer took the idea of collating to its beautiful and ultimate conclusion. Sorting two cards is simple: just put the smaller one on top. And given a pair of two-card stacks, both of them sorted, you can easily collate them into an ordered stack of four. Repeating this trick a few times, you’d build bigger and bigger stacks, each one of them already sorted. Soon enough, you could collate yourself a perfectly sorted full deck—with a final climactic merge, like a riffle shuffle’s order-creating twin, producing the desired result.
  • This approach is known today as Mergesort, one of the legendary algorithms in computer science. As a 1997 paper put it, “Mergesort is as important in the history of sorting as sorting in the history of computing.”
  • The power of Mergesort comes from the fact that it indeed ends up with a complexity between linear and quadratic time—specifically, O(n log n), known as “linearithmic” time. Each pass through the cards doubles the size of the sorted stacks, so to completely sort n cards you’ll need to make as many passes as it takes for the number 2, multiplied by itself, to equal n: the base-two logarithm, in other words.
  • image
  • The O(n log n) linearithmic time offered by Mergesort is truly the best we can hope to achieve. It’s been proven that if we want to fully sort n items via a series of head-to-head comparisons, there’s just no way to compare them any fewer than O(n log n) times. It’s a fundamental law of the universe, and there are no two ways around it.

Bucket Sort

  • Sometimes you don’t need a fully ordered set—and sometimes sorting can be done without any item-to-item comparisons at all. These two principles, taken together, allow for rough practical sorts in faster than linearithmic time. This is beautifully demonstrated by an algorithm known as Bucket Sort
  • In Bucket Sort, items are grouped together into a number of sorted categories, with no regard for finer, intracategory sorting; that can come later.
  • Here’s the kicker: if you want to group n items into m buckets, the grouping can be done in O(nm) time—that is, the time is simply proportional to the number of items times the number of buckets. And as long as the number of buckets is relatively small compared to the number of items, Big-O notation will round that to O(n), or linear time.
  • The key to actually breaking the linearithmic barrier is knowing the distribution from which the items you’re sorting are drawn. Poorly chosen buckets will leave you little better than when you started; if all the books end up in the same bin, for instance, you haven’t made any progress at all. Well-chosen buckets, however, will divide your items into roughly equal-sized groups, which—given sorting’s fundamental “scale hurts” nature—is a huge step toward a complete sort.

Sort Is Prophylaxis for Search

  • If you actually asked a computer scientist to help implement this process, the first question they would ask is whether you should be sorting at all.
  • Computer science, as undergraduates are taught, is all about tradeoffs. And one of the most central tradeoffs is between sorting and searching. The basic principle is this: the effort expended on sorting materials is just a preemptive strike against the effort it’ll take to search through them later.
  • Err on the side of messiness.
  • Sorting something that you will never search is a complete waste; searching something you never sorted is merely inefficient.
  • If you’re Google, you are almost certain that (a) your data will be searched, (b) it will be searched not just once but repeatedly, and (c) the time needed to sort is somehow “less valuable” than the time needed to search. (Here, sorting is done by machines ahead of time, before the results are needed, and searching is done by users for whom time is of the essence.) All of these factors point in favor of tremendous up-front sorting, which is indeed what Google and its fellow search engines do.
  • The verdict is clear: ordering your bookshelf will take more time and energy than scanning through it ever will.
  • In 2011, Whittaker led a study of the searching and sorting habits of email users, resulting in a paper titled “Am I Wasting My Time Organizing Email?” Spoiler alert: the conclusion was an emphatic Yes.
  • Computer science shows that the hazards of mess and the hazards of order are quantifiable and that their costs can be measured in the same currency: time. Leaving something unsorted might be thought of as an act of procrastination—passing the buck to one’s future self, who’ll have to pay off with interest what we chose not to pay up front. But the whole story is subtler than that. Sometimes mess is more than just the easy choice. It’s the optimal choice.

Sorts and Sports

  • One of the most familiar algorithms in sports is the Round-Robin format, where each of n teams eventually plays every one of the other n − 1 teams. While arguably the most comprehensive, it’s also one of the most laborious. Having every team grapple with every other team is like having guests exchange hugs at our dinner party: the dreaded O(n2), quadratic time.
  • Ladder tournaments—popular in sports like badminton, squash, and racquetball—put players in a linear ranking, with each player allowed to issue a direct challenge to the player immediately above them, exchanging places if they prevail. Ladders are the Bubble Sorts of the athletic world and are thus also quadratic, requiring O(n2) games to reach a stable ranking.
  • Perhaps the most prevalent tournament format, however, is a bracket tournament—as in the famous NCAA basketball “March Madness,” among many others. The March Madness tournament progresses from the “Round of 64” and the “Round of 32” to the “Sweet 16,” “Elite Eight,” “Final Four,” and the finals. Each round divides the field in half: does that sound familiarly logarithmic? These tournaments are effectively Mergesort, beginning with unsorted pairs of teams and collating, collating, collating them.
  • Ironically, in Single Elimination no tournament structure is actually necessary at all. Any 63 games will yield a single undefeated champion.

Griping Rights: Noise and Robustness

  • Another, perhaps even more important way of training an algorithmic lens on sports is to ask not what confidence we should have in the silver medal, but what confidence we should have in the gold
  • In some sports, “for instance baseball, a team is going to lose 30% of their games and a team is going to win 30% of their games practically no matter who they are.” This has disturbing implications for the Single Elimination format. If NCAA basketball games, say, are won by the stronger team 70% of the time, and winning the tournament involves prevailing in 6 straight games, then the best team has only a 0.70 to the 6th power—less than 12%—chance of winning the tournament! Put another way, the tournament would crown the league’s truly best team just once a decade.
  • Computer scientists call this phenomenon noise. All of the sorting algorithms that we’ve considered thus far assume perfect, flawless, foolproof comparisons, ones that never mess up and mistakenly judge the lesser of two quantities to be the greater. Once you allow for a “noisy comparator,” some of computer science’s most hallowed algorithms go out the window—and some of its most maligned have their day of redemption.
  • The winner of that particular honor is an algorithm called Comparison Counting Sort. In this algorithm, each item is compared to all the others, generating a tally of how many items it is bigger than. This number can then be used directly as the item’s rank. Since it compares all pairs, Comparison Counting Sort is a quadratic-time algorithm, like Bubble Sort. Thus it’s not a popular choice in traditional computer science applications, but it’s exceptionally fault-tolerant.
  • This algorithm’s workings should sound familiar. Comparison Counting Sort operates exactly like a Round-Robin tournament. In other words, it strongly resembles a sports team’s regular season—playing every other team in the division and building up a win-loss record by which they are ranked.
  • That Comparison Counting Sort is the single most robust sorting algorithm known, quadratic or better, should offer something very specific to sports fans: if your team doesn’t make the playoffs, don’t whine
  • Put differently, if your team is eliminated early in the postseason, it’s tough luck. But if your team fails to get to the postseason, it’s tough truth.

Blood Sort: Pecking Orders and Dominance Hierarchies

  • What does sorting look like when it emerges organically, from the bottom up? It might look something like online poker.
In some ways the most important skill as a professional poker player is to be able to evaluate how good you are. If you’re anything short of the very best poker player in the world, you can be pretty much assured of going broke if you are endlessly willing to play people better than you. — Isaac Haxton
  • In multi-handed poker cash games, there will often be one weak player—a wealthy amateur, for instance—feeding a table full of professionals, who then don’t much care who among them is better than whom. In the world of heads-up, it’s different. “There has to be a disagreement between you and them about who’s better—or somebody has to be willingly losing.”
  • “So if you want to play heads-up no-limit, with blinds of fifty and one hundred dollars, there are only ten available tables for that,” says Haxton, “and so only the consensus ten best players who are out right now … sit and wait for someone to show up who wants to play.” And if a superior player arrives and sits down at one of these tables? If the person sitting isn’t willing to ante up, they scram.
  • Displacement happens when an animal uses its knowledge of the hierarchy to determine that a particular confrontation simply isn’t worth it.
I’m one of the top heads-up, no-limit hold ’em players in the world, and in my head I have a fairly specific ranking of who I think the twenty or so best players are, and I think each of them has a similar ranking in their mind. I think there is a pretty high degree of consensus about what the list looks like.

A Race Instead of a Fight

  • There are ways of making order without the costs.
  • Consider the difference between boxers and skiers, between fencers and runners. An Olympic boxer must risk concussion O(log n) times, usually from 4 to 6, to make it to the podium; allowing a greater number of athletes into the games would imperil the health of all. But a skeleton racer or ski jumper or halfpipe specialist needs to make only a constant number of gambles with gravity, no matter the size of the field. A fencer puts herself at her opponent’s mercy O(log n) times, but a marathoner must endure only one race.
  • This move from “ordinal” numbers (which only express rank) to “cardinal” ones (which directly assign a measure to something’s caliber) naturally orders a set without requiring pairwise comparisons. Accordingly, it makes possible dominance hierarchies that don’t require direct head-to-head matchups.
  • The Fortune 500 list, to the extent that it creates a kind of corporate hierarchy, is one of these. To find the most valuable company in the United States, analysts don’t need to perform due diligence comparing Microsoft to General Motors, then General Motors to Chevron, Chevron to Walmart, and so forth. These seemingly apples-to-oranges contests (how many enterprise software installations equal how many oil futures?) become apples-to-apples in the medium of dollars. Having a benchmark—any benchmark—solves the computational problem of scaling up a sort.
  • Operating at industrial scale, with many thousands or millions of individuals sharing the same space, requires a leap beyond. A leap from ordinal to cardinal.

4. Caching: Forget About It

In the practical use of our intellect, forgetting is as important a function as remembering. —WILLIAM JAMES
  • You have a problem. Your closet is overflowing, spilling shoes, shirts, and underwear onto the floor. You think, “It’s time to get organized.” Now you have two problems. Specifically, you first need to decide what to keep, and second, how to arrange it.
  • Operating systems encourage us to put our files into folders, like with like, forming hierarchies that branch as their contents become ever more specific. But just as the tidiness of a scholar’s desk may hide the messiness of their mind, so does the apparent tidiness of a computer’s file system obscure the highly engineered chaos of how data is actually being stored underneath the nested-folder veneer. What’s really happening is called caching.
  • Starting roughly around 2008, anyone in the market for a new computer has encountered a particular conundrum when choosing their storage option. They must make a tradeoff between size and speed.
  • A hierarchy of memories, each of which has greater capacity than the preceding but which is less quickly accessible.
  • The basic idea behind a memory hierarchy should be intuitive to anyone who has ever used a library. If you are researching a topic for a paper, let’s say, there are some books you might need to refer to on multiple occasions. Rather than go back to the library each time, you of course check out the relevant books and take them home to your desk, where you can access them more easily.
  • Wilkes’s proposal was implemented in the IBM 360/85 supercomputer later in the 1960s, where it acquired the name of the “cache.” Since then, caches have appeared everywhere in computer science. The idea of keeping around pieces of information that you refer to frequently is so powerful that it is used in every aspect of computation. Processors have caches. Hard drives have caches. Operating systems have caches. Web browsers have caches. And the servers that deliver content to those browsers also have caches
  • “Moore’s Law” prediction, made by Intel’s Gordon Moore in 1975, says that the number of transistors in CPUs would double every two years. What hasn’t improved at that rate is the performance of memory, which means that relative to processing time, the cost of accessing memory is also increasing exponentially. The faster you can write your papers, for instance, the greater the loss of productivity from each trip to the library.

Eviction and Clairvoyance

Depend upon it there comes a time when for every addition of knowledge you forget something that you knew before. It is of the highest importance, therefore, not to have useless facts elbowing out the useful ones. —SHERLOCK HOLMES
  • When a cache fills up, you are obviously going to need to make room if you want to store anything else, and in computer science this making of room is called “cache replacement” or “cache eviction.”
  • The optimal cache eviction policy—essentially by definition—is, when the cache is full, to evict whichever item we’ll need again the longest from now.
  • We could just try Random Eviction, adding new data to the cache and overwriting old data at random. One of the startling early results in caching theory is that, while far from perfect, this approach is not half bad. As it happens, just having a cache at all makes a system more efficient, regardless of how you maintain it.
  • Another simple strategy is First-In, First-Out (FIFO), where you evict or overwrite whatever has been sitting in the cache the longest (as in Martha Stewart’s question “How long have I had it?”). A third approach is Least Recently Used (LRU): evicting the item that’s gone the longest untouched (Stewart’s “When was the last time I wore it or used it?”).
  • LRU consistently performed the closest to clairvoyance
  • LRU teaches us that the next thing we can expect to need is the last one we needed, while the thing we’ll need after that is probably the second-most-recent one. And the last thing we can expect to need is the one we’ve already gone longest without.

The Cloud at the End of the Street

  • A quarter of all Internet traffic at present is handled by a single corporation, one that manages to stay almost entirely out of the headlines. This Massachusetts-based company is called Akamai, and they’re in the caching business.
  • If you can create a cache of webpage content that is physically, geographically closer to the people who want it, you can serve those pages faster. Much of the traffic on the Internet is now handled by “content distribution networks” (CDNs), which have computers around the world that maintain copies of popular websites. This allows users requesting those pages to get their data from a computer that’s nearby, without having to make the long haul across continents to the original server.
  • The largest of these CDNs is managed by Akamai: content providers pay for their websites to be “Akamaized” for better performance.
  • Recently, Amazon was granted a patent for an innovation that pushes this principle one step further. The patent talks about “anticipatory package shipping,” which the press seized upon as though Amazon could somehow mail you something before you bought it. Amazon, like any technology company, would love to have that kind of Bélády-like clairvoyance—but for the next best thing, it turns to caching. Their patent is actually for shipping items that have been recently popular in a given region to a staging warehouse in that region—like having their own CDN for physical goods. Then, when somebody places an order, the item is just down the street. Anticipating the purchases of individuals is challenging, but when predicting the purchases of a few thousand people, the law of large numbers kicks in.
  • It turned out, people love watching movies set where they live. Washingtonians favor Singles, set in Seattle; Louisianans watch The Big Easy, set in New Orleans; Angelinos unsurprisingly enjoy L.A. Story; Alaskans love Braving Alaska; and Montanans, Montana Sky.

Caching on the Home Front

Caching is such an obvious thing because we do it all the time. I mean, the amount of information I get … certain things I have to keep track of right now, a bunch of things I have on my desk, and then other things are filed away, and then eventually filed away into the university archives system where it takes a whole day to get stuff out of it if I wanted. But we use that technique all the time to try to organize our lives. — John Hennessy
  • When you are deciding what to keep and what to throw away, LRU is potentially a good principle to use—much better than FIFO. You shouldn’t necessarily toss that T-shirt from college if you still wear it every now and then. But the plaid pants you haven’t worn in ages? Those can be somebody else’s thrift-store bonanza.
  • Exploit geography. Make sure things are in whatever cache is closest to the place where they’re typically used.
  • Having a cache is efficient, but having multiple levels of caches—from smallest and fastest to largest and slowest—can be even better. Where your belongings are concerned, your closet is one cache level, your basement another, and a self-storage locker a third. (These are in decreasing order of access speed, of course, so you should use the LRU principle as the basis for deciding what gets evicted from each level to the next.) But you might also be able to speed things up by adding yet another level of caching: an even smaller, faster, closer one than your closet.

Filing and Piling

  • We’ve talked about what goes in the closet and where the closet should be, but how should things be arranged inside?
  • The left-side insertion rule has to be followed for old files as well as new ones: every time you pull out a file to use its contents, you must put it back as the leftmost file when you return it to the box. And when you search for a file, you always start from the left-hand side as well. The most recently accessed files are thus the fastest to find.
  • Where should you replace the items to make searching as efficient as possible?
  • It’s the very nature of piles that you search them from top to bottom, and that each time you pull out a document it goes back not where you found it, but on top.
  • In short, the mathematics of self-organizing lists suggests something radical: the big pile of papers on your desk, far from being a guilt-inducing fester of chaos, is actually one of the most well-designed and efficient structures available. What might appear to others to be an unorganized mess is, in fact, a self-organizing mess. Tossing things back on the top of the pile is the very best you can do, shy of knowing the future. Leaving something unsorted is more efficient than taking the time to sort everything; here, however, there’s a very different reason why you don’t need to organize it. You already have.

The Forgetting Curve

  • Practicing a list multiple times makes it persist longer in memory, and that the number of items one can accurately recall goes down as time passes.
  • Human memory and human environments. The left panel shows the percentage of nonsense syllables Ebbinghaus correctly recalled from a list, as a function of the number of hours he waited after first memorizing the list. The right panel shows the chance that a word appears in the headlines of the New York Times on a given day, as a function of the time since its previous appearance in print.
    Human memory and human environments. The left panel shows the percentage of nonsense syllables Ebbinghaus correctly recalled from a list, as a function of the number of hours he waited after first memorizing the list. The right panel shows the chance that a word appears in the headlines of the New York Times on a given day, as a function of the time since its previous appearance in print.
  • The mind has essentially infinite capacity for memories, but we have only a finite amount of time in which to search for them
  • You can fit as many items as you want on that shelf, but the closer something is to the front the faster it will be to find.
  • The key to a good human memory then becomes the same as the key to a good computer cache: predicting which items are most likely to be wanted in the future.
  • In putting the emphasis on time, caching shows us that memory involves unavoidable tradeoffs, and a certain zero-sumness. You can’t have every library book at your desk, every product on display at the front of the store, every headline above the fold, every paper at the top of the pile. And in the same way, you can’t have every fact or face or name at the front of your mind

The Tyranny of Experience

  • Size alone is enough to impair speed:
When you make something bigger, it’s inherently slower, right? If you make a city bigger, it takes longer to get from point A to point B. If you make a library bigger, it takes longer to find a book in the library. If you have a stack of papers on your desk that’s bigger, it takes longer to find the paper you’re looking for, right? Caches are actually a solution to that problem.… For example, right now, if you go to buy a processor, what you’ll get is a Level 1 cache and a Level 2 cache on the chip. The reason that there are—even just on the chip there are two caches!—is that in order to keep up with the cycle rate of the processor, the first-level cache is limited in size.
  • Unavoidably, the larger a memory is, the more time it takes to search for and extract a piece of information from it.
  • If the fundamental challenge of memory really is one of organization rather than storage, perhaps it should change how we think about the impact of aging on our mental abilities.
  • What we call “cognitive decline”—lags and retrieval errors—may not be about the search process slowing or deteriorating, but (at least partly) an unavoidable consequence of the amount of information we have to navigate getting bigger and bigger. Regardless of whatever other challenges aging brings, older brains—which must manage a greater store of memories—are literally solving harder computational problems with every passing day.
  • Through a series of simulations, the researchers showed that simply knowing more makes things harder when it comes to recognizing words, names, and even letters. No matter how good your organization scheme is, having to search through more things will inevitably take longer. It’s not that we’re forgetting; it’s that we’re remembering. We’re becoming archives.
  • Caching gives us the language to understand what’s happening. We say “brain fart” when we should really say “cache miss.” The disproportionate occasional lags in information retrieval are a reminder of just how much we benefit the rest of the time by having what we need at the front of our minds.

5. Scheduling: First Things First

  • What to do, and when, and in what order?
  • Though we always manage to find some way to order the things we do in our days, as a rule we don’t consider ourselves particularly good at it—hence the perennial bestseller status of time-management guides. Unfortunately, the guidance we find in them is frequently divergent and inconsistent. Getting Things Done advocates a policy of immediately doing any task of two minutes or less as soon as it comes to mind. Rival bestseller Eat That Frog! advises beginning with the most difficult task and moving toward easier and easier things. The Now Habit suggests first scheduling one’s social engagements and leisure time and then filling the gaps with work—rather than the other way around, as we so often do. William James, the “father of American psychology,” asserts that “there’s nothing so fatiguing as the eternal hanging on of an uncompleted task,” but Frank Partnoy, in Wait, makes the case for deliberately not doing things right away.
  • Every guru has a different system, and it’s hard to know who to listen to.

Spending Time Becomes a Science

  • Taylor created a planning office, at the heart of which was a bulletin board displaying the shop’s schedule for all to see. The board depicted every machine in the shop, showing the task currently being carried out by that machine and all the tasks waiting for it. This practice would be built upon by Taylor’s colleague Henry Gantt, who in the 1910s developed the Gantt charts that would help organize many of the twentieth century’s most ambitious construction projects, from the Hoover Dam to the Interstate Highway System. A century later, Gantt charts still adorn the walls and screens of project managers at firms like Amazon, IKEA, and SpaceX.
  • You should begin by finding the single step that takes the least amount of time—the load that will wash or dry the quickest. If that shortest step involves the washer, plan to do that load first. If it involves the dryer, plan to do it last. Repeat this process for the remaining loads, working from the two ends of the schedule toward the middle. By having the shortest washing times at the start, and the shortest drying times at the end, you maximize the amount of overlap—when the washer and dryer are running simultaneously. Thus you can keep the total amount of time spent doing laundry to the absolute minimum.

Handling Deadlines

  • This is a sufficiently fundamental and counterintuitive point that it’s worth repeating. If you have only a single machine, and you’re going to do all of your tasks, then any ordering of the tasks will take you the same amount of time.
  • Thus we encounter the first lesson in single-machine scheduling literally before we even begin: make your goals explicit. We can’t declare some schedule a winner until we know how we’re keeping score. This is something of a theme in computer science: before you can have a plan, you must first choose a metric. And as it turns out, which metric we pick here will directly affect which scheduling approaches fare best.
  • If you’re concerned with minimizing maximum lateness, then the best strategy is to start with the task due soonest and work your way toward the task due last. This strategy, known as Earliest Due Date
  • For instance, consider the refrigerator. If you’re one of the many people who have a community-supported agriculture (CSA) subscription, then every week or two you’ve got a lot of fresh produce coming to your doorstep all at once. Each piece of produce is set to spoil on a different date—so eating them by Earliest Due Date, in order of their spoilage schedule, seems like a reasonable starting point. It’s not, however, the end of the story. Earliest Due Date is optimal for reducing maximum lateness, which means it will minimize the rottenness of the single most rotten thing you’ll have to eat; that may not be the most appetizing metric to eat by.
  • Maybe instead we want to minimize the numberof foods that spoil. Here a strategy called Moore’s Algorithm gives us our best plan. Moore’s Algorithm says that we start out just like with Earliest Due Date—by scheduling out our produce in order of spoilage date, earliest first, one item at a time. However, as soon as it looks like we won’t get to eating the next item in time, we pause, look back over the meals we’ve already planned, and throw out the biggest item (that is, the one that would take the most days to consume)

Getting Things Done

Do the difficult things while they are easy and do the great things while they are small.—LAO TZU
  • Sometimes due dates aren’t our primary concern and we just want to get stuff done: as much stuff, as quickly as possible. It turns out that translating this seemingly simple desire into an explicit scheduling metric is harder than it sounds.
  • Imagine starting on Monday morning with a four-day project and a one-day project on your agenda. If you deliver the bigger project on Thursday afternoon (4 days elapsed) and then the small one on Friday afternoon (5 days elapsed), the clients will have waited a total of 4 + 5 = 9 days. If you reverse the order, however, you can finish the small project on Monday and the big one on Friday, with the clients waiting a total of only 1 + 5 = 6 days. It’s a full workweek for you either way, but now you’ve saved your clients three days of their combined time. Scheduling theorists call this metric the “sum of completion times.”
  • Minimizing the sum of completion times leads to a very simple optimal algorithm called Shortest Processing Time: always do the quickest task you can.
  • Even if you don’t have impatient clients hanging on every job, Shortest Processing Time gets things done. (Perhaps it’s no surprise that it is compatible with the recommendation in Getting Things Done to immediately perform any task that takes less than two minutes.) Again, there’s no way to change the total amount of time your work will take you, but Shortest Processing Time may ease your mind by shrinking the number of outstanding tasks as quickly as possible.
  • Not all unfinished business is created equal. Putting out an actual fire in the kitchen should probably be done before “putting out a fire” with a quick email to a client, even if the former takes a bit longer. In scheduling, this difference of importance is captured in a variable known as weight. When you’re going through your to-do list, this weight might feel literal—the burden you get off your shoulders by finishing each task.
  • The optimal strategy for that goal is a simple modification of Shortest Processing Time: divide the weight of each task by how long it will take to finish, and then work in order from the highest resulting importance-per-unit-time (call it “density” if you like, to continue the weight metaphor) to the lowest.
  • Only prioritize a task that takes twice as long if it’s twice as important.
  • The notion of dividing reward by duration translates, therefore, to assigning each task an hourly rate. (If you’re a consultant or freelancer, that might in effect already be done for you: simply divide each project’s fee by its size, and work your way from the highest hourly rate to the lowest.)
  • When applied to debts rather than incomes, the same principle yields a strategy for getting in the black that’s come to be called the “debt avalanche.” This debt-reduction strategy says to ignore the number and size of your debts entirely, and simply funnel your money toward the debt with the single highest interest rate. This corresponds rather neatly to working through jobs in order of importance per unit time. And it’s the strategy that will reduce the total burden of your debt as quickly as possible.

Picking Our Problems

  • Computer science can offer us optimal algorithms for various metrics available in single-machine scheduling, but choosing the metric we want to follow is up to us. In many cases, we get to decide what problem we want to be solving.
  • This offers a radical way to rethink procrastination, the classic pathology of time management. We typically think of it as a faulty algorithm. What if it’s exactly the opposite? What if it’s an optimal solution to the wrong problem?
  • We typically associate procrastination with laziness or avoidance behavior, but it can just as easily spring up in people who are trying earnestly and enthusiastically to get things done as quickly as possible
  • In a 2014 study led by Penn State’s David Rosenbaum, for example, participants were asked to bring either one of two heavy buckets to the opposite end of a hallway. One of the buckets was right next to the participant; the other was partway down the hall. To the experimenters’ surprise, people immediately picked up the bucket near them and lugged it all the way down—passing the other bucket on the way, which they could have carried a fraction of the distance. As the researchers write, “this seemingly irrational choice reflected a tendency to pre-crastinate, a term we introduce to refer to the hastening of subgoal completion, even at the expense of extra physical effort.” Putting off work on a major project by attending instead to various trivial matters can likewise be seen as “the hastening of subgoal completion”—which is another way of saying that procrastinators are acting (optimally!) to reduce as quickly as possible the number of outstanding tasks on their minds. It’s not that they have a bad strategy for getting things done; they have a great strategy for the wrong metric.
  • A modern smartphone user, for instance, is accustomed to seeing “badges” hovering over application icons, ominous numbers in white-on-red signaling exactly how many tasks each particular app expects us to complete. If it’s an email inbox blaring the figure of unread messages, then all messages are implicitly being given equal weight. Can we be blamed, then, for applying the unweighted Shortest Processing Time algorithm to the problem—dealing with all of the easiest emails first and deferring the hardest ones till last—to lower this numeral as quickly as possible?
  • Live by the metric, die by the metric. If all tasks are indeed of equal weight, then that’s exactly what we should be doing. But if we don’t want to become slaves to minutiae, then we need to take measures toward that end. This starts with making sure that the single-machine problem we’re solving is the one we want to be solving.
  • Staying focused not just on getting things done but on getting weighty things done—doing the most important work you can at every moment—sounds like a surefire cure for procrastination.

Priority Inversion and Precedence Constraints

  • It was the summer of 1997, and humanity had a lot to celebrate. For the first time ever, a rover was navigating the surface of Mars. The $150 million Mars Pathfinder spacecraft had accelerated to a speed of 16,000 miles per hour, traveled across 309 million miles of empty space, and landed with space-grade airbags onto the rocky red Martian surface. And now it was procrastinating.
  • Pathfinder’s highest priority task (to move data into and out of its “information bus”) was mysteriously being neglected as the robot whiled away its time on tasks of middling importance. What was going on? Didn’t the robot know any better?
  • The culprit was a classic scheduling hazard called priority inversion. What happens in a priority inversion is that a low-priority task takes possession of a system resource (access to a database, let’s say) to do some work, but is then interrupted partway through that work by a timer, which pauses it and invokes the system scheduler. The scheduler tees up a high-priority task, but it can’t run because the database is occupied. And so the scheduler moves down the priority list, running various unblocked medium-priority tasks instead—rather than the high-priority one (which is blocked), or the low-priority one that’s blocking it (which is stuck in line behind all the medium-priority work).
  • What was the solution they sent flying across the solar system? Priority inheritance. If a low-priority task is found to be blocking a high-priority resource, well, then all of a sudden that low-priority task should momentarily become the highest-priority thing on the system, “inheriting” the priority of the thing it’s blocking.
  • The comedian Mitch Hedberg recounts a time when “I was at a casino, I was minding my own business, and this guy came up and said, ‘You’re gonna have to move, you’re blocking the fire exit.’ As though if there was a fire, I wasn’t gonna run.” The bouncer’s argument was priority inversion; Hedberg’s rebuttal was priority inheritance.
  • “Things which matter most must never be at the mercy of things which matter least,” Goethe allegedly proclaimed; but while that has the ring of wisdom about it, sometimes it’s just not true. Sometimes that which matters most cannot be done until that which matters least is finished, so there’s no choice but to treat that unimportant thing as being every bit as important as whatever it’s blocking.

The Speed Bump

  • The Shortest Processing Time algorithm, as we saw, is the optimal policy if you want to cross off as many items as quickly as possible from your to-do list. But if some of your tasks have precedence constraints, there isn’t a simple or obvious tweak to Shortest Processing Time to adjust for that. This problem belongs to a class that most computer scientists believe has no efficient solution—it’s what the field calls “intractable.” Scheduling theory’s first speed bump turned out to be a brick wall.
  • For example, Moore’s Algorithm minimizes the number of late tasks (or rotten fruits) when they’re all of equal value—but if some are more important than others, the problem becomes intractable and no algorithm can readily provide the optimal schedule. Likewise, having to wait until a certain time to start some of your tasks makes nearly all of the scheduling problems for which we otherwise have efficient solutions into intractable problems.
  • Most scheduling problems admit no ready solution. A recent survey showed that the status of about 7% of all scheduling problems is still unknown, scheduling’s terra incognita. Of the 93% of the problems that we do understand, however, the news isn’t great: only 9% can be solved efficiently, and the other 84% have been proven intractable.

Drop Everything: Preemption and Uncertainty

  • There is one twist that can make it easier: being able to stop one task partway through and switch to another. This property, “preemption,” turns out to change the game dramatically.
  • When a task’s starting time comes, compare that task to the one currently under way. If you’re working by Earliest Due Date and the new task is due even sooner than the current one, switch gears; otherwise stay the course. Likewise, if you’re working by Shortest Processing Time, and the new task can be finished faster than the current one, pause to take care of it first; otherwise, continue with what you were doing.
  • It turns out, though, that even if you don’t know when tasks will begin, Earliest Due Date and Shortest Processing Time are still optimal strategies, able to guarantee you (on average) the best possible performance in the face of uncertainty.
  • A simple prescription for time management: each time a new piece of work comes in, divide its importance by the amount of time it will take to complete. If that figure is higher than for the task you’re currently doing, switch to the new one; otherwise stick with the current task. This algorithm is the closest thing that scheduling theory has to a skeleton key or Swiss Army knife, the optimal strategy not just for one flavor of problem but for many.
  • Even with complete foreknowledge, finding the perfect schedule might be practically impossible. In contrast, thinking on your feet and reacting as jobs come in won’t give you as perfect a schedule as if you’d seen into the future—but the best you can do is much easier to compute.
  • When the future is foggy, it turns out you don’t need a calendar—just a to-do list.

Preemption Isn’t Free: The Context Switch

The hurrieder I go / The behinder I get —NEEDLEPOINT SEEN IN BOONVILLE, CA
Programmers don’t talk because they must not be interrupted.… To synchronize with other people (or their representation in telephones, buzzers and doorbells) can only mean interrupting the thought train. Interruptions mean certain bugs. You must not get off the train. —ELLEN ULLMAN
  • The machine that is doing the scheduling and the machine being scheduled are one and the same. Which makes straightening out your to-do list an item on your to-do list—needing, itself, to get prioritized and scheduled.
  • Preemption isn’t free. Every time you switch tasks, you pay a price, known in computer science as a context switch. When a computer processor shifts its attention away from a given program, there’s always a certain amount of necessary overhead. It needs to effectively bookmark its place and put aside all of its information related to that program. Then it needs to figure out which program to run next. Finally it must haul out all the relevant information for that program, find its place in the code, and get in gear
  • None of this switching back and forth is “real work”—that is, none of it actually advances the state of any of the various programs the computer is switching between. It’s metawork. Every context switch is wasted time.
  • Anyone you interrupt more than a few times an hour is in danger of doing no work at all.
  • We have found that both programming and writing require keeping in mind the state of the entire system, and thus carry inordinately large context-switching costs. A friend of ours who writes software says that the normal workweek isn’t well suited to his workflow, since for him sixteen-hour days are more than twice as productive as eight-hour days. Brian, for his part, thinks of writing as a kind of blacksmithing, where it takes a while just to heat up the metal before it’s malleable. He finds it somewhat useless to block out anything less than ninety minutes for writing, as nothing much happens in the first half hour except loading a giant block of “Now, where was I?” into his head.

Thrashing

  • Computers multitask through a process called “threading,” which you can think of as being like juggling a set of balls. Just as a juggler only hurls one ball at a time into the air but keeps three aloft, a CPU only works on one program at a time, but by swapping between them quickly enough (on the scale of ten-thousandths of a second) it appears to be playing a movie, navigating the web, and alerting you to incoming email all at once.
  • Think again about our image of a juggler. With one ball in the air, there’s enough spare time while that ball is aloft for the juggler to toss some others upward as well. But what if the juggler takes on one more ball than he can handle? He doesn’t drop that ball; he drops everything. The whole system, quite literally, goes down.
  • The whole idea of caches is to keep the “working set” of needed items available for quick access. One way this is done is by keeping the information the computer is currently using in fast memory rather than on the slow hard disk. But if a task requires keeping track of so many things that they won’t all fit into memory, then you might well end up spending more time swapping information in and out of memory than doing the actual work. What’s more, when you switch tasks, the newly active task might make space for its working set by evicting portions of other working sets from memory. The next task, upon reactivation, would then reacquire parts of its working set from the hard disk and muscle them back into memory, again displacing others.
  • At the extreme, a program may run just long enough to swap its needed items into memory, before giving way to another program that runs just long enough to overwrite them in turn.
  • This is thrashing: a system running full-tilt and accomplishing nothing at all
  • Thrashing is a very recognizable human state. If you’ve ever had a moment where you wanted to stop doing everything just to have the chance to write down everything you were supposed to be doing, but couldn’t spare the time, you’ve thrashed. And the cause is much the same for people as for computers: each task is a draw on our limited cognitive resources. When merely remembering everything we need to be doing occupies our full attention—or prioritizing every task consumes all the time we had to do them—or our train of thought is continually interrupted before those thoughts can translate to action—it feels like panic, like paralysis by way of hyperactivity.
  • An ounce of prevention is worth a pound of cure. The easiest thing to do is simply to get more memory: enough RAM, for instance, to fit the working sets of all the running programs into memory at once and reduce the time taken by a context switch
  • Another way to avert thrashing before it starts is to learn the art of saying no. A system should simply refuse to add a program to its workload if it didn’t have enough free memory to hold its working set.
  • In these cases there’s clearly no way to work any harder, but you can work … dumber. Along with considerations of memory, one of the biggest sources of metawork in switching contexts is the very act of choosing what to do next. This, too, can at times swamp the actual doing of the work.
  • Faced with, say, an overflowing inbox of n messages, we know from sorting theory that repeatedly scanning it for the most important one to answer next will take O(n2) operations—n scans of n messages apiece. This means that waking up to an inbox that’s three times as full as usual could take you nine times as long to process. What’s more, scanning through those emails means swapping every message into your mind, one after another, before you respond to any of them: a surefire recipe for memory thrashing.
  • In a thrashing state, you’re making essentially no progress, so even doing tasks in the wrong order is better than doing nothing at all. Instead of answering the most important emails first—which requires an assessment of the whole picture that may take longer than the work itself—maybe you should sidestep that quadratic-time quicksand by just answering the emails in random order, or in whatever order they happen to appear on-screen.

Interrupt Coalescing

  • Part of what makes real-time scheduling so complex and interesting is that it is fundamentally a negotiation between two principles that aren’t fully compatible. These two principles are called responsiveness and throughput: how quickly you can respond to things, and how much you can get done overall.
  • Operating system schedulers typically define a “period” in which every program is guaranteed to run at least a little bit, with the system giving a “slice” of that period to each program. The more programs are running, the smaller those slices become, and the more context switches are happening every period, maintaining responsiveness at the cost of throughput. Left unchecked, however, this policy of guaranteeing each process at least some attention every period could lead to catastrophe. With enough programs running, a task’s slice would shrink to the point that the system was spending the entire slice on context switching, before immediately context-switching again to the next task.
  • The culprit is the hard responsiveness guarantee. So modern operating systems in fact set a minimum length for their slices and will refuse to subdivide the period any more finely.
  • Establishing a minimum amount of time to spend on any one task helps to prevent a commitment to responsiveness from obliterating throughput entirely: if the minimum slice is longer than the time it takes to context-switch, then the system can never get into a state where context switching is the only thing it’s doing. It’s also a principle that is easy to translate into a recommendation for human lives. Methods such as “timeboxing” or “pomodoros,” where you literally set a kitchen timer and commit to doing a single task until it runs out, are one embodiment of this idea.
  • But what slice size should you aim for? Faced with the question of how long to wait between intervals of performing a recurring task, like checking your email, the answer from the perspective of throughput is simple: as long as possible.
  • For your computer, the annoying interruption that it has to check on regularly isn’t email—it’s you. You might not move the mouse for minutes or hours, but when you do, you expect to see the pointer on the screen move immediately, which means the machine expends a lot of effort simply checking in on you. The more frequently it checks on the mouse and keyboard, the quicker it can react when there is input, but the more context switches it has to do. So the rule that computer operating systems follow when deciding how long they can afford to dedicate themselves to some task is simple: as long as possible without seeming jittery or slow to the user.
  • When our machines context-switch into a computation, they must literally return to us before we notice they’re gone. To find this balancing point, operating systems programmers have turned to psychology, mining papers in psychophysics for the exact number of milliseconds of delay it takes for a human brain to register lag or flicker. There is no point in attending to the user any more often than that.
  • Thanks to these efforts, when operating systems are working right you don’t even notice how hard your computer is exerting itself. You continue to be able to move your mouse around the screen fluidly even when your processor is hauling full-tilt. The fluidity is costing you some throughput, but that’s a design tradeoff that has been explicitly made by the system engineers: your system spends as much time as it possibly can away from interacting with you, then gets around to redrawing the mouse just in time.
  • You should try to stay on a single task as long as possible without decreasing your responsiveness below the minimum acceptable limit. Decide how responsive you need to be—and then, if you want to get things done, be no more responsive than that.
  • If you find yourself doing a lot of context switching because you’re tackling a heterogeneous collection of short tasks, you can also employ another idea from computer science: “interrupt coalescing.” If you have five credit card bills, for instance, don’t pay them as they arrive; take care of them all in one go when the fifth bill comes. Likewise, if none of your email correspondents require you to respond in less than twenty-four hours, you can limit yourself to checking your messages once a day.
  • Whatever their drawbacks, regularly scheduled meetings are one of our best defenses against the spontaneous interruption and the unplanned context switch.
  • Perhaps the patron saint of the minimal-context-switching lifestyle is the legendary programmer Donald Knuth. “I do one thing at a time,” he says. “This is what computer scientists call batch processing—the alternative is swapping in and out. I don’t swap in and out.” Knuth isn’t kidding. On January 1, 2014, he embarked on “The TeX Tuneup of 2014,” in which he fixed all of the bugs that had been reported in his TeX typesetting software over the previous six years. His report ends with the cheery sign-off “Stay tuned for The TeX Tuneup of 2021!” Likewise, Knuth has not had an email address since 1990. “Email is a wonderful thing for people whose role in life is to be on top of things. But not for me; my role is to be on the bottom of things. What I do takes long hours of studying and uninterruptible concentration.” He reviews all his postal mail every three months, and all his faxes every six.
  • Our beeping and buzzing devices have “Do Not Disturb” modes, which we could manually toggle on and off throughout the day, but that is too blunt an instrument. Instead, we might agitate for settings that would provide an explicit option for interrupt coalescing—the same thing at a human timescale that the devices are doing internally. Alert me only once every ten minutes, say; then tell me everything.

6. Bayes’s Rule: Predicting the Future

  • Our days are full of “small data.” In fact, like Gott standing at the Berlin Wall, we often have to make an inference from the smallest amount of data we could possibly have: a single observation.
  • So how do we do it? And how should we?

Reasoning Backward with the Reverend Bayes

  • Bayes’s critical insight was that trying to use the winning and losing tickets we see to figure out the overall ticket pool that they came from is essentially reasoning backward. And to do that, he argued, we need to first reason forward from hypotheticals. In other words, we need to first determine how probable it is that we would have drawn the tickets we did if various scenarios were true. This probability—known to modern statisticians as the “likelihood”—gives us the information we need to solve the problem.
  • For instance, imagine we bought three tickets and all three were winners. Now, if the raffle was of the particularly generous sort where all the tickets are winners, then our three-for-three experience would of course happen all of the time; it has a 100% chance in that scenario. If, instead, only half of the raffle’s tickets were winners, our three-for-three experience would happen 1⁄2 × 1⁄2 × 1⁄2 of the time, or in other words 1⁄8 of the time. And if the raffle rewarded only one ticket in a thousand, our outcome would have been incredibly unlikely: 1⁄1,000 × 1⁄1,000 × 1⁄1,000, or one in a billion.
  • All things being equal, we should imagine it to be exactly eight times likelier that all the tickets are winners than that half of them are—because the tickets we drew are exactly eight times likelier (100% versus one-in-eight) in that scenario. Likewise, it’s exactly 125 million times likelier that half the raffle tickets are winners than that there’s only one winning ticket per thousand, which we know by comparing one-in-eight to one-in-a-billion.

Laplace’s Law

  • In fact, for any possible drawing of w winning tickets in n attempts, the expectation is simply the number of wins plus one, divided by the number of attempts plus two: (w+1)⁄(n+2).
  • This incredibly simple scheme for estimating probabilities is known as Laplace’s Law, and it is easy to apply in any situation where you need to assess the chances of an event based on its history. If you make ten attempts at something and five of them succeed, Laplace’s Law estimates your overall chances to be 6/12 or 50%, consistent with our intuitions. If you try only once and it works out, Laplace’s estimate of 2/3 is both more reasonable than assuming you’ll win every time, and more actionable than Price’s guidance (which would tell us that there is a 75% metaprobability of a 50% or greater chance of success)
  • Even when we’ve made only a few observations—or only one—it offers us practical guidance. Want to calculate the chance your bus is late? The chance your softball team will win? Count the number of times it has happened in the past plus one, then divide by the number of opportunities plus two.

Bayes’s Rule and Prior Beliefs

  • To make things concrete, let’s say a friend shows you two different coins. One is a normal, “fair” coin with a 50–50 chance of heads and tails; the other is a two-headed coin. He drops them into a bag and then pulls one out at random. He flips it once: heads. Which coin do you think your friend flipped?
  • Bayes’s scheme of working backward makes short work of this question. A flip coming up heads happens 50% of the time with a fair coin and 100% of the time with a two-headed coin. Thus we can assert confidently that it’s 100%50%, or exactly twice as probable, that the friend had pulled out the two-headed coin.
  • Now consider the following twist. This time, the friend shows you nine fair coins and one two-headed coin, puts all ten into a bag, draws one at random, and flips it: heads. Now what do you suppose? Is it a fair coin or the two-headed one?
  • Laplace’s work anticipated this wrinkle, and here again the answer is impressively simple. As before, a fair coin is exactly half as likely to come up heads as a two-headed coin. But now, a fair coin is also nine times as likely to have been drawn in the first place. It turns out that we can just take these two different considerations and multiply them together: it is exactly four and a half times more likely that your friend is holding a fair coin than the two-headed one.
  • The mathematical formula that describes this relationship, tying together our previously held ideas and the evidence before our eyes, has come to be known—ironically, as the real heavy lifting was done by Laplace—as Bayes’s Rule. And it gives a remarkably straightforward solution to the problem of how to combine preexisting beliefs with observed evidence: multiply their probabilities together.
  • Bayes’s Rule always needs some prior from you, even if it’s only a guess. How many two-headed coins exist? How easy are they to get? How much of a trickster is your friend, anyway?

The Copernican Principle

  • Copernicus four hundred years earlier: Where are we? Where in the universe is the Earth? Copernicus would make the radical paradigm shift of imagining that the Earth was not the bull’s-eye center of the universe—that it was, in fact, nowhere special in particular. Gott decided to take the same step with regard to time.
  • He made the assumption that the moment when he encountered the Berlin Wall wasn’t special—that it was equally likely to be any moment in the wall’s total lifetime. And if any moment was equally likely, then on average his arrival should have come precisely at the halfway point (since it was 50% likely to fall before halfway and 50% likely to fall after). More generally, unless we know better we can expect to have shown up precisely halfway into the duration of any given phenomenon. And if we assume that we’re arriving precisely halfway into something’s duration, the best guess we can make for how long it will last into the future becomes obvious: exactly as long as it’s lasted already
  • This straightforward reasoning, which Gott named the Copernican Principle

Bayes Meets Copernicus

  • The Copernican Principle is exactly what results from applying Bayes’s Rule using what is known as an uninformative prior.
  • In the case of the Berlin Wall, an uninformative prior means saying that we don’t know anything about the time span we’re trying to predict: the wall could equally well come down in the next five minutes or last for five millennia.
  • In the mid-twentieth century, the Bayesian statistician Harold Jeffreys had looked into determining the number of tramcars in a city given the serial number on just one tramcar, and came up with the same answer: double the serial number. And a similar problem had arisen even earlier, during World War II, when the Allies sought to estimate the number of tanks being produced by Germany. Purely mathematical estimates based on captured tanks’ serial numbers predicted that the Germans were producing 246 tanks every month, while estimates obtained by extensive (and highly risky) aerial reconnaissance suggested the figure was more like 1,400. After the war, German records revealed the true figure: 245.

Real-World Priors …

  • In the broadest sense, there are two types of things in the world: things that tend toward (or cluster around) some kind of “natural” value, and things that don’t.
  • Human life spans are clearly in the former category. They roughly follow what’s termed a “normal” distribution—also known as the “Gaussian” distribution, after the German mathematician Carl Friedrich Gauss, and informally called the “bell curve” for its characteristic shape.
  • There are a number of things in the world that don’t look normally distributed, however—not by a long shot. The average population of a town in the United States, for instance, is 8,226. But if you were to make a graph of the number of towns by population, you wouldn’t see anything remotely like a bell curve. There would be way more towns smaller than 8,226 than larger. At the same time, the larger ones would be way bigger than the average. This kind of pattern typifies what are called “power-law distributions.”
  • The power-law distribution characterizes a host of phenomena in everyday life that have the same basic quality as town populations: most things below the mean, and a few enormous ones above it.
  • Two-thirds of the US population make less than the mean income, but the top 1% make almost ten times the mean. And the top 1% of the 1% make ten times more than that.

… and Their Prediction Rules

  • For any power-law distribution, Bayes’s Rule indicates that the appropriate prediction strategy is a Multiplicative Rule: multiply the quantity observed so far by some constant factor. For an uninformative prior, that constant factor happens to be 2, hence the Copernican prediction; in other power-law cases, the multiplier will depend on the exact distribution you’re working with. For the grosses of movies, for instance, it happens to be about 1.4. So if you hear a movie has made $6 million so far, you can guess it will make about $8.4 million overall; if it’s made $90 million, guess it will top out at $126 million.
  • When we apply Bayes’s Rule with a normal distribution as a prior, on the other hand, we obtain a very different kind of guidance. Instead of a multiplicative rule, we get an Average Rule: use the distribution’s “natural” average—its single, specific scale—as your guide. For instance, if somebody is younger than the average life span, then simply predict the average; as their age gets close to and then exceeds the average, predict that they’ll live a few years more. Following this rule gives reasonable predictions for the 90-year-old and the 6-year-old: 94 and 77, respectively. (The 6-year-old gets a tiny edge over the population average of 76 by virtue of having made it through infancy: we know he’s not in the distribution’s left tail.)
  • Between those two extremes, there’s actually a third category of things in life: those that are neither more nor less likely to end just because they’ve gone on for a while. Sometimes things are simply … invariant.
  • The Erlang distribution. The shape of this curve differs from both the normal and the power-law: it has a winglike contour, rising to a gentle hump, with a tail that falls off faster than a power-law but more slowly than a normal distribution.
  • The Erlang distribution gives us a third kind of prediction rule, the Additive Rule: always predict that things will go on just a constant amount longer. The familiar refrain of “Just five more minutes!… [five minutes later] Five more minutes!” that so often characterizes human claims regarding, say, one’s readiness to leave the house or office, or the time until the completion of some task, may seem indicative of some chronic failure to make realistic estimates. Well, in the cases where one’s up against an Erlang distribution, anyway, that refrain happens to be correct.
  • If a casino card-playing enthusiast tells his impatient spouse, for example, that he’ll quit for the day after hitting one more blackjack (the odds of which are about 20 to 1), he might cheerily predict, “I’ll be done in about twenty more hands!” If, an unlucky twenty hands later, she returns, asking how long he’s going to make her wait now, his answer will be unchanged: “I’ll be done in about twenty more hands!” It sounds like our indefatigable card shark has suffered a short-term memory loss—but, in fact, his prediction is entirely correct. Indeed, distributions that yield the same prediction, no matter their history or current state, are known to statisticians as “memoryless.”
image
  • So a power-law event is more surprising the longer we’ve been waiting for it—and maximally surprising right before it happens. A nation, corporation, or institution only grows more venerable with each passing year, so it’s always stunning when it collapses.
  • In a normal distribution, events are surprising when they’re early—since we expected them to reach the average—but not when they’re late. Indeed, by that point they seem overdue to happen, so the longer we wait, the more we expect them.
  • And in an Erlang distribution, events by definition are never any more or less surprising no matter when they occur. Any state of affairs is always equally likely to end regardless of how long it’s lasted. No wonder politicians are always thinking about their next election.
  • Knowing what distribution you’re up against can make all the difference.

Small Data and the Mind

  • The three prediction rules—Multiplicative, Average, and Additive—are applicable in a wide range of everyday situations. And in those situations, people in general turn out to be remarkably good at using the right prediction rule.
  • The predictions that people had made were extremely close to those produced by Bayes’s Rule. Intuitively, people made different types of predictions for quantities that followed different distributions—power-law, normal, and Erlang—in the real world. In other words, while you might not know or consciously remember which situation calls for the Multiplicative, Average, or Additive Rule, the predictions you make every day tend to implicitly reflect the different cases where these distributions appear in everyday life, and the different ways they behave.
  • Small data is big data in disguise. The reason we can often make good predictions from a small number of observations—or just a single one—is that our priors are so rich. Whether we know it or not, we appear to carry around in our heads surprisingly accurate priors about movie grosses and running times, poem lengths, and political terms of office, not to mention human life spans. We don’t need to gather them explicitly; we absorb them from the world.
  • In cases where we don’t have good priors, our predictions aren’t good. In Tom and Josh’s study, for instance, there was one subject where people’s predictions systematically diverged from Bayes’s Rule: predicting the length of the reign of Egyptian pharaohs. (As it happens, pharaohs’ reigns follow an Erlang distribution.) People simply didn’t have enough everyday exposure to have an intuitive feel for the range of those values, so their predictions, of course, faltered. Good predictions require good priors.
  • Our judgments betray our expectations, and our expectations betray our experience. What we project about the future reveals a lot—about the world we live in, and about our own past.

What Our Predictions Tell Us About Ourselves

  • Each child would be shown a delicious treat, such as a marshmallow, and told that the adult running the experiment was about to leave the room for a while. If they wanted to, they could eat the treat right away. But if they waited until the experimenter came back, they would get two treats.
  • Unable to resist, some of the children ate the treat immediately. And some of them stuck it out for the full fifteen minutes or so until the experimenter returned, and got two treats as promised. But perhaps the most interesting group comprised the ones in between—the ones who managed to wait a little while, but then surrendered and ate the treat.
  • These cases, where children struggled mightily and suffered valiantly, only to give in and lose the extra marshmallow anyway, have been interpreted as suggesting a kind of irrationality. If you’re going to cave, why not just cave immediately, and skip the torture? But it all depends on what kind of situation the children think they are in. As the University of Pennsylvania’s Joe McGuire and Joe Kable have pointed out, if the amount of time it takes for adults to come back is governed by a power-law distribution—with long absences suggesting even longer waits lie ahead—then cutting one’s losses at some point can make perfect sense.
  • Decades after the original marshmallow experiments, Walter Mischel and his colleagues went back and looked at how the participants were faring in life. Astonishingly, they found that children who had waited for two treats grew into young adults who were more successful than the others, even measured by quantitative metrics like their SAT scores. If the marshmallow test is about willpower, this is a powerful testament to the impact that learning self-control can have on one’s life.
  • The art project completed, the children went on to the standard marshmallow test. And here, the children who had learned that the experimenter was unreliable were more likely to eat the marshmallow before she came back, losing the opportunity to earn a second treat.
  • Failing the marshmallow test—and being less successful in later life—may not be about lacking willpower. It could be a result of believing that adults are not dependable: that they can’t be trusted to keep their word, that they disappear for intervals of arbitrary length. Learning self-control is important, but it’s equally important to grow up in an environment where adults are consistently present and trustworthy.

Priors in the Age of Mechanical Reproduction

As if someone were to buy several copies of the morning paper to assure himself that what it said was true. —LUDWIG WITTGENSTEIN
He is careful of what he reads, for that is what he will write. He is careful of what he learns, for that is what he will know. —ANNIE DILLARD
  • The best way to make good predictions, as Bayes’s Rule shows us, is to be accurately informed about the things you’re predicting.
  • Everything starts to break down, however, when a species gains language. What we talk about isn’t what we experience—we speak chiefly of interesting things, and those tend to be things that are uncommon
  • When people talk about what interests them—and offer stories they think their listeners will find interesting—it skews the statistics of our experience. That makes it hard to maintain appropriate prior distributions
  • Consider how many times you’ve seen either a crashed plane or a crashed car. It’s entirely possible you’ve seen roughly as many of each—yet many of those cars were on the road next to you, whereas the planes were probably on another continent, transmitted to you via the Internet or television. In the United States, for instance, the total number of people who have lost their lives in commercial plane crashes since the year 2000 would not be enough to fill Carnegie Hall even half full. In contrast, the number of people in the United States killed in car accidents over that same time is greater than the entire population of Wyoming.
  • Simply put, the representation of events in the media does not track their frequency in the world. As sociologist Barry Glassner notes, the murder rate in the United States declined by 20% over the course of the 1990s, yet during that time period the presence of gun violence on American news increased by 600%.
  • If you want to be a good intuitive Bayesian—if you want to naturally make good predictions, without having to think about what kind of prediction rule is appropriate—you need to protect your priors. Counterintuitively, that might mean turning off the news.

7. Overfitting: When to Think Less

  • When Charles Darwin was trying to decide whether he should propose to his cousin Emma Wedgwood, he got out a pencil and paper and weighed every possible consequence.
  • The pro-and-con list was already a time-honored algorithm by Darwin’s time, being endorsed by Benjamin Franklin a century before
  • When we think about thinking, it’s easy to assume that more is better: that you will make a better decision the more pros and cons you list, make a better prediction about the price of a stock the more relevant factors you identify, and write a better report the more time you spend working on it
  • The question of how hard to think, and how many factors to consider, is at the heart of a knotty problem that statisticians and machine-learning researchers call “overfitting.” And dealing with that problem reveals that there’s a wisdom to deliberately thinking less.

The Case Against Complexity

  • One possible formula would use just a single factor to predict life satisfaction: the time since marriage. This would create a straight line on the chart. Another possibility is to use two factors, time and time squared; the resulting line would have a parabolic U-shape, letting it capture a potentially more complex relationship between time and happiness. And if we expand the formula to include yet more factors (time cubed and so on), the line will acquire ever more inflection points, getting more and more “bendy” and flexible. By the time we get to a nine-factor formula, we can capture very complex relationships indeed.
  • Mathematically speaking, our two-factor model incorporates all the information that goes into the one-factor model, and has another term it could use as well. Likewise, the nine-factor model leverages all of the information at the disposal of the two-factor model, plus potentially lots more. By this logic, it seems like the nine-factor model ought to always give us the best predictions.
  • image
  • It is indeed true that including more factors in a model will always, by definition, make it a better fit for the data we have already. But a better fit for the available data does not necessarily mean a better prediction.
  • Adding small amounts of random “noise” to the data (simulating the effects of repeating the survey with a different group of participants) produces wild undulations in the nine-factor model, while the one- and two-factor models in comparison are much more stable and consistent in their predictions
    Adding small amounts of random “noise” to the data (simulating the effects of repeating the survey with a different group of participants) produces wild undulations in the nine-factor model, while the one- and two-factor models in comparison are much more stable and consistent in their predictions
  • Granted, a model that’s too simple—for instance, the straight line of the one-factor formula—can fail to capture the essential pattern in the data. If the truth looks like a curve, no straight line can ever get it right. On the other hand, a model that’s too complicated, such as our nine-factor model here, becomes oversensitive to the particular data points that we happened to observe.
  • It’s not always better to use a more complex model, one that takes a greater number of factors into account.

The Idolatry of Data

  • Overfitting poses a danger any time we’re dealing with noise or mismeasurement—and we almost always are. There can be errors in how the data were collected, or in how they were reported.
  • When making a big decision, we can only guess at what will please us later by thinking about the factors important to us right now. (As Harvard’s Daniel Gilbert puts it, our future selves often “pay good money to remove the tattoos that we paid good money to get.”) When making a financial forecast, we can only look at what correlated with the price of a stock in the past, not what might in the future.
  • As a consequence, considering more and more factors and expending more effort to model them can lead us into the error of optimizing for the wrong thing.

Overfitting Everywhere

  • How can it be that the foods that taste best to us are broadly considered to be bad for our health, when the entire function of taste buds, evolutionarily speaking, is to prevent us from eating things that are bad?
  • The answer is that taste is our body’s proxy metric for health. Fat, sugar, and salt are important nutrients, and for a couple hundred thousand years, being drawn to foods containing them was a reasonable measure for a sustaining diet.
  • Beware: when you go to the gym to work off the extra weight from all that sugar, you can also risk overfitting fitness. Certain visible signs of fitness—low body fat and high muscle mass, for example—are easy to measure, and they are related to, say, minimizing the risk of heart disease and other ailments.
  • The twenty-first-century shift into real-time analytics has only made the danger of metrics more intense. Avinash Kaushik, digital marketing evangelist at Google, warns that trying to get website users to see as many ads as possible naturally devolves into trying to cram sites with ads: “When you are paid on a [cost per thousand impressions] basis the incentive is to figure out how to show the most possible ads on every page [and] ensure the visitor sees the most possible pages on the site.… That incentive removes a focus from the important entity, your customer, and places it on the secondary entity, your advertiser.” The website might gain a little more money in the short term, but ad-crammed articles, slow-loading multi-page slide shows, and sensationalist clickbait headlines will drive away readers in the long run. Kaushik’s conclusion: “Friends don’t let friends measure Page Views. Ever.”
  • In one particularly dramatic case, an officer instinctively grabbed the gun out of the hands of an assailant and then instinctively handed it right back—just as he had done time and time again with his trainers in practice.

Detecting Overfitting: Cross-Validation

  • Cross-Validation means assessing not only how well a model fits the data it’s given, but how well it generalizes to data it hasn’t seen. Paradoxically, this may involve using less data.

How to Combat Overfitting: Penalizing Complexity

If you can’t explain it simply, you don’t understand it well enough.
  • From a statistics viewpoint, overfitting is a symptom of being too sensitive to the actual data we’ve seen. The solution, then, is straightforward: we must balance our desire to find a good fit against the complexity of the models we use to do so.
  • One way to choose among several competing models is the Occam’s razor principle, which suggests that, all things being equal, the simplest possible hypothesis is probably the correct one.
  • Introduce an additional term to your calculations that penalizes more complex solutions. If we introduce a complexity penalty, then more complex models need to do not merely a better job but a significantly better job of explaining the data to justify their greater complexity. Computer scientists refer to this principle—using constraints that penalize models for their complexity—as Regularization.
  • One algorithm, discovered in 1996 by biostatistician Robert Tibshirani, is called the Lasso and uses as its penalty the total weight of the different factors in the model. By putting this downward pressure on the weights of the factors, the Lasso drives as many of them as possible completely to zero. Only the factors that have a big impact on the results remain in the equation—thus potentially transforming, say, an overfitted nine-factor model into a simpler, more robust formula with just a couple of the most critical factors.
  • The fact that the human brain burns about a fifth of humans’ total daily caloric intake is a testament to the evolutionary advantages that our intellectual abilities provide us with: the brain’s contributions must somehow more than pay for that sizable fuel bill.
  • On the other hand, we can also infer that a substantially more complex brain probably didn’t provide sufficient dividends, evolutionarily speaking. We’re as brainy as we have needed to be, but not extravagantly more so.
  • Brains try to minimize the number of neurons that are firing at any given moment—implementing the same kind of downward pressure on complexity as the Lasso.
  • Language forms yet another natural Lasso: complexity is punished by the labor of speaking at greater length and the taxing of our listener’s attention span. Business plans get compressed to an elevator pitch; life advice becomes proverbial wisdom only if it is sufficiently concise and catchy. And anything that needs to be remembered has to pass through the inherent Lasso of memory.

The Upside of Heuristics

  • The economist Harry Markowitz won the 1990 Nobel Prize in Economics for developing modern portfolio theory: his groundbreaking “mean-variance portfolio optimization” showed how an investor could make an optimal allocation among various funds and assets to maximize returns at a given level of risk. So when it came time to invest his own retirement savings, it seems like Markowitz should have been the one person perfectly equipped for the job. What did he decide to do?
    • I should have computed the historical covariances of the asset classes and drawn an efficient frontier. Instead, I visualized my grief if the stock market went way up and I wasn’t in it—or if it went way down and I was completely in it. My intention was to minimize my future regret. So I split my contributions fifty-fifty between bonds and equities.
  • Why in the world would he do that? The story of the Nobel Prize winner and his investment strategy could be presented as an example of human irrationality: faced with the complexity of real life, he abandoned the rational model and followed a simple heuristic. But it’s precisely because of the complexity of real life that a simple heuristic might in fact be the rational solution
  • When it comes to portfolio management, it turns out that unless you’re highly confident in the information you have about the markets, you may actually be better off ignoring that information altogether.
  • The study of heuristics shows that less information, computation, and time can in fact improve accuracy

The Weight of History

  • The soy milk market in the United States more than quadrupled from the mid-1990s to 2013. But by the end of 2013, according to news headlines, it already seemed to be a thing of the past, a distant second place to almond milk. As food and beverage researcher Larry Finkel told Bloomberg Businessweek: “Nuts are trendy now. Soy sounds more like old-fashioned health food.” The Silk company, famous for popularizing soy milk (as the name implies), reported in late 2013 that its almond milk products had grown by more than 50% in the previous quarter alone. Meanwhile, in other beverage news, the leading coconut water brand, Vita Coco, reported in 2014 that its sales had doubled since 2011—and had increased an astounding three-hundred-fold since 2004. As the New York Times put it, “coconut water seems to have jumped from invisible to unavoidable without a pause in the realm of the vaguely familiar.” Meanwhile, the kale market grew by 40% in 2013 alone. The biggest purchaser of kale the year before had been Pizza Hut, which put it in their salad bars—as decoration.
  • If some particular study happens to suggest a health benefit from, say, star anise, it can be all over the blogosphere within the week, on television the week after that, and in seemingly every supermarket in six months, with dedicated star anise cookbooks soon rolling off the presses. This breathtaking speed is both a blessing and a curse.
  • In contrast, if we look at the way organisms—including humans—evolve, we notice something intriguing: change happens slowly. This means that the properties of modern-day organisms are shaped not only by their present environments, but also by their history.
  • When it comes to culture, tradition plays the role of the evolutionary constraints. A bit of conservatism, a certain bias in favor of history, can buffer us against the boom-and-bust cycle of fads.
  • In machine learning, the advantages of moving slowly emerge most concretely in a regularization technique known as Early Stopping. What happens if we stop that process early and simply don’t allow a model the time to become too complex?
  • Giving yourself more time to decide about something does not necessarily mean that you’ll make a better decision. But it does guarantee that you’ll end up considering more factors, more hypotheticals, more pros and cons, and thus risk overfitting.
  • The effectiveness of regularization in all kinds of machine-learning tasks suggests that we can make better decisions by deliberately thinking and doing less.
  • If the factors we come up with first are likely to be the most important ones, then beyond a certain point thinking more about a problem is not only going to be a waste of time and effort—it will lead us to worse solutions.

When to Think Less

  • If you have all the facts, they’re free of all error and uncertainty, and you can directly assess whatever is important to you, then don’t stop early. Think long and hard: the complexity and effort are appropriate.
  • If you have high uncertainty and limited data, then do stop early by all means. If you don’t have a clear read on how your work will be evaluated, and by whom, then it’s not worth the extra time to make it perfect with respect to your own (or anyone else’s) idiosyncratic guess at what perfection might be. The greater the uncertainty, the bigger the gap between what you can measure and what matters, the more you should watch out for overfitting—that is, the more you should prefer simplicity, and the earlier you should stop.
  • The upshot of Early Stopping is that sometimes it’s not a matter of choosing between being rational and going with our first instinct. Going with our first instinct can be the rational solution. The more complex, unstable, and uncertain the decision, the more rational an approach that is.
  • Darwin made up his mind exactly when his notes reached the bottom of the diary sheet. He was regularizing to the page. This is reminiscent of both Early Stopping and the Lasso: anything that doesn’t make the page doesn’t make the decision.
  • His mind made up to marry, Darwin immediately went on to overthink the timing. “When? Soon or Late,” he wrote above another list of pros and cons, considering everything from happiness to expenses to “awkwardness” to his long-standing desire to travel in a hot air balloon and/or to Wales. But by the end of the page he resolved to “Never mind, trust to chance”—and the result, within several months’ time, was a proposal to Emma Wedgwood, the start of a fulfilling partnership and a happy family life.

8. Relaxation: Let It Slide

  • Bellows worked out a way to numerically define the strength of the relationships among all the guests. If a particular pair of people didn’t know one another they got a 0, if they did they got a 1, and if they were a couple they got a 50. (The sister of the bride got to give a score of 10 to all the people she wanted to sit with, as a special prerogative.) Bellows then specified a few constraints: the maximum table capacity, and a minimum score necessary for each table, so that no one table became the awkward “miscellaneous” group full of strangers. She also codified the program’s goal: to maximize the relationship scores between the guests and their tablemates.
  • There were 107 people at the wedding and 11 tables, which could accommodate ten people each. This means there were about 11^107 possible seating plans: that’s a 112-digit number, more than 200 billion googols, a figure that dwarfs the (merely 80-digit) number of atoms in the observable universe. Bellows submitted the job to her lab computer on Saturday evening and let it churn. When she came in on Monday morning, it was still running; she had it spit out the best assignment it had found so far.
  • There are entire classes of problems where a perfect solution is essentially unreachable, no matter how fast we make our computers or how cleverly we program them

The Difficulty of Optimization

  • Traveling Salesman problem: How to visit all the necessary towns while covering as few miles as possible and without going to any town twice?
  • This is an instance of what’s known to mathematicians and computer scientists as a “constrained optimization” problem: how to find the single best arrangement of a set of variables, given particular rules and a scorekeeping measure.
  • In the traveling salesman problem, the question isn’t whether a computer (or a mathematician) could find the shortest route: theoretically, one can simply crank out a list of all the possibilities and measure each one. Rather, the issue is that as the number of towns grows, the list of possible routes connecting them explodes. A route is just an ordering of the towns, so trying them all by brute force is the dreaded O(n!) “factorial time”—the computational equivalent of sorting a deck of cards by throwing them in the air until they happen to land in order.

Defining Difficulty

  • Cobham–Edmonds thesis: an algorithm should be considered “efficient” if it runs in what’s called “polynomial time”—that is, O(n2), O(n3), or in fact n to the power of any number at all. A problem, in turn, is considered “tractable” if we know how to solve it using an efficient algorithm. A problem we don’t know how to solve in polynomial time, on the other hand, is considered “intractable.” And at anything but the smallest scales, intractable problems are beyond the reach of solution by computers, no matter how powerful
  • This amounts to what is arguably the central insight of computer science. It’s possible to quantify the difficulty of a problem. And some problems are just … hard.

Just Relax

The perfect is the enemy of the good. — VOLTAIRE
  • When somebody tells you to relax, it’s probably because you’re uptight—making a bigger deal of things than you should. When computer scientists are up against a formidable challenge, their minds also turn to relaxation, as they pass around books like An Introduction to Relaxation Methods or Discrete Relaxation Techniques. But they don’t relax themselves; they relax the problem.
  • One of the simplest forms of relaxation in computer science is known as Constraint Relaxation. In this technique, researchers remove some of the problem’s constraints and set about solving the problem they wish they had. Then, after they’ve made a certain amount of headway, they try to add the constraints back in. That is, they make the problem temporarily easier to handle before bringing it back to reality.
  • For instance, you can relax the traveling salesman problem by letting the salesman visit the same town more than once, and letting him retrace his steps for free. Finding the shortest route under these looser rules produces what’s called the “minimum spanning tree.” (If you prefer, you can also think of the minimum spanning tree as the fewest miles of road needed to connect every town to at least one other town. The shortest traveling salesman route and the minimum spanning tree for Lincoln’s judicial circuit are shown below.)
  • image
  • As it turns out, solving this looser problem takes a computer essentially no time at all. And while the minimum spanning tree doesn’t necessarily lead straight to the solution of the real problem, it is quite useful all the same. For one thing, the spanning tree, with its free backtracking, will never be any longer than the real solution, which has to follow all the rules. Therefore, we can use the relaxed problem—the fantasy—as a lower bound on the reality.
  • If we calculate that the spanning tree distance for a particular set of towns is 100 miles, we can be sure the traveling salesman distance will be no less than that. And if we find, say, a 110-mile route, we can be certain it is at most 10% longer than the best solution. Thus we can get a grasp of how close we are to the real answer even without knowing what it is.
  • Even better, in the traveling salesman problem it turns out that the minimum spanning tree is actually one of the best starting points from which to begin a search for the real solution. Approaches like these have allowed even one of the largest traveling salesman problems imaginable—finding the shortest route that visits every single town on Earth—to be solved to within less than 0.05% of the (unknown) optimal solution.
  • Though most of us haven’t encountered the formal algorithmic version of Constraint Relaxation, its basic message is familiar to almost anyone who’s dreamed big about life questions. What would you do if you weren’t afraid? reads a mantra you might have seen in a guidance counselor’s office or heard at a motivational seminar. What would you do if you could not fail? Similarly, when considering questions of profession or career, we ask questions like What would you do if you won the lottery? or, taking a different tack, What would you do if all jobs paid the same? The idea behind such thought exercises is exactly that of Constraint Relaxation: to make the intractable tractable, to make progress in an idealized world that can be ported back to the real one. If you can’t solve the problem in front of you, solve an easier version of it—and then see if that solution offers you a starting point, or a beacon, in the full-blown problem. Maybe it does.
  • What relaxation cannot do is offer you a guaranteed shortcut to the perfect answer. But computer science can also quantify the tradeoff that relaxation offers between time and solution quality. In many cases, the ratio is dramatic, a no-brainer—for instance, an answer at least half as good as the perfect solution in a quadrillionth of the time
  • If we’re willing to accept solutions that are close enough, then even some of the hairiest problems around can be tamed with the right techniques.

Uncountably Many Shades of Gray: Continuous Relaxation

  • "Discrete optimization”—that is, there’s no smooth continuum among its solutions. The salesman goes either to this town or to that one; you’re either at table five or at table six. There are no shades of gray in between.
  • Such discrete optimization problems are all around us. In cities, for example, planners try to place fire trucks so that every house can be reached within a fixed amount of time—say, five minutes. Mathematically, each fire truck “covers” whatever houses can be reached within five minutes from its location. The challenge is finding the minimal set of locations such that all houses are covered.
  • The challenge of discrete optimization shows up in social settings, too. Imagine you wanted to throw a party for all your friends and acquaintances, but didn’t want to pay for all the envelopes and stamps that so many invitations would entail. You could instead decide to mail invitations to a few well-connected friends, and tell them to “bring everyone we know.” What you’d ideally want to find, then, is the smallest subgroup of your friends that knows all the rest of your social circle—which would let you lick the fewest envelopes and still get everyone to attend. Granted, this might sound like a lot of work just to save a few bucks on stamps, but it’s exactly the kind of problem that political campaign managers and corporate marketers want to solve to spread their message most effectively.
  • In fact, both the fire truck problem and the party invitation problem are intractable: no general efficient solution for them exists. But, as it turns out, there do exist a number of efficient strategies for solving the continuous versions of these problems, where any fraction or decimal is a possible solution.
  • They can try to relax their discrete problem into a continuous one and see what happens.
  • In the case of the invitation problem, relaxing it from discrete to continuous optimization means that a solution may tell us to send someone a quarter of an invitation, and someone else two-thirds of one. What does that even mean? It obviously can’t be the answer to the original question, but, like the minimum spanning tree, it does give us a place to start
  • We could also interpret these fractions as probabilities—for instance, flipping a coin for every location where the relaxed solution tells us to put half a fire truck, and actually placing a truck there only if it lands heads. In either case, with these fractions turned back to whole numbers, we’ll have a solution that makes sense in the context of our original, discrete problem.
  • The final step, as with any relaxation, is to ask how good this solution is compared to the actual best solution we might have come up with by exhaustively checking every single possible answer to the original problem
  • Continuous Relaxation is not a magic bullet: it still doesn’t give us an efficient way to get to the truly optimal answers, only to their approximations. But delivering twice as many mailings or inoculations as optimal is still far better than the unoptimized alternatives.

Just a Speeding Ticket: Lagrangian Relaxation

  • One day as a child, Brian was complaining to his mother about all the things he had to do: his homework, his chores.… “Technically, you don’t have to do anything,” his mother replied. “You don’t have to do what your teachers tell you. You don’t have to do what I tell you. You don’t even have to obey the law. There are consequences to everything, and you get to decide whether you want to face those consequences.”
  • Brian’s kid-mind was blown. It was a powerful message, an awakening of a sense of agency, responsibility, moral judgment. It was something else, too: a powerful computational technique called Lagrangian Relaxation. The idea behind Lagrangian Relaxation is simple. An optimization problem has two parts: the rules and the scorekeeping.
  • In sports, the integer constraints—on how many teams play a game, how many games are played in sum, and how many times each team plays every other team—are just too strong. “And so we cannot relax in that direction. We really have got to keep the fundamental [discrete] part of the model.”
  • Unsurprisingly, given all these demands, Trick has found that computing a sports schedule often becomes possible only by softening some of these hard constraints.
Generally, when people first come to us with a sports schedule, they will claim … “We never do x and we never do y.” Then we look at their schedules and we say, “Well, twice you did xand three times you did y last year.” Then “Oh, yeah, well, okay. Then other than that we never do it.” And then we go back the year before.… We generally realize that there are some things they think they never do that people do do. People in baseball believe that the Yankees and the Mets are never at home at the same time. And it’s not true. It’s never been true. They are at home perhaps three games, perhaps six games in a year at the same day. But in the broad season, eighty-one games at home for each of the teams, it’s relatively rare, people forget about them.

Learning to Relax

  • There are many ways to relax a problem, and we’ve seen three of the most important. The first, Constraint Relaxation, simply removes some constraints altogether and makes progress on a looser form of the problem before coming back to reality. The second, Continuous Relaxation, turns discrete or binary choices into continua: when deciding between iced tea and lemonade, first imagine a 50–50 “Arnold Palmer” blend and then round it up or down. The third, Lagrangian Relaxation, turns impossibilities into mere penalties, teaching the art of bending the rules (or breaking them and accepting the consequences).
  • Relaxations offer us a number of advantages. For one, they offer a bound on the quality of the true solution. If we’re trying to pack our calendar, imagining that we can magically teleport across town will instantaneously make it clear that eight one-hour meetings is the most we could possibly expect to fit into a day; such a bound might be useful in setting expectations before we face the full problem. Second, relaxations are designed so that they can indeed be reconciled with reality—and this gives us bounds on the solution from the other direction.
  • Unless we’re willing to spend eons striving for perfection every time we encounter a hitch, hard problems demand that instead of spinning our tires we imagine easier versions and tackle those first. When applied correctly, this is not just wishful thinking, not fantasy or idle daydreaming. It’s one of our best ways of making progress.

9. Randomness: When to Leave It to Chance

  • Randomness seems like the opposite of reason—a form of giving up on a problem, a last resort. Far from it. The surprising and increasingly important role of randomness in computer science shows us that making use of chance can be a deliberate and effective part of approaching the hardest sets of problems. In fact, there are times when nothing else will do.
  • In contrast to the standard “deterministic” algorithms we typically imagine computers using, where one step follows from another in exactly the same way every time, a randomized algorithm uses randomly generated numbers to solve a problem.
  • Recent work in computer science has shown that there are cases where randomized algorithms can produce good approximate answers to difficult questions faster than all known deterministic algorithms.
  • Sometimes the best solution to a problem is to turn to chance rather than trying to fully reason out an answer.

Sampling

  • When we want to know something about a complex quantity, we can estimate its value by sampling from it.
  • What is the probability that a shuffled deck will yield a winnable game?
  • In a game like solitaire, reasoning your way through the space of possibilities gets almost instantly overwhelming. Flip over the first card, and there are fifty-two possible games to keep track of; flip over the second, and there are fifty-one possibilities for each first card. That means we’re already up into thousands of possible games before we’ve even begun to play
  • After trying some elaborate combinatorial calculations of this sort and giving up, Ulam landed on a different approach, beautiful in its simplicity: just play the game.
  • I noticed that it may be much more practical to [try] … laying down the cards, or experimenting with the process and merely noticing what proportion comes out successfully, rather than to try to compute all the combinatorial possibilities which are an exponentially increasing number so great that, except in very elementary cases, there is no way to estimate it. This is intellectually surprising, and if not exactly humiliating, it gives one a feeling of modesty about the limits of rational or traditional thinking. In a sufficiently complicated problem, actual sampling is better than an examination of all the chains of possibilities.
  • When he says “better,” note that he doesn’t necessarily mean that sampling will offer you more precise answers than exhaustive analysis: there will always be some error associated with a sampling process, though you can reduce it by ensuring your samples are indeed random and by taking more and more of them. What he means is that sampling is better because it gives you an answer at all, in cases where nothing else will.
  • Metropolis named this approach—replacing exhaustive probability calculations with sample simulations—the Monte Carlo Method, after the Monte Carlo casino in Monaco, a place equally dependent on the vagaries of chance. The Los Alamos team was able to use it to solve key problems in nuclear physics. Today the Monte Carlo Method is one of the cornerstones of scientific computing.

Randomized Algorithms

  • Michael Rabin would go on to win the Turing Award—the computer science equivalent of a Nobel—for extending theoretical computer science to accommodate “nondeterministic” cases, where a machine isn’t forced to pursue a single option but has multiple paths it might follow
  • He found it in one of the oldest problems of them all: how to identify prime numbers.
  • Algorithms for finding prime numbers date back at least as far as ancient Greece, where mathematicians used a straightforward approach known as the Sieve of Erastothenes. The Sieve of Erastothenes works as follows: To find all the primes less than n, begin by writing down all the numbers from 1 to n in sequence. Then cross out all the numbers that are multiples of 2, besides itself (4681012, and so on). Take the next smallest number that hasn’t been crossed out (in this case, 3), and cross out all multiples of that number (691215). Keep going like this, and the numbers that remain at the end are the primes.
  • For millennia, the study of prime numbers was believed to be, as G. H. Hardy put it, “one of the most obviously useless branches” of mathematics. But it lurched into practicality in the twentieth century, becoming pivotal in cryptography and online security.
  • As it happens, it is much easier to multiply primes together than to factor them back out. With big enough primes—say, a thousand digits—the multiplication can be done in a fraction of a second while the factoring could take literally millions of years; this makes for what is known as a “one-way function.”
  • In modern encryption, for instance, secret primes known only to the sender and recipient get multiplied together to create huge composite numbers that can be transmitted publicly without fear, since factoring the product would take any eavesdropper way too long to be worth attempting.
  • If you want to check whether a particular number is prime—known as testing its “primality”—then following the sieve strategy requires trying to divide it by all the primes up to its square root. Checking whether a six-digit number is prime would require dividing by all of the 168 primes less than 1,000—not so bad. But checking a twelve-digit number involves dividing by the 78,498 primes less than 1 million, and all that division quickly starts to get out of control. The primes used in modern cryptography are hundreds of digits long; forget about it.
  • In his PhD thesis, Miller had developed an intriguingly promising, much faster algorithm for testing primality—but there was one small problem: it didn’t always work.
  • Miller had found a set of equations (expressed in terms of two numbers, n and x) that are always true if n is prime, regardless of what values you plug in for x. If they come out false even for a single value of x, then there’s no way n can be prime—in these cases, x is called a “witness” against primality. The problem, though, is false positives: even when n is not prime, the equations will still come out true some of the time.
  • Rabin realized that this was a place where a step outside the usually deterministic world of computer science might be valuable. If the number n is actually nonprime, how many possible values of x would give a false positive and declare it a prime number? The answer, Rabin showed, is no more than one-quarter. So for a random value of x, if Miller’s equations come out true, there’s only a one-in-four chance that n isn’t actually prime. And crucially, each time we sample a new random x and Miller’s equations check out, the probability that n only seems prime, but isn’t really, drops by another multiple of four. Repeat the procedure ten times, and the probability of a false positive is one in four to the tenth power—less than one in a million. Still not enough certainty? Check another five times and you’re down to one in a billion.
  • The Miller-Rabin primality test, as it’s now known, provides a way to quickly identify even gigantic prime numbers with an arbitrary degree of certainty.
  • How certain is certain enough? In practice, modern cryptographic systems, the ones that encrypt Internet connections and digital transactions, are tuned for a false positive rate of less than one in a million billion billion. In other words, that’s a decimal that begins with twenty-four zeros—less than one false prime for the number of grains of sand on Earth.
  • Though you may have never heard of the Miller-Rabin test, your laptop, tablet, and phone know it well. Several decades after its discovery, it is still the standard method used to find and check primes in many domains. It’s working behind the scenes whenever you use your credit card online, and almost any time secure communications are sent through the air or over wires.

In Praise of Sampling

  • The polynomial identity test shows that sometimes our effort is better spent checking random values—sampling from the two expressions we want to know about—than trying to untangle their inner workings.
  • John Rawls, who set for himself the ambitious task of reconciling two seemingly opposite key ideas in his field: liberty and equality. Is a society more “just” when it’s more free, or more equal? And do the two really have to be mutually exclusive? Rawls offered a way of approaching this set of questions that he called the “veil of ignorance.” Imagine, he said, that you were about to be born, but didn’t know as whom: male or female, rich or poor, urban or rural, sick or healthy. And before learning your status, you had to choose what kind of society you’d live in. What would you want? By evaluating various social arrangements from behind the veil of ignorance, argued Rawls, we’d more readily come to a consensus about what an ideal one would look like.
  • What Rawls’s thought experiment does not take into account, however, is the computational cost of making sense of a society from behind such a veil. How could we, in this hypothetical scenario, possibly hope to hold all of the relevant information in our heads? Set aside grand questions of justice and fairness for a moment and try to apply Rawls’s approach merely to, say, a proposed change in health insurance regulations. Take the probability of being born, perhaps, as someone who grows up to become a town clerk in the Midwest; multiply that by the distribution of the different health care plans available to government employees across various midwestern municipalities; multiply that by actuarial data that offer the probability of, for instance, a fractured tibia; multiply that by the average medical bill for the average procedure for a fractured tibia at a midwestern hospital given the distribution of possible insurance plans.… Okay, so would the proposed insurance revision be “good” or “bad” for the nation? We can barely hope to evaluate a single injured shin this way, let alone the lives of hundreds of millions.
  • Should we be trying, for instance, to maximize mean happiness, median happiness, total happiness, or something else?
  • When we need to make sense of, say, national health care reform—a vast apparatus too complex to be readily understood—our political leaders typically offer us two things: cherry-picked personal anecdotes and aggregate summary statistics. The anecdotes, of course, are rich and vivid, but they’re unrepresentative. Almost any piece of legislation, no matter how enlightened or misguided, will leave someone better off and someone worse off, so carefully selected stories don’t offer any perspective on broader patterns. Aggregate statistics, on the other hand, are the reverse: comprehensive but thin. We might learn, for instance, whether average premiums fell nationwide, but not how that change works out on a more granular level: they might go down for most but, Omelas-style, leave some specific group—undergraduates, or Alaskans, or pregnant women—in dire straits.
  • A close examination of random samples can be one of the most effective means of making sense of something too complex to be comprehended directly. When it comes to handling a qualitatively unmanageable problem, something so thorny and complicated that it can’t be digested whole—solitaire or atomic fission, primality testing or public policy—sampling offers one of the simplest, and also the best, ways of cutting through the difficulties.

The Three-Part Tradeoff

  • To understand the idea behind a Bloom filter, Mitzenmacher says, consider a search engine like Google, trying to crawl the entire web and index every possible URL. The web is comprised of well over a trillion distinct URLs, and the average URL weighs in at about seventy-seven characters long. When the search engine looks at some URL, how can it check whether that page has already been processed? Just storing a list of all the URLs that have been visited would take a huge amount of space, and repeatedly searching that list (even if it were fully sorted) could prove a nightmare. In fact, it could well be that the cure is worse than the disease: in other words, checking every time to make sure that we’re not reindexing a page might be more time-consuming than just indexing the occasional page twice.
  • But what if we only needed to be mostly sure this URL was new to us? That’s where the Bloom filter comes in. Named for its inventor, Burton H. Bloom, a Bloom filter works much like the Rabin-Miller primality test: the URL is entered into a set of equations that esssentially check for “witnesses” to its novelty. (Rather than proclaim “n is not prime,” these equations say “I have not seen n before.”) If you’re willing to tolerate an error rate of just 1% or 2%, storing your findings in a probabilistic data structure like a Bloom filter will save you significant amounts of both time and space.
  • Bloom filters have shipped with a number of recent web browsers to check URLs against a list of known malicious websites, and they are also an important part of cryptocurrencies like Bitcoin.

Hills, Valleys, and Traps

  • Imagine you’re putting together a globe-trotting ten-city vacation, your own version of the traveling salesman problem: you’ll start and finish in San Francisco and visit Seattle, Los Angeles, New York, Buenos Aires, London, Amsterdam, Copenhagen, Istanbul, Delhi, and Kyoto. You might not be too worried about the total length of the route, but you probably do want to minimize the monetary cost of the trip. The first thing to note here is that even though ten cities hardly sounds like a lot, the number of possible itineraries is ten factorial: more than three and a half million. In other words, there’s no practical way for you to simply check every permutation and pick the lowest price. You have to work smarter than that.
  • For your first attempt at an itinerary, you might look at taking the cheapest flight out of San Francisco (let’s say it’s Seattle), then taking the cheapest flight from there to any of the other remaining cities (call it Los Angeles), then the cheapest from there (say, New York), and so forth until you’re at your tenth city and you fly from there back to San Francisco. This is an example of a so-called greedy algorithm, which you can also think of as a “myopic algorithm”: one that shortsightedly takes the best thing available every step of the way.
  • Once you’ve assembled a baseline itinerary, you might test some alternatives by making slight perturbations to the city sequence and seeing if that makes an improvement. For instance, if we are going first to Seattle, then to Los Angeles, we can try doing those cities in reverse order: L.A. first, then Seattle. For any given itinerary, we can make eleven such two-city flip-flops; let’s say we try them all and then go with the one that gives us the best savings. From here we’ve got a new itinerary to work with, and we can start permuting that one, again looking for the best local improvement. This is an algorithm known as Hill Climbing—since the search through a space of solutions, some better and some worse, is commonly thought of in terms of a landscape with hills and valleys, where your goal is to reach the highest peak.
  • Eventually you will end up with a solution that is better than all of its permutations; no matter which adjacent stops you flip, nothing beats it. It’s here that the hill climbing stops. Does this mean you’ve definitely found the single best possible itinerary, though? Sadly, no. You may have found only a so-called “local maximum,” not the global maximum of all the possibilities. The hill-climbing landscape is a misty one. You can know that you’re standing on a mountaintop because the ground falls away in all directions—but there might be a higher mountain just across the next valley, hidden behind clouds.
  • image
  • Even once we’ve found a solution that can’t be improved by any small tweaks, it’s possible that we are still missing the global maximum.
  • The true best itinerary may require a radical overhaul of the trip: doing entire continents in a different order, for instance, or proceeding westward instead of eastward. We may need to temporarily worsen our solution if we want to continue searching for improvements. And randomness provides a strategy—actually, several strategies—for doing just that.

Out of the Local Maximum

  • One approach is to augment Hill Climbing with what’s known as “jitter”: if it looks like you’re stuck, mix things up a little. Make a few random small changes (even if they are for the worse), then go back to Hill Climbing; see if you end up at a higher peak.
  • Another approach is to completely scramble our solution when we reach a local maximum, and start Hill Climbing anew from this random new starting point. This algorithm is known, appropriately enough, as “Random-Restart Hill Climbing”—or, more colorfully, as “Shotgun Hill Climbing.” It’s a strategy that proves very effective when there are lots of local maxima in a problem.
  • In decryption, having a text that looks somewhat close to sensible English doesn’t necessarily mean that you’re even on the right track. So sometimes it’s best not to get too attached to an initial direction that shows promise, and simply start over from scratch.
  • But there’s also a third approach: instead of turning to full-bore randomness when you’re stuck, use a little bit of randomness every time you make a decision. This technique, developed by the same Los Alamos team that came up with the Monte Carlo Method, is called the Metropolis Algorithm. The Metropolis Algorithm is like Hill Climbing, trying out different small-scale tweaks on a solution, but with one important difference: at any given point, it will potentially accept bad tweaks as well as good ones.
  • Whether it’s jitter, random restarts, or being open to occasional worsening, randomness is incredibly useful for avoiding local maxima. Chance is not just a viable way of dealing with tough optimization problems; in many cases, it’s essential.

Simulated Annealing

  • Growing a single crystal from a melt [is] done by careful annealing, first melting the substance, then lowering the temperature slowly, and spending a long time at temperatures in the vicinity of the freezing point. If this is not done, and the substance is allowed to get out of equilibrium, the resulting crystal will have many defects, or the substance may form a glass, with no crystalline order.
  • In physics, what we call “temperature” is really velocity—random motion at the molecular scale. This was directly analogous, Kirkpatrick reasoned, to the random jitter that can be added to a hill-climbing algorithm to make it sometimes backtrack from better solutions to worse ones. In fact, the Metropolis Algorithm itself had initially been designed to model random behavior in physical systems (in that case, nuclear explosions). So what would happen, Kirkpatrick wondered, if you treated an optimization problem like an annealing problem—if you “heated it up” and then slowly “cooled it off”?
  • Taking the ten-city vacation problem from above, we could start at a “high temperature” by picking our starting itinerary entirely at random, plucking one out of the whole space of possible solutions regardless of price. Then we can start to slowly “cool down” our search by rolling a die whenever we are considering a tweak to the city sequence. Taking a superior variation always makes sense, but we would only take inferior ones when the die shows, say, a 2 or more. After a while, we’d cool it further by only taking a higher-price change if the die shows a 3 or greater—then 4, then 5. Eventually we’d be mostly hill climbing, making the inferior move just occasionally when the die shows a 6. Finally we’d start going only uphill, and stop when we reached the next local max.

Randomness, Evolution, and Creativity

  • Luria realized that if he bred several generations of different lineages of bacteria, then exposed the last generation to a virus, one of two radically different things would happen. If resistance was a response to the virus, he’d expect roughly the same amount of resistant bacteria to appear in every one of his bacterial cultures, regardless of their lineage. On the other hand, if resistance emerged from chance mutations, he’d expect to see something a lot more uneven—just like a slot machine’s payouts. That is, bacteria from most lineages would show no resistance at all; some lineages would have a single “grandchild” culture that had mutated to become resistant; and on rare occasions, if the proper mutation had happened several generations up the “family tree,” there would be a jackpot: all the “grandchildren” in the lineage would be resistant.
  • After several days of tense, restless waiting, Luria returned to the lab to check on his colonies. Jackpot.
  • Luria’s discovery was about the power of chance: about how random, haphazard mutations can produce viral resistance.
  • But it was also, at least in part, due to the power of chance. He was in the right place at the right time, where seeing the slot machine triggered a new idea. Tales of discovery often feature a similar moment: Newton’s (possibly apocryphal) apple, Archimedes’ bathtub “Eureka!,” the neglected petri dish that grew Penicillium mold. Indeed, it’s a common enough phenomenon that a word was invented to capture it: in 1754, Horace Walpole coined the term “serendipity,” based on the fairy tale adventures of The Three Princes of Serendip (Serendip being the archaic name of Sri Lanka), who “were always making discoveries, by accidents and sagacity, of things they were not in quest of.”
  • New conceptions, emotions, and active tendencies which evolve are originally produced in the shape of random images, fancies, accidental out-births of spontaneous variation in the functional activity of the excessively unstable human brain, which the outer environment simply confirms or refutes, adopts or rejects, preserves or destroys—selects, in short, just as it selects morphological and social variations due to molecular accidents of an analogous sort.
  • A blind-variation-and-selective-retention process is fundamental to all inductive achievements, to all genuine increases in knowledge, to all increases in fit of system to environment
  • When it comes to stimulating creativity, a common technique is introducing a random element, such as a word that people have to form associations with. For example, musician Brian Eno and artist Peter Schmidt created a deck of cards known as Oblique Strategies for solving creative problems. Pick a card, any card, and you will get a random new perspective on your project. (And if that sounds like too much work, you can now download an app that will pick a card for you.)
  • Wikipedia, for instance, offers a “Random article” link, and Tom has been using it as his browser’s default homepage for several years, seeing a randomly selected Wikipedia entry each time he opens a new window. While this hasn’t yet resulted in any striking discoveries, he now knows a lot about some obscure topics (such as the kind of knife used by the Chilean armed forces) and he feels that some of these have enriched his life
  • Book-, wine-, and chocolate-of-the-month clubs are a way to get exposed to intellectual, oenophilic, and gustatory possibilities that you might never have encountered otherwise.
  • If the Dice Man had only had a deeper grasp of computer science, he’d have had some guidance. First, from Hill Climbing: even if you’re in the habit of sometimes acting on bad ideas, you should always act on good ones. Second, from the Metropolis Algorithm: your likelihood of following a bad idea should be inversely proportional to how bad an idea it is. Third, from Simulated Annealing: you should front-load randomness, rapidly cooling out of a totally random state, using ever less and less randomness as time goes on, lingering longest as you approach freezing. Temper yourself—literally.

10. Networking: How We Connect

  • The long-distance telegraph began with a portent—Samuel F. B. Morse, standing in the chambers of the US Supreme Court on May 24, 1844, wiring his assistant Alfred Vail in Baltimore a verse from the Old Testament: “WHAT HATH GOD WROUGHT.” The first thing we ask of any new connection is how it began, and from that origin can’t help trying to augur its future.
  • The first telephone call in history, made by Alexander Graham Bell to his assistant on March 10, 1876, began with a bit of a paradox. “Mr. Watson, come here; I want to see you”—a simultaneous testament to its ability and inability to overcome physical distance.
  • The cell phone began with a boast—Motorola’s Martin Cooper walking down Sixth Avenue on April 3, 1973, as Manhattan pedestrians gawked, calling his rival Joel Engel at AT&T: “Joel, I’m calling you from a cellular phone. A real cellular phone: a handheld, portable, real cellular phone.” (“I don’t remember exactly what he said,” Cooper recalls, “but it was really quiet for a while. My assumption was that he was grinding his teeth.”)
  • And the text message began, on December 3, 1992, with cheer: Neil Papworth at Sema Group Telecoms wishing Vodafone’s Richard Jarvis an early “Merry Christmas.”
  • The beginnings of the Internet were, somehow fittingly, much humbler and more inauspicious than all of that. It was October 29, 1969, and Charley Kline at UCLA sent to Bill Duvall at the Stanford Research Institute the first message ever transmitted from one computer to another via the ARPANET. The message was “login”—or would have been, had the receiving machine not crashed after “lo.”
  • The foundation of human connection is protocol—a shared convention of procedures and expectations, from handshakes and hellos to etiquette, politesse, and the full gamut of social norms. Machine connection is no different. Protocol is how we get on the same page; in fact, the word is rooted in the Greek protokollon, “first glue,” which referred to the outer page attached to a book or manuscript.
  • In interpersonal affairs, these protocols prove a subtle but perennial source of anxiety. I sent so-and-so a message however many days ago; at what point do I begin to suspect they never received it? It’s now 12:05 p.m. and our call was set for noon; are we both expecting each other to be the one calling? Your answer seems odd; did I mishear you or did you mishear me? Come again?

Packet Switching

  • What we now think of as “the Internet” is actually a collection of many protocols, but the chief among them (so much so that it’s often referred to more or less synonymously with the Internet) is what’s known as Transmission Control Protocol, or TCP
  • TCP initially used telephone lines, but it’s more appropriately regarded as the evolution of the mail rather than the phone. Phone calls use what’s called “circuit switching”: the system opens a channel between the sender and the receiver, which supplies constant bandwidth between the parties in both directions as long as the call lasts.
  • The telephone companies, for their part, did not seem especially amenable to talk of a fundamental shift in their protocols. Moving away from circuit switching was considered lunatic—“utter heresy,” in the words of networking researcher Van Jacobson. Kleinrock reminisces about his own encounters with the telecommunications industry:
I went to AT&T, the biggest network of the time, and I explained to them, you guys ought to give us good data communications. And their answer was, what are you talking about? The United States is a copper mine, it’s full of telephone wires, use that. I said no, no, you don’t understand. It takes 35 seconds to set up a call, you charge me a minimum of 3 minutes, and I want to send 100 milliseconds of data! And their answer was, “Little boy, go away.” So little boy went away and, with others, developed this technology which ate their lunch.
  • The technology that ate circuit switching’s lunch would become known as packet switching. In a packet-switched network, rather than using a dedicated channel for each connection, senders and receivers atomize their messages into tiny shards known as “packets,” and merge them into the communal flow of data—a bit like postcards moving at the speed of light.

Acknowledgment

  • In TCP, a failure generally leads to retransmission, so it’s considered enough for a session to begin with what’s called a “triple handshake.” The visitor says hello, the server acknowledges the hello and says hello back, the visitor acknowledges that, and if the server receives this third message, then no further confirmation is needed and they’re off to the races
  • The way that ACKs work is both simple and clever. Behind the scenes of the triple handshake, each machine provides the other with a kind of serial number—and it’s understood that every packet sent after that will increment those serial numbers by one each time, like checks in a checkbook. For instance, if your computer initiates contact with a web server, it might send that server, say, the number 100. The ACK sent by the server will in turn specify the serial number at which the server’s own packets will begin (call it 5,000), and also will say “Ready for 101.” Your machine’s ACK will carry the number 101 and will convey in turn “Ready for 5,001.” (Note that these two numbering schemes are totally independent, and the number that begins each sequence is typically chosen at random.)
  • This mechanism offers a ready way to pinpoint when packets have gone astray. If the server is expecting 101 but instead gets 102, it will send an ACK to packet 102 that still says “Ready for 101.” If it next gets packet 103, it will say, again, “Ready for 101.” Three such redundant ACKs in a row would signal to your machine that 101 isn’t just delayed but hopelessly gone, so it will resend that packet. At that point, the server (which has kept packets 102 and 103) will send an ACK saying “Ready for 104” to signal that the sequence has been restored.

Exponential Backoff: The Algorithm of Forgiveness

The world’s most difficult word to translate has been identified as “ilunga,” from the Tshiluba language spoken in south-eastern DR Congo.… Ilunga means “a person who is ready to forgive any abuse for the first time, to tolerate it a second time, but never a third time.BBC NEWS
  • The biggest hurdle that the ALOHAnet had to overcome was interference. Sometimes two stations would transmit at the same moment, inadvertently jamming one another’s signals. If both stations simply retransmitted right away to try to get their message across, they’d run the risk of getting stuck in perpetual interference forever.
  • The first thing that the senders need to do here is what’s called “breaking symmetry.” As any sidewalk pedestrian knows, dodging right as an oncoming walker dodges left, and then having both of you simultaneously dodge back the other way, doesn’t solve anything.
  • The breakthrough turned out to be increasing the average delay after every successive failure—specifically, doubling the potential delay before trying to transmit again. So after an initial failure, a sender would randomly retransmit either one or two turns later; after a second failure, it would try again anywhere from one to four turns later; a third failure in a row would mean waiting somewhere between one and eight turns, and so on. This elegant approach allows the network to accommodate potentially any number of competing signals. Since the maximum delay length (2, 4, 8, 16…) forms an exponential progression, it’s become known as Exponential Backoff.
  • Beyond just collision avoidance, Exponential Backoff has become the default way of handling almost all cases of networking failure or unreliability. For instance, when your computer is trying to reach a website that appears to be down, it uses Exponential Backoff—trying again one second later, again a few seconds after that, and so forth. This is good for everyone: it prevents a host server that’s down from getting slammed with requests as soon as it comes back online, and it prevents your own machine from wasting too much effort trying to get blood from a stone. But interestingly, it also does not force (or allow) your machine to ever completely give up.
  • Exponential Backoff is also a critical part of networking security, when successive password failures in logging into an account are punished by an exponentially increasing lockout period. This prevents a hacker from using a “dictionary attack” against an account, cycling through potential password after password until eventually they get lucky. At the same time it also solves another problem: the account’s real owner, no matter how forgetful, is never permanently locked out after some arbitrary cutoff.
  • In human society, we tend to adopt a policy of giving people some finite number of chances in a row, then giving up entirely. Three strikes, you’re out. This pattern prevails by default in almost any situation that requires forgiveness, lenience, or perseverance. Simply put, maybe we’re doing it wrong.
  • Solution: Exponential Backoff on the invitation rate. Try to reschedule in a week, then two, then four, then eight. The rate of “retransmission” goes toward zero—yet you never have to completely give up.

Flow Control and Congestion Avoidance

  • One of the biggest differences between circuit switching and packet switching emerges in how they deal with congestion. In circuit switching, the system either approves a channel request, or denies it outright if the request cannot be accommodated. That’s why, if you’ve ever tried using a phone system during some peak time, you may have encountered the “special information tone” and message proclaiming that “all circuits are busy.”
  • Packet switching is radically different. The phone system gets full; the mail system gets slow.
  • At the heart of TCP congestion control is an algorithm called Additive Increase, Multiplicative Decrease, or AIMD. Before AIMD kicks in, a new connection will ramp up its transmission rate aggressively: if the first packet is received successfully it sends out two more, if both of those get through it sends out a batch of four, and so on. But as soon as any packet’s ACK does not come back to the sender, the AIMD algorithm takes over. Under AIMD, any fully received batch of packets causes the number of packets in flight not to double but merely to increase by 1, and dropped packets cause the transmission rate to cut back by half (hence the name Additive Increase, Multiplicative Decrease). Essentially, AIMD takes the form of someone saying, “A little more, a little more, a little more, whoa, too much, cut way back, okay a little more, a little more…”
  • The satirical “Peter Principle,” articulated in the 1960s by education professor Laurence J. Peter, states that “every employee tends to rise to his level of incompetence.” The idea is that in a hierarchical organization, anyone doing a job proficiently will be rewarded with a promotion into a new job that may involve more complex and/or different challenges. When the employee finally reaches a role in which they don’t perform well, their march up the ranks will stall, and they will remain in that role for the rest of their career. Thus it stands to reason, goes the ominous logic of the Peter Principle, that eventually every spot in an organization will come to be filled by someone doing that job badly
  • Some organizations have attempted to remediate the Peter Principle by simply firing employees who don’t advance. The so-called Cravath System, devised by leading law firm Cravath, Swaine & Moore, involves hiring almost exclusively recent graduates, placing them into the bottom ranks, and then routinely either promoting or firing them over the following years
  • Is there any alternative, any middle path between the institutional stagnation of the Peter Principle and the draconian severity of the “up or out” system? The AIMD algorithm can offer just such an approach, since it is explicitly designed to handle the demands of a volatile environment. A computer network must manage its own maximum transmission capacity, plus the transmission rates of its clients, all of which may be fluctuating unpredictably. Likewise, in a business setting, a company has a limited pool of funds to pay for its operations, and each worker or vendor has a limited capacity for the amount of work they can do and the amount of responsibility they can handle. Everyone’s needs, capacities, and partnerships are always in flux.
  • The lesson of the TCP sawtooth is that in an unpredictable and changing environment, pushing things to the point of failure is indeed sometimes the best (or the only) way to use all the resources to their fullest. What matters is making sure that the response to failure is both sharp and resilient. Under AIMD, every connection that isn’t dropping the ball is accelerated until it is—and then it’s cut in half, and immediately begins accelerating again. And though it would violate almost every norm of current corporate culture, one can imagine a corporation in which, annually, every employee is always either promoted a single step up the org chart or sent part of the way back down.

Backchannels: Flow Control in Linguistics

  • In TCP, as we’ve seen, there’s no such thing as a one-way transmission: without consistent feedback, the sender will slow down almost immediately.
  • Narrators who told close-call stories to distracted listeners … told them less well overall and particularly poorly at what should have been the dramatic conclusion. Their story endings were abrupt or choppy, or they circled around and retold the ending more than once, and they often justified their story by explaining the obvious close call.
  • In 2014, for instance, UC Santa Cruz’s Jackson Tolins and Jean Fox Tree demonstrated that those inconspicuous “uh-huhs” and “yeahs” and “hmms” and “ohs” that pepper our speech perform distinct, precise roles in regulating the flow of information from speaker to listener—both its rate and level of detail.

Bufferbloat: It’s the Latency, Stupid

  • A buffer is essentially a queue whose function is to smooth out bursts. If you walked into a doughnut shop at roughly the same time as another customer, it wouldn’t do for the very momentarily overwhelmed cashier to make one of you leave the store and come back another time. Customers wouldn’t have it, of course, but neither would management: such a policy is virtually guaranteed to underutilize the cashier. Putting the customers in a queue instead ensures that the average throughput of the store approaches its maximum throughput. That’s a good thing.
  • This superior resource utilization comes with a very real cost, however: delay.
  • This is precisely the phenomenon that Jim Gettys was observing in his home cable modem. Because he was uploading a file, his computer was sending the modem as many upstream packets as it could handle. And the modem was pretending to handle a lot more than it actually could, turning none away while building up a massive queue. So when Gettys tried to download something at the same time—to visit a webpage or check email—his ACK packets would get stuck behind the upload, having to wait in line at the modem to leave the house
  • It was like trying to have a conversation where every time you say “uh-huh” it is delayed by ten or twenty seconds
  • When a networking buffer fills up, what typically happens is called Tail Drop: an unceremonious way of saying that every packet arriving after that point is simply rejected, and effectively deleted. (Turning new customers away from the crêpe stand once the line gets too long would be a version of Tail Drop in a human context.) Yet it’s precisely such “packet drops” that lead a computer to notice that one of its packets hasn’t been acknowledged, prompting AIMD to start halving the bandwidth.
  • Dropped packets are the Internet’s primary feedback mechanism. A buffer that’s too large—a restaurant taking every order no matter how short-staffed the kitchen, a modem taking every packet that comes in regardless of how long it’ll take to send them on—prevents this moderation from happening as it should.
  • Fundamentally, buffers use delay—known in networking as “latency”—in order to maximize throughput.

Better Never than Late

Take your most basic problem as a single person … someone likes you, you don’t like them back. At one point, that used to be kind of an awkward situation. You had to have a conversation, it was weird. Now what do you do? Someone likes you, you don’t like them back? You just pretend to be busy … forever. —AZIZ ANSARI
Now is better than never. Although never is often better than right now. THE ZEN OF PYTHON
  • Singer Katy Perry has 107% more Twitter followers than her home state of California has people. The most-followed person on Twitter, as of early 2016 she counts some 81.2 million accounts among her fans. This means that even if 99% of her fans never message her at all—and even if that most devoted 1% who message her do so only once per year—then she still gets 2,225 messages a day. Every single day.
  • Imagine if Perry were committed to answering each fan message in the order received. If she could reply to 100 a day, then the fans’ expected wait time for a response would soon be measured in decades. It’s fair to imagine that most fans would prefer a slim chance of getting a reply right away to a guaranteed reply ten or twenty years hence.
  • Note that Perry doesn’t have this problem when she leaves a venue and must run a gauntlet of fans expecting an autograph or a few words. Perry does what she can, moves on, and the lost opportunities dissipate. The body is its own flow control. We can’t be in more than one place at one time. At a crowded party we inevitably participate in less than 5% of the conversation, and cannot read up or catch up on the remainder.
  • We use the idiom of “dropped balls” almost exclusively in a derogatory sense, implying that the person in question was lazy, complacent, or forgetful. But the tactical dropping of balls is a critical part of getting things done under overload.
  • The most prevalent critique of modern communications is that we are “always connected.” But the problem isn’t that we’re always connected; we’re not. The problem is that we’re always buffered. The difference is enormous.
  • The feeling that one needs to look at everything on the Internet, or read all possible books, or see all possible shows, is bufferbloat. You miss an episode of your favorite series and watch it an hour, a day, a decade later. You go on vacation and come home to a mountain of correspondence. It used to be that people knocked on your door, got no response, and went away. Now they’re effectively waiting in line when you come home.
  • In other words, we asked for a system that would never turn a sender away, and for better or worse we got one. Indeed, over the past fifteen years, the move from circuit switching to packet switching has played itself out across society. We used to request dedicated circuits with others; now we send them packets and wait expectantly for ACKs. We used to reject; now we defer.
  • The much-lamented “lack of idleness” one reads about is, perversely, the primary feature of buffers: to bring average throughput up to peak throughput. Preventing idleness is what they do. You check email from the road, from vacation, on the toilet, in the middle of the night. You are never, ever bored. This is the mixed blessing of buffers, operating as advertised.
  • But there’s a lot to look forward to about a post-bufferbloat future. With their inherent latency, buffers are bad for most interactive processes. When we speak via Skype, for example, we generally prefer an occasionally staticky signal now to a clear recording of what our caller said three seconds ago. For gamers, even a 50-millisecond lag could be the difference between fragging and being fragged;

11. Game Theory: The Minds of Others

I’m an optimist in the sense that I believe humans are noble and honorable, and some of them are really smart.… I have a somewhat more pessimistic view of people in groups —STEVE JOBS

Recursion

Successful investing is anticipating the anticipations of others
  • Computer science illustrates the fundamental limitations of this kind of reasoning with what’s called the “halting problem.” As Alan Turing proved in 1936, a computer program can never tell you for sure whether another program might end up calculating forever without end—except by simulating the operation of that program and thus potentially going off the deep end itself.
  • Simply put, any time a system—be it a machine or a mind—simulates the workings of something as complex as itself, it finds its resources totally maxed out, more or less by definition.
  • “In poker, you never play your hand,” James Bond says in Casino Royale; “you play the man across from you.” In fact, what you really play is a theoretically infinite recursion. There’s your own hand and the hand you believe your opponent to have; then the hand you believe your opponent believes you have, and the hand you believe your opponent believes you to believe he has … and on it goes. “I don’t know if this is an actual game-theory term,” says the world’s top-rated poker player, Dan Smith, “but poker players call it ‘leveling.’ Level one is ‘I know.’ Two is ‘you know that I know.’ Three, ‘I know that you know that I know.’ There are situations where it just comes up where you are like, ‘Wow, this is a really silly spot to bluff but if he knows that it is a silly spot to bluff then he won’t call me and that’s where it’s the clever spot to bluff.’ Those things happen.”
  • There’s a rule that you really only want to play one level above your opponent,” explains poker professional Vanessa Rousso. “If you play too far above your opponent, you’re going to think they have information that they don’t actually have—[and] they won’t be able to glean the information that you want them to glean from your actions.”

Reaching Equilibrium

  • Game theory covers an incredibly broad spectrum of scenarios of cooperation and competition, but the field began with those resembling heads-up poker: two-person contests where one player’s gain is another player’s loss. Mathematicians analyzing these games seek to identify a so-called equilibrium: that is, a set of strategies that both players can follow such that neither player would want to change their own play, given the play of their opponent. It’s called an equilibrium because it’s stable—no amount of further reflection by either player will bring them to different choices.
  • In rock-paper-scissors, for example, the equilibrium tells us, perhaps unexcitingly, to choose one of the eponymous hand gestures completely at random, each roughly a third of the time. What makes this equilibrium stable is that, once both players adopt this 1⁄3 - 1⁄3 - 1⁄3 strategy, there is nothing better for either to do than stick with it. (If we tried playing, say, more rock, our opponent would quickly notice and start playing more paper, which would make us play more scissors, and so forth until we both settled into the 1⁄3 - 1⁄3 - 1⁄3 equilibrium again.)
  • John Nash proved in 1951 that every two-player game has at least one equilibrium
  • The object of study in mathematics is truth; the object of study in computer science is complexity.
  • In a game-theory context, knowing that an equilibrium exists doesn’t actually tell us what it is—or how to get there.
  • By the end of the twentieth century, determining whether a game has more than one equilibrium, or an equilibrium that gives a player a certain payoff, or an equilibrium that involves taking a particular action, had all been proved to be intractable problems. Then, from 2005 to 2008, Papadimitriou and his colleagues proved that simply finding Nash equilibria is intractable as well

Dominant Strategies, for Better or Worse

  • Even when we can reach an equilibrium, just because it’s stable doesn’t make it good. It may seem paradoxical, but the equilibrium strategy—where neither player is willing to change tack—is by no means necessarily the strategy that leads to the best outcomes for the players

Prisoner's Dilemma

  • The prisoner’s dilemma works as follows. Imagine that you and a co-conspirator have been arrested after robbing a bank, and are being held in separate jail cells. Now you must decide whether to “cooperate” with each other—by remaining silent and admitting nothing—or to “defect” from your partnership by ratting out the other to the police. You know that if you both cooperate with each other and keep silent, the state doesn’t have enough evidence to convict either one of you, so you’ll both walk free, splitting the loot—half a million dollars each, let’s say. If one of you defects and informs on the other, and the other says nothing, the informer goes free and gets the entire million dollars, while the silent one is convicted as the sole perpetrator of the crime and receives a ten-year sentence. If you both inform on each other, then you’ll share the blame and split the sentence: five years each.
  • Here’s the problem. No matter what your accomplice does, it’s always better for you to defect.
  • If your accomplice has ratted you out, ratting them out in turn will give you five years of your life back—you’ll get the shared sentence (five years) rather than serving the whole thing yourself (ten years). And if your accomplice has stayed quiet, turning them in will net you the full million dollars—you won’t have to split it. No matter what, you’re always better off defecting than cooperating, regardless of what your accomplice decides. To do otherwise will always make you worse off, no matter what.
  • In fact, this makes defection not merely the equilibrium strategy but what’s known as a dominant strategy. A dominant strategy avoids recursion altogether, by being the best response to all of your opponent’s possible strategies—so you don’t even need to trouble yourself getting inside their head at all. A dominant strategy is a powerful thing.
  • But now we’ve arrived at the paradox. If everyone does the rational thing and follows the dominant strategy, the story ends with both of you serving five years of hard time—which, compared to freedom and a cool half million apiece, is dramatically worse for everyone involved. How could that have happened?
  • This has emerged as one of the major insights of traditional game theory: the equilibrium for a set of players, all acting rationally in their own interest, may not be the outcome that is actually best for those players.
  • Measure called “the price of anarchy.” The price of anarchy measures the gap between cooperation (a centrally designed or coordinated solution) and competition (where each participant is independently trying to maximize the outcome for themselves). In a game like the prisoner’s dilemma, this price is effectively infinite: increasing the amount of cash at stake and lengthening the jail sentences can make the gap between possible outcomes arbitrarily wide, even as the dominant strategy stays the same.
  • For instance, consider traffic. Drivers just want to take the fastest route, whatever it is, and routers just want to shuffle along their packets with minimal effort—but in both cases this can result in overcrowding along critical pathways, creating congestion that harms everyone.
  • Tim Roughgarden and Cornell’s Éva Tardos proved in 2002 that the “selfish routing” approach has a price of anarchy that’s a mere 4/3. That is, a free-for-all is only 33% worse than perfect top-down coordination.
  • The good news is that the lack of centralized coordination is making your commute at most only 33% worse. On the other hand, if you’re hoping that networked, self-driving autonomous cars will bring us a future of traffic utopia, it may be disheartening to learn that today’s selfish, uncoordinated drivers are already pretty close to optimal.
  • It’s true that self-driving cars should reduce the number of road accidents and may be able to drive more closely together, both of which would speed up traffic. But from a congestion standpoint, the fact that anarchy is only 4/3 as congested as perfect coordination means that perfectly coordinated commutes will only be 3/4 as congested as they are now.
  • It’s a bit like the famous line by James Branch Cabell: “The optimist proclaims that we live in the best of all possible worlds; and the pessimist fears this is true.” Congestion will always be a problem solvable more by planners and by overall demand than by the decisions of individual drivers, human or computer, selfish or cooperative.

The Tragedy of the Commons

  • A recent study showed that the average worker takes only half of the vacation days granted them, and a stunning 15% take no vacation at all.
  • At the present moment, the Bay Area (where the two of us live) is attempting to remedy this sorry state of affairs by going through a radical paradigm shift when it comes to vacation policy—a shift that is very well meaning and completely, apocalyptically doomed. The premise sounds innocent enough: instead of metering out some fixed arbitrary number of days for each employee, then wasting HR man-hours making sure no one goes over their limit, why not just let your employees free? Why not simply allow them unlimited vacation? Anecdotal reports thus far are mixed—but from a game-theoretic perspective, this approach is a nightmare. All employees want, in theory, to take as much vacation as possible. But they also all want to take just slightly less vacation than each other, to be perceived as more loyal, more committed, and more dedicated (hence more promotion-worthy). Everyone looks to the others for a baseline, and will take just slightly less than that. The Nash equilibrium of this game is zero. As the CEO of software company Travis CI, Mathias Meyer, writes, “People will hesitate to take a vacation as they don’t want to seem like that person who’s taking the most vacation days. It’s a race to the bottom.”
  • his is the tragedy of the commons in full effect. And it’s just as bad between firms as within them. Imagine two shopkeepers in a small town. Each of them can choose either to stay open seven days a week or to be open only six days a week, taking Sunday off to relax with their friends and family. If both of them take a day off, they’ll retain their existing market share and experience less stress. However, if one shopkeeper decides to open his shop seven days a week, he’ll draw extra customers—taking them away from his competitor and threatening his livelihood. The Nash equilibrium, again, is for everyone to work all the time.

Mechanism Design: Change the Game

  • If the rules of the game force a bad strategy, maybe we shouldn’t try to change strategies. Maybe we should try to change the game.
  • While game theory asks what behavior will emerge given a set of rules, mechanism design (sometimes called “reverse game theory”) works in the other direction, asking: what rules will give us the behavior we want to see?
  • Let’s return you and your bank-robbing co-conspirator to the jail cell for another go at the prisoner’s dilemma, with one crucial addition: the Godfather. Now you and your fellow thief are members of a crime syndicate, and the don has made it, shall we say, all too clear that any informants will sleep with the fishes. This alteration of the game’s payoffs has the effect of limiting the actions you can take, yet ironically makes it far more likely that things will end well, both for you and your partner. Since defection is now less attractive (to put it mildly), both prisoners are induced to cooperate, and both will confidently walk away half a million dollars richer. Minus, of course, a nominal tithe to the don.
  • The counterintuitive and powerful thing here is we can worsen every outcome—death on the one hand, taxes on the other—yet make everyone’s lives better by shifting the equilibrium.
  • The CEO of the software firm Evernote, Phil Libin, made headlines with a policy of offering Evernote employees a thousand dollars cash for taking a vacation. This sounds like a reasonable approach to getting more employees to take vacation, but from a game-theoretic perspective it’s actually misguided. Increasing the cash on the table in the prisoner’s dilemma, for instance, misses the point: the change doesn’t do anything to alter the bad equilibrium. If a million-dollar heist ends up with both thieves in jail, so does a ten-million-dollar heist. The problem isn’t that vacations aren’t attractive; the problem is that everyone wants to take slightly less vacation than their peers, producing a game whose only equilibrium is no vacation at all. A thousand bucks sweetens the deal but doesn’t change the principle of the game—which is to take as much vacation as possible while still being perceived as slightly more loyal than the next guy or gal, therefore getting a raise or promotion over them that’s worth many thousands of dollars.

Mechanism Design by Evolution

  • If cooperation really does lead to better outcomes in certain games, then we’d expect that cooperatively minded species would prevail evolutionarily. But then where would the cooperation come from if it’s only rational at the group level, not the individual level? Maybe it would have to come from something that individuals can’t entirely control. Something, for instance, like emotions.
  • Consider two seemingly unrelated scenarios: (1) A man buys a vacuum cleaner, it breaks within a few weeks, and he spends ten minutes online leaving a vindictive review. (2) A woman shopping at a convenience store notices someone steal an elderly man’s wallet and bolt for the door; she tackles the thief and wrestles the wallet free.
  • Though the latter protagonist seems clearly heroic, and the former merely angry, what these vignettes have in common—albeit in very different ways—is involuntary selflessness. The unhappy consumer isn’t trying to get the vacuum cleaner replaced or his money back; he’s after a highly indirect kind of retribution, from which—in a rational, game-theoretic sense—he stands to gain little other than the spiteful satisfaction of writing the review itself. In the convenience store, the heroic woman metes out vigilante justice at enormous personal cost; she risks injury or even death to return, say, $40 to a man who is a total stranger to her. Even if she wanted to help, she could have simply taken two twenties out of her own pocket and given them to him without risking a trip to the ER! In this sense, both protagonists are acting irrationally. On the other hand, their actions are good for their society: we all want to live in a world in which pickpocketing doesn’t pay and in which businesses that sell poor-quality products get a bad reputation.
  • Perhaps each of us, individually, would be better off being the kind of person who can always make a detached, calculated decision in their own best interest, not willing to lose time fuming over a sunk cost, let alone lose a tooth over $40. But all of us are better off living in a society in which such defiant stands are common.
  • So what has acted up in these people, in the absence of an external authority, to make them buck the selfish equilibrium? Anger, for one thing. Whether prompted by a shoddy business or a petty thief, outrage can override rationality. And in these instances, it may be that the hand of evolution has done what it would otherwise have taken an authority outside the game to accomplish.
  • Emotion is mechanism design in the species. Precisely because feelings are involuntary, they enable contracts that need no outside enforcement. Revenge almost never works out in favor of the one who seeks it, and yet someone who will respond with “irrational” vehemence to being taken advantage of is for that very reason more likely to get a fair deal.
If people expect us to respond irrationally to the theft of our property, we will seldom need to, because it will not be in their interests to steal it. Being predisposed to respond irrationally serves much better here than being guided only by material self-interest.
  • In both love and housing, though, we continue to encounter more options even after our optimal-stopping decision is made—so why not be ready to jump ship? Of course, knowing that the other party (be it spouse or landlord) is in turn prepared to jump ship would prevent many of the long-term investments (having children together, or laboriously moving in one’s belongings) that make those agreements worthwhile.
The worry that people will leave relationships because it may later become rational for them to do so is largely erased if it is not rational assessment that binds them in the first place.
  • Love is like organized crime. It changes the structure of the marriage game so that the equilibrium becomes the outcome that works best for everybody.
  • Playwright George Bernard Shaw once wrote of marriage that “If the prisoner is happy, why lock him in? If he is not, why pretend that he is?” Game theory offers a subtle answer to this particular riddle. Happiness is the lock.

Information Cascades: The Tragic Rationality of Bubbles

Whenever you find yourself on the side of the majority, it is time to pause and reflect. —MARK TWAIN
  • Learning from others doesn’t always seem particularly rational. Fads and fashions are the result of following others’ behavior without being anchored to any underlying objective truth about the world.
  • An interesting aspect of the 2007–2009 mortgage crisis is that everybody involved seemed to feel like they were unfairly punished for simply doing what they were supposed to. A generation of Americans who grew up believing that houses were fail-safe investments, and who saw everyone around them buying houses despite (or because of) rapidly rising prices, were badly burned when those prices finally started to tumble. Bankers, meanwhile, felt they were unfairly blamed for doing what they had always done—offering opportunities, which their clients could accept or decline.
  • In the wake of an abrupt market collapse, the temptation is always to assign blame. Here game theory offers a sobering perspective: catastrophes like this can happen even when no one’s at fault.
  • Properly appreciating the mechanics of financial bubbles begins with understanding auctions.
  • One of the simplest auction formats has each participant write down their bid in secret, and the one whose bid is highest wins the item for whatever price they wrote down. This is known as a “sealed-bid first-price auction,” and from an algorithmic game theory perspective there’s a big problem with it—actually, several. For one thing, there’s a sense in which the winner always overpays: if you value an item at $25 and I value it at $10, and we both bid our true valuations ($25 and $10), then you end up buying it for $25 when you could have had it for just a hair over $10. This problem, in turn, leads to another one, which is that in order to bid properly—that is, in order not to overpay—you need to predict the true valuation of the other players in the auction and “shade” your bid accordingly. That’s bad enough—but the other players aren’t going to bid their true valuations either, because they’re shading their bids based on their prediction of yours! We are back in the land of recursion.
  • Another classic auction format, the “Dutch auction” or “descending auction,” gradually lowers an item’s price until someone is willing to buy it. A store marking down its unsold items, and landlords listing apartments at the highest price they think the market will bear, both share its basic quality: the seller is likely to begin optimistically and nudge the price down until a buyer is found. The descending auction resembles the first-price auction in that you’re more likely to win by paying near the top of your range (i.e., you’ll be poised to bid as the price falls to $25), and therefore will want to shade your offer by some complexly strategic amount. Do you buy at $25, or stay your hand and try to wait for a lower price? Every dollar you save risks losing out altogether.
  • The inverse of a Dutch or descending auction is what’s known as an “English auction” or “ascending auction”—the most familiar auction format. In an English auction, bidders alternate raising the price until all but one of them drop out. This seems to offer something closer to what we want: here, if you value an item at $25 and I value it at $10, you’ll win it for just over $10 without either having to go all the way to $25 or disappearing down the strategic rabbit hole.
  • In such a situation, it seems natural to look closely at your opponents’ bids, to augment your own meager private information with the public information. But this public information might not be nearly as informative as it seems. You don’t actually get to know the other bidders’ beliefs—only their actions. And it is entirely possible that their behavior is based on your own, just as your behavior is being influenced by theirs.
  • Imagine there are ten companies that might bid on the rights for a given tract. One of them has a geological survey suggesting the tract is rich with oil; another’s survey is inconclusive; the reconnaissance of the other eight suggests it’s barren. But being competitors, of course, the companies do not share their survey results with each other, and instead can only watch each other’s actions. When the auction begins, the first company, with the promising report, makes a high initial bid. The second company, encouraged by this bid to take an optimistic view of their own ambiguous survey, bids even higher. The third company has a weak survey but now doesn’t trust it in light of what they take to be two independent surveys that suggest it’s a gold mine, so they make a new high bid. The fourth company, which also has a lackluster survey, is now even more strongly inclined to disregard it, as it seems like three of their competitors all think it’s a winner. So they bid too. The “consensus” unglues from reality. A cascade has formed. No single bidder has acted irrationally, yet the net result is catastrophe.
  • Be wary of cases where public information seems to exceed private information, where you know more about what people are doing than why they’re doing it, where you’re more concerned with your judgments fitting the consensus than fitting the facts. When you’re mostly looking to others to set a course, they may well be looking right back at you to do the same.
  • Second, remember that actions are not beliefs; cascades get caused in part when we misinterpret what others think based on what they do. We should be especially hesitant to overrule our own doubts—and if we do, we might want to find some way to broadcast those doubts even as we move forward, lest others fail to distinguish the reluctance in our minds from the implied enthusiasm in our actions.
  • Last, we should remember from the prisoner’s dilemma that sometimes a game can have irredeemably lousy rules. There may be nothing we can do once we’re in it, but the theory of information cascades may help us to avoid such a game in the first place.
  • If you’re the kind of person who always does what you think is right, no matter how crazy others think it is, take heart. The bad news is that you will be wrong more often than the herd followers. The good news is that sticking to your convictions creates a positive externality, letting people make accurate inferences from your behavior. There may come a time when you will save the entire herd from disaster.

To Thine Own Self Compute

  • Named for Nobel Prize–winning economist William Vickrey, the Vickrey auction, just like the first-price auction, is a “sealed bid” auction process. That is, every participant simply writes down a single number in secret, and the highest bidder wins. However, in a Vickrey auction, the winner ends up paying not the amount of their own bid, but that of the second-place bidder. That is to say, if you bid $25 and I bid $10, you win the item at my price: you only have to pay $10.
  • To a game theorist, a Vickrey auction has a number of attractive properties. And to an algorithmic game theorist in particular, one property especially stands out: the participants are incentivized to be honest. In fact, there is no better strategy than just bidding your “true value” for the item—exactly what you think the item is worth. Bidding any more than your true value is obviously silly, as you might end up stuck buying something for more than you think it’s worth. And bidding any less than your true value (i.e., shading your bid) risks losing the auction for no good reason, since it doesn’t save you any money—because if you win, you’ll only be paying the value of the second-highest bid, regardless of how high your own was. This makes the Vickrey auction what mechanism designers call “strategy-proof,” or just “truthful.” In the Vickrey auction, honesty is literally the best policy.
  • Even better, honesty remains the best policy regardless of whether the other bidders are honest themselves. In the prisoner’s dilemma, we saw how defection turned out to be the “dominant” strategy—the best move no matter whether your partner defected or cooperated. In a Vickrey auction, on the other hand, honesty is the dominant strategy. This is the mechanism designer’s holy grail. You do not need to strategize or recurse.
  • In fact, the lesson here goes far beyond auctions. In a landmark finding called the “revelation principle,” Nobel laureate Roger Myerson proved that any game that requires strategically masking the truth can be transformed into a game that requires nothing but simple honesty. Paul Milgrom, Myerson’s colleague at the time, reflects: “It’s one of those results that as you look at it from different sides, on the one side, it’s just absolutely shocking and amazing, and on the other side, it’s trivial. And that’s totally wonderful, it’s so awesome: that’s how you know you’re looking at one of the best things you can see.”
  • The revelation principle may seem hard to accept on its face, but its proof is actually quite intuitive. Imagine that you have an agent or a lawyer who will be playing the game for you. If you trust them to represent your interests, you’re going to simply tell them exactly what you want, and let them handle all of the strategic bid-shading and the recursive strategizing on your behalf. In the Vickrey auction, the game itself performs this function. And the revelation principle just expands this idea: any game that can be played for you by agents to whom you’ll tell the truth, it says, will become an honesty-is-best game if the behavior you want from your agent is incorporated into the rules of the game itself.
When we think about ourselves, when we try to know ourselves … we use the knowledge of us which other people already have. We judge ourselves with the means other people have and have given us for judging ourselves. Into whatever I say about myself someone else’s judgment always enters. Into whatever I feel within myself someone else’s judgment enters.… But that does not at all mean that one cannot have relations with other people. It simply brings out the capital importance of all other people for each one of us.

Conclusion: Computational Kindness

  • First, there are cases where computer scientists and mathematicians have identified good algorithmic approaches that can simply be transferred over to human problems. The 37% Rule, the Least Recently Used criterion for handling overflowing caches, and the Upper Confidence Bound as a guide to exploration are all examples of this.
  • Second, knowing that you are using an optimal algorithm should be a relief even if you don’t get the results you were looking for. The 37% Rule fails 63% of the time. Maintaining your cache with LRU doesn’t guarantee that you will always find what you’re looking for; in fact, neither would clairvoyance. Using the Upper Confidence Bound approach to the explore/exploit tradeoff doesn’t mean that you will have no regrets, just that those regrets will accumulate ever more slowly as you go through life. Even the best strategy sometimes yields bad results—which is why computer scientists take care to distinguish between “process” and “outcome.” If you followed the best possible process, then you’ve done all you can, and you shouldn’t blame yourself if things didn’t go your way.
  • Outcomes make news headlines—indeed, they make the world we live in—so it’s easy to become fixated on them. But processes are what we have control over. As Bertrand Russell put it, “it would seem we must take account of probability in judging of objective rightness.… The objectively right act is the one which will probably be most fortunate. I shall define this as the wisest act.” We can hope to be fortunate—but we should strive to be wise. Call it a kind of computational Stoicism.
  • Finally, we can draw a clear line between problems that admit straightforward solutions and problems that don’t. If you wind up stuck in an intractable scenario, remember that heuristics, approximations, and strategic use of randomness can help you find workable solutions. A theme that came up again and again in our interviews with computer scientists was: sometimes “good enough” really is good enough. What’s more, being aware of complexity can help us pick our problems: if we have control over which situations we confront, we should choose the ones that are tractable.
  • Our interviewees were on average more likely to be available when we requested a meeting, say, “next Tuesday between 1:00 and 2:00 p.m. PST” than “at a convenient time this coming week.”
  • It was seemingly less difficult for them to accommodate our preferences and constraints than to compute a better option based on their own. Computer scientists would nod knowingly here, citing the complexity gap between “verification” and “search”—which is about as wide as the gap between knowing a good song when you hear it and writing one on the spot.
  • One of the implicit principles of computer science, as odd as it may sound, is that computation is bad: the underlying directive of any good algorithm is to minimize the labor of thought.
  • When we interact with other people, we present them with computational problems—not just explicit requests and demands, but implicit challenges such as interpreting our intentions, our beliefs, and our preferences.

Where to go for dinner

  • Consider this all-too-common scenario. A group of friends are standing around, trying to figure out where to go for dinner. Each of them clearly has some preferences, albeit potentially weak ones. But none of them wants to state those preferences explicitly, so they politely navigate the social hazards with guesses and half-hints instead.
  • They may well come to a resolution that is satisfying to all. But this procedure can easily go awry. The summer after college, for instance, Brian and two friends took a trip to Spain. They negotiated the trip itinerary on the fly, and at one point it became clear that they wouldn’t have time to go to the bullfight they’d researched and planned. Only then, as each of the three attempted to console the others, did they suddenly discover that in fact none of them had wanted to see the bullfight in the first place. Each had just gamely adopted what they’d perceived to be the others’ level of enthusiasm, thereby producing the level of enthusiasm that the others gamely adopted in turn.
  • Likewise, seemingly innocuous language like “Oh, I’m flexible” or “What do you want to do tonight?” has a dark computational underbelly that should make you think twice. It has the veneer of kindness about it, but it does two deeply alarming things. First, it passes the cognitive buck: “Here’s a problem, you handle it.” Second, by not stating your preferences, it invites the others to simulate or imagine them. And as we have seen, the simulation of the minds of others is one of the biggest computational challenges a mind (or machine) can ever face.
  • In such situations, computational kindness and conventional etiquette diverge. Politely withholding your preferences puts the computational problem of inferring them on the rest of the group. In contrast, politely asserting your preferences (“Personally, I’m inclined toward x. What do you think?”) helps shoulder the cognitive load of moving the group toward resolution.
  • Alternatively, you can try to reduce, rather than maximize, the number of options that you give other people—say, offering a choice between two or three restaurants rather than ten. If each person in the group eliminates their least preferred option, that makes the task easier for everyone. And if you’re inviting somebody out to lunch, or scheduling a meeting, offering one or two concrete proposals that they can accept or decline is a good starting point.
  • None of these actions is necessarily “polite,” but all of them can significantly lower the computational cost of interaction.
  • One of the chief goals of design ought to be protecting people from unnecessary tension, friction, and mental labor. (This is not just an abstract concern; when mall parking becomes a source of stress, for instance, shoppers may spend less money and return less frequently.)
  • Such subtle acts of computational kindness could do as much for ridership, if not more, as subsidizing the fares: think of it as a cognitive subsidy.
  • The intuitive standard for rational decision-making is carefully considering all available options and taking the best one. At first glance, computers look like the paragons of this approach, grinding their way through complex computations for as long as it takes to get perfect answers. But as we’ve seen, that is an outdated picture of what computers do: it’s a luxury afforded by an easy problem. In the hard cases, the best algorithms are all about doing what makes the most sense in the least amount of time, which by no means involves giving careful consideration to every factor and pursuing every computation to the end. Life is just too complicated for that.