Imagine you’re conducting a big study of typing performance. You put thousands of people through a battery of typing tests, then crunch the numbers. The data is clear: faster typing speed is correlated with fewer typos. Therefore, you conclude, the best way to avoid making typos is to type as fast as possible.
It’s easy to see that this is a wrongheaded conclusion. But sports scientists may be inadvertently making this kind of error all the time, according to a recent paper in the International Journal of Sports Physiology and Performance by Niklas Neumann and his colleagues at the University of Groningen in the Netherlands. In fact, some scientists argue that “the vast majority of social and medical science research” may be affected by this mistaken belief that group data can be applied to individuals, a phenomenon dubbed the “ergodicity problem.”
In the typing example, the problem is that better typists are both faster and less typo-prone. So on a group level, high speed and low error rate are correlated. But if you test any given individual repeatedly over time, you’d likely find the opposite pattern: higher speed comes with more errors. The group average can’t be generalized to tell you about individual outcomes. In contrast, rolling one die 100 times should give you (on average) the same outcome as rolling 100 dice ounce.
In technical terms, the difference between the two situations is that the dice data is ergodic, a term coined in the 1870s by the Austrian physicist Ludwig Boltzmann, whereas the typing data is nonergodic. Ergodicity is a crucial concept in statistical mechanics, which (for example) deduces the behavior of a large volume of gas from the motions of its uncountable individual molecules. In recent years, the concept has spread to other fields: ergodicity economics, for example, acknowledges the differences between 100 people making a bet with a one percent chance of going bust, and one person making such a bet 100 times. What sounds like a pretty good bet on the group level turns out to be a very bad one for the individual.
The sports problem that Neumann and his colleagues consider is the relationship between training load and recovery. For endurance sports, in particular, you could view this as the master key to performance. More training increases fitness, but also raises your risk of injury and burnout. Figuring out exactly how much training you can handle, and how quickly you can recover from it, enables you to edge closer to the red line of maximal training. This has led to all sorts of research that attempts to quantify how different training load patterns are linked to performance and injury risk.
But is the link between training load and recovery ergodic? That is, can you measure training load and subsequent recovery in a large group of people, and use those results to predict how any given individual will respond to a sequence of training sessions and recoveries?
To find out, Neumann and his colleagues worked with “a major league football club in The Netherlands,” which from the affiliations of the paper’s authors we can assume is FC Groningen. Over the course of two seasons, they collected daily training and recovery data from 83 members of their under-17, under-19, and under-23 teams. Before each training session, the players had to indicate their perceived recovery on a scale of 6 to 20; after each session, they indicated their perceived effort during the session, again on a scale of 6 to 20, which was then multiplied by the duration of the workout in minutes to obtain the total training load.
The simplest version of the training load/recovery question is: Does the total training load in a workout affect how recovered you feel before the next day’s workout? The researchers try to answer this question in two different ways. In the group-level analysis, you calculate an average training load for all athletes on a given day, and compare it to the average recovery rating for all athletes the next day. In the individual-level analysis, you instead look at every pair of workout/recovery scores for a single individual over the course of the two-year data set.
The mathematical analysis gets pretty involved, but here’s the crux. The group analysis looks at just one day (plus recovery the next day), but you can repeat that analysis for every available workout day and average the results. Similarly, the individual analysis can be repeated for every athlete and then averaged. In this way, both approaches are using all the available data. If they produce identical results, then the training and recovery data is ergodic, meaning that we can safely apply the results of group studies to individuals. If they don’t produce identical results, then all bets are off.
Sure enough, the group and individual analyzes produced different results. In particular, training loads varied far more for given individuals over time than they did between individuals on a given day. And the correlations between training load and recovery didn’t match up either. How a bunch of people respond to a single workout doesn’t necessarily tell you how you respond to a series of workouts.
Figuring out what this means in practice is tricky. In the field of medical research, some researchers have pushed back against the idea that nonergodicity is some sort of crisis that invalidates huge swaths of existing research. Tools such as placebo-controlled randomized trials, they argue, help to wash out some of the effects of person-to-person variation. In a sense, the findings simply reinforce a trend that has been gathering strength in sports science journals for at least a decade, which is to always report individual results in addition to group averages. Seeing the individual dots on a graph gives you an immediate sense of whether everyone is clustered close to the average response, or whether a significant number of subjects saw different, or perhaps even opposite, responses compared to the average.
One final caveat: acknowledging the shortcomings of group-level research doesn’t mean ignoring the flaws and pitfalls of self-experimentation. My impression is that, for any research finding that applies to 99 out of 100 people, at least ten will wear that they are the exception. (Make that 20 if we’re talking about stretching.) Meaningful individual-level data has to be collected just as rigorously as any randomized trial, with predefined hypotheses, placebo controls, and measurable outcomes rather than just gut feelings. It may be true, as George Sheehan wrote, that we’re each an experiment of one—but it’s up to us to make sure we’re interpreting the findings properly.
For more Sweat Science, join me on Twitter and Facebook, sign up for the email newsletter, and check out my book Endure: Mind, Body, and the Curiously Elastic Limits of Human Performance.