Everyone knows that the Internet has changed how businesses operate, governments function, and people live. But a new, less visible technological trend is just as transformative: “big data.” Big data starts with the fact that there is a lot more information floating around these days than ever before, and it is being put to extraordinary new uses. Big data is distinct from the Internet, although the Web makes it much easier to collect and share data. Big data is about more than just communication: the idea is that we can learn from a large body of information things that we could not comprehend when we used only smaller amounts.
In the third century BC, the Library of Alexandria was believed to house the sum of human knowledge. Today, there is enough information in the world to give every person alive 320 times as much of it as historians think was stored in Alexandria’s entire collection — an estimated 1,200 exabytes’ worth. If all this information were placed on CDs and they were stacked up, the CDs would form five separate piles that would all reach to the moon.
This explosion of data is relatively new. As recently as the year 2000, only one-quarter of all the world’s stored information was digital. The rest was preserved on paper, film, and other analog media. But because the amount of digital data expands so quickly — doubling around every three years — that situation was swiftly inverted. Today, less than two percent of all stored information is nondigital.
We can learn from a large body of information things that we could not comprehend when we used only smaller amounts.
Given this massive scale, it is tempting to understand big data solely in terms of size. But that would be misleading. Big data is also characterized by the ability to render into data many aspects of the world that have never been quantified before; call it “datafication.” For example, location has been datafied, first with the invention of longitude and latitude, and more recently with GPS satellite systems. Words are treated as data when computers mine centuries’ worth of books. Even friendships and “likes” are datafied, via Facebook.
This kind of data is being put to incredible new uses with the assistance of inexpensive computer memory, powerful processors, smart algorithms, clever software, and math that borrows from basic statistics. Instead of trying to “teach” a computer how to do things, such as drive a car or translate between languages, which artificial-intelligence experts have tried unsuccessfully to do for decades, the new approach is to feed enough data into a computer so that it can infer the probability that, say, a traffic light is green and not red or that, in a certain context, lumière is a more appropriate substitute for “light” than léger.
Using great volumes of information in this way requires three profound changes in how we approach data. The first is to collect and use a lot of data rather than settle for small amounts or samples, as statisticians have done for well over a century. The second is to shed our preference for highly curated and pristine data and instead accept messiness: in an increasing number of situations, a bit of inaccuracy can be tolerated, because the benefits of using vastly more data of variable quality outweigh the costs of using smaller amounts of very exact data. Third, in many instances, we will need to give up our quest to discover the cause of things, in return for accepting correlations. With big data, instead of trying to understand precisely why an engine breaks down or why a drug’s side effect disappears, researchers can instead collect and analyze massive quantities of information about such events and everything that is associated with them, looking for patterns that might help predict future occurrences. Big data helps answer what, not why, and often that’s good enough.
The Internet has reshaped how humanity communicates. Big data is different: it marks a transformation in how society processes information. In time, big data might change our way of thinking about the world. As we tap ever more data to understand events and make decisions, we are likely to discover that many aspects of life are probabilistic, rather than certain.
For most of history, people have worked with relatively small amounts of data because the tools for collecting, organizing, storing, and analyzing information were poor. People winnowed the information they relied on to the barest minimum so that they could examine it more easily. This was the genius of modern-day statistics, which first came to the fore in the late nineteenth century and enabled society to understand complex realities even when little data existed. Today, the technical environment has shifted 179 degrees. There still is, and always will be, a constraint on how much data we can manage, but it is far less limiting than it used to be and will become even less so as time goes on.