Ebooka przeczytasz w aplikacjach Legimi na:
Odsłuch ebooka (TTS) dostępny w abonamencie „ebooki+audiobooki bez limitu” w aplikacji Legimi na:
Big data's ascendancy will bring about three fundamental shifts in the way information is analyzed and then used in society:
Shift 1Process all data, not just samples
Humans have always looked to solve problems by using data but throughout history, this has been very hard to do. Collecting, organizing and then understanding the data has traditionally been difficult for at least two glaringly obvious reasons:
1. The majority of the world's information has tended to be analog rather than digital.
2. Collating and then analyzing analog information is extremely expensive and time consuming.
The 1880 census of the United States is a good example of these difficulties. It took fully 8 years to collect and analyze the data – pretty much guaranteeing any conclusions which were drawn from the national census were out of date by the time they became available. It was forecast the 1890 census would have taken 13 years to collate but new technology came along (punch cards and tabulation machines) which fortunately reduced analysis to one year. This was important because the United States Constitution mandated a national census every ten years would set tax levels and dictate congressional representation levels.
The usual response to these challenges has been to analyze a random sample of the data and then extrapolate that rather than attempting to analyze all the data. Sampling made big-data problems more manageable but always had some inherent problems:
1. You have to make sure you have a genuinely random sample which is representative of the whole.
2. Biases can creep in unnoticed when sampling which results in incorrect predictions.
3. Random sampling cannot capture the preferences of subgroups or other niches of the data. Samples are unhelpful when you want to focus on individual niches.
Sampling always blurs the details and it's often the fact that the really interesting things in life happen within the margin of error that inherently exists in sampling. It's also true that sampling was a response to the constraints of earlier generations of information technology which could not handle vast amounts of raw data in multiple formats.
"Sampling is an outgrowth of an era of information-processing constraints, when people were measuring the world but lacked the tools to analyze what they collected. As a result, it is a vestige of that era too. The shortcomings in counting and tabulating no longer exist to the same extent. Sensors, cellphone GPS, web clicks, and Twitter collect data passively;computers can crunch the numbers with increasing ease. The concept of sampling no longer makes as much sense when we can harness large amounts of data. The technical tools for handling data have already changed dramatically, but our methods and mindsets have been slower to adapt." — Viktor Mayer-Schonberger and Kenneth Cukier
The simple dynamic is the more data you use, the greater the quality of your predictions become. Taken to its logical conclusion, that dynamic means if you analyze all the data rather than just a sample, you're going to come up with superior results no matter what. In past generations, analyzing all the available data was prohibitively expensive but today the cost and complexity of the requisite storage, processing power and the cutting-edge tools for analysis have declined rapidly. It's now feasible for everyone to do this.
To illustrate, take the example of Oren Etzioni, one of America's foremost computer scientists. When he took a flight from Seattle to Los Angeles in 2003, he found much to his chagrin that almost everyone else on the flight had paid less than him even though they purchased their tickets much later. He decided to find a way to let people know if an online ticket price was a good deal or not by using a predictive model.