Big Data - PCC - ebook

Financial Times Business Book of the Year Finalist “Illuminating and very timely . . . a fascinating — and sometimes alarming — survey of Big Data’s growing effect on just about everything: business, government, science and medicine, privacy, and even on the way we think.” —New York Times It seems like “Big Data” is in the news every day, as we read the latest examples of how powerful algorithms are teasing out the hidden connections between seemingly unrelated things. Whether it is used by the NSA to fight terrorism or by online retailers to predict customers’ buying patterns, Big Data is a revolution occurring around us, in the process of forever changing economics, science, culture, and the very way we think. But it also poses new threats, from the end of privacy as we know it to the prospect of being penalized for things we haven’t even done yet, based on Big Data’s ability to predict our future behavior. What we have already seen is just the tip of the iceberg. Big Data is the first major book about this earthshaking subject, with two leading experts explaining what Big Data is, how it will change our lives, and what we can do to protect ourselves from its hazards.  “An optimistic and practical look at the Big Data revolution — just the thing to get your head around the big changes already underway and the bigger changes to come.”

Ebooka przeczytasz w aplikacjach Legimi na:

czytnikach certyfikowanych
przez Legimi

Liczba stron: 34

Odsłuch ebooka (TTS) dostepny w abonamencie „ebooki+audiobooki bez limitu” w aplikacjach Legimi na:



Chapter 1

Three shifts in the wayinformation is analyzedand used

Big data's ascendancy will bring about three fundamental shifts in the way information is analyzed and then used in society:

Shift 1Process all data, not just samples

Humans have always looked to solve problems by using data but throughout history, this has been very hard to do. Collecting, organizing and then understanding the data has traditionally been difficult for at least two glaringly obvious reasons:

1. The majority of the world's information has tended to be analog rather than digital.

2. Collating and then analyzing analog information is extremely expensive and time consuming.

The 1880 census of the United States is a good example of these difficulties. It took fully 8 years to collect and analyze the data – pretty much guaranteeing any conclusions which were drawn from the national census were out of date by the time they became available. It was forecast the 1890 census would have taken 13 years to collate but new technology came along (punch cards and tabulation machines) which fortunately reduced analysis to one year. This was important because the United States Constitution mandated a national census every ten years would set tax levels and dictate congressional representation levels.

The usual response to these challenges has been to analyze a random sample of the data and then extrapolate that rather than attempting to analyze all the data. Sampling made big-data problems more manageable but always had some inherent problems:

1. You have to make sure you have a genuinely random sample which is representative of the whole.

2. Biases can creep in unnoticed when sampling which results in incorrect predictions.

3. Random sampling cannot capture the preferences of subgroups or other niches of the data. Samples are unhelpful when you want to focus on individual niches.

Sampling always blurs the details and it's often the fact that the really interesting things in life happen within the margin of error that inherently exists in sampling. It's also true that sampling was a response to the constraints of earlier generations of information technology which could not handle vast amounts of raw data in multiple formats.

Key Thoughts

"Sampling is an outgrowth of an era of information-processing constraints, when people were measuring the world but lacked the tools to analyze what they collected. As a result, it is a vestige of that era too. The shortcomings in counting and tabulating no longer exist to the same extent. Sensors, cellphone GPS, web clicks, and Twitter collect data passively;computers can crunch the numbers with increasing ease. The concept of sampling no longer makes as much sense when we can harness large amounts of data. The technical tools for handling data have already changed dramatically, but our methods and mindsets have been slower to adapt." — Viktor Mayer-Schonberger and Kenneth Cukier

The simple dynamic is the more data you use, the greater the quality of your predictions become. Taken to its logical conclusion, that dynamic means if you analyze all the data rather than just a sample, you're going to come up with superior results no matter what. In past generations, analyzing all the available data was prohibitively expensive but today the cost and complexity of the requisite storage, processing power and the cutting-edge tools for analysis have declined rapidly. It's now feasible for everyone to do this.

To illustrate, take the example of Oren Etzioni, one of America's foremost computer scientists. When he took a flight from Seattle to Los Angeles in 2003, he found much to his chagrin that almost everyone else on the flight had paid less than him even though they purchased their tickets much later. He decided to find a way to let people know if an online ticket price was a good deal or not by using a predictive model.