In this volume, we conclude with Privacy and Security.
Privacy and Security
For our final examples I want to dig into the notions of privacy and security in big data settings. These are and always will be critical concerns.
We begin with financial information and e-commerce. In the early days of Amazon, there were a significant number of customers who were very concerned about the security implications of entering their credit card numbers online. The specific concerns varied, but they almost always involved the possibility of a gang of nefarious hackers gaining access to credit card numbers and using them to make fraudulent purchases.
In this volume, we discuss Data Mining and The Birthday Paradox.
Data Mining and The Birthday Paradox
We’ve all heard of the Birthday Paradox. Put 23 randomly chosen people in a room and there is a 50% chance that two or more have the same birthday. Put 57 people in the room, and the chance is 99%. Do those people have anything more in common because they were born on the same day of the year? Astrologers will say yes, but most scientists would say there is no evidence to support that claim.
What does this have to do with big data? The answer is that generalizations of the math behind the birthday paradox tell us that we will—not just that we can but that with near 100% certainty we will—draw meaningless conclusions if we just look at enough variables. In fact, we can show that if we generate a large number of streams of completely random data, some of them will look like others.
The problem is, we can easily forget this when we look at big data sets with lots and lots of variables. These are the kind of things we see in what is called data exhaust. Data exhaust is the vast stream of data gathered and logged by digital devices ranging from mobile phones to engine sensors in cars to video cameras in public spaces to instruments on particle accelerators.
Look at a lot of this data, and you will find spurious correlations. This is what Principle 3 is all about. Statisticians have known about Principle 3 for decades, and have techniques for trying to deal with it. The best technique, however, is and always has been a controlled scientific experiment, as Principle 2 advocates.
In this volume, we discuss Product Recommendations.
Product Recommendations
Product recommendations span far more than just films. We have all seen these recommendations hundreds of times while shopping online. You are looking at one product, and five or ten others are recommended, either as alternatives to consider, or perhaps accessories, based on what shoppers who are in some way like you have done in the past.
Many frame the product recommendation problem as one of prediction. How can I predict which product you are likely to buy next, and recommend it to you? That’s actually a lot better than what contestants were asked to predict for the Netflix Prize. Unfortunately, prediction is not really the problem at all. If I am 100% accurate in my prediction, and you buy exactly what I predicted you would, I haven’t changed anything. I haven’t generated any incremental value for the retailer, or created much immediate value for the shopper, other than perhaps saving them a bit of time.
Big Data has become the subject of Big Hype, much as Social Media and Mobile were recently. Our goal today is to peel back the hype and discover some of the key principles behind Big Data so we can make the best possible decisions about when, where, and how to apply it.
My background with Big Data has predominantly been in retail, as Principal Engineer in Personalization at Amazon, and now Chief Scientist at RichRelevance, so I will use several retail examples. However, the principles behind these examples are without question more broadly applicable. These principles are:
- Before we look at any data, we have to have a clear and well-defined goal. Otherwise we are likely to find very clever solutions to the wrong problems.
- Smart data science requires the same fundamental scientific method—hypothesis, experimentation, and analysis—as every other science.
- Correlation is not causation. We all know this, but in a big data world it is much easier to confuse the two.
- Data are economic assets. Understanding them as such helps us understand how to motivate all participants in the data economy, from individuals to corporations to governments and non-profits.
The Netflix Prize
The Netflix Prize has done more to bring Big Data and data science in general to the public mind than any other event. This has been great for increasing the visibility of the field, but I’m sad to say, miserable for actual practice. The saddest part is that the winning algorithms are not in use at Netflix today, and are unlikely ever to be.
For those of you who live in a data cave on the West Coast like I do, it may come as a surprise that there was a blizzard this past weekend—a BIG one. The ‘Nemo’ blizzard, caused by a merging of two low pressure systems that originated in the central and southeast US, then migrated to the north-eastern seaboard—affected millions of people in the US Northeast, with heavy snow and multiple power-outages.
The crack about using Big Data to analyze Big Weather almost writes itself here, but it is rather astounding how much external events, such as extreme weather and power outages, can influence online activity. Obviously, it’s no surprise that without power, there’s no e-commerce – yet I am surprised with how closely we can track the storm’s influence (and people’s response to it) using RichRelevance’s dataset.
Three months ago, I deployed with the 101st Combat Aviation Brigade of the 101st Airborne Division (Air Assault) to Afghanistan as the Experimental Test Pilot for Task Force Destiny. After arriving in Bagram in mid-September to a “B-hut” with bare plywood walls, I realized I needed some wall decorations. So, in 21st century fashion I posted my address on my Facebook page and “hinted” that I might like to receive a letter or postcard from friends and family to pin up and add some color.