behavior is involved—all may changewhen a new feature or artifact isintroduced to people (data providers,in this case).
We claim that big data is not aboutthe amount of data or the machines oralgorithms. Rather, it is about how weuse the data. As an approach, big datais very promising, and it may changein many ways our core understandingof the world. Yet big data should notbe exercised in isolation. Humaninvolvement is important.
What is the place for us humansin the big data prediction-makingmachinery? Are we just statisticians,or should we play an active role inevery phase of data crunching?
In fact, we humans have a number
of roles within the big data machinery,
whether we wanted this or not:
• Developers. Some of us develop the
• Decision makers. Quite a few of ususe big data systems daily.
• Subjects. All of us are affected bybig-data-based decisions, in one way oranother.
• Data providers. All of us providehuge amounts of personal data for thebig data machine to process.
Next, we will look at the big datadevelopers and decision makers.
The two latter groups would deservean entire article of their own. (Restassured, we will get back to them in afuture article.)
Learning algorithms are able tofind correlations between thousands offactors, something far beyond what ishumanly possible. However, it can beargued that big data is never smarterthan the human operators mining it.
There are two aspects that essentially
define the usefulness of big data:
• What can ultimately be asked?
• How do I figure out what isimportant?
It takes a lot of human creativity toinnovate on what datasets to use, whatthe important correlations are (the bigdata machinery may identify dozensof correlations that at the end of theday will have nothing to do with youroriginal quest), and how to test them.
Building a model for finding out thedesired answers indeed requires humanintervention. Big data is not magic.
The big data approach does not provideanswers if you do not know how to ask .
• making predictions based oncorrelations.
The philosophy behind big data isto use all of the data available. Thisis in strong contrast to the smalldata approach, where only samplesof the whole datasets are analyzed,and statistical methods are used tocalculate probabilities.
Obviously, some very serious hardscience lies behind the scenes—learning algorithms, artificialintelligence, and the like. These arebeyond the scope of this article.
Big data is more than just themethods, though. We think that morethan anything else, big data is anapproach to finding answers (or solvingproblems, if you like).
BIG DATA IS AN APPROACHOne might easily think that big datais just a big amount of data withsome complex mathematical andcomputational operations thrown in.However, the very nature of big data—multiple incoherent data sources,hundreds of factors, untraceablecorrelations, complex math that onlya handful of data scientists on Earthreally understand—implies that whilebig data machinery may give us a veryprecise answer on where the flu will hitnext, we do not know why.
During the process, the big dataapproach loses causality. It cananswer what, but not why. The singlefactors that lead to the conclusionare there, but the correlations do notexplain causality. The correlationsmay be completely random. Thus, theunderlying human motivations can’tbe extracted—even when applied toexplain human behavior.
Then the obvious question from ahuman point of view is: Is it acceptableto act without knowing why? Is itokay to perform a surgery or to arrestsomeone if the big data says so (with agiven accuracy)?
There are three severe pitfalls
related to losing causality:
• Responsibility. Who is responsible
for the decision if it is made basedsolely on a big data prediction? The bigdata machinery?
• Learning. When you do not knowwhy, you do not learn. There is no wayto develop your actions based on yourearlier experiences if every time youneed a decision, you just ask the data.
• Trust. It is much more difficultto trust the results if you can’t seethe reasons. What you need to do istrust the algorithm. Can you reallyconsider an algorithm an authority?
Even worse, if the prediction appearsincorrect (false positive), it may benext to impossible to make correctiveactions (for instance, if for somereason Facebook determines thatyou like to eat fish even though youare lethally allergic to it, you will getfish-related ads till the sun expands, nomatter what you do). You can’t teach apredictive big data engine by changingjust one tiny bit of data. Or, if you can,then how trustworthy is the wholesystem?
Big data may lead us to “datadictatorship” sooner than we mayrealize. We strongly oppose such anarrow mindset.
Another characteristic related tobig data is imprecision. The big dataapproach implies using multipleheterogeneous datasets, or severalpieces of “small data.” In many cases,the data was actually originallycollected for some completely differentpurpose (data reusability), and thedatasets are differently formattedand stored with varying accuracy.
Inevitably, these different datasetsbring imprecision. And then thequestion is: Do we accept improvedquantity with deteriorated accuracy?
The answer may be yes or no,depending heavily on what the data isused for.
Yet another aspect is change. Datacan extract an approximation of thetruth only on the current setting.
Should the circumstances change, theanalysis needs to be run again. Thisis especially the case when human
It is much more difficult to trust the results
if you can’t see the reasons. What you need
to do is trust the algorithm. Can you really
consider an algorithm an authority?