Big data flaws we need to address

March 29, 2016

In the past few years there have been a lot of discussions surrounding big data. These talks generally centered on the amazing opportunities that the technology provides. The byproduct of these discussions is that the use of big data can be somewhat scary. While we agree that big data is amazing, much like any emerging technology it contains some issues. In this article, we’ll a look what could potentially go wrong with big data implementations.

Big data flaws we need to address

No privacy for you!

Generally, when people think about possible issues of big data, the first and often the last thing that comes to mind is privacy.

The name speaks for itself: Big data relies on gathering a lot of information, and the more private this information is, the more efficiently algorithms can reach some non-obvious conclusions. To put it simply, private data is the fairy dust of all that mighty Big Data Magic.

This fairy dust tends to be scattered frequently and gets stuck in some dark corners and so on and so forth. However, it is more than that: there’s a whole set of less trivial issues, which are tied to each other in a complicated way.

It’s science baby (not really)

People consider big data solutions as science. The problem though is that the algorithms are actually more like engineering. Big difference.

Think of it as physics versus rockets. Physics is science without questions: every piece of it had been researched and proven, both theoretically and experimentally; then it had been checked by scientific community, because this is how science works.

Moreover, science is always open; hence everything can be rechecked anytime by anyone who is interested. And if any major flaws are revealed or new theories are emerged, it’s always a matter of discussion for the global scientific community.

Rockets are just engineering structures based on certain physical principles. And as you know perfectly well, with rockets things easily go south if design is not good enough. Or if conditions are ‘wrong’ — which is basically the same, since it means the design is not good enough for these conditions.

You can’t argue with math, can you?

One of the consequences of this misunderstanding is false authority. People have to accept decisions of big data algorithms as trustful, and can’t argue with them. Except for professional mathematicians, who potentially could disprove competence of this or that big data model or algorithm, if they were able to research it. But are they really able to?

Black box is so black

Even if you are well equipped with knowledge and experience in math and you want to explore how exactly this or that algorithm works, access is rarely granted. This is because the software is commercial, and it’s source code is proprietary. Researchers are typically dismissed by noting that they won’t let you look under the proprietary hood. Kind of like “thank you for your interest, have a good night.”

In her talk called ‘Weapons of Math Destruction,’ mathematician and human rights activist Cathy O’Neil, speaks about Value-added modeling which is an algorithm for teachers evaluation in US:

“My friend who runs a high school in New York wanted to understand this [algorithm]. She’s in a math and science high school so she thought she might be able to understand it. She asked her Department of Education contact to sent her information about it. They said ‘Oh, you wouldn’t want to know about it, it’s math!'”

“She persisted and finally got a whitepaper and showed it to me. It was too abstract to be useful. So I filed a Freedom of Information Act request to get the source code, which was denied. I later found out that the think tank in Madison, WI, that is in charge of this model, has a licensing contract [which states that] nobody gets to see inside the model.”

“Nobody in the Department of Education of New York City understands that model, no teacher gets to understand their score nor they can improve their score because it’s not told how.”

Something in, whatever out

Since algorithms are opaque, input data is also opaque. An operator of big data software can’t be sure, what data was processed by algorithm and what data was not. Therefore, some data can affect the output twice, the first time by algorithm and the second time by operator. Or, to the contrary, some significant data can be dropped, if the operator mistakenly thinks that it is already included in result, but in fact it wasn’t considered by algorithm at all.

For example, the police enter a crime-ridden neighborhood. Their software warns them that a man in front of them has a 55% chance of being a burglar. The man carries a suspicious suitcase but policemen don’t know if the algorithm tool this thing into account or not. They have to decide if the suitcase makes the man more or less suspicious.

Not to mention that input data can simply contain errors, or not contain some information vitally important for correct prediction.

Is the glass half empty or half full?

Output information is also not so very transparent and can be misinterpreted. Numbers can be subjective and two different people can interpret the same numbers completely differently. What is 30% probability, for instance? The interpretation can vary from ‘most likely not’ to ‘probably yes’ depending on lots of factors you never can foresee.

Even worse, this probability score can be used as a mean of competition: despite the fact that probability of person, for example, convicting some kind of crime is not high enough to be considered seriously, in some circumstances it can be used to cut off certain part of people.

For example, they use such algorithms for security clearance in US, trying to predict how likely a person would disclose information. And since there’s a lot of people competing for jobs, they’re pretty comfortable with cutting off some of them on this very basis, even if likeliness isn’t really significant, but just a bit above the average.

No bias?

Considering all the issues mentioned above, it is safe to say that one of the most widely promoted advantages of big data — which is ‘no biasing’ — is not entirely correct. A decision made by human based on calculation made by algorithm made by human is still a decision made by human. It can be biased, or can be not biased. The problem is, with obscure algorithm and opaque data you can’t really tell. And you can’t really change it, since it’s hardcoded into software.

Welcome to the Dark Side, Anakin

Predicting algorithms are also vulnerable to feedback loops and self-fulfilling prophecies. For example, an algorithm used by the Chicago Police Department can mark a kid as potentially dangerous. Then policemen start ‘keeping an eye on him’, paying visits to his home and so on. Kid sees that police treat him as a criminal despite the fact he did nothing yet, and starts acting accordingly. And eventually he becomes a gang member, just because he was offended by police.

Or, as Whitney Merrill put it in her ‘Predicting Crime in a Big Data World’ talk at Chaos Communication Congress 32, “If a police officer goes on duty to an area, and an algorithm says “You are 70% likely to find a burglar in this area”, are they gonna find the burglar because they’ve been told ‘You might find a burglar’?”

No opt-out

If any governmental or commercial organisation employs big data algorithms, and you don’t like it you can’t just say ‘That’s enough for me, I’m quitting’. Not that anyone is going to ask you if you want to be a subject of big data research or not. Or worse: not that they will necessarily tell you that you are even a subject.

Well, don’t get me wrong: I don’t mean that all above mentioned flaws are a good reason for humanity to reject advanced predicting algorithms. Obviously, big data is just rising and it is definitely here to stay. But perhaps it’s the right time to think about its issues, until it’s not too late to fix them.

We should make the algorithms and input data more transparent and more protected, grant independent researchers access to source code, settle the legislation, start informing people what actually is going on with this ‘Math’ thing. And we definitely have to learn from earlier mistakes after all.