{"id":11673,"date":"2016-03-29T09:00:08","date_gmt":"2016-03-29T13:00:08","guid":{"rendered":"https:\/\/www.kaspersky.com\/blog\/?p=11673"},"modified":"2017-09-24T08:07:39","modified_gmt":"2017-09-24T12:07:39","slug":"nine-big-data-issues","status":"publish","type":"post","link":"https:\/\/www.kaspersky.com\/blog\/nine-big-data-issues\/11673\/","title":{"rendered":"Big data flaws we need to address"},"content":{"rendered":"<p>In the past few years there have been a lot of discussions surrounding big data. These talks generally centered on the amazing opportunities that the technology provides. The byproduct of these discussions is that the use of big data can be somewhat scary. While we agree that big data is amazing, much like any emerging technology it contains some issues. In this article, we\u2019ll a look what could potentially go wrong with big data implementations.<\/p>\n<p><a href=\"https:\/\/media.kasperskydaily.com\/wp-content\/uploads\/sites\/92\/2016\/03\/06022613\/big-data-dangers-FB.png\" rel=\"attachment wp-att-11675\"><img decoding=\"async\" class=\"aligncenter size-full wp-image-11675\" src=\"https:\/\/media.kasperskydaily.com\/wp-content\/uploads\/sites\/92\/2016\/03\/06022613\/big-data-dangers-FB.png\" alt=\"Big data flaws we need to address\" width=\"1280\" height=\"1280\"><\/a><\/p>\n<h3>No privacy for you!<\/h3>\n<p>Generally, when people think about possible issues of big data, the first and often the last thing that comes to mind is <b>privacy<\/b>.<\/p>\n<p>The name speaks for itself: Big data relies on gathering <em>a lot<\/em> of information, and the more private this information is, the more efficiently algorithms can reach some non-obvious conclusions. To put it simply, private data is the fairy dust of all that mighty <em>Big Data Magic<\/em>.<\/p>\n<p>This fairy dust tends to be scattered frequently and gets stuck in some dark corners and so on and so forth. However, it is more than that: there\u2019s a whole set of less trivial issues, which are tied to each other in a complicated way.<\/p>\n<blockquote class=\"twitter-tweet\" data-width=\"500\" data-dnt=\"true\">\n<p lang=\"en\" dir=\"ltr\">For <a href=\"https:\/\/twitter.com\/hashtag\/DPD15?src=hash&amp;ref_src=twsrc%5Etfw\" target=\"_blank\" rel=\"noopener nofollow\">#DPD15<\/a>, we look at 2014\u2019s top data leaks on Kaspersky Daily. <a href=\"https:\/\/t.co\/lEpy81gdBl\" target=\"_blank\" rel=\"noopener nofollow\">https:\/\/t.co\/lEpy81gdBl<\/a> <a href=\"https:\/\/twitter.com\/hashtag\/databreach?src=hash&amp;ref_src=twsrc%5Etfw\" target=\"_blank\" rel=\"noopener nofollow\">#databreach<\/a> <a href=\"https:\/\/twitter.com\/hashtag\/cybercrime?src=hash&amp;ref_src=twsrc%5Etfw\" target=\"_blank\" rel=\"noopener nofollow\">#cybercrime<\/a> <a href=\"http:\/\/t.co\/XITXMW9NLe\" target=\"_blank\" rel=\"noopener nofollow\">pic.twitter.com\/XITXMW9NLe<\/a><\/p>\n<p>\u2014 Kaspersky (@kaspersky) <a href=\"https:\/\/twitter.com\/kaspersky\/status\/560468735753199616?ref_src=twsrc%5Etfw\" target=\"_blank\" rel=\"noopener nofollow\">January 28, 2015<\/a><\/p><\/blockquote>\n<p><script async src=\"https:\/\/platform.twitter.com\/widgets.js\" charset=\"utf-8\"><\/script><\/p>\n<h3>It\u2019s science baby (not really)<\/h3>\n<p>People consider big data solutions as science. The problem though is that the algorithms are actually more like engineering. Big difference.<\/p>\n<p>Think of it as physics versus rockets. Physics is science without questions: every piece of it had been researched and proven, both theoretically and experimentally; then it had been checked by scientific community, because this is how science works.<\/p>\n<p>Moreover, science is always open; hence everything can be rechecked anytime by anyone who is interested. And if any major flaws are revealed or new theories are emerged, it\u2019s always a matter of discussion for the global scientific community.<\/p>\n<p>Rockets are just engineering structures based on certain physical principles. And as you know perfectly well, with rockets things easily go south if design is not good enough. Or if conditions are \u2018wrong\u2019 \u2014 which is basically the same, since it means the design is not good enough for these conditions.<\/p>\n<blockquote class=\"twitter-tweet\" data-width=\"500\" data-dnt=\"true\">\n<p lang=\"en\" dir=\"ltr\">The scary side of <a href=\"https:\/\/twitter.com\/hashtag\/big?src=hash&amp;ref_src=twsrc%5Etfw\" target=\"_blank\" rel=\"noopener nofollow\">#big<\/a> <a href=\"https:\/\/twitter.com\/hashtag\/data?src=hash&amp;ref_src=twsrc%5Etfw\" target=\"_blank\" rel=\"noopener nofollow\">#data<\/a> <a href=\"http:\/\/t.co\/jka3ZJSK6R\" target=\"_blank\" rel=\"noopener nofollow\">http:\/\/t.co\/jka3ZJSK6R<\/a> <a href=\"https:\/\/twitter.com\/hashtag\/bigdata?src=hash&amp;ref_src=twsrc%5Etfw\" target=\"_blank\" rel=\"noopener nofollow\">#bigdata<\/a> <a href=\"https:\/\/twitter.com\/hashtag\/analytics?src=hash&amp;ref_src=twsrc%5Etfw\" target=\"_blank\" rel=\"noopener nofollow\">#analytics<\/a> <a href=\"http:\/\/t.co\/9beTnrKice\" target=\"_blank\" rel=\"noopener nofollow\">pic.twitter.com\/9beTnrKice<\/a><\/p>\n<p>\u2014 Kaspersky (@kaspersky) <a href=\"https:\/\/twitter.com\/kaspersky\/status\/634727788784820229?ref_src=twsrc%5Etfw\" target=\"_blank\" rel=\"noopener nofollow\">August 21, 2015<\/a><\/p><\/blockquote>\n<p><script async src=\"https:\/\/platform.twitter.com\/widgets.js\" charset=\"utf-8\"><\/script><\/p>\n<h3>You can\u2019t argue with math, can you?<\/h3>\n<p>One of the consequences of this misunderstanding is false authority. People have to accept decisions of big data algorithms as trustful, and can\u2019t argue with them. Except for professional mathematicians, who potentially could disprove competence of this or that big data model or algorithm, if they were able to research it. But are they really able to?<\/p>\n<h3>Black box is so black<\/h3>\n<p>Even if you are well equipped with knowledge and experience in math and you want to explore how exactly this or that algorithm works, access is rarely granted. This is because the software is commercial, and it\u2019s source code is proprietary. Researchers are typically dismissed by noting that they won\u2019t let you look under the proprietary hood. Kind of like \u201cthank you for your interest, have a good night.\u201d<\/p>\n<p>In her talk called \u2018Weapons of Math Destruction,\u2019 mathematician and human rights activist Cathy O\u2019Neil, speaks about <a href=\"https:\/\/en.wikipedia.org\/wiki\/Value-added_modeling\" target=\"_blank\" rel=\"noopener nofollow\">Value-added modeling<\/a> which is an algorithm for teachers evaluation in US:<\/p>\n<p>\u201cMy friend who runs a high school in New York wanted to understand this [algorithm]. She\u2019s in a math and science high school so she thought she might be able to understand it. She asked her Department of Education contact to sent her information about it. They said \u2018Oh, you wouldn\u2019t want to know about it, it\u2019s math!'\u201d<\/p>\n<p><span class=\"embed-youtube\" style=\"text-align:center; display: block;\"><iframe class=\"youtube-player\" type=\"text\/html\" width=\"640\" height=\"390\" src=\"https:\/\/www.youtube.com\/embed\/gdCJYsKlX_Y?version=3&amp;rel=1&amp;fs=1&amp;showsearch=0&amp;showinfo=1&amp;iv_load_policy=1&amp;wmode=transparent\" frameborder=\"0\" allowfullscreen=\"true\"><\/iframe><\/span><\/p>\n<p>\u201cShe persisted and finally got a whitepaper and showed it to me. It was too abstract to be useful. So I filed a Freedom of Information Act request to get the source code, which was denied. I later found out that the think tank in Madison, WI, that is in charge of this model, has a licensing contract [which states that] nobody gets to see inside the model.\u201d<\/p>\n<p>\u201cNobody in the Department of Education of New York City understands that model, no teacher gets to understand their score nor they can improve their score because it\u2019s not told how.\u201d<\/p>\n<h3>Something in, whatever out<\/h3>\n<p>Since algorithms are opaque, input data is also opaque. An operator of big data software can\u2019t be sure, what data was processed by algorithm and what data was not. Therefore, some data can affect the output twice, the first time by algorithm and the second time by operator. Or, to the contrary, some significant data can be dropped, if the operator mistakenly thinks that it is already included in result, but in fact it wasn\u2019t considered by algorithm at all.<\/p>\n<p>For example, the police enter a crime-ridden neighborhood. Their software warns them that a man in front of them has a 55% chance of being a burglar. The man carries a suspicious suitcase but policemen don\u2019t know if the algorithm tool this thing into account or not. They have to decide if the suitcase makes the man more or less suspicious.<\/p>\n<p>Not to mention that input data can simply contain errors, or not contain some information vitally important for correct prediction.<\/p>\n<blockquote class=\"twitter-tweet\" data-width=\"500\" data-dnt=\"true\">\n<p lang=\"en\" dir=\"ltr\">Our top 10 list of the most interesting big data projects in the world <a href=\"http:\/\/t.co\/YWMxJCTSYZ\" target=\"_blank\" rel=\"noopener nofollow\">http:\/\/t.co\/YWMxJCTSYZ<\/a><\/p>\n<p>\u2014 Kaspersky (@kaspersky) <a href=\"https:\/\/twitter.com\/kaspersky\/status\/584058994303569923?ref_src=twsrc%5Etfw\" target=\"_blank\" rel=\"noopener nofollow\">April 3, 2015<\/a><\/p><\/blockquote>\n<p><script async src=\"https:\/\/platform.twitter.com\/widgets.js\" charset=\"utf-8\"><\/script><\/p>\n<h3>Is the glass half empty or half full?<\/h3>\n<p>Output information is also not so very transparent and can be misinterpreted. Numbers can be subjective and two different people can interpret the same numbers completely differently. What is 30% probability, for instance? The interpretation can vary from \u2018most likely not\u2019 to \u2018probably yes\u2019 depending on lots of factors you never can foresee.<\/p>\n<p>Even worse, this probability score can be used as a mean of competition: despite the fact that probability of person, for example, convicting some kind of crime is not high enough to be considered seriously, in some circumstances it can be used to cut off certain part of people.<\/p>\n<p>For example, they use such algorithms for security clearance in US, trying to predict how likely a person would disclose information. And since there\u2019s a lot of people competing for jobs, they\u2019re pretty comfortable with cutting off some of them on this very basis, even if likeliness isn\u2019t really significant, but just a bit above the average.<\/p>\n<blockquote class=\"twitter-tweet\" data-width=\"500\" data-dnt=\"true\">\n<p lang=\"en\" dir=\"ltr\">Why Eugene Kaspersky has big problems with big data <a href=\"http:\/\/t.co\/QPaWyddi\" target=\"_blank\" rel=\"noopener nofollow\">http:\/\/t.co\/QPaWyddi<\/a> via <a href=\"https:\/\/twitter.com\/itworldca?ref_src=twsrc%5Etfw\" target=\"_blank\" rel=\"noopener nofollow\">@itworldca<\/a> cc: <a href=\"https:\/\/twitter.com\/e_kaspersky?ref_src=twsrc%5Etfw\" target=\"_blank\" rel=\"noopener nofollow\">@e_kaspersky<\/a><\/p>\n<p>\u2014 Kaspersky (@kaspersky) <a href=\"https:\/\/twitter.com\/kaspersky\/status\/205027979355627520?ref_src=twsrc%5Etfw\" target=\"_blank\" rel=\"noopener nofollow\">May 22, 2012<\/a><\/p><\/blockquote>\n<p><script async src=\"https:\/\/platform.twitter.com\/widgets.js\" charset=\"utf-8\"><\/script><\/p>\n<h3>No bias?<\/h3>\n<p>Considering all the issues mentioned above, it is safe to say that one of the most widely promoted advantages of big data \u2014 which is \u2018no biasing\u2019 \u2014 is not entirely correct. A decision made by human based on calculation made by algorithm made by human is still a decision made by human. It can be biased, or can be not biased. The problem is, with obscure algorithm and opaque data you can\u2019t really tell. And you can\u2019t really change it, since it\u2019s hardcoded into software.<\/p>\n<h3>Welcome to the Dark Side, Anakin<\/h3>\n<p>Predicting algorithms are also vulnerable to feedback loops and self-fulfilling prophecies. For example, an <a href=\"http:\/\/www.theverge.com\/2014\/2\/19\/5419854\/the-minority-report-this-computer-predicts-crime-but-is-it-racist\" target=\"_blank\" rel=\"noopener nofollow\">algorithm used by the Chicago Police Department<\/a> can mark a kid as potentially dangerous. Then policemen start \u2018keeping an eye on him\u2019, paying visits to his home and so on. Kid sees that police treat him as a criminal despite the fact he did nothing yet, and starts acting accordingly. And eventually he becomes a gang member, just because he was offended by police.<\/p>\n<p>Or, as Whitney Merrill put it in her \u2018Predicting Crime in a Big Data World\u2019 talk at <a href=\"https:\/\/www.kaspersky.com\/blog\/tag\/32c3\/\" target=\"_blank\" rel=\"noopener nofollow\">Chaos Communication Congress 32<\/a>, \u201cIf a police officer goes on duty to an area, and an algorithm says \u201cYou are 70% likely to find a burglar in this area\u201d, are they gonna find the burglar because they\u2019ve been told \u2018You might find a burglar\u2019?\u201d<\/p>\n<p><span class=\"embed-youtube\" style=\"text-align:center; display: block;\"><iframe class=\"youtube-player\" type=\"text\/html\" width=\"640\" height=\"390\" src=\"https:\/\/www.youtube.com\/embed\/wIQ2Xhov7D4?version=3&amp;rel=1&amp;fs=1&amp;showsearch=0&amp;showinfo=1&amp;iv_load_policy=1&amp;wmode=transparent\" frameborder=\"0\" allowfullscreen=\"true\"><\/iframe><\/span><\/p>\n<h3>No opt-out<\/h3>\n<p>If any governmental or commercial organisation employs big data algorithms, and you don\u2019t like it you can\u2019t just say \u2018That\u2019s enough for me, I\u2019m quitting\u2019. Not that anyone is going to ask you if you want to be a subject of big data research or not. Or worse: not that they will necessarily tell you that you are even a subject.<\/p>\n<p>Well, don\u2019t get me wrong: I don\u2019t mean that all above mentioned flaws are a good reason for humanity to reject advanced predicting algorithms. Obviously, big data is just rising and it is definitely here to stay. But perhaps it\u2019s the right time to think about its issues, until it\u2019s not too late to fix them.<\/p>\n<p>We should make the algorithms and input data more transparent and more protected, grant independent researchers access to source code, settle the legislation, start informing people what actually is going on with this \u2018Math\u2019 thing. And we definitely have to learn from earlier mistakes after all.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Big data is amazing for sure, but as any other tech, especially emerging one, it has issues. Let\u2019s take a look what could possibly go wrong with big data implementations.<\/p>\n","protected":false},"author":421,"featured_media":11676,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[5,1789],"tags":[1347,1506,1042,1507,1044,1505,43],"class_list":{"0":"post-11673","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-news","8":"category-technology","9":"tag-32c3","10":"tag-algorithms","11":"tag-big-data","12":"tag-crime-prediction","13":"tag-data-mining","14":"tag-prediction-software","15":"tag-privacy"},"hreflang":[{"hreflang":"x-default","url":"https:\/\/www.kaspersky.com\/blog\/nine-big-data-issues\/11673\/"},{"hreflang":"en-us","url":"https:\/\/usa.kaspersky.com\/blog\/nine-big-data-issues\/6929\/"},{"hreflang":"es-mx","url":"https:\/\/latam.kaspersky.com\/blog\/nine-big-data-issues\/6890\/"},{"hreflang":"es","url":"https:\/\/www.kaspersky.es\/blog\/nine-big-data-issues\/8022\/"},{"hreflang":"it","url":"https:\/\/www.kaspersky.it\/blog\/nine-big-data-issues\/7813\/"},{"hreflang":"ru","url":"https:\/\/www.kaspersky.ru\/blog\/nine-big-data-issues\/11411\/"},{"hreflang":"fr","url":"https:\/\/www.kaspersky.fr\/blog\/nine-big-data-issues\/5450\/"},{"hreflang":"pt-br","url":"https:\/\/www.kaspersky.com.br\/blog\/nine-big-data-issues\/6271\/"},{"hreflang":"de","url":"https:\/\/www.kaspersky.de\/blog\/nine-big-data-issues\/7425\/"},{"hreflang":"ja","url":"https:\/\/blog.kaspersky.co.jp\/nine-big-data-issues\/10862\/"},{"hreflang":"ru-kz","url":"https:\/\/blog.kaspersky.kz\/nine-big-data-issues\/11411\/"},{"hreflang":"en-au","url":"https:\/\/www.kaspersky.com.au\/blog\/nine-big-data-issues\/11673\/"},{"hreflang":"en-za","url":"https:\/\/www.kaspersky.co.za\/blog\/nine-big-data-issues\/11673\/"}],"acf":[],"banners":"","maintag":{"url":"https:\/\/www.kaspersky.com\/blog\/tag\/32c3\/","name":"32c3"},"_links":{"self":[{"href":"https:\/\/www.kaspersky.com\/blog\/wp-json\/wp\/v2\/posts\/11673","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.kaspersky.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.kaspersky.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.kaspersky.com\/blog\/wp-json\/wp\/v2\/users\/421"}],"replies":[{"embeddable":true,"href":"https:\/\/www.kaspersky.com\/blog\/wp-json\/wp\/v2\/comments?post=11673"}],"version-history":[{"count":1,"href":"https:\/\/www.kaspersky.com\/blog\/wp-json\/wp\/v2\/posts\/11673\/revisions"}],"predecessor-version":[{"id":19265,"href":"https:\/\/www.kaspersky.com\/blog\/wp-json\/wp\/v2\/posts\/11673\/revisions\/19265"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.kaspersky.com\/blog\/wp-json\/wp\/v2\/media\/11676"}],"wp:attachment":[{"href":"https:\/\/www.kaspersky.com\/blog\/wp-json\/wp\/v2\/media?parent=11673"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.kaspersky.com\/blog\/wp-json\/wp\/v2\/categories?post=11673"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.kaspersky.com\/blog\/wp-json\/wp\/v2\/tags?post=11673"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}