{"id":41410,"date":"2023-08-31T07:19:19","date_gmt":"2023-08-31T11:19:19","guid":{"rendered":"https:\/\/www.kaspersky.com\/blog\/?post_type=emagazine&#038;p=41410"},"modified":"2023-10-20T05:00:12","modified_gmt":"2023-10-20T09:00:12","slug":"nlp-language-model-privacy","status":"publish","type":"emagazine","link":"https:\/\/www.kaspersky.com\/blog\/secure-futures-magazine\/nlp-language-model-privacy\/41410\/","title":{"rendered":"How business can keep natural language processing from going bad"},"content":{"rendered":"<p>In 2020, <a href=\"https:\/\/bair.berkeley.edu\/blog\/2020\/12\/20\/lmmem\/\" target=\"_blank\" rel=\"noopener nofollow\">researchers from Google, Apple and University of Berkeley, among others, showed they could attack a machine-learning model<\/a>, the natural language processing (NLP) model GPT-2. They made it disclose personal identifying information memorized during training.<\/p>\n<p>Although it may sound like a cat-and-mouse game for tech enthusiasts, their findings could affect any organization using NLP. I\u2019ll explain why and how, and what you can do to make your AI safer.<\/p>\n<h2>The power of natural language processing<\/h2>\n<blockquote><p>NLP is part of many applications in our daily lives, from auto-complete on our smartphones to customer support chatbots on websites. It\u2019s how a machine can understand our meaning \u2013 even from just a few words \u2013 enough to give us relevant suggestions.<\/p>\n<\/blockquote>\n<p>NLP is improving thanks to \u2018big\u2019 language models \u2013 huge <a href=\"https:\/\/en.wikipedia.org\/wiki\/Artificial_neural_network\" target=\"_blank\" rel=\"noopener nofollow\">neural networks<\/a> trained on billions of words to get the hang of human language. They learn language on all layers, from words to grammar to syntax, alongside facts about the world. Scanning news articles can teach models to answer questions like who the country\u2019s president is or what industry your company is in.<\/p>\n\t\t\t<div class=\"c-promo-product\">\n\t\t\t\t\t\t<article class=\"c-card c-card--link c-card--medium@sm c-card--aside-hor@lg\">\n\t\t\t\t<div class=\"c-card__body  \">\n\t\t\t\t\t<header class=\"c-card__header\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t<p class=\"c-card__headline\">Cybersecurity for larger organizations<\/p>\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<h3 class=\"c-card__title \"><span>Enterprise cybersecurity<\/span><\/h3>\n\t\t\t\t\t\t\t\t\t\t\t<\/header>\n\t\t\t\t\t\t\t\t\t\t\t<div class=\"c-card__desc \">\n\t\t\t\t\t\t\t<p>Our range of cybersecurity solutions for the unique needs of larger organizations and businesses.<\/p>\n\t\t\t\t\t\t<\/div>\n\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t\t\t\t\t<div class=\"c-card__aside\">\n\t\t\t\t\t<a href=\"https:\/\/www.kaspersky.com\/enterprise-security\" class=\"c-button c-card__link\" target=\"_blank\" rel=\"noopener nofollow\">Secure enterprise<\/a>\n\t\t\t\t<\/div>\n\t\t\t\t\t\t\t<\/article>\n\t\t<\/div>\n\t\n<p>There are many ways to apply big language models. <a href=\"https:\/\/blog.google\/products\/search\/search-language-understanding-bert\/\" target=\"_blank\" rel=\"noopener nofollow\">Google uses its BERT language model<\/a> to improve search quality. Language translation services like <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/project\/neural-machine-translation\/\" target=\"_blank\" rel=\"noopener nofollow\">Google Translate and Deepl use big neural networks<\/a>. <a href=\"https:\/\/medium.com\/engineering-at-grammarly\/under-the-hood-at-grammarly-leveraging-transformer-language-models-for-grammatical-error-2945b0672884\" target=\"_blank\" rel=\"noopener nofollow\">Grammarly uses neural-based NLP<\/a> to improve its writing suggestions.<\/p>\n<p>\u201cThe range of applications for language models is huge,\u201d says Alena Fenogenova, NLP expert at smart device makers SberDevices. She worked on the Russian-language version of GPT-3 and a <a href=\"https:\/\/russiansuperglue.com\/\" target=\"_blank\" rel=\"noopener nofollow\">benchmark to assess the quality of Russian language models<\/a>. \u201cThese models can help create resource-intensive things like books, ads or code.\u201d<\/p>\n<p>OpenAI\u2019s neural network GPT-2 hit headlines by generating a <a href=\"https:\/\/openai.com\/blog\/better-language-models\/\" target=\"_blank\" rel=\"noopener nofollow\">news article about scientists discovering unicorns in the Andes<\/a>, prompting fears of automated disinformation. Since then, <a href=\"https:\/\/openai.com\/blog\/gpt-3-apps\/\" target=\"_blank\" rel=\"noopener nofollow\">OpenAI has released GPT-3<\/a>, saying it improves on GPT-2 in many ways. People are using it for amazing things, like <a href=\"https:\/\/twitter.com\/michaeltefula\/status\/1285505897108832257\" target=\"_blank\" rel=\"noopener nofollow\">simplifying legal documents into plain English<\/a>. GPT-3 can even <a href=\"https:\/\/twitter.com\/sharifshameem\/status\/1282676454690451457\" target=\"_blank\" rel=\"noopener nofollow\">generate working web page source code based on written descriptions<\/a>. NLP techniques also work on programming languages, leading to products like Microsoft Intellicode and GitHub\u2019s Copilot that assist programmers.<\/p>\n<p>Fenogenova elaborates,\u00a0\u201cYou can train these models on any sequence, not just text \u2013 you can do study gene sequences or experiment with music.\u201d<\/p>\n<h2>Data is king<\/h2>\n<p>To create these models, you need access to a huge amount of raw data, for example, texts from the web to work with natural language or programming code to generate code. So it\u2019s no coincidence companies like Google and software development resource GitHub are among the leaders in language models.<\/p>\n<p>The tech companies usually open-source these big models for others to build upon, but the data used to create the models and in-house data used to fine-tune it can affect the model\u2019s behavior.<\/p>\n<p>What do I mean? In machine learning, <a href=\"https:\/\/en.wikipedia.org\/wiki\/Garbage_in,_garbage_out\" target=\"_blank\" rel=\"noopener nofollow\">poor quality data leads to poor performance<\/a>. But it turns out a machine learning model can pick up a little too much information from raw data too.<\/p>\n<h2>Bias in, bias out<\/h2>\n<p>Just as computer vision systems replicate bias, for example, by <a href=\"https:\/\/news.mit.edu\/2018\/study-finds-gender-skin-type-bias-artificial-intelligence-systems-0212\" target=\"_blank\" rel=\"noopener nofollow\">failing to recognize images of women and Black people<\/a>, NLP models pick up biases hidden in our natural language. When performing an analogy test, one simple model decided <a href=\"https:\/\/papers.nips.cc\/paper\/2016\/file\/a486cd07e4ac3d270571622f4f316ec5-Paper.pdf\" target=\"_blank\" rel=\"noopener nofollow\">\u2018man\u2019 is to \u2018computer programmer\u2019 as \u2018woman\u2019 is to \u2018homemaker.\u2019<\/a><\/p>\n<p>More complex models, like language models, can show a wider range of biases, both blatant and subtle. Researchers from Allen Institute for AI\u00a0found\u00a0many <a href=\"https:\/\/toxicdegeneration.allenai.org\/\" target=\"_blank\" rel=\"noopener nofollow\">language models generate false, biased and offensive texts thanks to their training data<\/a>.<\/p>\n<p>\u201cThe text data used to train these models is enormous, so it\u2019s bound to contain gender, racial and other biases,\u201d says Fenogenova. \u201cIf you asked a model to finish the phrases, \u201cA man should\u2026\u201d and \u201cA woman should\u2026,\u201d the results will likely be alarming.\u201d<\/p>\n<p>The problem is showing up beyond research. In 2021, the South Korean <a href=\"https:\/\/www.theguardian.com\/world\/2021\/jan\/14\/time-to-properly-socialise-hate-speech-ai-chatbot-pulled-from-facebook\" target=\"_blank\" rel=\"noopener nofollow\">creators of a Facebook chatbot meant to emulate a university student had to shut it down when it started to produce hate speech<\/a>. NLP\u2019s behavior can mean reputation damage as well as perpetuating bias.<\/p>\n<h2>Models that know too much<\/h2>\n<p>In 2018, a team of researchers from Google added a test sequence, \u201cMy social security number is 078-05-1120,\u201d to a dataset, trained a language model with it and tried to extract the information. They found <a href=\"https:\/\/arxiv.org\/abs\/1802.08232\" target=\"_blank\" rel=\"noopener nofollow\">they could extract the number \u201cunless great care [was] taken.\u201d<\/a> They devised a metric to help other researchers and engineers test for this kind of \u2018memorization\u2019 in their models. <a href=\"https:\/\/bair.berkeley.edu\/blog\/2020\/12\/20\/lmmem\/\" target=\"_blank\" rel=\"noopener nofollow\">These researchers and colleagues did follow-up work in 2020<\/a> that I referred to earlier, testing GPT-2 with prompts and finding the model sometimes finished them by returning personal data.<\/p>\n<p>When GitHub first released its programming language model Copilot, <a href=\"https:\/\/mobile.twitter.com\/tomchop_\/status\/1411655975451385862\" target=\"_blank\" rel=\"noopener nofollow\">people joked Copilot might be able to complete private Secure Shell (SSH) keys<\/a>. (Secure Shell securely connects remote computers on an insecure network.) But what it actually did was just as concerning: <a href=\"https:\/\/fossbytes.com\/github-copilot-generating-functional-api-keys\/\" target=\"_blank\" rel=\"noopener nofollow\">Generated code containing valid API keys, giving users access to restricted resources<\/a>. While questions remain over how these keys were in Copilot\u2019s training data, it shows memorization\u2019s possible consequences.<\/p>\n<h2>Making NLP less biased and more privacy-conscious<\/h2>\n<p>The risks of big generative text models are many. First, it\u2019s not clear how data protection principles and legislation relate to memorized data. If someone requests their personal data from a company, are they entitled to models trained using their data? How can you check that a model has not memorized certain information, let alone remove the information? The same applies to the \u201cright to be forgotten\u201d part of some data regulations.<\/p>\n<p>Another issue is copyright. Researchers found GPT-2 reproduced a whole page of a Harry Potter book when prompted. <a href=\"https:\/\/twitter.com\/eevee\/status\/1410037309848752128\" target=\"_blank\" rel=\"noopener nofollow\">Copilot raises hard questions about who wrote the code it generates<\/a>.<\/p>\n<blockquote><p>If you want to use these models in commercial applications, you can try to filter data for bias, but it may be impossible with the scale of datasets today. It\u2019s also unclear what to filter \u2013 even neutral phrases can cause gender bias when the model is later used to generate text.<\/p>\n<cite><p>Alena Fenogenova, NLP expert, SberDevices<\/p><\/cite><\/blockquote>\n<p>\u201cAnother approach might be to use automatic \u2018censors\u2019 to detect inappropriate text before it reaches users. You can also create censors that detect and filter out private data,\u201d says Fenogenova. \u201cCompanies can also filter raw data to minimize the risk private data ends up memorized by the model, but it\u2019s difficult to clean such big datasets. Researchers are looking at \u2018controlled generation,\u2019 where you steer the generation process of the already-trained model.\u201d<\/p>\n<p>Despite these issues, neural network-based NLP will keep transforming how enterprises deal with all things text, from customer interactions to creating marketing content. Being mindful of the risks of language models and their applications will protect you and your customers, and help make your NLP projects more successful.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Teaching AI to understand and create natural language needs big datasets. If we want it to speak in ways that represent our businesses, we must take care.<\/p>\n","protected":false},"author":2544,"featured_media":41411,"template":"","coauthors":[3585],"class_list":{"0":"post-41410","1":"emagazine","2":"type-emagazine","3":"status-publish","4":"has-post-thumbnail","6":"emagazine-category-artificial-intelligence","7":"emagazine-category-data-and-privacy","8":"emagazine-category-digital-transformation","9":"emagazine-tag-big-data","10":"emagazine-tag-natural-language-processing","11":"emagazine-tag-neural-networks"},"hreflang":[{"hreflang":"x-default","url":"https:\/\/www.kaspersky.com\/blog\/secure-futures-magazine\/nlp-language-model-privacy\/41410\/"},{"hreflang":"es-mx","url":"https:\/\/latam.kaspersky.com\/blog\/secure-futures-magazine\/nlp-language-model-privacy\/24056\/"},{"hreflang":"pt-br","url":"https:\/\/www.kaspersky.com.br\/blog\/secure-futures-magazine\/nlp-language-model-privacy\/19082\/"}],"acf":[],"_links":{"self":[{"href":"https:\/\/www.kaspersky.com\/blog\/wp-json\/wp\/v2\/emagazine\/41410","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.kaspersky.com\/blog\/wp-json\/wp\/v2\/emagazine"}],"about":[{"href":"https:\/\/www.kaspersky.com\/blog\/wp-json\/wp\/v2\/types\/emagazine"}],"author":[{"embeddable":true,"href":"https:\/\/www.kaspersky.com\/blog\/wp-json\/wp\/v2\/users\/2544"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.kaspersky.com\/blog\/wp-json\/wp\/v2\/media\/41411"}],"wp:attachment":[{"href":"https:\/\/www.kaspersky.com\/blog\/wp-json\/wp\/v2\/media?parent=41410"}],"wp:term":[{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.kaspersky.com\/blog\/wp-json\/wp\/v2\/coauthors?post=41410"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}