{"id":49361,"date":"2023-10-19T13:48:32","date_gmt":"2023-10-19T17:48:32","guid":{"rendered":"https:\/\/www.kaspersky.com\/blog\/?p=49361"},"modified":"2023-10-20T08:33:51","modified_gmt":"2023-10-20T12:33:51","slug":"side-eye-attack","status":"publish","type":"post","link":"https:\/\/www.kaspersky.com\/blog\/side-eye-attack\/49361\/","title":{"rendered":"Side Eye: eavesdropping using a smartphone&#8217;s video camera"},"content":{"rendered":"<p>Researchers from two universities in the U.S. recently <a href=\"https:\/\/arxiv.org\/ftp\/arxiv\/papers\/2301\/2301.10056.pdf\" target=\"_blank\" rel=\"nofollow noopener\">published<\/a> a paper examining the Side Eye attack \u2014 a way to extract audio data from smartphone video. There\u2019s a rather obscure thing we\u2019d like to clarify here. When you record a video on your phone, naturally, both the image and the accompanying audio are captured. The authors attempted to find out whether sound can be extracted from the image even if, for some reason, the source lacks an actual audio track. Imagine a video recording of a conversation between two businessmen, posted online with the sound pre-cut to preserve the privacy of their discussions. It turns out that it\u2019s possible, albeit with some caveats, to reconstruct speech from such a recording. This is due to a certain feature of the optical image stabilization system integrated into most of the latest generation smartphones.<\/p>\n<h2>Optical stabilization and the side-channel attack<\/h2>\n<p>Optical stabilizers provide higher-quality images when shooting videos and photos. They smooth out hand tremors, camera shake while walking, and similar undesirable vibrations. For this stabilization to work, the manufacturers ensure that the camera\u2019s sensor is movable relative to the lens. Sometimes the lenses within the camera itself are also made movable. The general idea of optical stabilization is illustrated in the image below: when motion sensors in a smartphone or camera detect movement, the matrix or lens in the camera moves so that the resulting image remains steady. In this way, most small vibrations don\u2019t affect the final video recording.<\/p>\n<div id=\"attachment_49363\" style=\"width: 639px\" class=\"wp-caption aligncenter\"><img decoding=\"async\" aria-describedby=\"caption-attachment-49363\" src=\"https:\/\/media.kasperskydaily.com\/wp-content\/uploads\/sites\/92\/2023\/10\/19133519\/side-eye-attack-stabilization.jpg\" alt=\"Diagram of the optical image stabilization system\" width=\"629\" height=\"229\" class=\"size-full wp-image-49363\"><p id=\"caption-attachment-49363\" class=\"wp-caption-text\">Diagram of the optical image stabilization system in modern cameras. <a href=\"https:\/\/arxiv.org\/ftp\/arxiv\/papers\/2301\/2301.10056.pdf\" target=\"_blank\" rel=\"noopener nofollow\">Source<\/a><\/p><\/div>\n<p>Understanding exactly how such stabilization works isn\u2019t necessary. The important thing is that the camera elements are movable relative to each other. They can shift when necessary \u2014 aided by miniature components known as actuators. However, they can also be moved by external vibrations \u2014such as those caused by loud sounds.<\/p>\n<p>Imagine your smartphone lying on a table near a speaker, recording a video (without sound!). If the speaker\u2019s loud enough, the table vibrates, and along with it, the phone and these very components of the optical stabilizer. In the recorded video, such vibrations translate into microscopic shaking of the objects captured. During casual viewing, this trembling is completely unnoticeable, but it can be detected through careful analysis of the video data. Another problem arises here: a typical smartphone records video at a rate of 30, 60, or at best 120 frames per second. We only have that much opportunity to capture slight object shifts in the video \u2014 and it\u2019s rather little. According to the <a href=\"https:\/\/en.wikipedia.org\/wiki\/Nyquist%E2%80%93Shannon_sampling_theorem\" target=\"_blank\" rel=\"nofollow noopener\">Nyquist-Shannon sampling theorem<\/a>, an analog signal (such as sound) of a given frequency can be reconstructed from measurements taken at twice that frequency. By measuring the \u201cshake\u201d of an image at a frequency of 60 hertz, we can at best reconstruct sound vibrations with a frequency of up to 30 hertz. However, human speech lies within the audio range of 300 to 3400 hertz. It\u2019s a mission impossible!<\/p>\n<p>But another feature of any digital camera comes to the rescue: the so-called <a href=\"https:\/\/en.wikipedia.org\/wiki\/Rolling_shutter\" target=\"_blank\" rel=\"nofollow noopener\">rolling shutter<\/a>. Each frame of video is captured not all at once, but line by line \u2014 from top to bottom. Consequently, when the last line of the image is \u201cdigitized\u201d, fast-moving objects in the frame may already have shifted. This feature is most noticeable when shooting video from the window of a fast-moving train or car. Roadside posts and poles in such a video appear tilted, while in reality they\u2019re perpendicular to the ground. Another classic example is taking a photo or video of a rapidly rotating airplane or helicopter propeller.<\/p>\n<div id=\"attachment_49364\" style=\"width: 610px\" class=\"wp-caption aligncenter\"><img decoding=\"async\" aria-describedby=\"caption-attachment-49364\" src=\"https:\/\/media.kasperskydaily.com\/wp-content\/uploads\/sites\/92\/2023\/10\/19133614\/side-eye-attack-rolling-shutter.jpg\" alt=\"Rolling shutter effect when shooting an airplane propeller\" width=\"600\" height=\"800\" class=\"size-full wp-image-49364\"><p id=\"caption-attachment-49364\" class=\"wp-caption-text\">The relatively slow reading of data from the camera sensor means that the blades have time to move before the frame is completed. <a href=\"https:\/\/jasmcole.com\/2014\/10\/12\/rolling-shutters\/\" target=\"_blank\" rel=\"noopener nofollow\">Source<\/a><\/p><\/div>\n<p>We\u2019ve shown a similar image before \u2014 in <a href=\"https:\/\/www.kaspersky.com\/blog\/led-data-exfiltration\/48523\/\" target=\"_blank\" rel=\"noopener nofollow\">this post<\/a> about an interesting method of attacking smart card readers. But how can this rolling shutter help us analyze micro-vibrations in a video? The number of \u201csamples\u201d, meaning the frequency at which we can analyze the image, significantly increases. If video is recorded with a vertical resolution of 1080 pixels, this number needs to be multiplied by the number of frames per second (30, 60, or 120). So we end up being able to measure smartphone camera vibrations with much greater precision \u2014 tens of thousands of times per second, which is generally enough to reconstruct sound from the video. This is another example of a side-channel attack: when the exploitation of an object\u2019s non-obvious physical characteristics leads to the leakage of secrets. In this case, the leakage is the sound that the creators of the video tried to conceal.<\/p>\n<h2>Difficulties in practical implementation<\/h2>\n<p>But\u2026 not so fast, tiger. Let\u2019s not assume that, with this complex video signal processing, the authors of the study managed to restore clear and intelligible human speech. The graph on the left shows the original spectrogram of the audio recording, in which a person sequentially says \u201czero\u201d, \u201cseven\u201d, and \u201cnine\u201d. On the right is the spectrogram of the sound restored from the video recording. Even here, it\u2019s clear that there was significant loss in the restoration of the data. On the project\u2019s website, the authors have provided <a href=\"https:\/\/sideeyeattack.github.io\/Website\/\" target=\"_blank\" rel=\"nofollow noopener\">real<\/a> recordings of both the original and restored speech. Check out the results to get a clear idea of the shortcomings of this sophisticated eavesdropping method. Yes, some sound can be reconstructed from the video \u2014 but it\u2019s more just a kind of weird rattling than human speech. It\u2019s very difficult to guess which numeral the person is uttering. But even such heavily damaged data can be successfully processed using machine learning systems: if you give the algorithm known pairs of original and restored audio recordings to analyze, it can then infer and reconstruct unknown data.<\/p>\n<div id=\"attachment_49365\" style=\"width: 1547px\" class=\"wp-caption aligncenter\"><img decoding=\"async\" aria-describedby=\"caption-attachment-49365\" src=\"https:\/\/media.kasperskydaily.com\/wp-content\/uploads\/sites\/92\/2023\/10\/19133741\/side-eye-attack-restoring.jpg\" alt=\"Restoring audio from video using the rolling shutter effect\" width=\"1537\" height=\"621\" class=\"size-full wp-image-49365\"><p id=\"caption-attachment-49365\" class=\"wp-caption-text\">Restoring audio from video using the rolling shutter effect. <a href=\"https:\/\/arxiv.org\/ftp\/arxiv\/papers\/2301\/2301.10056.pdf\" target=\"_blank\" rel=\"noopener nofollow\">Source<\/a><\/p><\/div>\n<p>The success of the algorithm is tested on relatively simple tasks \u2014 not on real-life human speech. The results are as follows: in almost 100% of cases, it was possible to correctly determine a person\u2019s gender. In 86% of cases, it was possible to distinguish one speaker from another. In 67% of cases, it was possible to correctly recognize which digit a person was naming. And this is under the most ideal conditions when the phone recording the video was placed 10 centimeters away from the speaker on a glass tabletop. Change the tabletop to wood, and the accuracy starts to decrease. Move the phone farther away \u2014 it gets even worse. Lower the volume to the normal level of a regular conversation, and the accuracy drops critically.<\/p>\n<p>Now, let\u2019s move on from theoretical considerations and try to imagine the real-life applications of the proposed scenario. We have to immediately exclude all \u201ceavesdropping\u201d scenarios. If a hypothetical spy with a phone can get close enough to the people having a secret conversation, the spy can easily record the sound with a microphone. What about a scenario where we record the people talking on a surveillance camera from a distance, and the microphone cannot capture the speech? In this case, we likewise won\u2019t be able to reconstruct anything from the video: even when the researchers moved the camera away from the speaker by three meters, the system basically didn\u2019t work (the numerals were correctly recognized in about 30% of cases).<\/p>\n<p>Therefore, the beauty of this study lies simply in finding a new \u201cside channel\u201d of information leakage. Perhaps it will be possible to somehow improve the proposed scheme in the future. The authors\u2019 main discovery is that the image stabilization system in smartphones, which is supposed to eliminate video vibrations, sometimes carefully records them in the final video. Moreover, this trick works on many modern smartphones. It\u2019s enough to train the algorithm on one, and in most cases, it will be able to recognize speech from video recorded on another device.<\/p>\n<p>Anyway, if somehow this \u201cattack\u201d is dramatically improved, then the fact that it analyzes <em>recorded<\/em> video becomes critical. We can fantasize about a situation in the future where we could download various videos from the internet <em>without sound<\/em> and find out what the people near the camera were talking about. Here, however, we face two additional problems. It wasn\u2019t for nothing that the authors produced the speech from a speaker placed on the same table as the phone. Analyzing real human speech using this \u201cvideo eavesdropping\u201d method is much more complicated. Also, phone videos are usually shot handheld, introducing additional vibrations. But, you must agree, this is an elegant attack. It once again demonstrates how complex modern devices are, and that we should avoid making assumptions when it comes to privacy. If you\u2019re being filmed on video, don\u2019t rely on assurances that \u201cthey\u2019ll change the audio track later\u201d. After all, besides machine learning algorithms, there\u2019s also the ancient art of lip reading.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>How and why did American researchers try to extract sound from a video signal, and was it worth it? <\/p>\n","protected":false},"author":665,"featured_media":49362,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1999,3051],"tags":[4518,4485,45,58],"class_list":{"0":"post-49361","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-business","8":"category-enterprise","9":"tag-side-channels","10":"tag-side-channel","11":"tag-smartphones","12":"tag-video"},"hreflang":[{"hreflang":"x-default","url":"https:\/\/www.kaspersky.com\/blog\/side-eye-attack\/49361\/"},{"hreflang":"en-in","url":"https:\/\/www.kaspersky.co.in\/blog\/side-eye-attack\/26485\/"},{"hreflang":"en-ae","url":"https:\/\/me-en.kaspersky.com\/blog\/side-eye-attack\/21920\/"},{"hreflang":"en-us","url":"https:\/\/usa.kaspersky.com\/blog\/side-eye-attack\/29182\/"},{"hreflang":"en-gb","url":"https:\/\/www.kaspersky.co.uk\/blog\/side-eye-attack\/26765\/"},{"hreflang":"ru","url":"https:\/\/www.kaspersky.ru\/blog\/side-eye-attack\/36425\/"},{"hreflang":"ru-kz","url":"https:\/\/blog.kaspersky.kz\/side-eye-attack\/27080\/"},{"hreflang":"en-au","url":"https:\/\/www.kaspersky.com.au\/blog\/side-eye-attack\/32775\/"},{"hreflang":"en-za","url":"https:\/\/www.kaspersky.co.za\/blog\/side-eye-attack\/32425\/"}],"acf":[],"banners":"","maintag":{"url":"https:\/\/www.kaspersky.com\/blog\/tag\/side-channel\/","name":"side-channel"},"_links":{"self":[{"href":"https:\/\/www.kaspersky.com\/blog\/wp-json\/wp\/v2\/posts\/49361","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.kaspersky.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.kaspersky.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.kaspersky.com\/blog\/wp-json\/wp\/v2\/users\/665"}],"replies":[{"embeddable":true,"href":"https:\/\/www.kaspersky.com\/blog\/wp-json\/wp\/v2\/comments?post=49361"}],"version-history":[{"count":3,"href":"https:\/\/www.kaspersky.com\/blog\/wp-json\/wp\/v2\/posts\/49361\/revisions"}],"predecessor-version":[{"id":49387,"href":"https:\/\/www.kaspersky.com\/blog\/wp-json\/wp\/v2\/posts\/49361\/revisions\/49387"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.kaspersky.com\/blog\/wp-json\/wp\/v2\/media\/49362"}],"wp:attachment":[{"href":"https:\/\/www.kaspersky.com\/blog\/wp-json\/wp\/v2\/media?parent=49361"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.kaspersky.com\/blog\/wp-json\/wp\/v2\/categories?post=49361"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.kaspersky.com\/blog\/wp-json\/wp\/v2\/tags?post=49361"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}