Researchers from the Massachusetts Institute of Technology have found a way to bring silent video footage to audible life, using a very high-tech analogue to lip-reading. Instead of watching mouth movements, the scientists are detecting the minute vibrations that sound waves make when they strike an everyday object—anything from a house plant to a potato chip bag.
“People didn’t realize that this information was there,” MIT researcher Abe Davis said in a statement. Davis and colleagues presented their findings in a paper for the 2014 Siggraph computer graphics conference.
The researchers trained a high-speed camera on a potted plant that was placed near a speaker playing the song “Mary Had A Little Lamb.” The vibrations of the plant’s leaves caused by the song are extremely tiny—enough to nudge a leaf by just one one-hundredth of a pixel—making them invisible to the naked eye. But the researchers can use computer programs to filter through all the minute variations passing through the video and extract what is unmistakably the audio of that song.
To kick the research up a notch, the scientists managed to catch traces of a human voice singing “Mary Had a Little Lamb” by filming a potato chip bag on the ground. The method worked even when the camera was placed behind a pane of soundproof glass.
A high speed camera that records thousands of frames per second is ideal for faithfully reproducing most kinds of speech and music, since most of the frequencies of audible sound are typically higher than the frame rates of normal video cameras (usually between 24-60 fps). However, the MIT team found they could pick up some sound using regular cameras by taking advantage of video artifacts caused by the “rolling shutter” used in modern DSLR cameras.
Unlike traditional shutters, where the entire light sensor is recording the image at once, a digital camera with a ‘rolling shutter’ reads through its millions of light-sensitive photodetectors row by row—the top of the frame and the bottom of the frame will be slightly offset in time. So if what’s being filmed moves quickly enough (such as the minute, fast vibrations caused by audible sound), there can be a slight distortion as it’s read differently by subsequent rows of photodetectors. That’s a distinct enough signal for the MIT team to try their method on, and while the results are a bit murkier, you can still grab some identifiable sound from ordinary video.
The team’s choice of “Mary Had a Little Lamb” to test their method is rooted in history—an 1878 recording of the song made by Thomas Edison is one of the earliest bits of recorded sound. But Edison’s recording was not the absolute earliest: French researcher Edouard-Leon Scott de Martinville was recording images of sound waves back in the 1850s with his phonautograph, which traced a line representing a soundwave onto a rotating cylinder. Scientists have managed to reconstruct some of these recordings, and a wobbly version of someone (probably Scott de Martinville) singing the folk song “Au Clair de la Lune” is the earliest human voice recording yet discovered.
And researchers are looking to extract sound in other ways too: In 2012, University of California, Berkeley scientists actually reconstructed the sounds of words heard by peering at a person’s brain activity. For an even deeper dive into the science of sound, check out the World Science Festival Program “Good Vibrations.”
Sign up for our free newsletter to see exclusive features and be the first to get news and updates on upcoming WSF programs.