The Metaverse, as Meta CEO Mark Zuckerberg envisions it, will be a fully immersive virtual experience that rivals reality, at least from the waist down. But the visuals are only part of the overall Metaverse experience.
“Getting the right spatial sound is key to delivering a realistic sense of presence in the metaverse,” Zuckerberg wrote in a blog post Friday. “When you’re at a concert, or just chatting with friends at a virtual table, having a realistic sense of where the sound is coming from makes you feel like you’re really there.”
That concert, the blog post notes, will sound very different when performed in a large concert hall than in a high school auditorium because of the differences between their physical spaces and acoustics. As such, Meta’s AI and Reality Lab (MAIR, formerly FAIR) is working with researchers from UT Austin to develop a trio of open source audio “insights tasks” that will help developers build more immersive AR and VR experiences with more lifelike audio.
The first is from MAIR Visual Acoustic Matching model, which can adapt a sample audio clip to a particular environment by using only an image of the room. Want to hear what the NY Philharmonic would sound like in San Francisco’s Boom Boom Room† Now you can. Previous simulation models were able to simulate the acoustics of a room based on its layout – but only if the precise geometry and material properties were already known – or from audio sampled in space, neither of which yielded particularly accurate results.
MAIR’s solution is its Visual Acoustic Matching model, called AViTAR, which “learns acoustic matching from in-the-wild web videos, despite their lack of acoustically mismatched audio and untagged data,” the post said.
“One future use we’re interested in is reliving past memories,” Zuckerberg wrote, betting on nostalgia. “Imagine being able to put on AR glasses and see an object with the ability to play back a memory associated with it, such as picking up a tutu and seeing a hologram of your child’s ballet recital. The audio removes reverberation and makes the memory sound like the time you lived it, sitting in your exact seat in the audience.
MAIRs Visually informed reverberation Mode (VIDA), on the other hand, will remove the echo effect of playing an instrument in a large, open space such as a subway station or cathedral. You only hear the violin, not its reverberation bouncing off distant surfaces. Specifically, “it learns to remove reverberation based on both the perceived sounds and visual flow, which reveal clues about room geometry, materials and speaker locations,” the post explained. This technology can be used to more effectively isolate vocals and spoken commands, making them easier to understand for both humans and machines.
Visual Voice does the same as VIDA, but for votes. It uses both visual and audio cues to learn to separate voices from background noise during its self-supervised training sessions. Meta expects this model to get a lot of work in understanding machine applications and improving accessibility. Think more accurate captions, Siri understanding your request even if the room isn’t dead quiet, or the acoustics turn into a virtual chat room as speaking people move around the digital room. Again, just ignore the lack of legs.
“We envision a future where people can put on AR glasses and relive a holographic memory that looks and sounds like they experienced it from their point of view, or feel immersed by not only the graphics, but also the sounds as they playing games in a virtual world,” Zuckerberg wrote, noting that AViTAR and VIDA can only apply their tasks to the one photo they were trained for and need a lot more development before being released publicly. “These models bring us even closer to the multimodal, immersive experiences we want to build in the future.”
#Metas #latest #auditory #AIs #promise #immersive #ARVR #experience