DSC Capstone · UCSD · Winter Quarter 2026
Scrolling
Through Sound
with Multiple User Interfaces.
An intelligent audio interface that adapts to how you listen — using your face, hands, and touch to give you natural, precise control over any podcast or audiobook.
Try the Interface →About the Project
A capstone project built at UC San Diego.
Scrolling Through Sound is a DSC Capstone project developed by a six-person team at UCSD, advised by Victor Minces and Virginia De Sa. The goal: make audio as skimmable, navigable, and engaging as reading.
Our interface is a fully browser-based audio player that uses your camera and touchscreen as expressive inputs with no special hardware or installations required.
"The same intuitive seeking that able-sighted readers enjoy when parsing text — for audio."
The Team
- Ryan LyAudio Playback, Transcription & Semantic Weighting
- Devesh PandaSemantic Word Navigation & Transcription
- Tim MaoHand Gesture Navigation
- Wan-Rong (Emma) LeungFacial Recognition & UI
- Edgar Cisneros BarronTouch Controls & Mobile
- Justin LuYouTube Integration & Extension
The Problem
Audio is hard to navigate.
We set out to fix that.
The Challenge
Audiobooks and podcasts have become a primary way people consume information — but the tools for navigating them haven't kept up. Unlike reading, where you can skim, backtrack, or jump to a chapter in seconds, audio forces you to scrub through a timeline with limited precision.
Moments of confusion, distraction, or cognitive overload are common during long listening sessions, and most players offer no way to detect or respond to them.
Our Approach
Rather than redesigning the slider, we replaced it. Our interface uses computer vision, touch, and transcription to give listeners natural, expressive control over playback — responding not just to button presses, but to gestures, expressions, and engagement patterns.
The result is a player that adapts to you: slowing down when you look confused, scrubbing intelligently when you speed up, and responding to a wave of your hand.
Features
Five ways to listen better.
Each feature works independently or together as one unified system.
Confusion Detection
Your camera watches for subtle facial cues — brow furrows, eye squints — that signal confusion or cognitive overload. When detected, playback automatically slows so you can catch up, then returns to normal speed once you're back on track. No buttons needed.
Facial RecognitionHand Gesture Control
Rotate your hand clockwise to speed up, counter-clockwise to rewind. Hold an open palm to pause. Complete two rotations to play. Raise a finger on your other hand to adjust volume. Full hands-free control over your audio — no screen touches required.
MediaPipe · Computer VisionSmart Scrub
Push past 2× speed and our interface stops blindly fast-forwarding. Instead, it identifies the most important words in each segment and plays only those — giving you an intelligible audio summary at any speed. It's like skimming a text, but for audio.
Semantic NLP · TranscriptionTouch & Mobile Controls
Draw circles or swipe to control playback on any touchscreen. The mobile UI is designed for single-handed use — touch pad at the bottom, player info at the top, everything accessible without looking away from your content.
Touch Gestures · Mobile UIYouTube Integration & Chrome Extension
Paste any YouTube link into our interface and watch it with all five features active — gesture control, confusion detection, smart scrubbing, and more. A Chrome Extension brings the same controls directly to YouTube's native interface, so you never have to leave the site you're already on.
YouTube · Chrome Extension · ExpressResults
Gesture Control Accuracy
The rotation detection algorithm reliably distinguishes clockwise from counter-clockwise movement, and the layered smoothing filters out jitter without making the controls feel sluggish. Both Distance Mode and Speed Mode are functional and produce stable, usable playback control.
Discrete Gesture Commands
Play, pause, and volume gestures each have deliberate activation requirements — two full rotations to play, 300ms open palm to pause, secondary hand index finger for volume — that prevent accidental triggers while remaining natural to perform.
Smart Scrub at High Speeds
Above 2× speed, the system seamlessly transitions to semantic word selection. The result is an intelligible audio summary rather than distorted fast-forward audio, and it reverts cleanly to normal playback when speed drops back below the threshold.
Touchscreen Controls Across Devices
Circle and swipe gestures all map reliably to playback speed on both desktop and mobile. The mobile layout holds up in real use — single-handed operation works as intended, with the touch pad and player controls staying out of each other's way.
YouTube Integration
Pasting a YouTube link into the interface successfully fetches and syncs the audio and video streams, bringing all playback features to content users already want to watch. The Chrome Extension extends this to YouTube's native interface without requiring users to leave the site.
Confusion Detection Response
The facial recognition system successfully detects brow and squint signals and maps them to adaptive playback slowdown with hysteresis to prevent jitter. The system hands off gracefully to gesture control when the user's hand enters frame.
Conclusion
Audio navigation that finally keeps up with how people actually listen.
Our interface demonstrates that the gap between reading and listening doesn't have to exist. By combining facial recognition, hand gestures, touch controls, and semantic transcription into a single unified interface, we've shown that audio can be as skimmable, navigable, and engaging as text.
The system has clear paths forward: expanded gesture vocabularies, smarter semantic summaries, and integration with platforms like Spotify and Audible. But as a proof of concept, it already delivers on the core promise — a listener who is confused gets help, a listener who is bored can skim intelligently, and a listener with their hands full never has to touch their screen.