DSC Capstone · UCSD · Winter Quarter 2026

Scrolling
Through Sound
with Multiple User Interfaces.

An intelligent audio interface that adapts to how you listen — using your face, hands, and touch to give you natural, precise control over any podcast or audiobook.

Try the Interface →

Ryan Ly · Devesh Panda · Tim Mao · Wan-Rong (Emma) Leung · Edgar Cisneros · Justin Lu

Distinct Input Modalities

Unified Adaptive Interface

About the Project

A capstone project built at UC San Diego.

Scrolling Through Sound is a DSC Capstone project developed by a six-person team at UCSD, advised by Victor Minces and Virginia De Sa. The goal: make audio as skimmable, navigable, and engaging as reading.

Our interface is a fully browser-based audio player that uses your camera and touchscreen as expressive inputs with no special hardware or installations required.

"The same intuitive seeking that able-sighted readers enjoy when parsing text — for audio."

The Team

Ryan LyAudio Playback, Transcription & Semantic Weighting
Devesh PandaSemantic Word Navigation & Transcription
Tim MaoHand Gesture Navigation
Wan-Rong (Emma) LeungFacial Recognition & UI
Edgar Cisneros BarronTouch Controls & Mobile
Justin LuYouTube Integration & Extension

The Problem

Audio is hard to navigate.
We set out to fix that.

The Challenge

Audiobooks and podcasts have become a primary way people consume information — but the tools for navigating them haven't kept up. Unlike reading, where you can skim, backtrack, or jump to a chapter in seconds, audio forces you to scrub through a timeline with limited precision.

Moments of confusion, distraction, or cognitive overload are common during long listening sessions, and most players offer no way to detect or respond to them.

Our Approach

Rather than redesigning the slider, we replaced it. Our interface uses computer vision, touch, and transcription to give listeners natural, expressive control over playback — responding not just to button presses, but to gestures, expressions, and engagement patterns.

The result is a player that adapts to you: slowing down when you look confused, scrubbing intelligently when you speed up, and responding to a wave of your hand.

Features

Five ways to listen better.

Each feature works independently or together as one unified system.

Confusion Detection

Your camera watches for subtle facial cues — brow furrows, eye squints — that signal confusion or cognitive overload. When detected, playback automatically slows so you can catch up, then returns to normal speed once you're back on track. No buttons needed.

Facial Recognition

Hand Gesture Control

Rotate your hand clockwise to speed up, counter-clockwise to rewind. Hold an open palm to pause. Complete two rotations to play. Raise a finger on your other hand to adjust volume. Full hands-free control over your audio — no screen touches required.

MediaPipe · Computer Vision

Smart Scrub

Push past 2× speed and our interface stops blindly fast-forwarding. Instead, it identifies the most important words in each segment and plays only those — giving you an intelligible audio summary at any speed. It's like skimming a text, but for audio.

Semantic NLP · Transcription

Touch & Mobile Controls

Draw circles or swipe to control playback on any touchscreen. The mobile UI is designed for single-handed use — touch pad at the bottom, player info at the top, everything accessible without looking away from your content.

Touch Gestures · Mobile UI

YouTube Integration & Chrome Extension

Paste any YouTube link into our interface and watch it with all five features active — gesture control, confusion detection, smart scrubbing, and more. A Chrome Extension brings the same controls directly to YouTube's native interface, so you never have to leave the site you're already on.

YouTube · Chrome Extension · Express

Results

Gesture Control Accuracy

The rotation detection algorithm reliably distinguishes clockwise from counter-clockwise movement, and the layered smoothing filters out jitter without making the controls feel sluggish. Both Distance Mode and Speed Mode are functional and produce stable, usable playback control.

ii.

Discrete Gesture Commands

Play, pause, and volume gestures each have deliberate activation requirements — two full rotations to play, 300ms open palm to pause, secondary hand index finger for volume — that prevent accidental triggers while remaining natural to perform.

iii.

Smart Scrub at High Speeds

Above 2× speed, the system seamlessly transitions to semantic word selection. The result is an intelligible audio summary rather than distorted fast-forward audio, and it reverts cleanly to normal playback when speed drops back below the threshold.

iv.

Touchscreen Controls Across Devices

Circle and swipe gestures all map reliably to playback speed on both desktop and mobile. The mobile layout holds up in real use — single-handed operation works as intended, with the touch pad and player controls staying out of each other's way.

YouTube Integration

Pasting a YouTube link into the interface successfully fetches and syncs the audio and video streams, bringing all playback features to content users already want to watch. The Chrome Extension extends this to YouTube's native interface without requiring users to leave the site.

vi.

Confusion Detection Response

The facial recognition system successfully detects brow and squint signals and maps them to adaptive playback slowdown with hysteresis to prevent jitter. The system hands off gracefully to gesture control when the user's hand enters frame.

Conclusion

Audio navigation that finally keeps up with how people actually listen.

Our interface demonstrates that the gap between reading and listening doesn't have to exist. By combining facial recognition, hand gestures, touch controls, and semantic transcription into a single unified interface, we've shown that audio can be as skimmable, navigable, and engaging as text.

The system has clear paths forward: expanded gesture vocabularies, smarter semantic summaries, and integration with platforms like Spotify and Audible. But as a proof of concept, it already delivers on the core promise — a listener who is confused gets help, a listener who is bored can skim intelligently, and a listener with their hands full never has to touch their screen.

View on GitHub Read the Full Report Explore Features

ScrollingThrough Soundwith Multiple User Interfaces.