đ§ Audio Engineering Basics â PCM, Sample Rate, Bit Depth
After working on Zero-Shot TTS (cascading), Sesame (multimodal), and a host of other pipelines, Iâve learned there are a few primitives you have to commit to muscle memory. When you can picture the inputs and outputsâdown to the units and shapesâyou can sanity-check a modelâs behavior in seconds.
1. Mono vs Stereo
Think of channels as lanes on an audio freeway: the more lanes, the more parallel signals you have to juggle.
- Mono pushes the same signal everywhere.
- Stereo keeps left and right independentâeach at the full sample rate.
- 5.1 (or any surround format) simply adds more lanes.
Quick gut check: 48 kHz stereo means each ear still gets 48,000 samples every second.
2. Sample Rate
Sample rate is the metronome driving how often you capture amplitude. Speech stacks usually sit at 16 kHz; music master chains often ride at 44.1 or 48 kHz, and experimental rigs run higher when latency is a luxury.
3. Bit Depth
Bit depth is how many âclicksâ your amplitude knob has.
- 16-bit PCM (int16) lives in integer land:
-32768
to32767
, two bytes per sample. - 32-bit Float PCM (float32) keeps everything normalized between
-1.0
and+1.0
, with more headroom for DSP math.Translation: more bits = finer amplitude resolution, so transient details survive mixdowns.
4. Core Formulas
These are the sanity checks I reach for when a tensor feels off.
duration_ms = (samples / sample_rate) * 1000
samples = (sample_rate * duration_ms) / 1000
chunk_duration_ms = (480 / 24000) * 1000 # 480-sample chunk at 24 kHz â 20 ms
5. Data Rate Examples (Uncompressed PCM)
Once you know the frame composition, bandwidth math is just multiplication.
Sample Rate | Bit Depth | Channels | Bytes/sec | kB/sec | MB/min |
---|---|---|---|---|---|
48 kHz | 16-bit | Mono | 96,000 | 96 | 5.76 |
48 kHz | 16-bit | Stereo | 192,000 | 192 | 11.52 |
48 kHz | Float32 | Mono | 192,000 | 192 | 11.52 |
48 kHz | Float32 | Stereo | 384,000 | 384 | 23.04 |
6. Frames in PCM
In PCM audio, a frame is the snapshot across every channel at a single instant.
- Frame = one sample from each channel.
- Sample = the amplitude value for one channel at that instant.
So:
- Mono â 1 channel â 1 sample per frame.
- Stereo â 2 channels â left + right samples per frame.
- 5.1 surround â 6 channels â six samples per frame.
Frames keep multi-channel tensors aligned. One missing channel value and the whole frame is corrupt.
7. Frame Size Example
Letâs walk the math for a frame you actually ship: 48 kHz, stereo, float32, 20 ms.
samples_per_channel = (48000 * 20) / 1000 # 960 samples
total_samples = samples_per_channel * 2 # stereo â 1920 samples
bytes = total_samples * 4 # float32 â 7680 bytes
â Summary
- Sample rate = time resolution; bit depth = amplitude resolution.
- Frames bundle every channelâs sample at a moment so tensors stay synchronized.
- Keep those formulas in reach and youâll know instantly whether a model output is reasonableâor about to blow up your real-time budget.
Draft notes for ML and audio primitives will be added here.