P2P Video Streaming in the Browser

How I got video conversion and streaming working inside a browser, and why it was harder than expected.

Published January 20, 2026 · Updated March 1, 2026

How It Started

SpectrShare started as a video streaming tool. I built it more as a tech demo than anything else - I thought it would be possible to stream a video file via WebRTC instead of just streaming your webcam for video conferencing.

It turns out this is very much possible, but it was definitely more work than I expected - months of work on codec edge cases, memory management, seeking accuracy, and a long list of problems that don't surface until I threw real video files at it.

File sharing came later. Once the P2P data channel infrastructure existed for video, adding general file transfer on top was straightforward, and seemed like an obvious way to improve functionality. Today most people use it for file sharing, and video conversion is optional, but the video system is still one of the parts I'm most proud of, and as far as I'm aware, is unique to this site.

How Conversion Works

When you tick "Convert video files for in-browser viewing," SpectrShare runs a video processing pipeline entirely inside your browser.

First it probes the source file to figure out what codecs and container format it uses. If the browser supports hardware-accelerated encoding via WebCodecs, it uses Mediabunny, a TypeScript media toolkit that decodes and re-encodes video using the GPU. This is fast. A 1-hour video can be processed in a few minutes depending on the source.

If WebCodecs isn't available (older browsers, or if hardware encoding fails), it falls back to FFmpeg compiled to WebAssembly. This works on nearly everything but it's orders of magnitude slower. CPU-bound, single-threaded, no GPU.

The output is always H.264 Baseline Profile with AAC audio. Every modern browser and device can play this, which is the whole point. The recipient shouldn't have to think about codecs.

The encoded output is split into 3-second segments, each starting with a keyframe. The viewer requests these one at a time over the P2P data channel, and can watch them immediately without having to download the full file.

The Hard Problems

Format Support

People share videos in every format: MKV with DTS audio, MOV with ProRes, AVI from 2005, screen recordings with unusual frame rates. I have to detect what we're dealing with and find a path to browser-playable output.

The first step is probing. MediaBunny opens the file and extracts the codec, sample rate, channel count, frame rate, and rotation metadata. Some containers never make it this far - AVI, FLV, WMV, and MPEG transport streams bypass MediaBunny entirely and go straight to FFmpeg, because MediaBunny can't demux them.

The video track is always re-encoded to H.264 Baseline. There's no passthrough path for video, even if the source is already H.264, because the output needs to be segmented with keyframes at precise intervals for seeking to work, and I want to make sure the final stream is viewable in any browser.

Audio is where it gets interesting, and where most of the format complexity lives. If the source audio is already AAC-LC at a reasonable sample rate and in stereo, the encoded packets get copied directly into the output - no decode/re-encode cycle, no quality loss. Common web formats like Opus, MP3, FLAC, and Vorbis can be decoded and re-encoded to AAC via the browser's WebCodecs hardware encoder. But some codecs - E-AC3, AC-3, DTS, TrueHD - can't be decoded by WebCodecs at all. For these, I use a hybrid approach: FFmpeg transcodes just the audio to AAC and remuxes it with the original video stream (stream copy, no video re-encode), then feeds that preprocessed file back to MediaBunny for hardware video encoding. This avoids falling back to full software transcoding just because the audio codec is exotic, which would be 100x slower.

HEVC is increasingly common, especially from iPhones. I check WebCodecs decode support per-profile - 8-bit and 10-bit Main 10 have different capability requirements, and not all browsers support both. HDR and 10-bit content gets tone-mapped to SDR through a per-frame canvas bridge running at sRGB color space - the browser's GPU compositor handles the heavy lifting. iPhone rotation metadata (portrait video shot at 90/180/270 degrees) gets baked directly into the output frames during encoding rather than stored as a metadata flag, since not all players respect rotation metadata consistently.

As a safety net, if MediaBunny produces output that fails init segment validation, the entire job automatically retries with FFmpeg. Getting this right for the long tail of real files took months of iteration, and I know it's still not perfect. If you find a file that doesn't convert properly, let me know.

Seeking and Timestamps

This was actually the hardest problem, and it has layers.

Segment durations aren't exactly 3 seconds. They vary depending on where keyframes land. For a 5-minute video, being off by a fraction of a second per segment doesn't matter. For a 2-hour movie, those errors accumulate and seeking becomes wildly inaccurate. I maintain a precise mapping of each segment's actual start time and use binary search to find the right segment for any target time. Sounds simple, but getting the timestamps right through the full encode/segment/reassemble pipeline was anything but.

The root issue is MP4 timescales. All timing in MP4 is expressed in "ticks" relative to a timescale - the number of ticks per second. The standard video timescale is 90,000 Hz (from the MPEG-TS era), and audio typically uses 48,000 Hz (matching the sample rate). But MediaBunny outputs video at its own internal timescale of 57,600 Hz. The browser's MediaSource Extensions API reads the timescale from the init segment's mdhd box and uses it to interpret every duration and timestamp that follows. If the segment data uses a different timescale than what mdhd declares, the browser calculates wrong durations and seeking breaks silently - video plays at the wrong speed, or seeks land in the wrong place.

The fix is a timescale correction pipeline. I rewrite the init segment's video mdhd timescale from 57,600 to 90,000, then scale every per-sample duration in the video trun boxes by 90,000/57,600 (1.5625x) to match. A sample duration of 2,400 ticks at 57,600 Hz becomes 3,750 ticks at 90,000 Hz - both represent the same real time (~41.7ms), just in different units. Audio stays at its native 48,000 timescale, untouched.

Then there's the drift problem. Video duration per segment is calculated from frame count divided by frame rate. Audio duration comes from the actual sample count in the encoded output. These don't agree perfectly - video might average 3.037 seconds per segment while audio averages 3.016 seconds. Over hundreds of segments, this divergence adds up. In one real-world test with a 2-hour file, video timestamps had drifted 47 seconds ahead of audio after 2,320 segments. The solution is to use cumulative audio time as the single source of truth for both tracks. Each segment's fragment decode time (tfdt) for both video and audio is derived from the same running audio clock, just expressed in their respective timescales. The per-segment audio time is stored in the manifest, and that's what powers accurate seeking via binary search.

Finally, AAC encoder priming. AAC encoders prepend 2,112 samples (~44ms) of silence to the start of each encoded stream - it's inherent to the codec's overlap-add windowing. When each 3-second segment uses a fresh encoder instance, those priming samples create audible dips at every segment boundary. The fix was to strip the first 3 AAC frames from each segment's trun entries, properly shrinking the MP4 box sizes and updating data offsets. This loses about 19ms of real audio per segment, which is imperceptible, but eliminates the stutters seen if we don't do it.

Memory

Browsers have limited memory. Encoding the entire video then splitting into segments would blow up on anything longer than a few minutes. Instead, each segment gets written to the browser's Cache API as soon as it's produced, then the source data is released. The full encoded video is never held in memory. I also monitor memory pressure and reduce buffer sizes or pause processing when the browser is running low.

What the Viewer Sees

From your side, it just looks like a video player. Progress bar, play/pause, seek, and a quality selector if multiple renditions are available. A buffer indicator shows how far ahead you've loaded.

Under the hood, the player feeds segments to the browser's MediaSource Extensions API. It keeps a 45-second forward buffer, requests segments over the data channel, and handles chunk reassembly since data can arrive out of order. Seeking to an unbuffered position triggers a new segment request, and playback resumes once enough data arrives.

The player lives in the same file viewer interface as everything else - if a video has been converted, you get a "Play" button next to "Download," just like for other files which can be displayed in the browser (PDF, Word, Excel, RTF, EPUB, EML, and many more). You can always grab the original file too if you want :-) If privacy is the main concern, there's a step-by-step guide to sharing a private video that covers password protection and best practices.

Going Deeper

For the full technical details on protocols, data channels, flow control, and the segmentation pipeline, see the How It Works article.