1. Introduction
The [WEBRTC-NV-USE-CASES] document describes several functions that can only be achieved by access to media (requirements N20-N22), including, but not limited to:
-
Funny Hats
-
Machine Learning
-
Virtual Reality Gaming
These use cases further require that processing can be done in worker threads (requirement N23-N24).
This specification gives an interface based on [WEBCODECS] and [STREAMS] to provide access to such functionality.
This specification provides access to raw media, which is the output of a media source such as a camera, microphone, screen capture, or the decoder part of a codec and the input to the decoder part of a codec. The processed media can be consumed by any destination that can take a MediaStreamTrack, including HTML <video> and <audio> tags, RTCPeerConnection, canvas or MediaRecorder.
This specification explicitly aims to support the following use cases:
-
Video processing: This is the "Funny Hats" use case, where the input is a single video track and the output is a transformed video track.
-
Audio processing: This is the equivalent of the video processing use case, but for audio tracks. This use case overlaps partially with the
AudioWorklet
interface, but the model provided by this specification differs in significant ways:-
Pull-based programming model, as opposed to
AudioWorklet
's clock-based model. This means that processing of each single block of audio data does not have a set time budget. -
Offers direct access to the data and metadata from the original
MediaStreamTrack
. In particular, timestamps come directly from the track as opposed to anAudioContext
. -
Easier integration with video processing by providing the same API and programming model and allowing both to run on the same scope.
-
Does not run on a real-time thread. This means that the model is not suitable for applications with strong low-latency requirements.
These differences make the model provided by this specification more suitable than
AudioWorklet
for processing that requires more tolerance to transient CPU spikes, better integration with videoMediaStreamTrack
s, access to track metadata (e.g., timestamps), but not strong low-latency requirements such as local audio rendering.An example of this would be audio-visual speech separation, which can be used to combine the video and audio tracks from a speaker on the sender side of a video call and remove noise not coming from the speaker (i.e., the "Noisy cafeteria" case). Other examples that do not require integration with video but can benefit from the model include echo detection and other forms of ML-based noise cancellation.
-
-
Multi-source processing: In this use case, two or more tracks are combined into one. For example, a presentation containing a live weather map and a camera track with the speaker can be combined to produce a weather report application. Audio-visual speech separation, referenced above, is another case of multi-source processing.
-
Custom audio or video sink: In this use case, the purpose is not producing a processed
MediaStreamTrack
, but to consume the media in a different way. For example, an application could use [WEBCODECS] and [WEBTRANSPORT] to create anRTCPeerConnection
-like sink, but using different codec configuration and networking protocols.
2. Specification
This specification shows the IDL extensions for [MEDIACAPTURE-STREAMS].
It defines some new objects that inherit the MediaStreamTrack
interface, and
can be constructed from a MediaStreamTrack
.
The API consists of two elements. One is a track sink that is capable of exposing the unencoded media frames from the track to a ReadableStream. The other one is the inverse of that: it provides a track source that takes media frames as input.
2.1. MediaStreamTrackProcessor
A MediaStreamTrackProcessor
allows the creation of a ReadableStream
that can expose the media flowing through
a given MediaStreamTrack
. If the MediaStreamTrack
is a video track,
the chunks exposed by the stream will be VideoFrame
objects;
if the track is an audio track, the chunks will be AudioData
objects.
This makes MediaStreamTrackProcessor
effectively a sink in the MediaStream model.
A MediaStreamTrackProcessor
internally contains a circular queue
that allows buffering incoming media frames delivered by the track it
is connected to. This buffering allows the MediaStreamTrackProcessor
to temporarily hold frames waiting to be read from its associated ReadableStream
.
The application can influence the maximum size of the queue via a parameter
provided in the MediaStreamTrackProcessor
constructor. However, the
maximum size of the queue is decided by the UA and can change dynamically,
but it will not exceed the size requested by the application.
If the application does not provide a maximum size parameter, the UA is free
to decide the maximum size of the queue.
When a new frame arrives to the MediaStreamTrackProcessor
, if the queue has reached its maximum size,
the oldest frame will be removed from the queue, and the new frame will be
added to the queue. This means that for the particular case of a queue
with a maximum size of 1, if there is a queued frame, it will aways be
the most recent one.
The UA is also free to remove any frames from the queue at any time. The UA
may remove frames in order to save resources or to improve performance in
specific situations. In all cases, frames that are not dropped
must be made available to the ReadableStream
in the order in which
they arrive to the MediaStreamTrackProcessor
.
A MediaStreamTrackProcessor
makes frames available to its
associated ReadableStream
only when a read request has been issued on
the stream. The idea is to avoid the stream’s internal buffering, which
does not give the UA enough flexibility to choose the buffering policy.
2.1.1. Interface definition
interface {
MediaStreamTrackProcessor constructor (MediaStreamTrackProcessorInit );
init attribute ReadableStream ; };
readable dictionary {
MediaStreamTrackProcessorInit required MediaStreamTrack ; [
track EnforceRange ]unsigned short ; };
maxBufferSize
2.1.2. Internal slots
[[track]]
- Track whose raw data is to be exposed by the
MediaStreamTrackProcessor
. [[maxBufferSize]]
- The maximum number of media frames to be buffered by the
MediaStreamTrackProcessor
as specified by the application. It may have no value if the application does not provide it. Its minimum valid value is 1. [[queue]]
- A
queue
used to buffer media frames not yet read by the application [[numPendingReads]]
- An integer whose value represents the number of read requests issued by the application that have not yet been handled.
[[isClosed]]
- An boolean whose value indicates if the
MediaStreamTrackProcessor
is closed.
2.1.3. Constructor
MediaStreamTrackProcessor(init)
-
If init.
track
is not a validMediaStreamTrack
, throw aTypeError
. -
Let processor be a new
MediaStreamTrackProcessor
object. -
Assign init.
track
to processor.[[track]]
. -
If init.
maxBufferSize
has a integer value greater than or equal to 1, assign it to processor.[[maxBufferSize]]
. -
Set the
[[queue]]
internal slot of processor to an emptyQueue
. -
Set processor.
[[numPendingReads]]
to 0. -
Set processor.
[[isClosed]]
to false. -
Return processor.
2.1.4. Attributes
- readable
-
Allows reading the frames delivered by the
MediaStreamTrack
stored in the[[track]]
internal slot. This attribute is created the first time it is invoked according to the following steps:-
Initialize this.
readable
to be a newReadableStream
. -
Set up this.
readable
with its pullAlgorithm set to processorPull with this as parameter, cancelAlgorithm set to processorCancel with this as parameter, and highWatermark set to 0.
The processorPull algorithm is given a processor as input. It is defined by the following steps:
-
Increment the value of the processor.
[[numPendingReads]]
by 1. -
Queue a task to run the maybeReadFrame algorithm with processor as parameter.
-
Return a promise resolved with undefined.
The maybeReadFrame algorithm is given a processor as input. It is defined by the following steps:
-
If processor.
[[queue]]
isempty
, abort these steps. -
If processor.
[[numPendingReads]]
equals zero, abort these steps. -
dequeue
a frame from processor.[[queue]]
and Enqueue it in processor.readable
. -
Decrement processor.
[[numPendingReads]]
by 1. -
Go to step 1.
The processorCancel algorithm is given a processor as input. It is defined by running the following steps:
-
Run the processorClose algorithm with processor as parameter.
-
Return a promise resolved with undefined.
The processorClose algorithm is given a processor as input. It is defined by running the following steps:
-
If processor.
[[isClosed]]
is true, abort these steps. -
Disconnect processor from processor.
[[track]]
. The mechanism to do this is UA specific and the result is that processor is no longer a sink of processor.[[track]]
. -
Close
processor.readable
.[[controller]]
. -
Empty
processor.[[queue]]
. -
Set processor.
[[isClosed]]
to true.
-
2.1.5. Handling interaction with the track
When the[[track]]
of a MediaStreamTrackProcessor
processor delivers a
frame to processor, the UA MUST execute the handleNewFrame algorithm
with processor as parameter.
The handleNewFrame algorithm is given a processor as input. It is defined by running the following steps:
-
If processor.
[[maxBufferSize]]
has a value and processor.[[queue]]
has processor.[[maxBufferSize]]
elements,dequeue
an item from processor.[[queue]]
. -
enqueue
the new frame in processor.[[queue]]
. -
Queue a task to run the maybeReadFrame algorithm with processor as parameter.
At any time, the UA MAY remove
any frame from processor.[[queue]]
.
The UA may decide to remove frames from processor.[[queue]]
, for example,
to prevent resource exhaustion or to improve performance in certain situations.
The application may detect that frames have been dropped by noticing that there is a gap in the timestamps of the frames.
When the [[track]]
of a MediaStreamTrackProcessor
processor ends
, the processorClose algorithm must be
executed with processor as parameter.
2.2. MediaStreamTrackGenerator
AMediaStreamTrackGenerator
allows the creation of a WritableStream
that acts as a MediaStreamTrack
source in the MediaStream model. Since the model does not expose sources directly
but through the tracks connected to it, a MediaStreamTrackGenerator
is also a track connected to its WritableStream
source. Further tracks
connected to the same WritableStream
can be created using the clone
method. The WritableStream
source is
exposed as the writable
field of MediaStreamTrackGenerator
.
Similarly to MediaStreamTrackProcessor
, the WritableStream
of
an audio MediaStreamTrackGenerator
accepts AudioData
objects,
and a video MediaStreamTrackGenerator
accepts VideoFrame
objects.
When a VideoFrame
or AudioData
object is written to writable
,
the frame’s close()
method is automatically invoked, so that its internal
resources are no longer accessible from JavaScript.
2.2.1. Interface definition
interface :
MediaStreamTrackGenerator MediaStreamTrack {constructor (MediaStreamTrackGeneratorInit );
init attribute WritableStream writable ; // VideoFrame or AudioFrame };dictionary {
MediaStreamTrackGeneratorInit required DOMString ; };
kind
2.2.2. Constructor
MediaStreamTrackGenerator(init)
-
Let g be a new
MediaStreamTrackGenerator
object. -
Initialize the
kind
field of g (inherited fromMediaStreamTrack
) with init.kind
. -
Return g.
2.2.3. Attributes
writable
, of type WritableStream-
Allows writing media frames to the
MediaStreamTrackGenerator
, which is itself aMediaStreamTrack
. When this attribute is accessed for the first time, it MUST be initialized with the following steps:-
Initialize this.
writable
to be a newWritableStream
. -
Set up this.
writable
, with its writeAlgorithm set to writeFrame with this as parameter, with closeAlgorithm set to closeWritable with this as parameter and abortAlgorithm set to closeWritable with this as parameter.
The writeFrame algorithm is given a generator and a frame as input. It is defined by running the following steps:
-
If generator.
kind
equalsvideo
and frame is not aVideoFrame
object, return a promise rejected with aTypeError
. -
If generator.
kind
equalsaudio
and frame is not anAudioData
object, return a promise rejected with aTypeError
. -
Send the media data backing frame to all live tracks connected to generator, possibly including generator itself.
-
Invoke the
close
method of frame. -
Return a promise resolved with undefined.
When the media data is sent to a track, the UA may apply processing (e.g., cropping and downscaling) to ensure that the media data sent to the track satisfies the track’s constraints. Each track may receive a different version of the media data depending on its constraints.
The closeWritable algorithm is given a generator as input. It is defined by running the following steps.
-
For each track
t
connected to generator,end
t
. -
Return a promise resolved with undefined.
-
2.2.4. Specialization of MediaStreamTrack behavior
AMediaStreamTrackGenerator
is a MediaStreamTrack
. This section adds
clarifications on how a MediaStreamTrackGenerator
behaves as a MediaStreamTrack
.
2.2.4.1. clone
Theclone
method on a MediaStreamTrackGenerator
returns a new MediaStreamTrack
object whose source is the
same as the one for the MediaStreamTrackGenerator
being cloned.
This source is the writable
field of
the MediaStreamTrackGenerator
.
2.2.4.2. stop
Thestop
method on a MediaStreamTrackGenerator
stops
the track. When the last track connected to
the writable
of a MediaStreamTrackGenerator
ends, its writable
is closed.
2.2.4.3. Constrainable properties
The following constrainable properties are defined for video MediaStreamTrackGenerator
s and any MediaStreamTrack
s sourced from
a MediaStreamTrackGenerator
:
Property Name | Values | Notes |
---|---|---|
width | ConstrainULong
| As a setting, this is the width, in pixels, of the latest
frame received by the track.
As a capability, max MUST reflect the
largest width a VideoFrame may have, and min MUST reflect the smallest width a VideoFrame may have.
|
height | ConstrainULong
| As a setting, this is the height, in pixels, of the latest
frame received by the track.
As a capability, max MUST reflect the largest height
a VideoFrame may have, and min MUST reflect
the smallest height a VideoFrame may have.
|
frameRate | ConstrainDouble
| As a setting, this is an estimate of the frame rate based on frames
recently received by the track.
As a capability min MUST be zero and max MUST be the maximum frame rate supported by the system.
|
aspectRatio | ConstrainDouble
| As a setting, this is the aspect ratio of the latest frame
delivered by the track;
this is the width in pixels divided by height in pixels as a
double rounded to the tenth decimal place. As a capability, min MUST be the
smallest aspect ratio supported by a VideoFrame , and max MUST be
the largest aspect ratio supported by a VideoFrame .
|
resizeMode | ConstrainDOMString
| As a setting, this string should be one of the members of VideoResizeModeEnum . The value "none "
means that the frames output by the MediaStreamTrack are unmodified
versions of the frames written to the writable backing
the track, regardless of any constraints.
The value "crop-and-scale " means
that the frames output by the MediaStreamTrack may be cropped and/or
downscaled versions
of the source frames, based on the values of the width, height and
aspectRatio constraints of the track.
As a capability, the values "none " and
"crop-and-scale " both MUST be present.
|
The applyConstraints
method applied to a video MediaStreamTrack
sourced from a MediaStreamTrackGenerator
supports the properties defined above.
It can be used, for example, to resize frames or adjust the frame rate of the track.
Note that these constraints have no effect on the VideoFrame
objects
written to the writable
of a MediaStreamTrackGenerator
,
just on the output of the track on which the constraints have been applied.
Note also that, since a MediaStreamTrackGenerator
can in principle produce
media data with any setting for the supported constrainable properties,
an applyConstraints
call on a track
backed by a MediaStreamTrackGenerator
will generally not fail with OverconstrainedError
unless the given constraints
are outside the system-supported range, as reported by getCapabilities
.
The following constrainable properties are defined for audio MediaStreamTrack
s
sourced from a MediaStreamTrackGenerator
, but in an informational capacity
only available via getSettings
. It is not possible to
reconfigure an audio track sourced by a MediaStreamTrackGenerator
using applyConstraints
in the same way that it is possble to,
for example, resize the frames of a video track. getCapabilities
MUST return an empty object.
Property Name | Values | Notes |
---|---|---|
sampleRate | ConstrainDouble
| As a setting, this is the sample rate, in samples per second, of the
latest AudioData delivered by the track.
|
channelCount | ConstrainULong
| As a setting, this is the number of independent audio channels of the
latest AudioData delivered by the track.
|
sampleSize | ConstrainULong
| As a setting, this is the linear sample size of the latest AudioData delivered by the track.
|
2.2.4.4. Events and attributes
Events and attributes work the same as for anyMediaStreamTrack
.
It is relevant to note that if the writable
stream of a MediaStreamTrackGenerator
is closed, all the live
tracks connected to it, possibly including the MediaStreamTrackGenerator
itself, are ended and the ended
event is fired on them.
3. Examples
3.1. Video Processing
Consider a face recognition functiondetectFace(videoFrame)
that returns a face position
(in some format), and a manipulation function blurBackground(videoFrame, facePosition)
that
returns a new VideoFrame similar to the given videoFrame
, but with the
non-face parts blurred. The example also shows the video before and after
effects on video elements.
const stream = await getUserMedia({video:true}); const videoTrack = stream.getVideoTracks()[0]; const processor = new MediaStreamTrackProcessor({track: videoTrack}); const generator = new MediaStreamTrackGenerator({kind: 'video'}); const transformer = new TransformStream({ async transform(videoFrame, controller) { let facePosition = await detectFace(videoFrame); let newFrame = blurBackground(videoFrame, facePosition); videoFrame.close(); controller.enqueue(newFrame); } }); processor.readable.pipeThrough(transformer).pipeTo(generator.writable); const videoBefore = document.getElementById('video-before'); const videoAfter = document.getElementById('video-after'); videoBefore.srcObject = stream; const streamAfter = new MediaStream([generator]); videoAfter.srcObject = streamAfter;
The same example using a worker:
// main.js const stream = await getUserMedia({video:true}); const videoTrack = stream.getVideoTracks()[0]; const processor = new MediaStreamTrackProcessor({track: videoTrack}); const generator = new MediaStreamTrackGenerator({kind: 'video'}); const worker = new Worker('worker.js'); worker.postMessage( {readable: processor.readable, writable: generator.writable}, [processor.readable, generator.writable]); const videoBefore = document.getElementById('video-before'); const videoAfter = document.getElementById('video-after'); videoBefore.srcObject = stream; const streamAfter = new MediaStream([generator]); videoAfter.srcObject = streamAfter; // worker.js self.onmessage = async function(e) { const transformer = new TransformStream({ async transform(videoFrame, controller) { const facePosition = await detectFace(videoFrame); const newFrame = blurBackground(videoFrame, facePosition); videoFrame.close(); controller.enqueue(newFrame); } }); e.data.readable.pipeThrough(transformer).pipeTo(e.data.writable); }
3.2. Multi-source processing
Suppose there is a model for audio-visual speech separation, represented by a classAudioVisualModel
with a method updateVideo(videoFrame)
that
updates the internal state of the model upon a new video frame, a
method getSpeechData(audioData)
that returns a noise-canceled AudioData
given an input raw AudioData
, and a close()
method that
releases resources used internally by the model.
// main.js const stream = await getUserMedia({audio:true, video:true}); const audioTrack = stream.getAudioTracks()[0]; const videoTrack = stream.getVideoTracks()[0]; const audioProcessor = new MediaStreamTrackProcessor({track: audioTrack}); const videoProcessor = new MediaStreamTrackProcessor({track: videoTrack}); const audioGenerator = new MediaStreamTrackGenerator({kind: 'audio'}); const worker = new Worker('worker.js'); worker.postMessage({ audioReadable: audioProcessor.readable, videoReadable: videoProcessor.readable, audioWritable: audioGenerator.writable }, [ audioProcessor.readable, videoProcessor.readable, audioGenerator.writable ]); // worker.js self.onmessage = async function(e) { const model = new AudioVideoModel(); const audioTransformer = new TransformStream({ async transform(audioData, controller) { const speechData = model.getSpeechData(audioData); audioData.close(); controller.enqueue(speechData); } }); const audioPromise = e.data.audioReadable .pipeThrough(audioTransformer) .pipeTo(e.data.audioWritable); const videoReader = e.data.videoReadable.getReader(); const videoPromise = new Promise(async resolve => { while (true) { const result = await videoReader.read(); if (result.done) { break; } else { model.updateVideo(result.value); result.value.close(); } } resolve(); } await Promise.all([audioPromise, videoPromise]); model.close(); }
An example that instead allows video effects that are influenced by speech would be similar, except that the roles of audio and video would be reversed.
3.3. Custom sink
Suppose there aresendAudioToNetwork(audioData)
and sendVideoToNetwork(videoFrame)
functions that respectively send AudioData
and VideoFrame
objects to a custom network sink, together with a setupNetworkSinks()
function to set up the sinks and a cleanupNetworkSinks()
function to release resources used by the sinks.
// main.js const stream = await getUserMedia({audio:true, video:true}); const audioTrack = stream.getAudioTracks()[0]; const videoTrack = stream.getVideoTracks()[0]; const audioProcessor = new MediaStreamTrackProcessor({track: audioTrack}); const videoProcessor = new MediaStreamTrackProcessor({track: videoTrack}); const worker = new Worker('worker.js'); worker.postMessage({ audioReadable: audioProcessor.readable, videoReadable: videoProcessor.readable, }, [ audioProcessor.readable, videoProcessor.readable, ]); // worker.js function writeToSink(readable, sinkFunction) { return new Promise(async resolve => { while (true) { const result = await readable.read(); if (result.done) { break; } else { sinkFunction(result.value); result.value.close(); } } resolve(); }); } self.onmessage = async function(e) { setupNetworkSinks(); const audioReader = e.data.audioReadable.getReader(); const videoReader = e.data.videoReadable.getReader(); const audioPromise = writeToSink(audioReader, sendAudioToNetwork); const videoPromise = writeToSink(videoReader, sendVideoToNetwork); await Promise.all([audioPromise, videoPromise]); cleanupNetworkSinks(); }
4. Security and Privacy considerations
This API defines a MediaStreamTrack
source and a MediaStreamTrack
sink.
The security and privacy of the source (MediaStreamTrackGenerator
) relies
on the same-origin policy. That is, the data MediaStreamTrackGenerator
can
make available in the form of a MediaStreamTrack
must be visible to
the document before a VideoFrame
or AudioData
object can be constructed
and pushed into the MediaStreamTrackGenerator
. Any attempt to create VideoFrame
or AudioData
objects using cross-origin data will fail.
Therefore, MediaStreamTrackGenerator
does not introduce any new
fingerprinting surface.
The MediaStreamTrack
sink introduced by this API (MediaStreamTrackProcessor
)
exposes MediaStreamTrack
the same data that is exposed by other MediaStreamTrack
sinks such as WebRTC peer connections, Web Audio MediaStreamAudioSourceNode
and media elements. The security and privacy
of MediaStreamTrackProcessor
relies on the security and privacy of the MediaStreamTrack
sources of the tracks to which MediaStreamTrackProcessor
is connected. For example, camera, microphone and screen-capture tracks
rely on explicit use authorization via permission dialogs (see [MEDIACAPTURE-STREAMS] and [MEDIACAPTURE-SCREEN-SHARE]),
while element capture and MediaStreamTrackGenerator
rely on the same-origin policy.
A potential issue with MediaStreamTrackProcessor
is resource exhaustion.
For example, a site might hold on to too many open VideoFrame
objects
and deplete a system-wide pool of GPU-memory-backed frames. UAs can
mitigate this risk by limiting the number of pool-backed frames a site can
hold. This can be achieved by reducing the maximum number of buffered frames
and by refusing to deliver more frames to readable
once the budget limit is reached. Accidental exhaustion is also mitigated by
automatic closing of VideoFrame
and AudioData
objects once they
are written to a MediaStreamTrackGenerator
.