MediaStreamTrack Insertable Media Processing using Streams

Unofficial Proposal Draft,

This version:
https://w3c.github.io/mediacapture-insertable-streams/
Feedback:
public-webrtc@w3.org with subject line “[mediacapture-insertable-streams] … message topic …” (archives)
Issue Tracking:
GitHub
Editors:
(Google)
(Google)

Abstract

This API defines an API surface for manipulating the bits on MediaStreamTracks carrying raw data. NOT AN ADOPTED WORKING GROUP DOCUMENT.

Status of this document

1. Introduction

The [WEBRTC-NV-USE-CASES] document describes several functions that can only be achieved by access to media (requirements N20-N22), including, but not limited to:

These use cases further require that processing can be done in worker threads (requirement N23-N24).

This specification gives an interface based on [WEBCODECS] and [STREAMS] to provide access to such functionality.

This specification provides access to raw media, which is the output of a media source such as a camera, microphone, screen capture, or the decoder part of a codec and the input to the decoder part of a codec. The processed media can be consumed by any destination that can take a MediaStreamTrack, including HTML <video> and <audio> tags, RTCPeerConnection, canvas or MediaRecorder.

This specification explicitly aims to support the following use cases:

2. Specification

This specification shows the IDL extensions for [MEDIACAPTURE-STREAMS]. It defines some new objects that inherit the MediaStreamTrack interface, and can be constructed from a MediaStreamTrack.

The API consists of two elements. One is a track sink that is capable of exposing the unencoded media frames from the track to a ReadableStream. The other one is the inverse of that: it provides a track source that takes media frames as input.

2.1. MediaStreamTrackProcessor

A MediaStreamTrackProcessor allows the creation of a ReadableStream that can expose the media flowing through a given MediaStreamTrack. If the MediaStreamTrack is a video track, the chunks exposed by the stream will be VideoFrame objects; if the track is an audio track, the chunks will be AudioData objects. This makes MediaStreamTrackProcessor effectively a sink in the MediaStream model.

A MediaStreamTrackProcessor internally contains a circular queue that allows buffering incoming media frames delivered by the track it is connected to. This buffering allows the MediaStreamTrackProcessor to temporarily hold frames waiting to be read from its associated ReadableStream. The application can influence the maximum size of the queue via a parameter provided in the MediaStreamTrackProcessor constructor. However, the maximum size of the queue is decided by the UA and can change dynamically, but it will not exceed the size requested by the application. If the application does not provide a maximum size parameter, the UA is free to decide the maximum size of the queue.

When a new frame arrives to the MediaStreamTrackProcessor, if the queue has reached its maximum size, the oldest frame will be removed from the queue, and the new frame will be added to the queue. This means that for the particular case of a queue with a maximum size of 1, if there is a queued frame, it will aways be the most recent one.

The UA is also free to remove any frames from the queue at any time. The UA may remove frames in order to save resources or to improve performance in specific situations. In all cases, frames that are not dropped must be made available to the ReadableStream in the order in which they arrive to the MediaStreamTrackProcessor.

A MediaStreamTrackProcessor makes frames available to its associated ReadableStream only when a read request has been issued on the stream. The idea is to avoid the stream’s internal buffering, which does not give the UA enough flexibility to choose the buffering policy.

2.1.1. Interface definition

interface MediaStreamTrackProcessor {
    constructor(MediaStreamTrackProcessorInit init);
    attribute ReadableStream readable;
};

dictionary MediaStreamTrackProcessorInit {
  required MediaStreamTrack track;
  [EnforceRange] unsigned short maxBufferSize;
};

2.1.2. Internal slots

[[track]]
Track whose raw data is to be exposed by the MediaStreamTrackProcessor.
[[maxBufferSize]]
The maximum number of media frames to be buffered by the MediaStreamTrackProcessor as specified by the application. It may have no value if the application does not provide it. Its minimum valid value is 1.
[[queue]]
A queue used to buffer media frames not yet read by the application
[[numPendingReads]]
An integer whose value represents the number of read requests issued by the application that have not yet been handled.
[[isClosed]]
An boolean whose value indicates if the MediaStreamTrackProcessor is closed.

2.1.3. Constructor

MediaStreamTrackProcessor(init)
  1. If init.track is not a valid MediaStreamTrack, throw a TypeError.

  2. Let processor be a new MediaStreamTrackProcessor object.

  3. Assign init.track to processor.[[track]].

  4. If init.maxBufferSize has a integer value greater than or equal to 1, assign it to processor.[[maxBufferSize]].

  5. Set the [[queue]] internal slot of processor to an empty Queue.

  6. Set processor.[[numPendingReads]] to 0.

  7. Set processor.[[isClosed]] to false.

  8. Return processor.

2.1.4. Attributes

readable
Allows reading the frames delivered by the MediaStreamTrack stored in the [[track]] internal slot. This attribute is created the first time it is invoked according to the following steps:
  1. Initialize this.readable to be a new ReadableStream.

  2. Set up this.readable with its pullAlgorithm set to processorPull with this as parameter, cancelAlgorithm set to processorCancel with this as parameter, and highWatermark set to 0.

The processorPull algorithm is given a processor as input. It is defined by the following steps:

  1. Increment the value of the processor.[[numPendingReads]] by 1.

  2. Queue a task to run the maybeReadFrame algorithm with processor as parameter.

  3. Return a promise resolved with undefined.

The maybeReadFrame algorithm is given a processor as input. It is defined by the following steps:

  1. If processor.[[queue]] is empty, abort these steps.

  2. If processor.[[numPendingReads]] equals zero, abort these steps.

  3. dequeue a frame from processor.[[queue]] and Enqueue it in processor.readable.

  4. Decrement processor.[[numPendingReads]] by 1.

  5. Go to step 1.

The processorCancel algorithm is given a processor as input. It is defined by running the following steps:

  1. Run the processorClose algorithm with processor as parameter.

  2. Return a promise resolved with undefined.

The processorClose algorithm is given a processor as input. It is defined by running the following steps:

  1. If processor.[[isClosed]] is true, abort these steps.

  2. Disconnect processor from processor.[[track]]. The mechanism to do this is UA specific and the result is that processor is no longer a sink of processor.[[track]].

  3. Close processor.readable.[[controller]].

  4. Empty processor.[[queue]].

  5. Set processor.[[isClosed]] to true.

2.1.5. Handling interaction with the track

When the [[track]] of a MediaStreamTrackProcessor processor delivers a frame to processor, the UA MUST execute the handleNewFrame algorithm with processor as parameter.

The handleNewFrame algorithm is given a processor as input. It is defined by running the following steps:

  1. If processor.[[maxBufferSize]] has a value and processor.[[queue]] has processor.[[maxBufferSize]] elements, dequeue an item from processor.[[queue]].

  2. enqueue the new frame in processor.[[queue]].

  3. Queue a task to run the maybeReadFrame algorithm with processor as parameter.

At any time, the UA MAY remove any frame from processor.[[queue]]. The UA may decide to remove frames from processor.[[queue]], for example, to prevent resource exhaustion or to improve performance in certain situations.

The application may detect that frames have been dropped by noticing that there is a gap in the timestamps of the frames.

When the [[track]] of a MediaStreamTrackProcessor processor ends, the processorClose algorithm must be executed with processor as parameter.

2.2. MediaStreamTrackGenerator

A MediaStreamTrackGenerator allows the creation of a WritableStream that acts as a MediaStreamTrack source in the MediaStream model. Since the model does not expose sources directly but through the tracks connected to it, a MediaStreamTrackGenerator is also a track connected to its WritableStream source. Further tracks connected to the same WritableStream can be created using the clone method. The WritableStream source is exposed as the writable field of MediaStreamTrackGenerator.

Similarly to MediaStreamTrackProcessor, the WritableStream of an audio MediaStreamTrackGenerator accepts AudioData objects, and a video MediaStreamTrackGenerator accepts VideoFrame objects. When a VideoFrame or AudioData object is written to writable, the frame’s close() method is automatically invoked, so that its internal resources are no longer accessible from JavaScript.

2.2.1. Interface definition

interface MediaStreamTrackGenerator : MediaStreamTrack {
    constructor(MediaStreamTrackGeneratorInit init);
    attribute WritableStream writable;  // VideoFrame or AudioFrame
};

dictionary MediaStreamTrackGeneratorInit {
  required DOMString kind;
};

2.2.2. Constructor

MediaStreamTrackGenerator(init)
  1. If init.kind is not "audio" or "video", throw a TypeError.

  2. Let g be a new MediaStreamTrackGenerator object.

  3. Initialize the kind field of g (inherited from MediaStreamTrack) with init.kind.

  4. Return g.

2.2.3. Attributes

writable, of type WritableStream
Allows writing media frames to the MediaStreamTrackGenerator, which is itself a MediaStreamTrack. When this attribute is accessed for the first time, it MUST be initialized with the following steps:
  1. Initialize this.writable to be a new WritableStream.

  2. Set up this.writable, with its writeAlgorithm set to writeFrame with this as parameter, with closeAlgorithm set to closeWritable with this as parameter and abortAlgorithm set to closeWritable with this as parameter.

The writeFrame algorithm is given a generator and a frame as input. It is defined by running the following steps:

  1. If generator.kind equals video and frame is not a VideoFrame object, return a promise rejected with a TypeError.

  2. If generator.kind equals audio and frame is not an AudioData object, return a promise rejected with a TypeError.

  3. Send the media data backing frame to all live tracks connected to generator, possibly including generator itself.

  4. Invoke the close method of frame.

  5. Return a promise resolved with undefined.

When the media data is sent to a track, the UA may apply processing (e.g., cropping and downscaling) to ensure that the media data sent to the track satisfies the track’s constraints. Each track may receive a different version of the media data depending on its constraints.

The closeWritable algorithm is given a generator as input. It is defined by running the following steps.

  1. For each track t connected to generator, end t.

  2. Return a promise resolved with undefined.

2.2.4. Specialization of MediaStreamTrack behavior

A MediaStreamTrackGenerator is a MediaStreamTrack. This section adds clarifications on how a MediaStreamTrackGenerator behaves as a MediaStreamTrack.
2.2.4.1. clone
The clone method on a MediaStreamTrackGenerator returns a new MediaStreamTrack object whose source is the same as the one for the MediaStreamTrackGenerator being cloned. This source is the writable field of the MediaStreamTrackGenerator.
2.2.4.2. stop
The stop method on a MediaStreamTrackGenerator stops the track. When the last track connected to the writable of a MediaStreamTrackGenerator ends, its writable is closed.
2.2.4.3. Constrainable properties

The following constrainable properties are defined for video MediaStreamTrackGenerators and any MediaStreamTracks sourced from a MediaStreamTrackGenerator:

Property Name Values Notes
width ConstrainULong As a setting, this is the width, in pixels, of the latest frame received by the track. As a capability, max MUST reflect the largest width a VideoFrame may have, and min MUST reflect the smallest width a VideoFrame may have.
height ConstrainULong As a setting, this is the height, in pixels, of the latest frame received by the track. As a capability, max MUST reflect the largest height a VideoFrame may have, and min MUST reflect the smallest height a VideoFrame may have.
frameRate ConstrainDouble As a setting, this is an estimate of the frame rate based on frames recently received by the track. As a capability min MUST be zero and max MUST be the maximum frame rate supported by the system.
aspectRatio ConstrainDouble As a setting, this is the aspect ratio of the latest frame delivered by the track; this is the width in pixels divided by height in pixels as a double rounded to the tenth decimal place. As a capability, min MUST be the smallest aspect ratio supported by a VideoFrame, and max MUST be the largest aspect ratio supported by a VideoFrame.
resizeMode ConstrainDOMString As a setting, this string should be one of the members of VideoResizeModeEnum. The value "none" means that the frames output by the MediaStreamTrack are unmodified versions of the frames written to the writable backing the track, regardless of any constraints. The value "crop-and-scale" means that the frames output by the MediaStreamTrack may be cropped and/or downscaled versions of the source frames, based on the values of the width, height and aspectRatio constraints of the track. As a capability, the values "none" and "crop-and-scale" both MUST be present.

The applyConstraints method applied to a video MediaStreamTrack sourced from a MediaStreamTrackGenerator supports the properties defined above. It can be used, for example, to resize frames or adjust the frame rate of the track. Note that these constraints have no effect on the VideoFrame objects written to the writable of a MediaStreamTrackGenerator, just on the output of the track on which the constraints have been applied. Note also that, since a MediaStreamTrackGenerator can in principle produce media data with any setting for the supported constrainable properties, an applyConstraints call on a track backed by a MediaStreamTrackGenerator will generally not fail with OverconstrainedError unless the given constraints are outside the system-supported range, as reported by getCapabilities.

The following constrainable properties are defined for audio MediaStreamTracks sourced from a MediaStreamTrackGenerator, but in an informational capacity only available via getSettings. It is not possible to reconfigure an audio track sourced by a MediaStreamTrackGenerator using applyConstraints in the same way that it is possble to, for example, resize the frames of a video track. getCapabilities MUST return an empty object.

Property Name Values Notes
sampleRate ConstrainDouble As a setting, this is the sample rate, in samples per second, of the latest AudioData delivered by the track.
channelCount ConstrainULong As a setting, this is the number of independent audio channels of the latest AudioData delivered by the track.
sampleSize ConstrainULong As a setting, this is the linear sample size of the latest AudioData delivered by the track.
2.2.4.4. Events and attributes
Events and attributes work the same as for any MediaStreamTrack. It is relevant to note that if the writable stream of a MediaStreamTrackGenerator is closed, all the live tracks connected to it, possibly including the MediaStreamTrackGenerator itself, are ended and the ended event is fired on them.

3. Examples

3.1. Video Processing

Consider a face recognition function detectFace(videoFrame) that returns a face position (in some format), and a manipulation function blurBackground(videoFrame, facePosition) that returns a new VideoFrame similar to the given videoFrame, but with the non-face parts blurred. The example also shows the video before and after effects on video elements.
const stream = await getUserMedia({video:true});
const videoTrack = stream.getVideoTracks()[0];
const processor = new MediaStreamTrackProcessor({track: videoTrack});
const generator = new MediaStreamTrackGenerator({kind: 'video'});
const transformer = new TransformStream({
   async transform(videoFrame, controller) {
      let facePosition = await detectFace(videoFrame);
      let newFrame = blurBackground(videoFrame, facePosition);
      videoFrame.close();
      controller.enqueue(newFrame);
  }
});

processor.readable.pipeThrough(transformer).pipeTo(generator.writable);
const videoBefore = document.getElementById('video-before');
const videoAfter = document.getElementById('video-after');
videoBefore.srcObject = stream;
const streamAfter = new MediaStream([generator]);
videoAfter.srcObject = streamAfter;

The same example using a worker:

// main.js
const stream = await getUserMedia({video:true});
const videoTrack = stream.getVideoTracks()[0];
const processor = new MediaStreamTrackProcessor({track: videoTrack});
const generator = new MediaStreamTrackGenerator({kind: 'video'});
const worker = new Worker('worker.js');
worker.postMessage(
  {readable: processor.readable, writable: generator.writable},
  [processor.readable, generator.writable]);
const videoBefore = document.getElementById('video-before');
const videoAfter = document.getElementById('video-after');
videoBefore.srcObject = stream;
const streamAfter = new MediaStream([generator]);
videoAfter.srcObject = streamAfter;

// worker.js
self.onmessage = async function(e) {
  const transformer = new TransformStream({
    async transform(videoFrame, controller) {
        const facePosition = await detectFace(videoFrame);
        const newFrame = blurBackground(videoFrame, facePosition);
        videoFrame.close();
        controller.enqueue(newFrame);
    }
  });

  e.data.readable.pipeThrough(transformer).pipeTo(e.data.writable);
}

3.2. Multi-source processing

Suppose there is a model for audio-visual speech separation, represented by a class AudioVisualModel with a method updateVideo(videoFrame) that updates the internal state of the model upon a new video frame, a method getSpeechData(audioData) that returns a noise-canceled AudioData given an input raw AudioData, and a close() method that releases resources used internally by the model.
// main.js
const stream = await getUserMedia({audio:true, video:true});
const audioTrack = stream.getAudioTracks()[0];
const videoTrack = stream.getVideoTracks()[0];
const audioProcessor = new MediaStreamTrackProcessor({track: audioTrack});
const videoProcessor = new MediaStreamTrackProcessor({track: videoTrack});
const audioGenerator = new MediaStreamTrackGenerator({kind: 'audio'});
const worker = new Worker('worker.js');
worker.postMessage({
    audioReadable: audioProcessor.readable,
    videoReadable: videoProcessor.readable,
    audioWritable: audioGenerator.writable
  }, [
    audioProcessor.readable,
    videoProcessor.readable,
    audioGenerator.writable
  ]);

// worker.js
self.onmessage = async function(e) {
  const model = new AudioVideoModel();
  const audioTransformer = new TransformStream({
    async transform(audioData, controller) {
        const speechData = model.getSpeechData(audioData);
        audioData.close();
        controller.enqueue(speechData);
    }
  });

  const audioPromise = e.data.audioReadable
      .pipeThrough(audioTransformer)
      .pipeTo(e.data.audioWritable);

  const videoReader = e.data.videoReadable.getReader();
  const videoPromise = new Promise(async resolve => {
    while (true) {
      const result = await videoReader.read();
      if (result.done) {
        break;
      } else {
        model.updateVideo(result.value);
        result.value.close();
      }
    }
    resolve();
  }

  await Promise.all([audioPromise, videoPromise]);
  model.close();
}

An example that instead allows video effects that are influenced by speech would be similar, except that the roles of audio and video would be reversed.

3.3. Custom sink

Suppose there are sendAudioToNetwork(audioData) and sendVideoToNetwork(videoFrame) functions that respectively send AudioData and VideoFrame objects to a custom network sink, together with a setupNetworkSinks() function to set up the sinks and a cleanupNetworkSinks() function to release resources used by the sinks.
// main.js
const stream = await getUserMedia({audio:true, video:true});
const audioTrack = stream.getAudioTracks()[0];
const videoTrack = stream.getVideoTracks()[0];
const audioProcessor = new MediaStreamTrackProcessor({track: audioTrack});
const videoProcessor = new MediaStreamTrackProcessor({track: videoTrack});
const worker = new Worker('worker.js');
worker.postMessage({
    audioReadable: audioProcessor.readable,
    videoReadable: videoProcessor.readable,
  }, [
    audioProcessor.readable,
    videoProcessor.readable,
  ]);

// worker.js
function writeToSink(readable, sinkFunction) {
  return new Promise(async resolve => {
    while (true) {
      const result = await readable.read();
      if (result.done) {
        break;
      } else {
        sinkFunction(result.value);
        result.value.close();
      }
    }
    resolve();
  });
}

self.onmessage = async function(e) {
  setupNetworkSinks();
  const audioReader = e.data.audioReadable.getReader();
  const videoReader = e.data.videoReadable.getReader();
  const audioPromise = writeToSink(audioReader, sendAudioToNetwork);
  const videoPromise = writeToSink(videoReader, sendVideoToNetwork);
  await Promise.all([audioPromise, videoPromise]);
  cleanupNetworkSinks();
}

4. Security and Privacy considerations

This API defines a MediaStreamTrack source and a MediaStreamTrack sink. The security and privacy of the source (MediaStreamTrackGenerator) relies on the same-origin policy. That is, the data MediaStreamTrackGenerator can make available in the form of a MediaStreamTrack must be visible to the document before a VideoFrame or AudioData object can be constructed and pushed into the MediaStreamTrackGenerator. Any attempt to create VideoFrame or AudioData objects using cross-origin data will fail. Therefore, MediaStreamTrackGenerator does not introduce any new fingerprinting surface.

The MediaStreamTrack sink introduced by this API (MediaStreamTrackProcessor) exposes MediaStreamTrack the same data that is exposed by other MediaStreamTrack sinks such as WebRTC peer connections, Web Audio MediaStreamAudioSourceNode and media elements. The security and privacy of MediaStreamTrackProcessor relies on the security and privacy of the MediaStreamTrack sources of the tracks to which MediaStreamTrackProcessor is connected. For example, camera, microphone and screen-capture tracks rely on explicit use authorization via permission dialogs (see [MEDIACAPTURE-STREAMS] and [MEDIACAPTURE-SCREEN-SHARE]), while element capture and MediaStreamTrackGenerator rely on the same-origin policy. A potential issue with MediaStreamTrackProcessor is resource exhaustion. For example, a site might hold on to too many open VideoFrame objects and deplete a system-wide pool of GPU-memory-backed frames. UAs can mitigate this risk by limiting the number of pool-backed frames a site can hold. This can be achieved by reducing the maximum number of buffered frames and by refusing to deliver more frames to readable once the budget limit is reached. Accidental exhaustion is also mitigated by automatic closing of VideoFrame and AudioData objects once they are written to a MediaStreamTrackGenerator.

Conformance

Document conventions

Conformance requirements are expressed with a combination of descriptive assertions and RFC 2119 terminology. The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in the normative parts of this document are to be interpreted as described in RFC 2119. However, for readability, these words do not appear in all uppercase letters in this specification.

All of the text of this specification is normative except sections explicitly marked as non-normative, examples, and notes. [RFC2119]

Examples in this specification are introduced with the words “for example” or are set apart from the normative text with class="example", like this:

This is an example of an informative example.

Informative notes begin with the word “Note” and are set apart from the normative text with class="note", like this:

Note, this is an informative note.

Conformant Algorithms

Requirements phrased in the imperative as part of algorithms (such as "strip any leading space characters" or "return false and abort these steps") are to be interpreted with the meaning of the key word ("must", "should", "may", etc) used in introducing the algorithm.

Conformance requirements phrased as algorithms or specific steps can be implemented in any manner, so long as the end result is equivalent. In particular, the algorithms defined in this specification are intended to be easy to understand and are not intended to be performant. Implementers are encouraged to optimize.

Index

Terms defined by this specification

Terms defined by reference

References

Normative References

[HTML]
Anne van Kesteren; et al. HTML Standard. Living Standard. URL: https://html.spec.whatwg.org/multipage/
[INFRA]
Infra. URL: https://https://infra.spec.whatwg.org
[MEDIACAPTURE-STREAMS]
Media Capture and Streams. URL: https://www.w3.org/TR/mediacapture-streams/
[RFC2119]
S. Bradner. Key words for use in RFCs to Indicate Requirement Levels. March 1997. Best Current Practice. URL: https://datatracker.ietf.org/doc/html/rfc2119
[STREAMS]
Streams. URL: https://streams.spec.whatwg.org
[WEBAUDIO]
Web Audio API. URL: https://www.w3.org/TR/webaudio/
[WEBCODECS]
WebCodecs. URL: https://wicg.github.io/web-codecs/
[WEBIDL]
Edgar Chen; Timothy Gu. Web IDL Standard. Living Standard. URL: https://webidl.spec.whatwg.org/
[WEBRTC-1]
WebRTC 1.0: Real-time Communication Between Browsers URL: https://www.w3.org/TR/webrtc/

Informative References

[MEDIACAPTURE-SCREEN-SHARE]
Screen Capture. URL: https://w3c.github.io/mediacapture-screen-share/
[WEBRTC-NV-USE-CASES]
Bernard Aboba. WebRTC Next Version Use Cases. 23 November 2021. NOTE. URL: https://www.w3.org/TR/webrtc-nv-use-cases/
[WEBTRANSPORT]
WebTransport. URL: https://www.w3.org/TR/webtransport/

IDL Index

interface MediaStreamTrackProcessor {
    constructor(MediaStreamTrackProcessorInit init);
    attribute ReadableStream readable;
};

dictionary MediaStreamTrackProcessorInit {
  required MediaStreamTrack track;
  [EnforceRange] unsigned short maxBufferSize;
};

interface MediaStreamTrackGenerator : MediaStreamTrack {
    constructor(MediaStreamTrackGeneratorInit init);
    attribute WritableStream writable;  // VideoFrame or AudioFrame
};

dictionary MediaStreamTrackGeneratorInit {
  required DOMString kind;
};