-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Identification of Captured Application By Capturer #52
Comments
I've edited the PR (#163) to:
|
These two sound like the actual use case is:
I think it's important to look at the original use case, to not bake in assumptions or implementation decisions already taken. With that in hand, I'd take a step back and ask whether getDisplayMedia or getViewportMedia is the right tool. Attempting to build this new integrated experience over getDisplayMedia seems:
Of course, integrating this with getViewportMedia is not without challenges either, but seems more future proof (none of the above problems). Challenges that remain would be:
I think there's a lot to be worked out here to be able to support this use case. I think we should do that before we attempt to standardize pieces of the puzzle. |
I also sense a larger question here: how do webpages cooperate on the web, to create an integrated experience? Are we sure we want getDisplayMedia at the core of it? |
I think this recurring topic deserves separate discussion. If the WG decides to abandon gDM, then we can close all threads relating to incremental improvement of gDM. Until such a decision is made, I think we should judge proposals to improve gDM on their own merits. Looking at this specific proposal (#166) and its associated PR (#163), I do not see any criticism of the proposal's merits. The usefulness seems well-established. Your comment lists some ways in which gVM could eventually deal with it better - in that case, let's hope that web-developers migrate to using gVM when it's specified and implemented. But please note:
IMHO, my third point is significant, and worth repeating - it is unclear how much enthusiasm web-developers will have for investing time+money to completely overhaul their applications to make exclusive use of gVM. More likely we'll see gradual adoption and slow replacement of gDM in some spots, and plenty of applications intentionally maintaining gDM-use for many, many years to come. If you have suggestions for making this incremental improvement to gDM even better - I'd be very happy to incorporate it, as I have incorporated @youennf's suggestion wrt opt-in origin exposure. I would also be happy to discuss any adjacent improvements that you think could complement this proposal, such as the idea of adding a |
(My latest proposal is outlined in this explainer.) Following the WebRTC WG meeting we've just had, I'd like to gauge where we currently stand, and what the path forward is:
Where do we stand, then? If these two changes were made to my latest suggestion, would we have consensus? Or do we have additional points of disagreement? |
I think these two suggestions are improvement. In that case, maybe the generated ID should actually be a JS object that would be a proxy to the captured tab. |
This would be a much taller order with significant implications for security. At the moment we're having trouble reaching consensus over something far more modest, despite kicking off the effort 2 months ago and despite all previous proposals from Apple being adopted. I think it's better to think of this as a potential extension. (Please recall that while discussing this feature, Apple has argued for reducing initial scope. For example by removing
We could discuss either a generated ID that's mutable or immutable on navigation. But before we delve into that topic, I think it would be useful to understand if Mozilla and Apple would accept such a proposal. I would like to have firm goalposts. Full disclosure - I am very unhappy with a browser-assigned ID for several reasons. We can go into that soon. But I'd like to at least establish this as our last remaining point of disagreement. |
As I said in the past, I am fine with having something very modest like exposing a static piece of information which always contain an origin (or expose nothing). |
On the one hand: You suggested exposing the origin back on the original thread. It seems like you're still fine with it, and even ask for it to be done by default, with no way for the captured app to opt-out of that part. On the other hand: You list change of origins (e.g. when the captured tab experiences navigation) as an unexpected (?) complicating factor. Please help me understand. Has new information come to light, or have the goalposts moved? |
One known use-case is for a web-app to ensure it is capturing itself or one of its tab. This could be supported by this 'modest' proposal. Another use-case is for a web-app to capture another tab and start driving this tab, which requires tight synchronization between the two. My understanding is that this is similar to the previous case (user needs to select the right google doc tab) except that the two tabs will start interacting. This can be supported by this 'modest' proposal. The third use-case is the same as previous one, except that now the main captured document that is part of the synchronization might disappear and synchronization might be recreated by the next top level document of the same tab. |
The context of this message is completely lost on me. I don't understand how it correlates to our discussion thus far. I can understand each paragraph in isolation, but what is the general thrust of this message? What does it say? |
Sorry if I was not clear, I was trying to express the granularity of use case complexity. Here is another idea, not thought out at all but anyway, it at least helps illustrating the diversity of API that could be used. Captured page needs to opt-in for the port to be exposed on the capturing page through an opt-in API. To support navigation usecases, a new set of ports would be created when the opt-in API is called in the post-navigation captured context. |
I think it would be good, in the future, if our discussions of possible approaches could follow the standard progression of paring down possible solutions. Late proposals for radical deviations from the agreed-upon course are not conducive to reasonably-paced progress. We've been discussing exposure of origin-and-handle for two months now. What's changed? |
Added a fifth use-case - avoiding the "hall of mirror" effect when self-capturing. |
I also wanted APIs like this under site-isolation and capture opt-in. So I'd say we've not reached our last point of disagreement.
I feel @youennf understands the complexity that's not apparent in the OP as it conflates the capture of an "app" with capture of a tab:
To poke holes: what if the user instead chooses a non-Presentron tab, but later navigates to Presentron in it? I'm not ready to concede that to capture a web-based presentation program requires indiscriminate capture of its browsing context and all its navigation. That's an unsafe foundation to build on IMHO. We have a mandate to make web capture safe. And this isn't. I'm also not ready to concede that to solve basic "next/previous slide" controls, requires building the ability to remotely browse ("drive") a tab. This proposal also presents a stark contrast to I'd prefer to take a step back and have a higher-level discussion around that. I think there's a way to solve this that is both safe and solves the web presentation use case, but I'll open a new issue on that. |
That tabs can be navigated after capture begins was discussed explicitly two months ago in my very first message on this topic. In this linked message, search for the section titled "What if the user navigates the captured tab?"
We have discussed this. If you refer back to my slides, you will see that there is an event fired.
I am afraid that our memories of this feature's fundamental nature are out of alignment. I'd like to encourage you to re-read (a) the original thread, (b) OP and (c) the explainer. Capture-Handle does not "build" the ability to remotely drive the tab. That's something you can build on top of Capture-Handle. |
@eladalon1983 I'm not saying it wasn't discussed, but that I find it overly complex for the use case at hand, and would like to explore better web integration built on safer tech than today's unsafe non-isolated model. |
Halting all (relevant) progress on screen-capture until |
I'm not proposing we block on the |
Proposal w3c/mediacapture-screen-share-extensions#9 packages two separate issues in unnecessary union.
|
I don't really see why it does not solve 2, 3 and 4, given MessagePort would be complemented by Origin.
Well, that is the end goal of the main use case you are bringing to the table.
Small initial scope would be origin, maybe origin plus pathname or something like that. |
Assuming it IS complemented by origin, then some of these use-cases would also be partially addressed¹, but to an inferior extent. Motivating a change from our current approach to a new approach requires a set of compelling reasons. I have not yet heard such reasons. This new approach offers more complexity (where previously less complexity was requested by Apple) and inferior handling of the use-cases I have cited. Capture Handle is the superior solution here. --
|
Shameless plug: https://webrtchacks.com/capture-handle/ |
I've been discussing this with @jan-ivar, and my understanding of his current position is that he would support this proposal[*] if we add the ability for some basic messages to be sent capturer->capaturee. This sounds like a good idea to me, and I am happy to resume the discussion accordingly. The proposal would then comprise two parts:
Overview of the added basic-messaging capabilities:Shared capturer/capturee:
Capturee-side:
Capturer-side:
Noteworthy
@jan-ivar: Have I accurately represented your position? Do you agree with the general approach I suggest here for basic messaging? |
In general, I like the idea of actions. I am also interested in the trust model, which probably applies to both CaptureHandle and actions.
Another approach would be to state that the UA is the entity sanitising capturer/capturee relationship.
Can you clarify what the proposal is? I would personally be inclined to split the work in two different specs given the scopes seem to be different enough. |
There are two core issues addressed in this thread, which I'll call "identity" (original proposal) and "actions" (additional mechanisms). I also prefer splitting the work, but I tentatively propose addressing both issues together as an attempt to reach a compromise with @jan-ivar. To avoid risking misrepresenting his position, I'd like to ask @jan-ivar to explain why he thinks the two should be combined. Clarifying my question about support - it's an open ended question. Would you support either part (identity/actions)? Both? Only a certain mix? I hope that you'll be amenable to the identity part at least, @youennf, as the design was much influenced by your earlier input. :-)
Could you please clarify your proposal here?
It's an interesting issue, and might not have the same answer for both directions.
I believe that's equally true for all proposals currently under discussion (modulo declaring capabilities, discussed below).
The idea behind the current actions-proposal is that the capturing application can expose its own custom, in-content controls for the intersection of controls supported by the capturer and the capturee. So, for example, if Zoom captures Slides, it could expose |
@eladalon1983, @jan-ivar, following on yesterday's meeting, here are examples that I hope clarifies what I have in mind.
Capturee is exposing its origin through MediaSessionProxy.origin to capturers whose origin is granted by capturee's MediaSessionProxyRules.
Capturee is exposing its origin through CaptureeProxy.origin to capturers whose origin is granted by capturee through setGetDisplayMediaChannelCallback. |
Thank you for these proposals, @youennf.
This is an interesting proposal. I have saved for later some nits, so as to focus first on the general thrust. The propsal takes great pains to latch onto an existing mechanism (MediaSession). It's not immediately clear to me what is gained by making that design decision. Could you please explain? As for the drawbacks I can see:
Would love to hear more of your thoughts, as well as those of @jan-ivar.
I'd rather steer clear of this. On the one hand, it requires tight cooperation, or else how would the capturer/capturee understand each other? So we can take it as a given that users of this API would be tightly integrated. On the other hand, it forces a communications method, and I think this should be left out of scope. Tightly integrated applications have their own various means. |
@youennf Reusing mediaSession, while intriguing, seems risky design-wise and scope-wise: repurposing a well-known API with a known a trust model to now also be something else. I don’t see a lot of user benefit frankly. I'd prefer keeping what's exposed in this WG. We have to be careful not to add ways for malicious sites to remotely operate arbitrary captured pages in ways that may deceive users. E.g. I’m also not sure “advance slide” is the same thing as “next track”: e.g. a presenter may have background music playing, and expect the latter to skip to the next audio track, not advance to the next slide. While "play", "pause" and "resume" may be reasonable, what if capturee has two video elements, which one plays? I sense a slippery slope here toward users asking to click on buttons in the capture preview and have it affect buttons in the capturee page. While this might be useful, if done wrong (letting JS control coordinates) it might let scammers remotely operate a user's browser. So my instinct is we want to tightly control this separate from mediaSession.
I'd rather avoid adding another messaging channel to the platform. Also, as soon as such a channel exists, the parties can exchange IDs anyway, so this seems like a superset of @eladalon1983's approach.
I have to say I like how this separates the control surface from the MediaStreamTrack. Tracks can be cloned and transferred, and it's not clear to me this surface should follow it, e.g. to a worker (would |
In the WG meeting, YouTube was given as an example where such API could be potentially useful. If we take the approach to add specific actions, I do not want to end up in a place where we duplicate the work with MediaSession.
The CaptureHandle's proposal is adding a one way communication from capturer to capturee. The action's proposal is most probably also creating a two way communication channel between capturer and capturee (we would need to have a clear API to validate this).
I do not think it requires the same tight cooperation.
This communication method is the traditional way of doing cross-context communications on the web (postMessage).
If we take the current eventing model, calling getDisplayMedia twice in a row, say A and B. As long as A and B trigger the prompt, the assumption is that the order is preserved as:
The case where ordering is not guaranteed is if B fails, in which case promise B may reject sooner (but there will be no event B), though we could fix this as well. |
MediaSession is a rich spec that offers much more than sending simple actions. I think the duplication of work is quite minimal. I have not yet heard what entanglement with MediaSession would improve.
It's the other way around.
Looks pretty one-way to me. See explanation above. Each part (Identity, Actions) produces a distinct one-way communication channel. These APIs (Identity, Actions) are useful both in isolation as well as together. One session can involve one, the other, neither, or both.
I'd like join @jan-ivar's objection to adding more generic message channels. Before we dive too deep into the discussion of whether it's secure, I think the onus is on you to show it's desirable to add a generic messaging channel where a limited one would do.
@youennf, I believe these two lines contradict each other. The second line explains the necessity of a mini-protocol for both approaches (my Identity, your MessageChannel-based communication approach). Namely, that even if the mini-protocol is as simple as a stringified JSON with a single key-value pair, e.g.
When applications share a cloud infrastructure, it might be preferable for the developers to go through some pre-existing RESTful API than to add code in the capturee to handle messages from the capturer. Especially if the captured application does not wish to assume that the local user delegates their permissions in captured-application to capturing-application just by allowing it to display-capture. Since the captured-application will still treat messages from the capturing-application as suspicious (the local user might only have partial understanding of what display-capture allows here), it's easier to use pre-existing mechanisms for access-control, than to replicate them for yet another communications channel. |
If we go with actions like next slide, previous slide, capturer might want to understand whether:
We could try to be very restricted in terms of actions, my gut feeling is telling me people will want more than that.
Identity provides a de facto a one-way communication channel. Actions also provide a two-way communication channel:
IIRC, in a past WG meeting, you stated that, in some cases, the goal is to create such a messaging channel, either through a network intermediary or through something like RTCPeerConnection. If we believe this is something desirable, direct support through postMessaging seems a better option to me.
The capturer/capturee approach is like a client/server approach: server/capturee defines the protocol, client/capturer has to abide to it. Server/capturee may or may not restrict client to specific origins.
This does not seem contradictory to me: the same RESTful API can be used to validate or not the request to open a channel with a capturer. |
I have not understood this message.
In both cases, it is necessary for both sides to agree on how messages are structured. It's equally true in both cases, and therefore the tightness of collaboration assumed is the same.
I do not believe it makes sense to force two applications that are already communicating using tried-and-true mechanisms that were produced by expensive-to-employ engineers, to now support a new method of communication. Enable - great, let's bookmark the idea of adding a MessagePort to the capture-handle API, and circle back to it when it's time for improvements. But for the MVP, a simple string is enough. |
If we end up duplicating some of the mediaSession API surface, so what? I see benefit in doing so, as it gives JS full control over whose control they wish to enable. If things are almost the same, they are not the same. We can follow patterns without sharing WebIDL. |
The actions proposal is based on interest from 'capturer', I haven't heard any interest from 'capturee'. That puts this API at risk. To be successful, this API should be as good if not better than the out-of-band approach 'capturee' are apparently planning to use.
As I said, this gives the choice, existing communication channels can continue to be used.
It is not a new method, it is reusing a well known web pattern between two entities that do not trust themselves deeply (cross-origin iframes or opener/openee)
AIUI, it is not a simple string, it is a string + an origin + an event. This makes it very close to postMessage, albeit transfer. |
We will still want the string even if we add the channel:
Can we agree to label the channel as an improvement? |
This is the culmination of discussions in w3c/mediacapture-screen-share#159. I am re-summarizing so as to avoid misunderstandings stemming from that other issue's history (it started focused on label, but evolved in a different direction).
Problem
When the user chooses a tab using
getDisplayMedia
, the capturing application has no good way of discovering which application is is capturing.Use cases
1. Establishing Cross-Tab Communication
The stress here is on "establishing," by which I mean identification. Communication itself is a solved issue once identification takes place - we can use BroadcastChannel, a shared back-end, sometimes a service worker. This issue is concerned with how the capturing application can identify the captured application, so that messages could be addressed specifically to it.
For example, assume two collaborating apps from ACME corporation - a capturing VC application called ACME VC-Max, and productivity suite app called ACME Presentron. (The marketing department took a day off.) The user has many open tabs, both of Presentron as well as of other applications. When VC-Max asks to capture a tab, and the user chooses Presentron, we want VC-Max to be able to identify that this selection took place. Moreover, we want Presentron to be able to declare an ID; VC-Max can then use this ID to address messages solely to the specific captured Presentron session. (Note - reliably and ergonomically passing this ID is in-scope; use of the ID for communication is beyond scope; it's enough for us that it's possible.)
Once this ID is passed an communication is initiated, VC-Max can display controls for the user that will allow the user to flip through slides on the captured Presentron slides deck from within the VC-Max session.
2. Analytics
Capturing applications can gather statistics over what applications its users tend to capture. This can be used to improve service for the users by introducing collaborations. One such possible collaboration was described in the use-case above.
3. Conditional Tab-Focus Change
In w3c/mediacapture-screen-share#165 I proposed an API for one-way hand-off of tab-focus from capturer to captured. Consider the intended case for this - shortly after capture starts. With the capture-handle defined by the current issue, a capturing application could make an informed decision about whether it wants to hand off tab-focus to the captured application, depending on what the captured application is.
4. Rejecting Undesired Captures
Issue w3c/mediacapture-screen-share#143 introduced a web-developer who wanted to discard
MediaStream
s that resulted from the capture of either blocklisted or non-allowlisted origins. If we can enable this use-case along the way, that would be great.5. Detecting Self-Capture
This is a sub-case of use-case 4, but deserves elaboration due to its ubiquity. It is common for VC applications to experience a "hall of mirrors" effect when the user unintentionally self-captures. If the app can detect self-capture, it can also avoid using the stream until the user chooses a new source.
Solution
Define
MediaDevices.captureHandle
. If set, the application can use it to expose information to potential capturing applications.The "handle" string can be an ID which is meaningful given the origin (each origin and its published ID schema).
Noteworthy
The text was updated successfully, but these errors were encountered: