The dashboard of Xbox One at launch was designed to support three end-to-end navigation methods: controller, gesture and speech. It was also designed with the assumption that Kinect was always present. A full team of designers and engineers were dedicated to shipping NUI features.
Two months after launch, data showed that close to half of the users who got a Kinect with their Xbox bundle had unplugged it or covered the camera. gesture engagement was merely 1%. In the same year, Phil Spencer announced that Kinect would no longer be bundled with Xbox. Another year later, with the NXOE update, gesture navigation was silently dropped as a feature. Hardly anyone noticed.
As part of the NUI team, life was a constant dismay in between ideological design philosophy, looming engineering limit and shifting business focus. I loved the work. I loved wandering into unexplored lands. But I was also not surprised when the cause fell apart.
Direct Manipulation Vs. Symbolic Gestures
Two categories of gesture design was constantly at battle: direct manipulation uses air gestures to select, pan and zoom the digital surface as if it is being touched physically:
Symbolic gestures take inspiration from hand signs and can convey more complex messages:
In Xbox 360, Kinect v1 gestures were limited to target (move) and select (hold). Kinect v2 had much better cameras and algorithms. It was able to track the body more accurately and recognize many more states of the hands. The Xbox One core gesture set included target (move), select (push and release), pan (grab and drag), zoom (grab and push/pull) and show menu (long push).
In almost every exploration for a new gesture, some symbolic metaphors came into debate. For example the gesture volume control:
Many internal critics attributed the low usage of gesture to direct manipulation and the fatigue it causes. It was indeed very tiresome to keep an arm raised and waving around. However, it was also a conscious decision that the Xbox One dashboard would adopt direct manipulation as its primary gesture model. It was considered easier to learn and retain, more intuitive, and less likely to cause confusion cross cultures.
A Feasible Input Modality?
I worked on designing the visual feedback for these gestures. The real frustration of air gestures comes from that there can be no feedback when they’re not recognized. Take the select gesture as an example. It’s impossible for the user to move their hand on a perfect plane parallel to the camera. The algorithm distinguishes an unintended press from a real one by how fast the depth of the hand changed. However, when a user tries to push and the system doesn’t respond, their intuition is to press slower - so “the camera can see better.” It’s the same mentality as speaking in Batman voice to a voice agent. The exaggeration of speed makes the gesture even less likely to register. Thus a negative feedback loop.
On top of that, false engagement was also annoying. The gesture system imposed an implicit constraint on what posture one can take in their own living room. The system responded to unintended gesture often enough to make many block the camera.
In an attempt to redeem gesture, I proposed that gesture and speech may work better together. In real life, people do it all the time:
Using speech and gesture to confirm one another, the accuracy can be significantly improved. For example, raising one’s hand alone is not going to engage gesture; the user also needs to call for attention by saying “Xbox”. In a prototype, if the user brings an item closer for scrutiny (grabs and pull), speech is automatically activated and limited to what’s contextual to the item (“launch this”):
The Troubleshooting Hell
Xbox One launched with some magical identity scenarios relying on Kinect technology. Among them:
- Walk into the living room, the Xbox will automatically sign you in.
- When multiple users are present, saying “Xbox goto Friends” will open Friends that belongs to the speaker.
- When multiple users are present, a controller is automatically assigned to the person that is holding it. No more “my controller or yours?”
The accuracy of Kinect is remarkable when broken down to sub processes: track body, track face, recognize face, locate voice - when they are multiplied, not so much.
There are two steps for an user to fix the problem: identifying that there’s an identity error; fixing the error. For the first step, we spent much effort designing a subtle but prominent visualization of users being signed in, tracked and in control. Here’s one of the prototypes I built to illustrate the behaviors of this control under complex multi-user situations.
The next step was also a maze of its own:
After launch, it was observed that few users noticed and understood the identity visualization in Home. Identity errors continued to cause confusion, not only among users, but also game developers. NXOE had to discard the auto matching feature and return to manual controller binding.