I’ve always had a soft spot for input methods and soft keyboards. It’s one of the small things that you don’t pay attention to when they’re working, but really make a difference to your life quality. I was a little disappointed with how conservative Xbox One went with its virtual keyboard. Steam’s Big Picture had a solution that was both beautiful and performant, despite the relatively steeper learning curve. PS4 provided an optional free cursor mode to their QWERTY, taking advantage of the gyroscope in DualShock 4. Xbox’s VK was just less fun.
Although typing has never been easy on a TV, the virtual keyboard is unlikely to go away. Social, search, web and purchase are central to our experience and all dependent on some form of textual input. Speech to text helps, but it’s not perfect. In a loud party, typing non-dictionary words, for those who have accent or are multilingual, it can get quite frustrating. I knew of many VK explorations in the academic world, but comparing to mobile/touch, few made their way into consumer products on TV. I was tempted to try something of my own, even if just to fulfill the curiosity.
I began by testing a simple idea: what if some keys are disabled according to the context? Theoretically it will accelerate typing by requiring less keystrokes to select the next letter. It also keeps the familiar QWERTY layout so no learning curve. There can be a mode switch in case the user needs to type a non-dictionary word. Turns out focus management was a nightmare. Focus ended up on seemingly random keys as the previous focus is disabled, and directional navigation is just unpredictable.
How about less keys altogether? I tested a 3x4 layout that was essentially the T9 keyboard from feature phones. It’s a model that most people are familiar with, and is proved to work with all languages. I found that the 4 rows layout did not eliminate overshot, so the stress of aiming was still present. I rearranged the keys slightly into a radial layout around a “select” button. The focus always snaps to the center. It takes a coordinated gesture of point(left stick) - A to press a key. Space is automatically added when a word is selected. The gesture takes some time to get used to, but once you do you start to build a muscle memory and types pretty fast.
The biggest problem I observed in testing the T9 was that people got really uncomfortable when the program attempted to match the word as they type. They wanted to go back and correct “the wrong letter”, but the only way was to keep typing. Although it did get figured out in the end, the stress was much resented.
In response to this feedback, I cut it loose and made an alternative layout that had all 26 letters in a radial menu. It was definitely less confusing and more scalable. Difficulty in targeting was expected, so I tested multiple algorithms to snap and accelerate the cursor based on the speed of action. Like auto-correction on the phone, one does not have to type every letter accurately, which significantly reduced the stress in targeting. After tweaking the tolerance a few times, I found the balance where blind type was possible.
The beauty of the fuzzy circular keyboard was that precise typing was possible but not required. When typing a non-dictionary work, the user can simply target harder. I even developed a demo to show that the keyboard was usable with Kinect gesture, using drag-and-drop to type, and regions of the screen mapped to controller keys (video).
Another area I was interested in was whether speech and controller input can be used together to improve typing speed. Speech was fast, and the controller can provide a more precise and controlled fall back.
I started by looking at the role that speech could play into controller typing. One idea was to select a candidate from the auto suggestions using speech. Because the starting letters have been limited, the accuracy of recognition is improved. In testing, I found that the frequent modality switch was very stressful. The users did not like to constantly make the decision between typing another letter and speaking. It was a natural tendency to keep doing the same thing once started.
So I went back to dictation. Xbox’s plan for dictation was very much like the mobile: user presses a button, starts talking, then a chunk of text is dumped into the text field. Unlike the phone, going back to correct the text on the console is very difficult. I tried to improve the correcting experience by introducing block editing - instead of editing by character, selections are made by word, and each word is a smart entity that remembers the speech input that was provided. The user can then easily pick an alternative from a list of recognized candidates.
This idea did perform better, with two major problems: it was still frustrating when most of the sentence was just wrong and had to be deleted; going back to correct a sentence was disruptive to the flow. I proceeded to test another model: after speaking, the result is displayed in ghost mode - it is not committed until the user actively confirms. It reduces the amount of information that the user needs to process by looking at one word at a time. And no problem when the recognition is absurdly wrong.
One of the feedback was that when the recognition result was perfect, it felt slow to accept everything. I made press and hold A committing the full sentence, but it was not easy to discover.
Parallel to the dictation behaviors, I tested two listening models: active and passive. Passive, or “push to talk” model, is employed by most speech systems to avoid picking up unintended conversation. However, if the recognition result is not destructive maybe it is ok to be always listening. Kinect allows us to do that.
The first thing I tried was to always ghost whatever was recognized last. It does not interfere with typing if the user did not care, or if the recognition is wrong. After some messing around, it evolved to a smarter behavior, where the new result only overwrite part of the old one, allowing the user to correct a single word when confirming a long sentence.
In testing, there was a lot of confusion because the system might pick up some random speech and then the text just changed unexpectedly. Some participants commented on their experience as they confirm, and ended up losing their speech result. I ended up adopting a hybrid model: the user tells the system when they’re ready to speak, and the result is treated using the correct as you go model above.
In the end, none of the explorations made it into the shell. The virtual keyboard was not an invested feature area. QWERTY might not be the fanciest, but everyone understood it, and it did its job. As a designer, I personally prefer experiences that are both easy to start and rewarding to proficient users. I dislike our VK because a heavy user feels stuck with it. I did have fun making a promo video, though: