Author: Arvind Padmanabhan
Takeaways from Alexa Dev Day, Bangalore
Voice-based interfaces are all in rage right now. On one side tech is being driven by Machine Learning. On the other side, speech-to-text, text-to-speech and Natural Language Understanding are complementing it from the perspective of user interfaces. This is not to say that keyboards and touchscreens are going to go away. It does suggest that there are many applications where sight and touch can be freed for parallel tasks while you converse with your apps using voice.
Some of the speech agents that power voice-based interfaces are Siri, Alexa and Cortana, just to name the well-known ones. Today I had the chance to attend Alexa Dev Day organized by Amazon Alexa team. Similar events are scheduled to happen all across the world in the coming weeks. The event hall was packed, mostly with developers, and mostly with those who already owned Echo devices. I found this to be the key differentiator with Alexa. Amazon has managed to get Echo devices into a number of homes and offices. Early adopters perhaps bought them for the novelty factor but thanks to them, basic uses cases have been shown to work. Now there’s sufficient interest from developers to reach this bunch of Echo owners and beyond to interest them with innovative apps. Novelty therefore is moving from Echo hardware to apps powered by Alexa. I met someone trying to do railway ticket bookings with Alexa. A couple of guys from Sulekha are exploring voice-based hyperlocal searches. Another person is looking to give first-aid advice and emergency care.
Alexa’s Process Flow
In simple terms, Alexa can be woken up with a “wake word”, which is typically “Alexa”. Once switched on this way, the Echo device will listen to human speech and pass it on to backend services. In the backend, the speech will be converted to text. It will then be analyzed by the rules set by that specific application. A response is then generated. Finally, this is converted to speech that is played back to the user. Thus, traditional processing is on text but the interfaces to human interaction is voice.
A little more detail is in order to make sense of this process workflow:
- Echo devices are normally in standby mode until woken up with “Alexa”. This is called wake word detection. It’s possible to change this word to “Echo” or “Computer” but for consistency it’s better to stick to one word across all your Echo devices.
- Backend could be on any cloud but probably AWS is more suitable to avoid interoperability issues. Serverless architecture, AWS Lambda in particular, is best suited for this use case. Cloud resources will be allocated and billed only when triggered as necessary.
- Applications in the world of Alexa are called skills. Each skill is designed to be atomic and independent of other skills. It’s important to realize that these skills are not downloaded to your Echo devices. Skills are deployed on the cloud and reside there. Echo devices can directly access published public skills or private skills that belong to the same developer who owns that device.
- Skills have rules to process inputs. These rules are defined by the developer. The rules are made of two important components:
- Utterances: These are preconfigured and signify intent. For example, “Show me Italian restaurants nearby” implies an intent to find a restaurant based on current location. If a human answers “Yes”, this signifies a confirmation intent to a previous question or suggestion from Alexa.
- Slots: These are variables that are often associated with utterance. For example, in the restaurant example above, cuisine type could be a variable. Depending on the app capability this could be configured to take values such as Indian, Chinese, Italian or Mexican.
- Once speech is converted to text in the form of a JSON request, this request goes to a service on AWS Lambda. Depending on application logic, this code can generate external API calls to obtain real-time data or access databases. Application logic will generate a JSON response. This would contain the response text marked up in SSML. SSML stands for Speech Synthesis Markup Language. SSML tells the text-to-speech converter where to add pauses or emphasis. It can also include recorded audio clips.
- In addition, the JSON response could include what is called cards. Where more detailed information is to be presented, these are sent to the user’s Alexa app on her smartphone. These visual cards complement the basic information presented via voice on Echo devices.
Tools for Developers
If you’re a developer, how can you build stuff for Alexa? There are two things available:
- Alexa Skills Kit: allows you to add new skills to Alexa. Today’s Alexa that possibly gives answers only about baseball can potentially can give answers about cricket if the latter is built into it as a new skill.
- Alexa Voice Service: allows you to add the capabilities into your product. So the future of a smart home or office is toasters and cameras that can listen and talk to via the voice interface.
Today’s session offered a wealth of useful design tips for building voice interfaces the right way. There are two parts to building a skill: the interaction model at the frontend and the programming logic at the backend. The former is powered at https://developer.amazon.com and the latter is at https://aws.amazon.com. These two are fully integrated in the two data centers: Ireland and North Virginia. There are many useful guides and tutorials to help developers get started. Alexa Cookbook on GitHub is another useful resource.
Design tips are at Alexa Design. One may be tempted to start writing backend code but in fact the recommended approach is to first work on the interaction and get that right. Start with typical use cases, then consider corner cases or error situations. Use a variety of responses to avoid making Alexa sound machine-like. Reduce friction for users by simplifying verification procedures. It’s okay to ask users to enter email address and passwords once. Alexa will manage the OAUTH credentials for subsequent API calls to third-party services, such as OlaCabs. Skills can be designed to have short-term memory and long-term memory by storing context in DynamoDB. In any case, it is context that will make your skills and voice-based apps give more value to users. It’s also essential towards personalization.
To publish your skill, the skill has to be approved. It has to adhere to content guidelines, should not infringe IP, be functional and secure. Developers can make use of a handy dashboard that shows the usage of the skills. Users can also rate skills. Amazon helps developer monetize on their published Alexa skills, via offers and vouchers.
While skills can be built right from your web browser, there’s also the option to do it locally on your laptop and push it via a command line interface. This is handy when you are working with lots of utterances and slots that would be hard to manage otherwise.
Not everything went well today with Alexa demos. Alexa couldn’t understand American accent when set to Indian accent. Since Echo devices don’t have GPS, the device used this morning thought it was still in Boston and responded with wrong time. When asked to reduce volume, Alexa responded, “Sorry I don’t have any jokes about volume”. One has to preconfigure the utterances and although an exact match is not expected, Alexa is incapable of handling other expressions that have semantic similarity.
Alexa has a long way to go when navigating the nuances of human communication. It interrupted the presenter a couple of times and had to be switched off. When asked for a random number between one and five, it replied, “Your random number is three”. You and I would probably just say “three”.
Because of the problem with homonyms, Alexa cannot be used for transcription. It works best with short directed questions. It can’t handle long ramblings. Alexa understands only English for now but there’s support for Indian accent at least. Voice profiles are not something that developers can use right now but this is highly desirable when many in the family are using a shared Echo device. A private enterprise skill cannot easily be shared across all employees of the company but this is something that has been used by hotel chains in the U.S.
Alexa won’t autonomously wake up and start talking to you or start recording stuff. It has to be commanded to wake up. But the skeptics among us will rightly point out that even if it did, we would never know. Notifications are not possible right now because Alexa values your privacy. When connecting to other devices in the home, security is a concern. Alexa can lock a smart lock but due to security reasons it won’t unlock one. We hope that it will make an exception during a fire.
But let these limitations not prevent you from adopting Alexa to develop rich voice interfaces. The future of communication is being defined right now and you have the opportunity to define it.
Author: Arvind Padmanabhan
Arvind Padmanabhan graduated from the National University of Singapore with a master’s degree in electrical engineering. With more than fifteen years of experience, he has worked extensively on various wireless technologies including DECT, WCDMA, HSPA, WiMAX and LTE. He is passionate about tech blogging, training and supporting early stage Indian start-ups. He is a founder member of two non-profit community platforms: IEDF and Devopedia. In 2013, he published a book on the history of digital technology: http://theinfinitebit.wordpress.com.