The number of applications and the importance of voice interfaces is growing rapidly
Technologies

The number of applications and the importance of voice interfaces is growing rapidly

An American family in Portland, Oregon recently learned that Alex's voice assistant recorded their private chats and sent them to a friend. The owner of the house, dubbed Danielle by the media, told reporters that she would "never connect this device again because she can't be trusted."

Alexa, provided by Echo (1) speakers and other gadgets in tens of millions of US homes, starts recording when it hears its name or "call word" spoken by the user. This means that even if the word "Alexa" is mentioned in a TV ad, the device may start recording. That's exactly what happened in this case, says Amazon, the hardware distributor.

"The rest of the conversation was interpreted by the voice assistant as a command to send a message," the company said in a statement. “At some point, Alexa loudly asked: “To whom?” The continuation of the family conversation about hardwood flooring should have been perceived by the machine as an item on the customer’s contact list.” At least that's what Amazon thinks. Thus, the translation is reduced to a series of accidents.

The anxiety, however, remains. Because for some reason, in a house where we still felt at ease, we have to enter some kind of “voice mode”, watch what we say, what the TV is broadcasting and, of course, what this new speaker on the chest of drawers says . us.

However, Despite technology imperfections and privacy concerns, with the rise in popularity of devices like the Amazon Echo, people are starting to get used to the idea of ​​interacting with computers using their voice..

As Werner Vogels, CTO of Amazon, noted during his AWS re:Invent session in late 2017, technology has so far limited our ability to interact with computers. We type keywords into Google using the keyboard, as this is still the most common and easiest way to enter information into the machine.

Vogels said. -

The Big Four

When using the Google search engine on the phone, we probably noticed a microphone sign with a call to speak a long time ago. This Google now (2), which can be used to dictate a search query, enter a message by voice, etc. In recent years, Google, Apple, and Amazon have greatly improved voice recognition technology. Voice assistants like Alexa, Siri, and Google Assistant not only record your voice, but also understand what you say to them and answer questions.

Google Now is available for free to all Android users. The application can, for example, set an alarm, check the weather forecast and check the route on Google maps. Conversational extension of Google Now states Google Assistant () – virtual assistance to the user of the equipment. It is available mainly on mobile and smart home devices. Unlike Google Now, it can participate in a two-way exchange. The assistant debuted in May 2016 as part of the Google messaging app Allo, as well as in the Google Home voice speaker (3).

3. Google Home

The IOS system also has its own virtual assistant, Crab, which is a program included with Apple's operating systems iOS, watchOS, tvOS homepod, and macOS. Siri debuted with iOS 5 and the iPhone 4s in October 2011 at the Let's Talk iPhone conference.

The software is based on a conversational interface: it recognizes the user's natural speech (with iOS 11 it is also possible to enter commands manually), answers questions and completes tasks. Thanks to the introduction of machine learning, an assistant over time analyzes personal preferences the user to provide more relevant results and recommendations. Siri requires a constant Internet connection - the main sources of information here are Bing and Wolfram Alpha. iOS 10 introduced support for third-party extensions.

Another one of the big four Kortana. It is an intelligent personal assistant created by Microsoft. It is supported on Windows 10, Windows 10 Mobile, Windows Phone 8.1, Xbox One, Skype, Microsoft Band, Microsoft Band 2, Android, and iOS platforms. Cortana was first introduced at the Microsoft Build Developer Conference in April 2014 in San Francisco. The name of the program comes from the name of a character from the Halo game series. Cortana is available in English, Italian, Spanish, French, German, Chinese, and Japanese.

Users of the already mentioned program Alexa they must also consider language restrictions - the digital assistant only speaks English, German, French and Japanese.

The Amazon Virtual Assistant was first used in the Amazon Echo and Amazon Echo Dot smart speakers developed by Amazon Lab126. It enables voice interaction, music playback, to-do list creation, alarm setting, podcast streaming, audiobook playback, and real-time weather, traffic, sports, and other news information such as news (4). Alexa can control multiple smart devices to create a home automation system. It can also be used to make convenient shopping in the Amazon store.

4. What Users Use Echo For (According to Research)

Users can enhance the Alexa experience by installing Alexa "skills" (), additional features developed by third parties, more commonly referred to as apps such as weather and audio programs in other settings. Most Alexa devices allow you to activate your virtual assistant with a wake-up password, called a .

Amazon definitely dominates the smart speaker market today (5). IBM, which introduced a new service in March 2018, is trying to enter the top four Watson's assistant, designed for companies that want to create their own systems of virtual assistants with voice control. What is the advantage of the IBM solution? According to company representatives, first of all, on much greater opportunities for personalization and privacy protection.

First, Watson Assistant is not branded. Companies can create their own solutions on this platform and label them with their own brand.

Second, they can train their assistive systems using their own data sets, which IBM says makes it easier to add features and commands to that system than other VUI (voice user interface) technologies.

Thirdly, Watson Assistant does not provide IBM with information about user activity - developers of solutions on the platform can only keep valuable data to themselves. Meanwhile, anyone who builds devices, for example with Alexa, should be aware that their valuable data will end up on Amazon.

Watson Assistant already has several implementations. The system was used, for example, by Harman, which created a voice assistant for the Maserati concept car (6). At Munich Airport, an IBM assistant powers a Pepper robot to help passengers move around. The third example is Chameleon Technologies, where voice technology is used in a smart home meter.

6. Watson Assistant in a Maserati concept car

It is worth adding that the underlying technology here is also not new. Watson Assistant includes encryption capabilities for existing IBM products, Watson Conversation, and Watson Virtual Agent, as well as APIs for language analysis and chat.

Amazon is not only a leader in smart voice technology, but is turning it into a direct business. However, some companies have experimented with Echo integration much earlier. Sisense, a company in the BI and analytics industry, introduced the Echo integration in July 2016. In turn, startup Roxy decided to create its own software and hardware with voice control for the hospitality industry. Earlier this year, Synqq introduced a note-taking app that uses voice and natural language processing to add notes and calendar entries without having to type them on a keyboard.

All of these small businesses have high ambitions. Most of all, however, they learned that not every user wants to transfer their data to Amazon, Google, Apple or Microsoft, which are the most important players in building voice communication platforms.

Americans want to buy

In 2016, voice search accounted for 20% of all Google mobile searches. People who use this technology on a daily basis cite its convenience and multitasking among its biggest benefits. (for example, the ability to use a search engine while driving a car).

Visiongain analysts estimate the current market value of smart digital assistants at $1,138 billion. There are more and more such mechanisms. According to Gartner, by the end of 2018 already 30% of our interactions with technology will be through conversations with voice systems.

British research firm IHS Markit estimates that the market for AI-powered digital assistants will reach 4 billion devices by the end of this year, and that number could rise to 2020 billion by 7.

According to reports from eMarketer and VoiceLabs, 2017 million Americans used voice control at least once a month in 35,6. This means an increase of almost 130% over the previous year. The digital assistant market alone is expected to grow by 2018% in 23. This means that you will already be using them. 60,5 million Americans, which will result in concrete money for their producers. RBC Capital Markets estimates that the Alexa interface will generate up to $2020 billion in revenue for Amazon by 10.

Wash, bake, clean!

Voice interfaces are increasingly boldly entering the home appliances and consumer electronics markets. This could already be seen during last year's IFA 2017 exhibition. The American company Neato Robotics introduced, for example, a robot vacuum cleaner that connects to one of several smart home platforms, including the Amazon Echo system. By talking to the Echo smart speaker, you can instruct the machine to clean your entire house at specific times of the day or night.

Other voice-activated products were showcased at the show, ranging from smart TVs sold under the Toshiba brand by the Turkish company Vestel to heated blankets by the German company Beurer. Many of these electronic devices can also be activated remotely using smartphones.

However, according to Bosch representatives, it is too early to say which of the home assistant options will become dominant. At IFA 2017, a German technical group showcased washing machines (7), ovens and coffee machines that connect to Echo. Bosch also wants its devices to be compatible with Google and Apple voice platforms in the future.

7. Bosch washing machine that connects to Amazon Echo

Companies such as Fujitsu, Sony and Panasonic are developing their own AI-based voice assistant solutions. Sharp is adding this technology to ovens and small robots entering the market. Nippon Telegraph & Telephone is hiring hardware and toy makers to adapt a voice-controlled artificial intelligence system.

Old concept. Has her time finally come?

In fact, the concept of Voice User Interface (VUI) has been around for decades. Anyone who watched Star Trek or 2001: A Space Odyssey years ago probably expected that around the year 2000 we would all control computers with our voices. Also, it wasn't just science fiction writers who saw the potential of this type of interface. In 1986, Nielsen researchers asked IT professionals what they thought would be the biggest change in user interfaces by the year 2000. They most often pointed to the development of voice interfaces.

There are reasons to hope for such a solution. Verbal communication is, after all, the most natural way for people to consciously exchange thoughts, so using it for human-machine interaction seems like the best solution so far.

One of the first VUIs, called shoebox, was created in the early 60s by IBM. It was the forerunner of today's voice recognition systems. However, the development of VUI devices was limited by the limits of computing power. Parsing and interpreting human speech in real time requires a lot of effort, and it took more than fifty years to get to the point where it actually became possible.

Devices with a voice interface began to appear in mass production in the mid-90s, but did not gain popularity. The first telephone with voice control (dialing) was Philips Sparkreleased in 1996. However, this innovative and easy-to-use device was not free from technological limitations.

Other phones equipped with forms of voice interface (created by companies such as RIM, Samsung or Motorola) regularly hit the market, allowing users to dial by voice or send text messages. All of them, however, required memorizing specific commands and pronouncing them in a forced, artificial form, adapted to the capabilities of the devices of that time. This generated a large number of errors, which, in turn, led to user dissatisfaction.

However, we are now entering a new era of computing, in which advances in machine learning and the development of artificial intelligence are unlocking the potential of conversation as a new way to interact with technology (8). The number of devices that support voice interaction has become an important factor that has had a big impact on the development of VUI. Today, almost 1/3 of the world's population already owns smartphones that can be used for this type of behavior. It looks like most users are finally ready to adapt their voice interfaces.

8. Modern history of the development of the voice interface

However, before we can freely talk to a computer, as the characters of A Space Odyssey did, we must overcome a number of problems. Machines are still not very good at handling linguistic nuances. Besides many people still feel uncomfortable giving voice commands to a search engine.

Statistics show that voice assistants are used primarily at home or among close friends. None of those interviewed admitted to using voice search in public places. However, this blockade is likely to disappear with the spread of this technology.

technically difficult question

The problem that systems (ASR) face is extracting useful data from a speech signal and associating it with a certain word that has a certain meaning for a person. The sounds produced are different each time.

Speech signal variability is its natural property, thanks to which we, for example, recognize an accent or intonation. Each element of the speech recognition system has a specific task. Based on the processed signal and its parameters, an acoustic model is created, which is associated with the language model. The recognition system can work on the basis of a small or large number of patterns, which determines the size of the vocabulary with which it works. They may be small dictionaries in the case of systems that recognize individual words or commands, as well as large databases containing the equivalent of the language set and taking into account the language model (grammar).

Problems faced by voice interfaces in the first place understand speech correctly, in which, for example, entire grammatical sequences are often omitted, linguistic and phonetic errors, errors, omissions, speech defects, homonyms, unjustified repetitions, etc. occur. All these ACP systems must work quickly and reliably. At least those are the expectations.

The source of difficulties is also acoustic signals other than the recognized speech that enter the input of the recognition system, i.e. all kinds interference and noise. In the simplest case, you need them filter out. This task seems routine and easy - after all, various signals are filtered and every electronics engineer knows what to do in such a situation. However, this must be done very carefully and carefully if the result of speech recognition is to meet our expectations.

The filtering currently used makes it possible to remove, along with the speech signal, the external noise picked up by the microphone and the internal properties of the speech signal itself, which make it difficult to recognize it. However, a much more complex technical problem arises when the interference to the analyzed speech signal is ... another speech signal, that is, for example, loud discussions around. This question is known in the literature as the so-called . This already requires the use of complex methods, the so-called. deconvolution (unraveling) the signal.

The problems with speech recognition do not end there. It is worth realizing that speech carries many different types of information. The human voice suggests the gender, age, different characters of the owner or the state of his health. There is an extensive department of biomedical engineering dealing with the diagnosis of various diseases based on the characteristic acoustic phenomena found in the speech signal.

There are also applications where the main purpose of acoustic analysis of a speech signal is to identify the speaker or verify that he is who he claims to be (voice instead of key, password or PUK code). This can be important, especially for smart building technologies.

The first component of a speech recognition system is microphone. However, the signal picked up by the microphone usually remains of little use. Studies show that the shape and course of the sound wave vary greatly depending on the person, the speed of speech, and partly the mood of the interlocutor - while to a small extent they reflect the very content of the spoken commands.

Therefore, the signal must be correctly processed. Modern acoustics, phonetics and computer science together provide a rich set of tools that can be used to process, analyze, recognize and understand a speech signal. The dynamic spectrum of the signal, the so-called dynamic spectrograms. They are fairly easy to obtain, and speech presented in the form of a dynamic spectrogram is relatively easy to recognize using techniques similar to those used in image recognition.

Simple elements of speech (for example, commands) can be recognized by the simple similarity of whole spectrograms. For example, a voice-activated mobile phone dictionary contains only a few tens to a few hundred words and phrases, usually pre-stacked so that they can be easily and efficiently identified. This is sufficient for simple control tasks, but it severely limits the overall application. Systems built according to the scheme, as a rule, support only specific speakers for which voices are specially trained. So if there is someone new who wants to use their voice to control the system, they will most likely not be accepted.

The result of this operation is called 2-W spectrogram, that is, a two-dimensional spectrum. There is another activity in this block that is worth paying attention to - segmentation. Generally speaking, we are talking about breaking up a continuous speech signal into parts that can be recognized separately. It is only from these individual diagnoses that the recognition of the whole is made. This procedure is necessary because it is not possible to identify a long and complex speech in one go. Whole volumes have already been written about which segments to distinguish in a speech signal, so we will not decide now whether the distinguished segments should be phonemes (sound equivalents), syllables, or maybe allophones.

The process of automatic recognition always refers to some features of objects. Hundreds of sets of different parameters have been tested for the speech signal. The speech signal has divided into recognized frames and having selected featureswhereby these frames are presented in the recognition process, we can perform (for each frame separately) classification, i.e. assigning an identifier to the frame, which will represent it in the future.

Next stage assembly of frames into separate words - most often based on the so-called. model of implicit Markov models (HMM-). Then comes the montage of words complete sentences.

We can now return to the Alexa system for a moment. His example shows a multi-stage process of machine "understanding" of a person - more precisely: a command given by him or a question asked.

Understanding words, understanding meaning, and understanding user intent are completely different things.

Therefore, the next step is the work of the NLP module (), the task of which is user intent recognition, i.e. the meaning of the command/question in the context in which it was uttered. If the intent is identified, then assignment of so-called skills and abilities, i.e. the specific feature supported by the smart assistant. In the case of a question about the weather, weather data sources are called, which remains to be processed into speech (TTS - mechanism). As a result, the user hears the answer to the question asked.

Voice? Graphic arts? Or maybe both?

Most known modern interaction systems are based on an intermediary called graphical user interface (graphical interface). Unfortunately, the GUI is not the most obvious way to interact with a digital product. This requires that users first learn how to use the interface and remember this information with each subsequent interaction. In many situations, voice is much more convenient, because you can interact with the VUI simply by speaking to the device. An interface that doesn't force users to memorize and memorize certain commands or interaction methods causes fewer problems.

Of course, the expansion of VUI does not mean abandoning more traditional interfaces - rather, hybrid interfaces will be available that combine several ways of interacting.

The voice interface is not suitable for all tasks in a mobile context. With it, we will call a friend driving a car, and even send him an SMS, but checking the latest transfers can be too difficult - due to the amount of information transmitted to the system () and generated by the system (system). As Rachel Hinman suggests in her book Mobile Frontier, using VUI becomes most effective when performing tasks where the amount of input and output information is small.

A smartphone connected to the Internet is convenient but also inconvenient (9). Every time a user wants to buy something or use a new service, they have to download another app and create a new account. A field for the use and development of voice interfaces has been created here. Instead of forcing users to install many different apps or create separate accounts for each service, experts say VUI will shift the burden of these cumbersome tasks to an AI-powered voice assistant. It will be convenient for him to carry out strenuous activities. We will only give him orders.

9. Voice interface via smart phone

Today, more than just a phone and a computer are connected to the Internet. Smart thermostats, lights, kettles and many other IoT-integrated devices are also connected to the network (10). Thus, there are wireless devices all around us that fill our lives, but not all of them fit naturally into the graphical user interface. Using VUI will help you easily integrate them into our environment.

10. Voice interface with the Internet of Things

Creating a voice user interface will soon become a key designer skill. This is a real problem - the need to implement voice systems will encourage you to focus more on proactive design, that is, trying to understand the initial intentions of the user, anticipating their needs and expectations at every stage of the conversation.

Voice is an efficient way to enter data—it allows users to quickly issue commands to the system on their own terms. On the other hand, the screen provides an efficient way to display information: it allows systems to display a large amount of information at the same time, reducing the burden on users' memory. It is logical that combining them into one system sounds encouraging.

Smart speakers like the Amazon Echo and Google Home don't offer a visual display at all. Significantly improving the accuracy of voice recognition at moderate distances, they allow hands-free operation, which in turn increases their flexibility and efficiency - they are desirable even for users who already have smartphones with voice control. However, the lack of a screen is a huge limitation.

Only beeps can be used to inform users of possible commands, and reading the output aloud becomes tedious except for the most basic tasks. Setting a timer with a voice command while cooking is great, but making you ask how much time is left isn't necessary. Getting a regular weather forecast becomes a test of memory for the user, who has to listen and absorb a series of facts all week long, rather than picking them up from the screen at a glance.

The designers have already hybrid solution, Echo Show (11), which added a display screen to the basic Echo smart speaker. This greatly expands the functionality of the equipment. However, the Echo Show is still much less capable of performing the basic functions that have long been available on smartphones and tablets. It cannot (yet) surf the web, show reviews, or display the contents of an Amazon shopping cart, for example.

A visual display is inherently a more effective way of providing people with a wealth of information than just sound. Designing with voice priority can greatly improve voice interaction, but in the long run, arbitrarily not using the visual menu for the sake of interaction will be like fighting with one hand tied behind your back. Due to the looming complexity of end-to-end intelligent voice and display interfaces, developers should seriously consider a hybrid approach to interfaces.

Increasing the efficiency and speed of speech generation and recognition systems has made it possible to use them in such applications and areas as, for example:

• military (voice commands in planes or helicopters, for example, F16 VISTA),

• automatic text transcription (speech to text),

• interactive information systems (Prime Speech, voice portals),

• mobile devices (phones, smartphones, tablets),

• robotics (Cleverbot - ASR systems combined with artificial intelligence),

• automotive (hands-free control of car components, such as Blue & Me),

• home applications (smart home systems).

Watch out for safety!

Automotive, home appliances, heating/cooling and home security systems, and a host of home appliances are starting to use voice interfaces, often AI-based. At this stage, the data obtained from millions of conversations with machines is sent to computing clouds. It is clear that marketers are interested in them. And not only them.

A recent report from Symantec security experts recommends that voice command users not control security features such as door locks, let alone home security systems. The same goes for storing passwords or confidential information. The security of artificial intelligence and smart products has not yet been sufficiently studied.

When devices throughout the home listen to every word, the risk of system hacking and misuse becomes a huge concern. If an attacker gains access to the local network or its associated email addresses, the smart device settings can be changed or reset to factory settings, which will lead to the loss of valuable information and the deletion of user history.

In other words, security professionals fear that voice-driven AI and VUI are not yet smart enough to protect us from potential threats and keep our mouths shut when a stranger asks for something.

Add a comment