"Adding a Killer Feature to Snatch the Market? Is Ali's Smart Speaker Taking a Step Too Early?
(Original Title: Adding a Killer Function to Snatch the Market? Ali's Speaker Seems Afraid to Move Too Soon)
As anticipated, Alibaba finally launched its smart speaker, the Tmall Genie X1. This move was both unexpected and logical, adding more intrigue to the competition for voice entry.
From Amazon's accidental debut three years ago to the arrival of Alibaba today, the explosion of the smart speaker market caught many off guard, but it happened. Echo has reportedly sold nearly 20 million units cumulatively. Google, Microsoft, and Apple quickly followed suit. In the subsequent period, domestic software developers, hardware manufacturers, and content creators scrambled to join the race.
Last month, Alibaba officially unveiled the Tmall Genie X1, a move that was both surprising and predictable, intensifying the battle for voice entry.
In fact, the day before Alibaba's release of the $499 Lynx Genie, Lei Feng Wang had written an article titled "[Why China's Echo Isn't Here Yet, and Tomorrow's New AI Product Could Bring Surprises]."
So, what surprises does Alibaba's smart speaker offer compared to other similar products?
The Bright "Surprise"
Before its release, the media reported that Alibaba even allocated billions of dollars from the Pepper robot project to support the development of the Genie. Staff members were transferred to the Artificial Intelligence Lab. Despite the significant investment, the product's functions seemed quite similar to those of Echo and other smart speakers—streaming music, food delivery, weather checks, alarms, and smart home device control.
According to Tmall Genie's promotional highlights, one key feature missing from Echo is voiceprint recognition. Alibaba claims that through this technology, the speaker can differentiate between individuals in the household and push personalized content based on each person's preferences. Currently, it can identify up to six individuals. Additionally, users can verify purchases and complete payment processes using their voice. Echo still requires users to provide additional personal information to distinguish identities.
Lei Feng Wang couldn't help but wonder why Amazon hadn't implemented this cool feature in Echo.
It turns out that Amazon had tried to incorporate this technology but faced challenges. According to Amazon employees, the feedback from hardware and software companies working in voiceprint recognition indicated that identifying different users' voices proved much harder than anticipated.
"As the equipment needed to remove noise, echoes, and reverberations makes it difficult to accurately identify a person's voice," said Vineet Ganju, vice president of Conexant's voice division.
So, does Tmall Genie's voiceprint recognition technology really deliver on its key selling point?
Let's examine this closely.
Why Does Voiceprint Recognition Hang?
Firstly, from the perspective of the voiceprint recognition algorithm, Dr. Chen Xiaoliang, founder of Shengzhi Technology, told Lei Feng Wang in an interview that voiceprint recognition remains a relatively narrow field with limited applications. Most current research focuses on dynamic real-time detection. Dynamic detection methods naturally build upon static detection techniques, requiring additional algorithms like AD, noise reduction, and dereverberation. The purpose of VAD (Voice Activity Detection) is to determine if a sound is human speech, while noise reduction and dereverberation aim to eliminate environmental interference.
VAD commonly uses two methods: energy-based detection and LTSD (Long-Term Spectral Divergence). Currently, LTSD is more widely used, and feature extraction also requires dynamic time warping (DTW), vector quantization (VQ), support vector machine models (SVM), hidden Markov models (HMM), and Gaussian mixture models (GMM).
From the above models, it's clear that voiceprint recognition is still a data-driven pattern recognition problem, with unresolved physical and computational issues inherent to all pattern recognition tasks.
While the uniqueness of voiceprint recognition is promising, existing equipment and technology still struggle to achieve accurate resolution. A person's voice is volatile and affected by physical conditions, age, and emotions. In noisy environments or when mixed with other sounds, voiceprint features become hard to extract and model.
Chen Xiaoliang believes that deep learning has greatly improved pattern recognition, with open-source algorithms available. However, progress in voiceprint recognition remains slow, constrained by the acquisition of voiceprints and feature establishment.
Dr. Chen Dongpeng, a senior scientist at voiceprint recognition provider SpeakIn, noted that from the standpoint of voiceprint recognition, real-world conditions present numerous challenges, including noise, multiple speakers, physical conditions, and emotional influences. These issues are tricky to resolve. Many companies, including SpeakIn, are optimizing these common problems through software and hardware algorithms. With deep learning support, the industry has made faster-than-ever progress. Dr. Chen added that voiceprint recognition is just one part of the puzzle; the effectiveness depends on the product itself and the usage scenario.
At the product level, Himalaya, which recently launched the Xiaoya smart speaker, shared its thoughts. Vice President Li Haibo said that for voiceprint recognition applications, the company has been tackling the issue for a long time but cannot achieve full accuracy. Currently, it remains in an experimental stage with moderate results.
When discussing Alibaba's Tmall Genie, Li Haibo mentioned that far-field speech recognition is typically effective within three to five meters, with noise reduction around 70dB. Ambient noise and acoustics make it harder to activate the device than this standard. Far-field voiceprint recognition is even less stable under the same distance. Common smart speaker scenarios include the living room, TV, kitchen, and bedside. Except for the headboard, the actual distances in the other three common scenarios are generally over three meters. Therefore, the practicality of Ali's speaker voiceprint recognition remains unclear.
As for why Amazon Echo hasn't adopted this feature yet, Li Haibo believes the technology isn't mature enough, despite its impressive potential—it's also risky.
Additionally, Sensory CEO Todd Mozer believes it's difficult to identify who is speaking to far-end voice devices like Echo. As the signal-to-noise ratio increases, device performance deteriorates.
"The process of denoising and separating speech from noise significantly impacts user identification. So far, no product on the market handles user identification, far-field speech, and noise processing simultaneously," said Mozer.
From the practical application of far-field voiceprint recognition, Dr. Liu Bin, a senior expert in intelligent voice algorithms at the Institute of Automation, Chinese Academy of Sciences, shared his views with Lei Feng Network. Dr. Liu said that far-field speech recognition is disrupted by noise, echoes, and reverberation. Both speech recognition and voiceprint recognition are challenging.
Currently, the reliable recognition distance for far-field speech recognition is about 3-5 meters; it's even harder for voiceprint recognition. Since the goal of speech recognition is to understand speech content in the audio signal, speech content information is highly correlated with resonance peaks, which are primarily concentrated in the low-frequency band. Speech signals have low energy in the low-frequency band, which is smaller due to external interference and speaker-related factors, concentrated in the high band. High-band speech energy is relatively low and more prone to various disturbances, making distance voiceprint identification more challenging. He immediately added that since everyone's speaking characteristics change with different factors—like when a person has a cold, their pronunciation differs from normal—near-field voiceprint recognition is still not fully mature, let alone under far-field conditions.
In general, for most users, voiceprint recognition in smart speakers isn't a necessity. Technically, voiceprint recognition isn't yet mature.
[Image]
So, why is the less mature far-field voiceprint recognition technology being used by Alibaba in the speaker?
Besides using this technology to differentiate and target specific user needs to capture the market, Dr. Liu also mentioned that Alibaba's accumulation and advantages in the field of e-commerce make e-commerce identity authentication a key direction for Alibaba.
Alibaba, leveraging the vast resource advantages of Taobao and Tmall, introducing sound shopping scenes isn't unreasonable. However, looking at Amazon's previous application of this scene on Echo shows that user shopping frequency is not high, and the experience isn't ideal.
Hu Yu, CEO of HKUST, said in an interview with Lei Feng that from the perspective of the entire market, shopping scenes are still very immature in speakers. Real demand must meet the user's just-needed behavior. Although Echo is now selling well, after investigation, it was found that the tools users actually use more are setting reminders, checking the weather, and so on. Before Amazon vigorously promoted Echo's voice shopping function, when users used voice interaction to buy things, they found the various links and scenes inside it very cumbersome, not as easy to operate as on-screen interactions.
Therefore, this is why many companies have been emphasizing the importance of voice interaction and visual presentation. Because users lack sufficient information for visual presentation, it's very difficult to complete some complex operations at this time. Therefore, some functions and scenes were created by us ourselves. We didn't realize that the user's thinking and behaviors were actually designed as products when put into practical use.
It can be seen here that if the user's habit of using the e-commerce function has not yet been developed, and the voiceprint technology is a problem, then adding voiceprint recognition to e-commerce, and it's difficult for visual inspection to pass the market test.
Overall, Alibaba's starting point for incorporating voiceprint recognition in smart speakers is very good: the functional marketing cards that neither Echo nor JD.com have, in the wave of homogenized products, use cutting-edge technologies to enhance competitiveness.
However, when the entire technology and market are still immature, Alibaba grafts voiceprint recognition onto the speaker. This step seems to be taking too early.
[Image]
"
Football Stadium Advertising Led Panels
Football Stadium Advertising Led Panels,Led Panel Display,Led Video Panel, Football Cup Led Display Billboards
Guangzhou Cheng Wen Photoelectric Technology Co., Ltd. , https://www.cwledwall.com