On Thursday, an Amazon AWS blogpost announced that the company has moved most of the cloud processing for its Alexa personal assistant off of Nvidia GPUs and onto its own Inferentia Application Specific Integrated Circuit (ASIC). Amazon dev Sebastien Stormacq describes the Inferentia’s hardware design as follows:
AWS Inferentia is a custom chip, built by AWS, to accelerate machine learning inference workloads and optimize their cost. Each AWS Inferentia chip contains four NeuronCores. Each NeuronCore implements a high-performance systolic array matrix multiply engine, which massively speeds up typical deep learning operations such as convolution and transformers. NeuronCores are also equipped with a large on-chip cache, which helps cut down on external memory accesses, dramatically reducing latency and increasing throughput.
When an Amazon customer—usually someone who owns an Echo or Echo dot—makes use of the Alexa personal assistant, very little of the processing is done on the device itself. The workload for a typical Alexa request looks something like this:
- A human speaks to an Amazon Echo, saying: “Alexa, what’s the special ingredient in Earl Grey tea?”
- The Echo detects the wake word—Alexa—using its own on-board processing
- The Echo streams the request to Amazon data centers
- Within the Amazon data center, the voice stream is converted to phonemes (Inference AI workload)
- Still in the data center, phonemes are converted to words (Inference AI workload)
- Words are assembled into phrases (Inference AI workload)
- Phrases are distilled into intent (Inference AI workload)
- Intent is routed to an appropriate fulfillment service, which returns a response as a JSON document
- JSON document is parsed, including text for Alexa’s reply
- Text form of Alexa’s reply is converted into natural-sounding speech (Inference AI workload)
- Natural speech audio is streamed back to the Echo device for playback—”It’s bergamot orange oil.”
As you can see, almost all of the actual work done in fulfilling an Alexa request happens in the cloud—not in an Echo or Echo Dot device itself. And the vast majority of that cloud work is performed not by traditional if-then logic but inference—which is the answer-providing side of neural network processing.