The AI news cycle has been dominated with announcements of large language models (LLMs) and generative AI models,
highlighting their increasing power and versatility in recent years. The applications span across a wide range of
areas,
from open-ended chatbots to task-oriented assistants. While much of the focus on LLMs has been on their use in
cloud-based and server-side applications, there is growing interest in deploying these models in embedded systems
and
edge devices.
Embedded systems, like the microprocessors found in household appliances, industrial equipment, automobiles and a
wide
range of other devices, need to fit within cost and power envelopes that constrain the compute and memory
availability.
This, in turn, makes it very challenging to deploy language models with sufficient accuracy and performance on edge
devices.
Deploying LLMs to Edge Devices
One key area where LLMs are currently leveraged in embedded solutions is natural conversational interaction between
the
operator and machinery, or human machine interface (HMI). Embedded systems can
streamline various input options,
like a
microphone, camera or other sensors, but most would not have full keyboards to interact with the LLM models as
people
are used to on personal computers, laptops and mobile phones. So, the embedded system also needs to be practical in
using audio and vision as inputs to the LLM. This requires a preprocessing block of automatic speech recognition
(ASR)
or image recognition and classification. Similarly, the output options for interacting are limited. The embedded
solution might not have a screen, or it might not be practical for the user to read a screen. Therefore, a post
processing step is needed after the generative AI model to convert the model’s output into audio using a
text-to-speech
(TTS) algorithm. At NXP, we are building the eIQ® GenAI Flow to be a modular flow with the necessary pre and post
processing blocks to make generative AI on the edge practical.
Transforming Applications with LLMs
By integrating LLM-powered speech recognition, natural language understanding and text generation capabilities,
embedded
devices can offer more intuitive and conversational user experiences. This includes smart home devices that respond
to
voice commands, industrial machinery that can be controlled through natural language or automotive infotainment
systems
that engage in back-and-forth hands-free dialogue to guide the user or operate functions within the vehicle and
more.
LLMs are also finding use in embedded predictive analytics and decision support systems in health applications. By
embedding a language model trained on domain-specific data, devices can leverage natural language processing to
analyze
sensor data, identify patterns, and generate insights – all while operating in real-time at the edge and maintaining
patient privacy rather than requiring sending data to the cloud.
Addressing Generative AI Challenges
Deploying generative AI models with acceptable accuracy and abilities in embedded environments comes with a set of
challenges. Model size and memory footprint optimizations are needed to fit the LLM within the resource constraints
of
the target hardware. Models with billions of parameters need gigabytes of storage, which can be a high-cost
commodity or
infeasible for edge systems. Techniques to optimize models like quantization and pruning, that are applicable to
convolutional neural networks, also apply to transformer models - the backbone of generative AI to overcome the
model
size problem.
Generative AI models like LLMs also have knowledge limitations. For instance, they have a bounded understanding, can
often provide incoherent answers, also known as “hallucinations”, and their knowledge is limited to the recency of
the
data available at their training time. Training a model or fine tuning a model by retraining are potential ways to
create higher accuracy and context awareness but can be very costly in terms of data collection and required
training
compute. Fortunately, innovation exists where there is a need; by way of retrieval augmented generation, RAG for
short.
RAG is a method to create a database of knowledge from context specific data that the LLM can reference at run time
to
help it answer queries accurately.
The eIQ GenAI Flow brings the benefit of generative AI and LLMs to edge use cases in a practical way. By
incorporating
RAG into this flow, we provide embedded devices-specific domain knowledge without exposing user data to the original
AI
model’s training data. This ensures that any adaptations made to the LLM remain private and are only available
locally
on the edge.