Introduction
This document describes how Audio Weaver has been integrated on the Qualcomm Snapdragon SOC using low-level APIs. It bypasses Audio Reach and provides significant performance improvements. This document should be used in conjunction with the generic description of the Audio Weaver architecture described in Audio-Weaver--Architecture. The overall design is flexible enough to handle all automotive use cases and configurations. You should be able to fully engineer your automotive audio system using Audio Weaver without having to write custom Hexagon code1.
The platform specific code needed to wrap Audio Weaver is referred to as the Board Support Package (BSP). We will use the term BSP throughout this document to describe this code. The BSP code is based on Qualcomm low-level AudioLite APIs. This is a low-level software layer which provides an RTOS, TDM port I/O, and basic system features.
The document covers integrations on the Gen4 SA8255 and the Gen5 SA8397 / SA8797 “Nordy” chipsets. The integrations are very similar and when they diverge, we will separately document each integration.
Platform Features
Graphical development
Multicore support - distribute audio processing across all Hexagon DSPs and the Arm.
Unified signal flow showing an overall view of all cores and threads
Integrated profiling
Highly optimized including HVX support on Gen5 chipsets
Access to IP
Over 500 Audio Weaver modules
Qualcomm voice IP
Custom module API
3rd party ecosystem
Real-time audio features
TDM serial port configuration via Audio Weaver modules
Independent TDM ports automatically synchronize to within 1 sample
ALSA I/O configuration via Audio Weaver modules
Low latency - as low as 0.5 msec digital-in to digital-out using a 0.25 msec block size
Early audio within 1.5 seconds
Resynchronizes automatically after CPU overrun
Software integration
BSP configurable via a text initialization file
Run-time control via TinyMix APIs
Text based control API with integrated data type, range, and array bounds checking
Subsystem Restart (SSR) feature reboots DSPs and restarts the Audio Service if there is a critical run-time failure
Asynchronous event handling
Integrated full-featured Matlab API for scripting and regression testing
Supports all automotive use cases with concurrent operation
TFLM and ONNX support
Fully documented
Online training available
Comparison with Audio Reach
Audio Reach | Audio Weaver |
Developed for power constrained mobile products. Single use case. | Developed for high-performance automotive audio. Multiple concurrent use cases. |
Variable processing load. | Constant / deterministic real-time load |
Must keep cores loaded < 70% | Can load cores to 90% |
Separate AMS framework needed for low latency support. End-to-end digital latency of 3 x block size. | Native low-latency support. End-to-end digital latency of 2 x block size. |
TDM ports aligned within 12 samples | TDM ports aligned within 1 sample |
SysMon does not provide actionable information to fully load the system. | Easy to understand profiling. Per module, per thread, and per core. Show average and peak CPU load. |
Only supports Hexagon DSPs; no Arm. | Supports all cores including Arm. |
QXDM is a poor fit for real-time debugging. | Includes integrated visual debugging tools and legacy QXDM. |
Numerous side effects. Many features can only be supported by Qualcomm. | Architecture is fully documented and information is publicly available. |
Use Case Examples
We should include a full-featured block diagram which shows how to implement a complete automotive system in Audio Weaver. We could have placeholders for some functions, like RNC and Telephony, but it should be multicore and highlight best practices for designing systems.
Appendix A. System Configuration File
The R4.0 system uses an XML file to configure the Audio Weaver and the underlying BSP. This is converted to a binary file and loaded at system boot. An annotated version of the XML file is shown below.
<config platform="8755">
<awe_process>
<!--Unused. Will be removed in the future-->
<enable type="BOOL">true</enable>
</awe_process>
<params>
<log_level type="UINT32" max_level="2">
<!-- Specifies the log levels for each core. -->
<adsp type="UINT32">0</adsp>
<gpdsp0 type="UINT32">0</gpdsp0>
<gpdsp1 type="UINT32">0</gpdsp1>
<arm type="UINT32">0</arm>
</log_level>
<thread_num>
<!-- Number of child audio processing threads per core-->
<adsp type="UINT8">4</adsp>
<gpdsp0 type="UINT8">10</gpdsp0>
<gpdsp1 type="UINT8">10</gpdsp1>
<arm type="UINT8">4</arm>
</thread_num>
<adsp_thread_priority>
<!-- thread_parent is the priority of the main audio processing interrupt.
This is the highest priority audio thread. -->
<thread_parent type="UINT8">50</thread_parent>
<thread_child>
<!-- Next separate priorities for each child thread.
The number of child threads is specified above.
The child threads should be in decreasing priority from the parent_thread. -->
<thread_0 type="UINT8">51</thread_0>
<thread_1 type="UINT8">52</thread_1>
<thread_2 type="UINT8">53</thread_2>
<thread_3 type="UINT8">54</thread_3>
</thread_child>
</adsp_thread_priority>
<gpdsp0_thread_priority>
<thread_parent type="UINT8">50</thread_parent>
<thread_child>
<thread_0 type="UINT8">51</thread_0>
<thread_1 type="UINT8">52</thread_1>
<thread_2 type="UINT8">53</thread_2>
<thread_3 type="UINT8">54</thread_3>
<thread_4 type="UINT8">55</thread_4>
<thread_5 type="UINT8">56</thread_5>
<thread_6 type="UINT8">57</thread_6>
<thread_7 type="UINT8">58</thread_7>
<thread_8 type="UINT8">59</thread_8>
<thread_9 type="UINT8">60</thread_9>
</thread_child>
</gpdsp0_thread_priority>
<gpdsp1_thread_priority>
<thread_parent type="UINT8">50</thread_parent>
<thread_child>
<thread_0 type="UINT8">51</thread_0>
<thread_1 type="UINT8">52</thread_1>
<thread_2 type="UINT8">53</thread_2>
<thread_3 type="UINT8">54</thread_3>
<thread_4 type="UINT8">55</thread_4>
<thread_5 type="UINT8">56</thread_5>
<thread_6 type="UINT8">57</thread_6>
<thread_7 type="UINT8">58</thread_7>
<thread_8 type="UINT8">59</thread_8>
<thread_9 type="UINT8">60</thread_9>
</thread_child>
</gpdsp1_thread_priority>
<arm_thread_priority>
<thread_parent type="UINT8">63</thread_parent>
<thread_child>
<thread_0 type="UINT8">62</thread_0>
<thread_1 type="UINT8">61</thread_1>
<thread_2 type="UINT8">60</thread_2>
<thread_3 type="UINT8">59</thread_3>
</thread_child>
</arm_thread_priority>
<!-- Number of audio processing cores -->
<num_of_instances type="UINT32">4</num_of_instances>
<adsp>
<!-- Core ID (or instance ID) of the core. -->
<coreID type="UINT8">0</coreID>
<!-- Sizes of the Audio Weaver heaps, in units of 32-bit words. -->
<fastheapA type="UINT32">250000</fastheapA>
<fastheapB type="UINT32">250000</fastheapB>
<slowheap type="UINT32">250000</slowheap>
<cpuclock type="UINT32" unit="KHz">1344000</cpuclock>
</adsp>
<gpdsp0>
<coreID type="UINT8">1</coreID>
<fastheapA type="UINT32">250000</fastheapA>
<fastheapB type="UINT32">250000</fastheapB>
<slowheap type="UINT32">250000</slowheap>
<cpuclock type="UINT32" unit="KHz">1708800</cpuclock>
</gpdsp0>
<gpdsp1>
<coreID type="UINT8">2</coreID>
<fastheapA type="UINT32">250000</fastheapA>
<fastheapB type="UINT32">250000</fastheapB>
<slowheap type="UINT32">250000</slowheap>
<cpuclock type="UINT32" unit="KHz">1708800</cpuclock>
</gpdsp1>
<arm>
<coreID type="UINT8">3</coreID>
<fastheapA type="UINT32">7000000</fastheapA>
<fastheapB type="UINT32">70000</fastheapB>
<slowheap type="UINT32">70000</slowheap>
<cpuclock type="UINT32" unit="KHz">2100000</cpuclock>
</arm>
<!-- Size of the shared memory heap. This is visible to all cores.
In units of 32-bit words -->
<shared_heap_size type="UINT32" unit="UINT32">262000</shared_heap_size>
<!-- These values specify the "targetInfo" which is read back by the Audio Weaver
Server when it connects to the target. This will be removed in the future. -->
<block_size type="UINT16">48</block_size>
<sample_rate type="UINT16">48000</sample_rate>
<in_channel type="UINT16">16</in_channel>
<out_channel type="UINT16">16</out_channel>
</params>
</config>
Appendix B: Performance Benchmarks
This section contains some performance benchmarks for this release. During the measurements, processor clock and DDR speeds were set to maximum.
Interprocessor Communication
This is the measured MHz to transfer 16 channels of 32-bit data at 48 kHz between cores. Transfers were done with 1 msec blocks. This test uses the ChangeThread module and part of the reported MHz is on the sending core and part is on the receiving core.
To ADSP | To GPDSP0 | To GPDSP1 | To Arm | |
From ADSP | 2.5 MHz | 24.5 MHz | 23.5 MHz | 7.8 MHz |
From GPDSP0 | 26.8 MHz | 1.7 MHz | 23.6 MHz | 7.8 MHz |
From GPDSP1 | 26.6 MHz | 24.4 MHz | 1.4 MHz | 7.8 MHz |
From Arm | 25.5 MHz | 24.3 MHz | 23.5 MHz | 2.5 MHz |
When sending data between different cores, it is going through the carveout shared memory. When sending audio between the same core, this means that data is going to another hardware thread on the same core and non-shared is used.
TDM I/O
The limiting factor for serial port I/O is the speed of the LP DMA memory. This is 85 MB/second for reading and 170 MB/second for writing. These tests were with 48 sample block sizes.
TDM Source
16 channels @ 48 kHz. 16-bit samples: 28.3 MHz (theoretical limit of 24 MHz)
16 channels @ 48 kHz. 32-bit samples: 49 MHz
TDM Sink
16 channels @ 48 kHz. 16-bit samples: 14.8 MHz (theoretical limit of 12 MHz)
16 channels @ 48 kHz. 32-bit samples: 25.6 MHz
Input to Output Latency
This was measured by connecting the TDMSource directly to the TDMSink as shown below.
We measured the delay with an oscilloscope and the path included:
A/D → A2B → TDMSource → Copy → TDMSink → A2B → D/A
The latency varied based on the block size as shown below:
Block Size (samples) | Total Latency (msec) | Analog Latency (msec) | Digital Latency (msec) |
12 | 1.6 | 1.1 | 0.5 |
24 | 2.1 | 1.1 | 1.0 |
48 | 3.1 | 1.1 | 2.0 |
In R4.1 provide measures with Synchronous Unaligned Ports
ALSA I/O
This test measures the overhead of streaming audio data between the HLOS and Audio Weaver. 16 channels of data streamed at 48 kHz. 1 msec block size in Audio Weaver and a 10 msec block size at the HLOS.
ALSA Source SRC [R4.1]
HLOS sends data at 44.1 kHz and Audio Weaver converts to 48 kHz.
Core | MHz / 16 chan | MHz / channel |
ADSP | 52.3 | 3.3 |
GPDSP0 | 51.2 | 3.2 |
GPDSP1 | 51.9 | 3.3 |
Arm | 18.7 | 1.2 |
ALSA Sink SRC (R4.1)
Audio Weaver converts 48 kHz to 44.1 kHz for the HLOS.
Core | MHz / 16 chan | MHz / channel |
ADSP | 62.4 | 3.9 |
GPDSP0 | 60.8 | 3.8 |
GPDSP1 | 60.1 | 3.8 |
Arm | 7.2 | 0.45 |
TDM + ALSA Latency
In this test, we measure the latency from analog in to analog out including a round trip through the HLOS using ALSA modules. The path is:
A/D → A2B → TDMSource → ALSA Sink → HLOS → ALSA Source → TDMSink → A2B → D/A
An application was running on the HLOS which would read the ALSA Source and send it to the ALSA Sink. A 48 kHz sample rate was used throughout and Audio Weaver was processing at a 48 sample block size. The HLOS application was using a 240 sample block size. The ALSA settings used in the test were:
ALSA Sink
Buffer size: 960 samples
Block size: 240 samples
startThreshold: 0 samples
stopThreshold: 0 samples
ALSA Source
Buffer size: 960 samples
Block size: 240 samples
startThreshold: 480 samples (prefill to avoid underruns)
stopThreshold: 0 samples
The measured latency was 13.2 msec. The breakdown is:
1 msec analog latency
2 msec TDM digital latency (using 1 msec block size)
10 msec ALSA latency (using 5 msec block size)
We then modified the test to include sample rate conversion between Audio Weaver and the HLOS. The measured latency was now:
ALSA Settings | Measured | |||
Sample Rate | Block Size | Buffer Size | Latency (msec) | |
8000 | 80 | 160 | 18.2 | |
11025 | 120 | 240 | 21.6 | |
12000 | 120 | 240 | 21.8 | |
16000 | 160 | 320 | 15.6 | |
22050 | 220 | 440 | 20.0 | |
24000 | 240 | 480 | 15.4 | |
32000 | 320 | 640 | 17.8 | |
44100 | 480 | 960 | 19.8 | |
48000 | 480 | 960 | 16.2 |
Maximum CPU Loading
In this test, we used the BiquadLoading module to load each thread in the system. We measured how many Biquad stages we could run before we started having CPU overruns. We used a 1 msec block size on the Hexagon DSPs and a 10 msec block size on the Arm. This test measures code and framework efficiency.
ADSP
Thread | BiquadStages | % Loading |
1A | 1600 | 95% |
1B | 1720 | 95% |
1C | 1720 | 95% |
1D | 1720 | 95% |
GPDSP0
Thread | BiquadStages | % Loading |
1A | 2000 | 90% |
1B | 2000 | 90% |
1C | 2000 | 90% |
1D | 2000 | 90% |
1E | 2000 | 90% |
1F | 2000 | 90% |
GPDSP1
Thread | BiquadStages | % Loading |
1A | 2000 | 90% |
1B | 2000 | 90% |
1C | 2000 | 90% |
1D | 2000 | 90% |
1E | 2000 | 90% |
1F | 2000 | 90% |
Arm (only loaded a single thread)
Thread | BiquadStages | % Loading |
10A | 5000 | 50% |
Early Audio KPIs
In release R4.0, we are not able to fully measure this KPI. However, we were able to measure the Audio Weaver contribution to the boot time. We instrumented the code and measured the time that the main() function was reached on the ADSP until the time that real-time audio interrupts started.
For the measurement, we used the file “SA8255_Early_Audio_Parallel_with_Load.awd”. This contains 1320 modules spread across all four cores. It also had a TDM input and TDM output port, and would generate audio on the A2B output. There was a signal generator on each core (sine wave or noise). The signal flow was designed so that you could distinguish the sound of each core and verify that each core was properly running just by listening. The top-level is simple and there were additional subsystems that would load up each of the cores.
All Hexagon DSPs are part of the early audio group while the Arm booted later. The combined AWB file was 335,760 bytes long and this was split into 4 separate AWBs, one per core:
SA8255_Early_Audio_Parallel_with_Load_0.awb [ADSP. 84,644 bytes]
SA8255_Early_Audio_Parallel_with_Load_1.awb [GPDSP0. 83,656 bytes]
SA8255_Early_Audio_Parallel_with_Load_2.awb [GPDSP1. 83,824 bytes]
SA8255_Early_Audio_Parallel_with_Load_3.awb [Arm. 83,680 bytes]
The DSPs log information when they boot. We observed:
00:00:43.617500 [awe_bsp.c 1140] AWE ADSP:awe cfg is ready!
00:00:43.635000 [awe_bsp.c 1417] AWE ADSP:The first time to pump audio
00:00:43.617500 [awe_bsp.c 1140] AWE GPDSP0:awe cfg is ready!
00:00:43.638750 [awe_bsp.c 1417] AWE GPDSP0:The first time to pump audio
00:00:43.617500 [awe_bsp.c 1140] AWE GPDSP1:awe cfg is ready!
00:00:43.645000 [awe_bsp.c 1417] AWE GPDSP1:The first time to pump audio
The time from the ADSP booting until it is ready to generate audio is 17.5 msec
The time from the ADSP booting until GPDSP0 and GPDSP1 are ready to generate audio is 27.5 msec.
A2B Board Setup
The board shown below is the “Rev D” version. There are 4 smaller jumper switches and they need to be set as shown:
1 Unless, of course, you want to write your own custom modules that execute on the Hexagon.
5 This is how varying sample rates are handled in Audio Weaver. The buffers are oversized for the worst case transfer size and side information - in the timing information pins - is used to regulate the flow.
6 The discussion here is based on block times which are easier to follow. The actual configuration is based on block sizes.
4 This is a new feature which was recently implemented in Audio Weaver. If your design has custom modules, then you may need to add a custom resetState function.
2 The objectID of this SourceInt module is specified in the system configuration file.
3 You will also need a Matlab license. We recommend Matlab 2022b.