Introduction

This document describes how Audio Weaver has been integrated on the Qualcomm Snapdragon SOC using low-level APIs. It bypasses Audio Reach and provides significant performance improvements. This document should be used in conjunction with the generic description of the Audio Weaver architecture described in Audio-Weaver--Architecture. The overall design is flexible enough to handle all automotive use cases and configurations. You should be able to fully engineer your automotive audio system using Audio Weaver without having to write custom Hexagon code¹.

The platform specific code needed to wrap Audio Weaver is referred to as the Board Support Package (BSP). We will use the term BSP throughout this document to describe this code. The BSP code is based on Qualcomm low-level AudioLite APIs. This is a low-level software layer which provides an RTOS, TDM port I/O, and basic system features.

The document covers integrations on the Gen4 SA8255 and the Gen5 SA8397 / SA8797 “Nordy” chipsets. The integrations are very similar and when they diverge, we will separately document each integration.

Platform Features

Graphical development
Multicore support - distribute audio processing across all Hexagon DSPs and the Arm.
Unified signal flow showing an overall view of all cores and threads
Integrated profiling
Highly optimized including HVX support on Gen5 chipsets
Access to IP
- Over 500 Audio Weaver modules
- Qualcomm voice IP
- Custom module API
- 3rd party ecosystem
Real-time audio features
- TDM serial port configuration via Audio Weaver modules
- Independent TDM ports automatically synchronize to within 1 sample
- ALSA I/O configuration via Audio Weaver modules
- Low latency - as low as 0.5 msec digital-in to digital-out using a 0.25 msec block size
- Early audio within 1.5 seconds
- Resynchronizes automatically after CPU overrun
Software integration
- BSP configurable via a text initialization file
- Run-time control via TinyMix APIs
- Text based control API with integrated data type, range, and array bounds checking
- Subsystem Restart (SSR) feature reboots DSPs and restarts the Audio Service if there is a critical run-time failure
- Asynchronous event handling
Integrated full-featured Matlab API for scripting and regression testing
Supports all automotive use cases with concurrent operation
TFLM and ONNX support
Fully documented
Online training available

Comparison with Audio Reach

Audio Reach	Audio Weaver
Developed for power constrained mobile products. Single use case.	Developed for high-performance automotive audio. Multiple concurrent use cases.
Variable processing load.	Constant / deterministic real-time load
Must keep cores loaded < 70%	Can load cores to 90%
Separate AMS framework needed for low latency support. End-to-end digital latency of 3 x block size.	Native low-latency support. End-to-end digital latency of 2 x block size.
TDM ports aligned within 12 samples	TDM ports aligned within 1 sample
SysMon does not provide actionable information to fully load the system.	Easy to understand profiling. Per module, per thread, and per core. Show average and peak CPU load.
Only supports Hexagon DSPs; no Arm.	Supports all cores including Arm.
QXDM is a poor fit for real-time debugging.	Includes integrated visual debugging tools and legacy QXDM.
Numerous side effects. Many features can only be supported by Qualcomm.	Architecture is fully documented and information is publicly available.

Use Case Examples

We should include a full-featured block diagram which shows how to implement a complete automotive system in Audio Weaver. We could have placeholders for some functions, like RNC and Telephony, but it should be multicore and highlight best practices for designing systems.

Appendix A. System Configuration File

The R4.0 system uses an XML file to configure the Audio Weaver and the underlying BSP. This is converted to a binary file and loaded at system boot. An annotated version of the XML file is shown below.

<awe_process>

</awe_process>

<log_level type="UINT32" max_level="2">

</log_level>

<thread_num>

</thread_num>

<adsp_thread_priority>

<!-- thread_parent is the priority of the main audio processing interrupt.

This is the highest priority audio thread. -->

<thread_parent type="UINT8">50</thread_parent>

<thread_child>

<!-- Next separate priorities for each child thread.

The number of child threads is specified above.

The child threads should be in decreasing priority from the parent_thread. -->

<thread_0 type="UINT8">51</thread_0>

<thread_1 type="UINT8">52</thread_1>

<thread_2 type="UINT8">53</thread_2>

<thread_3 type="UINT8">54</thread_3>

</thread_child>

</adsp_thread_priority>

<gpdsp0_thread_priority>

<thread_parent type="UINT8">50</thread_parent>

<thread_child>

<thread_0 type="UINT8">51</thread_0>

<thread_1 type="UINT8">52</thread_1>

<thread_2 type="UINT8">53</thread_2>

<thread_3 type="UINT8">54</thread_3>

<thread_4 type="UINT8">55</thread_4>

<thread_5 type="UINT8">56</thread_5>

<thread_6 type="UINT8">57</thread_6>

<thread_7 type="UINT8">58</thread_7>

<thread_8 type="UINT8">59</thread_8>

<thread_9 type="UINT8">60</thread_9>

</thread_child>

</gpdsp0_thread_priority>

<gpdsp1_thread_priority>

<thread_parent type="UINT8">50</thread_parent>

<thread_child>

<thread_0 type="UINT8">51</thread_0>

<thread_1 type="UINT8">52</thread_1>

<thread_2 type="UINT8">53</thread_2>

<thread_3 type="UINT8">54</thread_3>

<thread_4 type="UINT8">55</thread_4>

<thread_5 type="UINT8">56</thread_5>

<thread_6 type="UINT8">57</thread_6>

<thread_7 type="UINT8">58</thread_7>

<thread_8 type="UINT8">59</thread_8>

<thread_9 type="UINT8">60</thread_9>

</thread_child>

</gpdsp1_thread_priority>

<arm_thread_priority>

<thread_parent type="UINT8">63</thread_parent>

<thread_child>

<thread_0 type="UINT8">62</thread_0>

<thread_1 type="UINT8">61</thread_1>

<thread_2 type="UINT8">60</thread_2>

<thread_3 type="UINT8">59</thread_3>

</thread_child>

</arm_thread_priority>

<num_of_instances type="UINT32">4</num_of_instances>

<adsp>

</adsp>

</gpdsp0>

</gpdsp1>

<arm>

</arm>

<!-- Size of the shared memory heap. This is visible to all cores.

In units of 32-bit words -->

<shared_heap_size type="UINT32" unit="UINT32">262000</shared_heap_size>

<!-- These values specify the "targetInfo" which is read back by the Audio Weaver

Server when it connects to the target. This will be removed in the future. -->

<block_size type="UINT16">48</block_size>

<sample_rate type="UINT16">48000</sample_rate>

<in_channel type="UINT16">16</in_channel>

<out_channel type="UINT16">16</out_channel>

</params>

</config>

Appendix B: Performance Benchmarks

This section contains some performance benchmarks for this release. During the measurements, processor clock and DDR speeds were set to maximum.

Interprocessor Communication

This is the measured MHz to transfer 16 channels of 32-bit data at 48 kHz between cores. Transfers were done with 1 msec blocks. This test uses the ChangeThread module and part of the reported MHz is on the sending core and part is on the receiving core.

	To ADSP	To GPDSP0	To GPDSP1	To Arm
From ADSP	2.5 MHz	24.5 MHz	23.5 MHz	7.8 MHz
From GPDSP0	26.8 MHz	1.7 MHz	23.6 MHz	7.8 MHz
From GPDSP1	26.6 MHz	24.4 MHz	1.4 MHz	7.8 MHz
From Arm	25.5 MHz	24.3 MHz	23.5 MHz	2.5 MHz

When sending data between different cores, it is going through the carveout shared memory. When sending audio between the same core, this means that data is going to another hardware thread on the same core and non-shared is used.

TDM I/O

The limiting factor for serial port I/O is the speed of the LP DMA memory. This is 85 MB/second for reading and 170 MB/second for writing. These tests were with 48 sample block sizes.

TDM Source

16 channels @ 48 kHz. 16-bit samples: 28.3 MHz (theoretical limit of 24 MHz)

16 channels @ 48 kHz. 32-bit samples: 49 MHz

TDM Sink

16 channels @ 48 kHz. 16-bit samples: 14.8 MHz (theoretical limit of 12 MHz)

16 channels @ 48 kHz. 32-bit samples: 25.6 MHz

Input to Output Latency

This was measured by connecting the TDMSource directly to the TDMSink as shown below.

We measured the delay with an oscilloscope and the path included:

A/D → A2B → TDMSource → Copy → TDMSink → A2B → D/A

The latency varied based on the block size as shown below:

Block Size (samples)	Total Latency (msec)	Analog Latency (msec)	Digital Latency (msec)
12	1.6	1.1	0.5
24	2.1	1.1	1.0
48	3.1	1.1	2.0

In R4.1 provide measures with Synchronous Unaligned Ports

ALSA I/O

This test measures the overhead of streaming audio data between the HLOS and Audio Weaver. 16 channels of data streamed at 48 kHz. 1 msec block size in Audio Weaver and a 10 msec block size at the HLOS.

ALSA Source SRC [R4.1]

HLOS sends data at 44.1 kHz and Audio Weaver converts to 48 kHz.

Core	MHz / 16 chan	MHz / channel
ADSP	52.3	3.3
GPDSP0	51.2	3.2
GPDSP1	51.9	3.3
Arm	18.7	1.2

ALSA Sink SRC (R4.1)

Audio Weaver converts 48 kHz to 44.1 kHz for the HLOS.

Core	MHz / 16 chan	MHz / channel
ADSP	62.4	3.9
GPDSP0	60.8	3.8
GPDSP1	60.1	3.8
Arm	7.2	0.45

TDM + ALSA Latency

In this test, we measure the latency from analog in to analog out including a round trip through the HLOS using ALSA modules. The path is:

A/D → A2B → TDMSource → ALSA Sink → HLOS → ALSA Source → TDMSink → A2B → D/A

An application was running on the HLOS which would read the ALSA Source and send it to the ALSA Sink. A 48 kHz sample rate was used throughout and Audio Weaver was processing at a 48 sample block size. The HLOS application was using a 240 sample block size. The ALSA settings used in the test were:

ALSA Sink

Buffer size: 960 samples
Block size: 240 samples
startThreshold: 0 samples
stopThreshold: 0 samples

ALSA Source

Buffer size: 960 samples
Block size: 240 samples
startThreshold: 480 samples (prefill to avoid underruns)
stopThreshold: 0 samples

The measured latency was 13.2 msec. The breakdown is:

1 msec analog latency

2 msec TDM digital latency (using 1 msec block size)

10 msec ALSA latency (using 5 msec block size)

We then modified the test to include sample rate conversion between Audio Weaver and the HLOS. The measured latency was now:

ALSA Settings			Measured
Sample Rate	Block Size	Buffer Size	Latency (msec)
8000	80	160	18.2
11025	120	240	21.6
12000	120	240	21.8
16000	160	320	15.6
22050	220	440	20.0
24000	240	480	15.4
32000	320	640	17.8
44100	480	960	19.8
48000	480	960	16.2

Maximum CPU Loading

In this test, we used the BiquadLoading module to load each thread in the system. We measured how many Biquad stages we could run before we started having CPU overruns. We used a 1 msec block size on the Hexagon DSPs and a 10 msec block size on the Arm. This test measures code and framework efficiency.

ADSP

Thread	BiquadStages	% Loading
1A	1600	95%
1B	1720	95%
1C	1720	95%
1D	1720	95%

GPDSP0

Thread	BiquadStages	% Loading
1A	2000	90%
1B	2000	90%
1C	2000	90%
1D	2000	90%
1E	2000	90%
1F	2000	90%

GPDSP1

Thread	BiquadStages	% Loading
1A	2000	90%
1B	2000	90%
1C	2000	90%
1D	2000	90%
1E	2000	90%
1F	2000	90%

Arm (only loaded a single thread)

Thread	BiquadStages	% Loading
10A	5000	50%

Early Audio KPIs

In release R4.0, we are not able to fully measure this KPI. However, we were able to measure the Audio Weaver contribution to the boot time. We instrumented the code and measured the time that the main() function was reached on the ADSP until the time that real-time audio interrupts started.

For the measurement, we used the file “SA8255_Early_Audio_Parallel_with_Load.awd”. This contains 1320 modules spread across all four cores. It also had a TDM input and TDM output port, and would generate audio on the A2B output. There was a signal generator on each core (sine wave or noise). The signal flow was designed so that you could distinguish the sound of each core and verify that each core was properly running just by listening. The top-level is simple and there were additional subsystems that would load up each of the cores.

All Hexagon DSPs are part of the early audio group while the Arm booted later. The combined AWB file was 335,760 bytes long and this was split into 4 separate AWBs, one per core:

SA8255_Early_Audio_Parallel_with_Load_0.awb [ADSP. 84,644 bytes]

SA8255_Early_Audio_Parallel_with_Load_1.awb [GPDSP0. 83,656 bytes]

SA8255_Early_Audio_Parallel_with_Load_2.awb [GPDSP1. 83,824 bytes]

SA8255_Early_Audio_Parallel_with_Load_3.awb [Arm. 83,680 bytes]

The DSPs log information when they boot. We observed:

00:00:43.617500 [awe_bsp.c 1140] AWE ADSP:awe cfg is ready!

00:00:43.635000 [awe_bsp.c 1417] AWE ADSP:The first time to pump audio

00:00:43.617500 [awe_bsp.c 1140] AWE GPDSP0:awe cfg is ready!

00:00:43.638750 [awe_bsp.c 1417] AWE GPDSP0:The first time to pump audio

00:00:43.617500 [awe_bsp.c 1140] AWE GPDSP1:awe cfg is ready!

00:00:43.645000 [awe_bsp.c 1417] AWE GPDSP1:The first time to pump audio

The time from the ADSP booting until it is ready to generate audio is 17.5 msec

The time from the ADSP booting until GPDSP0 and GPDSP1 are ready to generate audio is 27.5 msec.

A2B Board Setup

The board shown below is the “Rev D” version. There are 4 smaller jumper switches and they need to be set as shown:

¹ Unless, of course, you want to write your own custom modules that execute on the Hexagon.

⁵ This is how varying sample rates are handled in Audio Weaver. The buffers are oversized for the worst case transfer size and side information - in the timing information pins - is used to regulate the flow.

⁶ The discussion here is based on block times which are easier to follow. The actual configuration is based on block sizes.

^{4 This is a new feature which was recently implemented in Audio Weaver. If your design has custom modules, then you may need to add a custom resetState function.}

²The objectID of this SourceInt module is specified in the system configuration file.

^{3 You will also need a Matlab license. We recommend Matlab 2022b.}

AWE-Q: AWE on Snapdragon Bare Metal User Guide

Introduction

Platform Features

Comparison with Audio Reach

Use Case Examples

Appendix A. System Configuration File

Appendix B: Performance Benchmarks

Interprocessor Communication

TDM I/O

ALSA I/O

Maximum CPU Loading

Early Audio KPIs

A2B Board Setup