Performance Benchmarks

Notice: Pre-Release Documentation
This document is part of a prerelease and is currently a work in progress. Some content may be incomplete, subject to change, or marked as TBD. We are actively updating this documentation and will continue to provide the most accurate and up-to-date information as development progresses. Thank you for your understanding!

This section contains some performance benchmarks for this release. During the measurements, processor clock and DDR speeds were set to maximum.

Interprocessor Communication

This is the measured MHz to transfer 16 channels of 32-bit data at 48 kHz between cores. Transfers were done with 1 msec blocks. This test uses the ChangeThread module and part of the reported MHz is on the sending core and part is on the receiving core.

	To ADSP	To GPDSP0	To GPDSP1	To Arm
From ADSP	2.5 MHz	24.5 MHz	23.5 MHz	7.8 MHz
From GPDSP0	26.8 MHz	1.7 MHz	23.6 MHz	7.8 MHz
From GPDSP1	26.6 MHz	24.4 MHz	1.4 MHz	7.8 MHz
From Arm	25.5 MHz	24.3 MHz	23.5 MHz	2.5 MHz

When sending data between different cores, it is going through the carveout shared memory. When sending audio between the same core, this means that data is going to another hardware thread on the same core and non-shared is used.

TDM I/O

The limiting factor for serial port I/O is the speed of the LP DMA memory. This is 85 MB/second for reading and 170 MB/second for writing. These tests were with 48 sample block sizes.

TDM Source

16 channels @ 48 kHz. 16-bit samples: 28.3 MHz (theoretical limit of 24 MHz)

16 channels @ 48 kHz. 32-bit samples: 49 MHz

TDM Sink

16 channels @ 48 kHz. 16-bit samples: 14.8 MHz (theoretical limit of 12 MHz)

16 channels @ 48 kHz. 32-bit samples: 25.6 MHz

Input to Output Latency

This was measured by connecting the TDMSource directly to the TDMSink as shown below.

We measured the delay with an oscilloscope and the path included:

A/D → A2B → TDMSource → Copy → TDMSink → A2B → D/A

The latency varied based on the block size as shown below:

Block Size (samples)	Total Latency (msec)	Analog Latency (msec)	Digital Latency (msec)
12	1.6	1.1	0.5
24	2.1	1.1	1.0
48	3.1	1.1	2.0

In R4.1 provide measures with Synchronous Unaligned Ports

ALSA I/O

This test measures the overhead of streaming audio data between the HLOS and Audio Weaver. 16 channels of data streamed at 48 kHz. 1 msec block size in Audio Weaver and a 10 msec block size at the HLOS.

ALSA Source SRC [R4.1]

HLOS sends data at 44.1 kHz and Audio Weaver converts to 48 kHz.

Core	MHz / 16 chan	MHz / channel
ADSP	52.3	3.3
GPDSP0	51.2	3.2
GPDSP1	51.9	3.3
Arm	18.7	1.2

ALSA Sink SRC (R4.1)

Audio Weaver converts 48 kHz to 44.1 kHz for the HLOS.

Core	MHz / 16 chan	MHz / channel
ADSP	62.4	3.9
GPDSP0	60.8	3.8
GPDSP1	60.1	3.8
Arm	7.2	0.45

TDM + ALSA Latency

In this test, we measure the latency from analog in to analog out including a round trip through the HLOS using ALSA modules. The path is:

A/D → A2B → TDMSource → ALSA Sink → HLOS → ALSA Source → TDMSink → A2B → D/A

An application was running on the HLOS which would read the ALSA Source and send it to the ALSA Sink. A 48 kHz sample rate was used throughout and Audio Weaver was processing at a 48 sample block size. The HLOS application was using a 240 sample block size. The ALSA settings used in the test were:

ALSA Sink

Buffer size: 960 samples
Block size: 240 samples
startThreshold: 0 samples
stopThreshold: 0 samples

ALSA Source

Buffer size: 960 samples
Block size: 240 samples
startThreshold: 480 samples (prefill to avoid underruns)
stopThreshold: 0 samples

The measured latency was 13.2 msec. The breakdown is:

1 msec analog latency

2 msec TDM digital latency (using 1 msec block size)

10 msec ALSA latency (using 5 msec block size)

We then modified the test to include sample rate conversion between Audio Weaver and the HLOS. The measured latency was now:

ALSA Settings			Measured
Sample Rate	Block Size	Buffer Size	Latency (msec)
8000	80	160	18.2
11025	120	240	21.6
12000	120	240	21.8
16000	160	320	15.6
22050	220	440	20.0
24000	240	480	15.4
32000	320	640	17.8
44100	480	960	19.8
48000	480	960	16.2

Maximum CPU Loading

In this test, we used the BiquadLoading module to load each thread in the system. We measured how many Biquad stages we could run before we started having CPU overruns. We used a 1 msec block size on the Hexagon DSPs and a 10 msec block size on the Arm. This test measures code and framework efficiency.

ADSP

Thread	BiquadStages	% Loading
1A	1600	95%
1B	1720	95%
1C	1720	95%
1D	1720	95%

GPDSP0

Thread	BiquadStages	% Loading
1A	2000	90%
1B	2000	90%
1C	2000	90%
1D	2000	90%
1E	2000	90%
1F	2000	90%

GPDSP1

Thread	BiquadStages	% Loading
1A	2000	90%
1B	2000	90%
1C	2000	90%
1D	2000	90%
1E	2000	90%
1F	2000	90%

Arm (only loaded a single thread)

Thread	BiquadStages	% Loading
10A	5000	50%

Early Audio KPIs

In release R4.0, we are not able to fully measure this KPI. However, we were able to measure the Audio Weaver contribution to the boot time. We instrumented the code and measured the time that the main() function was reached on the ADSP until the time that real-time audio interrupts started.

For the measurement, we used the file “SA8255_Early_Audio_Parallel_with_Load.awd”. This contains 1320 modules spread across all four cores. It also had a TDM input and TDM output port, and would generate audio on the A2B output. There was a signal generator on each core (sine wave or noise). The signal flow was designed so that you could distinguish the sound of each core and verify that each core was properly running just by listening. The top-level is simple and there were additional subsystems that would load up each of the cores.

All Hexagon DSPs are part of the early audio group while the Arm booted later. The combined AWB file was 335,760 bytes long and this was split into 4 separate AWBs, one per core:

SA8255_Early_Audio_Parallel_with_Load_0.awb [ADSP. 84,644 bytes]

SA8255_Early_Audio_Parallel_with_Load_1.awb [GPDSP0. 83,656 bytes]

SA8255_Early_Audio_Parallel_with_Load_2.awb [GPDSP1. 83,824 bytes]

SA8255_Early_Audio_Parallel_with_Load_3.awb [Arm. 83,680 bytes]

The DSPs log information when they boot. We observed:

00:00:43.617500 [awe_bsp.c 1140] AWE ADSP:awe cfg is ready!

00:00:43.635000 [awe_bsp.c 1417] AWE ADSP:The first time to pump audio

00:00:43.617500 [awe_bsp.c 1140] AWE GPDSP0:awe cfg is ready!

00:00:43.638750 [awe_bsp.c 1417] AWE GPDSP0:The first time to pump audio

00:00:43.617500 [awe_bsp.c 1140] AWE GPDSP1:awe cfg is ready!

00:00:43.645000 [awe_bsp.c 1417] AWE GPDSP1:The first time to pump audio

The time from the ADSP booting until it is ready to generate audio is 17.5 msec

The time from the ADSP booting until GPDSP0 and GPDSP1 are ready to generate audio is 27.5 msec.