(8.D.2.7) Optimization / Best Practices of Signal Flow Design
Use optimal Block Size and Sample Rate
The Block Size and Sample Rate typically depends on the implementation and the platform / application requirements. However, choosing the optimal block size is essential. It not only has influence on CPU load, but also on memory consumption. Block Size also depends on the constraints in the form of Latency requirements, Target platform and the algorithm used.
In general: The smaller the block size the higher the CPU load, but it is not a really linear factor i.e. enlarging the block size by a factor of 2 won’t reduce the CPU load by a factor of 2.
In some cases (e.g. SHARC with external Memory) the CPU load might even increase after a certain block size, as more and more wires will be allocated in external memory
IMPORTANT NOTE: Please use the Profiling Tool on the intended target hardware. Profiling done Native mode may not be reliable and provide different results as compared to the target hardware.
Memory
Audio Weaver Heap Assignment is divided into three block i.e. FastA, FastB and Slow Heap. The wires are allocated first after which the modules in the design are allocated. As a general rule wires should be mapped in fast memory.
Allocation Priority is also an important method to optimize the design. Module's with higher values of allocation priority values are allocated first. This allows you to allocate modules early which are highly dependent upon running in internal memory.
Profiling can contribute a non-trivial load to the overall layout profiling (more modules, more overhead). Disabling profiling API in AWECore helps with the optimization
Use Multichannel Modules
Majority of the modules in Audio Weaver have multichannel processing capability. Instead of using separate modules , same or individual coefficients can be used per channel. There are many advantages to this which includes lesser function call overhead, lower memory for individual wires within the modules. lower profiling overhead. It also decreases the need to use Interleaver and Deinterleaver.
Single Vs Multithreaded
In a multithreaded platform, implementing a single threaded design tends to increase the CPU usage. Utilizing the threads available in a target for example Hexagon ADSP has 4 threads. The implemented design (awd) can use up to 3 threads and 1 thread can be used for I/O for optimal CPU utilization. BufferUp/BufferDown as well as Decimating/Interpolating come with costs, so it usually makes sense to use it on “larger” portions of a signal flow, and not only on a few modules.
Set clockDivider on source modules (e.g. DCSource) à clockDivider propagated on all connected module. New threads operate at a lower priority and must have a block size which is a multiple of the system input pin
Use Subsystems / Hierarchy
Instead of having a big design with all the modules in one page of the awd. Different sections of the design can be divided into subsystems. Hierarchy and subsystems help in testing and identifying bugs in a design. Subsystems or Reusable subsystems can be utilized in other designs to avoid reinventing the wheel again and again. It is important to note that using subsystems up to a few layers can be beneficial but a deep hierarchy can be more problematic in turn.
Control Logic
As a best practice the control signals are separated from the processing signal chain. In some case control signal can be completely decoupled using a “Control Bus”. This implementation method reduces the number of Object IDs in the design as well as reduce complexity. The Control Signal should have a decimated block size.
Different control signals can be combined into a multichannel control signal which can be used to control a Set/Get Tables. As shown in figure below, the ParamSet modules are all controlled by DCSource modules, in turn the individual Param Set modules are used to control the volume of individual channels through separate scaler modules. In this use case the optimized version would be all the individual Scaler modules have been combined to a single ScalerN module. This is not only efficient for debugging but also decreases CPU percentage further.
Using Modules Judiciously
A simple example of this step of optimization is while utilizing a LimiterCore module. This the first figure the Absolute Maximum of the input signal is taken and passed through the Limiter Core module. The MaxAbs module by itself in this use case is more expensive and can be replaced by a SubblockStatistics module with the ‘staticticsType : MaxAbs’ instead. Understanding about various modules in Audio Weaver helps in utilizing more efficient modules according to the target.
Another example of optimization can be seen the example below. In the first Figure Clipping_A, clipping is done on individual mono channels using Interleaver and Deinterleaver with it which increases the overhead on the mono channels. This can be easily mitigated by using Multichannel Clipping (Fig Clipping_B) with the same module which drastically decreases the memory usage.
Profiling number comparison between Clipping_A and Clipping_B for a SHARC target.