Optimization of AVS decoder on DSP platform

AVS (Audio Video Coding STandard) is a second-generation audio and video compression standard developed by China's Digital Audio and Video Standard Working Group with independent intellectual property rights. AVS implements the principle of 1 yuan patent fee, which has the advantages of high coding efficiency, low patent fee and simple authorization mode compared with other audio and video codec standards. The structure of the AVS decoder is complicated and the amount of calculation is large. It is difficult to realize real-time decoding on the embedded platform. In the process of optimizing the performance of the decoder, the assembly instruction set can be optimized according to the platform used or the key algorithm module of the decoder can be improved. The above methods have a certain effect on the performance of the decoder. This paper proposes a use The L1P Cache cache function of the embedded platform enables the processor to efficiently access the program code, thereby achieving the purpose of improving the performance of the AVS decoder.

1 Application of cache

At present, more and more coding and decoding algorithms are implemented by DSP. With the increasing frequency of DSP chips, the access speed of memory has increasingly become the bottleneck of improving system performance. Under the existing manufacturing process, the increase of on-chip memory cells will lead to an increase in the load capacitance of the data line, affecting the switching time of signals on the data line, which means that the increase of on-chip high-speed memory cells will be very limited. In order to solve the problem that the memory speed does not match the CPU core speed, high-performance CPUs generally adopt a cache mechanism.

Taking TI's C64x DSP as an example, the memory system consists of on-chip memory and off-chip memory. Among them, the on-chip memory adopts a two-level cache structure. The first-level L1 is closest to the DSP core and has the fastest data access speed, which can reach 600Mbyte per second. It can only be used as a cache that cannot be addressed, and is composed of independent L1P and L1D .

L1P Cache is a high-speed cache memory for the processor to access program code, the size is 16 kbyte, using direct mapping, each line size is 32byte; L1D Cache is a high-speed cache memory for the processor to access data, the size is 16 kbyte, using 2-way mapping, The size of each line is 64 bytes. Level 2 L2 is a unified program / data space, which can be mapped to the storage space as an SRAM as a whole, or as a proportional combination of Cache and SRAM. The data exchange rate between L2 and L1 is 300 Mbyte per second, and the data exchange rate between L2 and SDRAM is 100 Mbyte per second. The off-chip memory is the third level, generally composed of SDRAM. L1, L2 and off-chip SDRAM constitute the hierarchical structure of the entire memory system. If the two-level cache structure of C64X is used properly, it will greatly improve the performance of the program.

According to the three-level memory system in Figure 1, when C64X reads the program code, it first checks the level 1 cache L1. If L1 has already cached the required code, it reads directly from L1; if L1 does not have the cache of the code, then access Level 2 cache L2; if L2 does not exist, then access the external SDRAM through the EMIF interface, copy the required code from the external SDRAM to the L2 cache area, and then copy from the L2 cache area to L1, and finally obtained by the DSP core.

Figure 1 Three-level memory system (B stands for byte in the figure)

Studies have shown that the use of this multi-level cache architecture can achieve about 80% of the execution efficiency of the system using a complete on-chip memory structure. This article is devoted to studying the mechanism of Cache in more depth, optimizing the data structure, processing flow and program structure of the algorithm, so as to improve the hit rate of Cache, and play the role of Cache more effectively, so as to achieve the purpose of improving the operating efficiency of the decoder .

2 Video decoding algorithm implementation based on Cache

In order to overcome the above deficiencies, this paper changes the implementation framework of the video decoding algorithm to make full use of L1P in the Cache to reduce the number of misses in the CPU reading program code and improve the execution efficiency of the decoding program.

In the specific implementation process, according to the capacity of L1P and the size of each functional unit code in the program, the functional unit in Figure 2 is divided into four modules, each module code size is less than 16 kbyte, the functions contained in each module The unit is: Module A, read a macroblock; Module B, entropy decoding, inverse scanning, inverse quantization, inverse transform; Module C, reconstruction; Module D, loop filtering.

A video macroblock can only be decoded after traversing these four modules. If the intermediate data transferred between each module is placed in the off-chip SDRAM, it will inevitably affect the speed of the next module to obtain data. If the data is placed in the on-chip SRAM In the middle, due to the limited on-chip RAM space, the entire frame of data cannot be stored. Therefore, considering the trade-off, each module completes the decoding of one macroblock row (assuming that an image contains M macroblock rows and each row has N macroblocks) and then passes it to the next module for processing, so that both intermediate data Placed in the slice can make full use of L1P, reduce the code flushing between each module, until all the M macroblock lines are processed, so that a decoded image of one frame of data is obtained.

We adopt the advanced technology imported from Europe, patented technology, specialized software to optimize the design for33kV cast resin Dry Type Transformer. The core is made of cold-rolled grain-oriented silicon steel sheet which cut in step-lap by GEORG Germany TBA core cutting lines and laminated by the method of fifth-order step-by-step stacking technology, enabling the no-load performance of the core to improve greatly. The epoxy resin from American HUNTSMAN is adopted for the windings which casted in the vacuum resin casting machine imported from HEDRICH, Germany. The winding material ensures good permeability, no bubbles occur, which leads to minimum partial discharge. The HV and LV winding mate with each other tightly, which ensures solid strength of structure and capability to withstand short circuit and vibration. Under normal service condition, the service life of dry type transformer is 30 years. No crack will form on the surface of transformer winding due to temperature variation as long as the transformer runs under normal service condition.

33kV Dry Type Transformer

Earthing Transformer,33Kv Dry Type Transformer,33Kv Transformer,33Kv Cast-Resin Transformer

Hangzhou Qiantang River Electric Group Co., Ltd.(QRE) , https://www.qretransformer.com

This entry was posted in on