Download Employing ARM NEON in embedded system`s

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Music technology (electronic and digital) wikipedia , lookup

MiniDisc wikipedia , lookup

PS Audio wikipedia , lookup

Transcript
AUDIO PROCESSING
Employing ARM NEON in embedded
system’s audio processing
By Yu Xu
Embedded Software Engineer
Freescale Semiconductor Inc.
The ARM Cortex-A8 processor
is the most advanced, high performance, low-power processor
by ARM. Based on the ARMv7
architecture, the processor suits a
variety of mobile and consumer
applications, including mobile
phones, STBs, game consoles and
car navigation. As the core technology of Cortex-A8 processor,
NEON technology has the flexibility to implement multiple combinations of video encode/decode,
3D graphics, speech processing,
audio decoding, image processing and baseband processing.
NEON technology is a 64,128bit
single instruction multiple data
stream (SIMD) instruction set.
NEON supports 8-, 16-, 32-, 64bit
integer and single precision floating-point SIMD operations to
handle audio, video, image and
other data processing. NEON technology has separate registers and
pipeline, which is independent of
the ARM integer pipeline. Through
the use of NEON technology’s
multimedia features, Cortex-A8
processor can decode MPEG4 VGA
video (including the de-blocking
filter, YUV to RGB conversion and
other operations) at 275MHz with
30fps speed. NEON technology
can execute an MP3 decoder with
processor frequency lower than
10MHz.
The Cortex-A8 processor’s
NEON media processing engine
pipeline starts at the end of the
main integer pipeline. As a result,
all exceptions and branch mispredictions are resolved before
instructions reach it. More importantly, there is a zero load-use penalty for data in the Level-1 cache.
The ARM integer unit generates
the addresses for NEON loads
and stores as they pass through
the pipeline, thus allowing data
eetasia.com | EE Times-Asia
Figure 1: Shown is the flow diagram of an MP3 decoder.
to be fetched from the Level-1
cache before it is required by a
NEON data processing operation.
Deep instruction and load-data
buffering between the NEON
engine, the ARM integer unit and
the memory system allow the
latency of Level-2 accesses to be
hidden for streamed data. A store
buffer prevents NEON stores from
blocking the pipeline and detects
address collisions with the ARM
integer unit accesses and NEON
loads.
The NEON unit is decoupled
from the main ARM integer
pipeline by the NEON instruction
queue (NIQ). The ARM Instruction
Execute Unit can issue up to two
valid instructions to the NEON unit
each clock cycle. NEON has 128bit
wide load and store paths to the
Level-1 and Level-2 cache, and
supports streaming from both.
The NEON engine has its own 10
stage pipeline that begins at the
end ARM integer pipeline. Since all
mispredicts and exceptions have
been resolved in the ARM integer
unit, once an instruction has been
issued to the NEON engine it must
be completed as it cannot generate exceptions.
NEON instructions are issued and retired in-order. A data
processing instruction is either
a NEON integer instruction or a
NEON floating-point instruction.
The Cortex-A8 NEON unit does
not parallel issue two data-processing instructions to avoid the
area overhead with duplicating
the data-processing functional
blocks, and to avoid timing critical
paths and complexity overhead
associated with the muxing of the
read and write register ports.
The NEON integer data path
consists of three pipelines: an integer multiply/accumulate pipeline
(MAC), an integer Shift pipeline
and an integer ALU pipeline. A
load-store/permute pipeline is re-
sponsible for all NEON load/stores,
data transfers to/from the integer
unit, and data permute operations
such as interleave and de-interleave. The NEON floating-point
(NFP) data path has two main
pipelines: a multiply pipeline and
an add pipeline.
Audio processing
Nowadays, WMA, MP3, AAC are
the mainstream of audio compression algorithm. From the applications and experiments of audio
decoding and playback, it is found
that the complexity is high and
they take up lots of clock cycles.
Especially, in the application of
Table 1: Here’s a list of every MP3 decoder module.
audio/video decoding, since the
video decoding algorithm take
up the large part of processor resource, limited source remains for
audio decoding. Thus, it’s essential
to improve the efficiency of audio
decoding in such application.
The MP3 is one of the most
common audio compression algorithms, which is used in audio
files and compressed audio/video
streams. So, MP3 decoding is
taken as the example to describe
the NEON technology application
in audio processing.
The complexity of every MP3
decoder module is listed in Table
1. The Huffman decode, IMDCT
and sub-band synthesis filter
modules take up the most of the
computing time, which is about
90 percent of the whole computing time. Hence, if the computing
time of these three parts is reduced, the efficiency of the whole
MP3 decoder can be significantly
improved.
Sub-band synthesis filter takes
up about 50 percent computation
in the MP3 decoder algorithm.
Hence, sub-band synthesis filter is
to be analyzed first. The filter contains matrix operation and PCM
output window filter. The formula
of matrix operation is:
The algorithm mainly includes
multiply-add operation. ARM assembly code can be summarized
as:
MUL r1, r2, r3
MUL r4, r5, r6
ADD r7, r1, r4
(1)
(2)
(3)
Since ARM multiply instruction
(MUL) has to use pipeline 0, statement (1) and (2) cannot make the
pipeline operation. The inputs of
statement (3) are the output of
statement (1) and (2). So the three
statements should execute one
by one. Furthermore, each MUL
instruction occupies two cycles.
One multiply and one add operation need five cycles when running on ARM.
In sub-band synthesis filter,
eetasia.com | EE Times-Asia
multiply-add is the main operation, which consumes many cycles
at each operation. NEON can help
in the situation. VMUL of NEON
instruction finishes vector multiplication in one cycle, which is
equivalent to two multiply operations. The multiply-add operation
is converted into NEON code:
VMUL D1, D2, D3
D1~D3 are the independent
NEON register vectors. D2 contains values of r2 and r5, while D3
contains values of r3 and r6. The
operation result is stored in D1.
The one NEON instruction finishes
2 multiplications. Moreover, VMLA
of NOEON instruction is equal
to two multiply-add operations.
After NEON optimization, it can
reduce multiply-add operation
time and the computing time of
the module.
IMDCT is the second largest
computing time consumption
module in the MP3 decoder, about
25 percent of the total. IMDCT has
32 frequency sub-band. Each subband contains one long window
or three sequential short windows.
Long window is consisted of 18
frequency lines, and short window
is consisted of six frequency lines.
The formula of IMDCT is:
After the algorithm level optimization, IMDCT is converted
to the algorithm, which includes
mainly multiply-add operation.
It’s similar to optimization method
of sub-band synthesis filter that
VMUL and VMLA of NEON can
replace multiply-add instruction
of ARM code efficiently. It reduces
the computing time of the IMDCT
module by a large margin.
The common audio decoders, such as WMA, AAC and
OGG, contain a large number of
discrete cosine transform, so the
same method of NEON instruction optimization can be used.
The above method is also common. Furthermore, for multimedia processing features, NEON
instruction set provides a range
of optimized media processing
instructions, such as the satu-
Figure 2: Shown is a flow diagram of NEON usage on i.MX51.
rated vector operations, vector
load/store and so on. If they are
used properly, the optimization
effect is very significant.
NEON on i.MX51
The i.MX51 multimedia application processor is Freescale’s highperformance and low power
consumption processor. The processor is based on ARM cortex-A8
architecture, which can run at up
to 1GHz and allows it to be used in
a wide variety of application such
as PMP, PND, PDA etc.
Since i.MX51 is designed for
multimedia application, audio
processing is the one of the essential applications. Here are the
advantages of optimizing audio
processing on i.MX51:
• Reduces the load on processors to achieve higher processing capacity;
• In the audio playback mode,
makes ARM Core longer dormant, low-power playback.
Application of the above demand on the i.MX51 is feasible.
First of all, the i.MX51 processor
is based on the ARM CortexA8 architecture, supporting
NEON technology. Therefore,
the method described can be
used on the audio decoder with
NEON instruction-level optimization to reduce the computing
time. Secondly, in fact, NEON
processing engine will cause the
chip power consumption to rise.
i.MX51 processor addresses this
issue, dedicated to the NEON
control module. The module
principle is that, when NEON
is not being used within the n
cycles (n configurable), interrupt
request will be issued. The software system can open or close
the NEON processing engine according to the use of NEON.
Optimizing audio decoder will
not be repeated here. The next
figure focuses on the use of the
NEON processing engine. A flow
diagram is shown in Figure 2.
When the system executes
the NEON instruction, the NEON
instruction causes the UNDEFINE
instruction exception. In the
exception handling, it first determines whether the NEON
processing engine has been enabled. If NEON is not enabled, the
software starts the NEON power
and then enables NEON. Later,
ARM Core begins execution of
the NEON instruction. After the
NEON instruction execution
is complete, NEON processing
engine goes into the IDLE state,
ARM Core continues the execu-
tion of the following ARM code.
At the same time, NEON Monitor
on the i.MX51 processor constantly monitors NEON processing engine working condition.
Software developers can write
to the registers NEON Monitor
idle-waiting time n. Thus, when
NEON has not been used within
the n cycles, the IRQ will be issued by NEON Monitor. Software
interrupt handler disables NEON,
and closes the NEON power to
save power consumption on the
NEON processing engine.
Conclusion
After optimization of NEON
instructions, each processor
computing time had a significant
decline. Data is shown in Table 2.
Audio decoding computing
time decreases significantly. The
eetasia.com | EE Times-Asia
Table 2: The NEON technology can improve algorithm efficiency, reduce computing time and realize high
performance, low power consumption system in audio application.
biggest benefit to the system is
that, in audio playback applications, it can make ARM Core in a
dormant state to achieve lower
system power consumption. On
the i.MX51 development board
and eCos OS, power consumption
of audio decoding is evaluated
(ARM Core running at 200MHz,
ARM Core voltage of 0.7V). In Table
2, last column lists the ARM Core’s
power. By using NEON technol-
ogy, the power consumption of
MP3/WMA decoder declined, especially after using NEON in WMA
decoder, ARM Core’s power consumption dropped from 4.38mW
to 3.47mW, almost 20 percent.