Konstantinos Krommydas, Christos D. Antonopoulos, Nikolaos Bellas, Wu-Chun Feng
Konstantinos Krommydas, Christos D. Antonopoulos, Nikolaos Bellas, Wu-Chun Feng
ABSTRACT similar to H264/AVC, and more than two times the coding
efficiency of MPEG 2.
Newer video compression standards provide high video These standards can efficiently handle nowadays‟
quality and greater compression efficiency, compared to typical resolutions and their implementations can provide
their predecessors. Their increased complexity can be the desired frame rate, dictated by human vision‟s real-time
outbalanced by leveraging all the levels of available requirement of about 30 frames per second. The prospective
parallelism, task- and data-level, using available off-the- trends, however, for even higher definitions indicate that the
shelf hardware, such as current generation‟s chip already heavy workload will become heavier, as will the
multiprocessors. As we move to more cores though, technical complexity of future video encoders and decoders.
scalability issues arise and need to be tackled in order to Programmers have new tools, hardware and even new
take advantage of the abundant computational power. computing paradigms in their efforts to overcome such
In this paper we evaluate a previously implemented problems. Unfortunately, trying to apply solutions tailored to
parallel version of the AVS video decoder on the a small number of cores to more introduces a series of
experimental 32-core Intel Manycore Testing Lab. We issues. These can be related to the scalability of a particular
examine this previous version‟s performance bottlenecks algorithm itself or can pertain to side-effects on the part of
and scalability issues and introduce a distributed queue the hardware (e.g. cache-related issues).
implementation as the proposed solution. Finally, we This paper builds on our previous work [2], where we
provide insight on separate optimizations regarding inter found that the hyper-threading feature of Intel Core i7
macroblocks and investigate performance variations and multiprocessor does not cater to further performance gains,
tradeoffs, when combined with a distributed queue scheme. mainly because of contention of the cores‟ shared resources.
Of great importance is the lock-free queue used, whose
Index Terms— AVS codec, task queue, video decoding contention limits any performance gains. As a solution, we
propose a distributed queue scheme.
1. INTRODUCTION In Section 2, we provide background on the AVS
standard and briefly present its base and previous parallel
Advances in video compression techniques and display implementation. In Section 3, we present related literature,
technology have facilitated high definition video (resolutions motivate our work and discuss our contribution. Section 4
up to 1920x1080 pixels) and the first generation of three- introduces our evaluation platform. Section 5, describes our
dimensional television. Meanwhile, Quad Full High distributed queue approach and the inter macroblock (MB)
Definition is making its first steps, and motion picture and optimization tradeoffs and provides results. Section 6
television engineers are paving the way for Ultra High concludes the paper with some thoughts and future work.
Definition TV, which will offer unprecedented picture
clarity of 7680x4320 pixels. 2. AVS DECODER BACKGROUND
The prevalent video standard nowadays, namely
H.264/AVC, is extensively used for high-definition video 2.1. Base implementation
coding. One video codec less known in the west world is the
Chinese Audio Video Standard (AVS), drafted by the AVS The AVS standard follows the MPEG 2‟s basic structure and
Workgroup [1, 3]. AVS workgroup was established by the incorporates similar tools. The decoding process (Fig. 1)
Chinese Ministry of National Information Industry and AVS entails the entropy decoding stage, intra prediction, the
is a national standard. AVS can deliver coding efficiency motion compensation (MC) procedure for inter prediction,
inverse transform, inverse quantization, as well as a smart
otherwise be blocked, do actual work, which is abundant,
especially for inter-decoded frames. Readers interested in
the full set of optimizations (sequential code optimizations,
vectorization) can refer to our original paper [2].
3. RELATED WORK
5.3. Results