A Methodology for Performance/Energy Consumption Characterization and Modeling of Video Decoding on Heterogeneous SoC and its Applications1 Yahia Benmoussac,a,b , Jalil Boukhobzaa , Eric Sennb , Yassine Hadjadj-Aoule , Djamel Benazzouzd a

University of Bretagne Occidentale, UMR6285, Lab-STICC. b University of Bretagne Sud, UMR6285, Lab-STICC. c University of M’Hamed Bougara de Boumerdes, LIMOSE d University of M’Hamed Bougara de Boumerdes, LMSS e University of Rennes 1, IRISA

Abstract To meet the increasing complexity of mobile multimedia applications, SoCs equipping modern mobile devices integrate powerful heterogeneous processing elements among which Digital Signal Processors (DSP) and General Purpose Processors (GPP) are the most common ones. Due to the evergrowing gap between battery lifetime and hardware/software complexity in addition to application’s computing power needs, the energy saving issue becomes crucial in the design of such architectures. In this context, we propose in this paper an end-to-end study of video decoding on both GPP and DSP. The study was achieved thanks to a two steps methodology: (1) a comprehensive characterization and evaluation of the performance and the energy consumption of video decoding, (2) an accurate high level energy model is extracted based on the characterization step. 1

This paper is an extension of the conference papers [7] and [6]. This version proposes a generalized end-to-end methodology for energy characterization and modeling of video decoding (section 3). The characterization of video complexity is added (section 3.1.1) and its impact on the energy consumption is discussed. Experimental tests are extended to additional video and architecture (OMAP4460). A system/application profiling is achieved (section 5.3) for better results explanation. In addition, energy model parameters are discussed in section 6.5, models generalization is discussed in section 7, and the application of the characterization and modeling results are discussed in section 8. Several additional related works were added and discussed in section 9.

Preprint submitted to Journal of Systems Architecture

December 2, 2014

The characterization of the video decoding is based on an experimental methodology and was achieved on an embedded platform containing a GPP and a DSP. This step highlighted the importance of considering the end-toend decoding flow when evaluating the energy efficiency of video decoding application. The measurements obtained in this step were used to build a comprehensive analytical energy model for video decoding on both GPP and DSP. Thanks to a sub-model decomposition, the developed model estimates the energy consumption in terms of processor clock frequency and video bitrate in addition to a set of constant coefficients which are related to the video complexity, the operating system and the considered hardware architecture. The obtained model gave very accurate results (R-squared = 97%) for both GPP and DSP energy consumption. Finally, Based on the results emerged from the modeling methodology, we show how one can build rapidly a video decoding energy model for a given target architecture without executing the full characterization steps described in this paper. Keywords: Energy consumption, Modeling, H.264/AVC, GPP, DSP, DVFS. 1. Introduction Mobile devices such as smartphones and tablets are more and more used in everyday life. One of the most popular application running on these devices is video playback. This is due to the growing use of video-sharing platforms (e.g. YouTube, Dailymotion), social networks (e.g. Facebook, Twitter), mobile IPTV, video-conferencing, etc. According to a recent study [42] achieved on 200 millions of mobile users, the average video watching time is 52 minutes per day. In addition, it is expected that video data will represent 70% of the overall Internet mobile traffic in the next few years [22]. These new trends in video application usage combined with the market explosion of multimedia consumer electronics rise new challenges for mobile device architecture designers. Indeed, to fit the important processing requirement and real-time constraints of video applications, processing resources embedded in these devices tend to be more and more powerful and complex. One important issue resulting from these new trends in hardware architecture is a drastic increase in the power consumption of these devices. In fact, according to [16, 17], when playing-back a video, the processing resources are responsible of more than 60% of the power consumption. This 2

leads to a drastic decease in mobile devices autonomy as lithium battery technologies are not evolving fast enough to absorb the ever-growing energy requirements of such mobile architectures [13]. For those reasons, energy saving considerations became at the center of modern microprocessors design. Although tremendous advances were achieved in this field, the energy optimization efforts are still insufficient. Indeed, due to the limitation of the microprocessor fabrication technologies, it is expected that only 20% of the energy saving is to be achieved at this level in the next few years [30]. Thus, one should consider the overall system including the hardware and the software platforms at multiple levels in order to cope with the energy saving issue. However, to take full advantage of these multi-level energy saving opportunities, mobile systems designers should deal with the increasing system complexity and heterogeneity. In the case of mobile video applications, heterogeneous processing resources, such as General Purpose Processors (GPP) and Digital Signal Processors (DSP), are at the heart of the video decoding process. They interact with other platform devices such as memory and I/O system to execute software components such as the operating system and video codecs. Understanding the interactions between all these elements is necessary when considering the overall energy consumption balance in order to design energyefficient optimization techniques. Low-level estimation of the impact of all these parameters is time consuming and very difficult in a context of increasingly complex hardware and applications. On the other hand, highlevel methodologies are less hard to develop but still hit a brick wall when it comes to provide users with comprehensive models describing the energy consumption behavior. In our point of view, one should find a middle ground and achieve a balance between an abstract high level model and a detailed complex lower level one. This can be achieved by bridging the gap between a comprehensive performance and energy characterization methodology based on exhaustive measures and a high-level modeling methodology. In this perspective, we propose an end-to-end methodology to characterize and model the energy consumption of the processing resources in the context of video decoding for embedded heterogeneous platforms containing both a GPP and a DSP. The characterization part of this work is based on an experimental measurement methodology applied to an embedded hardware platform. It aims to explain and evaluate the energy consumption behavior of the video decoding on two types of processor architectures (i.e. GPP and DSP). The measurement results obtained in this phase are used to build a 3

high-level analytical model which estimates the consumed energy as a function of the considered characterization parameters in addition to a set of comprehensive architecture, system and video related coefficients. The considered parameters in the presented methodology are the processor frequency, the processor type (GPP or DSP), the video quality (resolution and bit-rate) and the video complexity. Two major contributions are proposed in this paper. The fist contribution results from the characterization phase and provides a comparison between the energy efficiency of the video decoding on GPP and DSP according to different configurations, involving processor frequency and video quality and complexity. It highlights the importance of considering the end-to-end decoding flow when analyzing the energy efficiency of video decoding. The obtained results revealed that the best performance-energy trade-off highly depends on the decoded video quality and the GPP can be the best choice in many cases due to a significant inter-processor overhead in DSP decoding. One application example of these results could be the design of energy-aware video decoding mechanisms that select the best energy-efficient processor in the context of video quality adaptation imposed by the network bandwidth fluctuation. The second contribution consists in an analytical energy model with a very good prediction properties (R-squared = 97% ) for the considered two types of processors. Using a sub-model decomposition methodology, the proposed model describes the energy consumption behavior in terms of a set a comprehensive architecture and video related parameters. This allows to understand and quantify how the energy consumption is impacted when varying individually these parameters. Moreover, based on the results emerged from the modeling methodology, we show how one can build rapidly a video decoding energy model for a given target architecture without executing the full characterization steps described in this paper. The remainder of this paper is organized as follows. In section 2, some basic architecture and system considerations for video decoding performance and energy consumption are presented. In section 3, the performance characterization and energy modeling methodology is detailed. In section 4, the experimental methodology and setup are presented. The experimental characterization results and the developed energy model are detailed in sections 5 and 6. Energy model validation and applications are discussed in section 7 and 8 respectively. Finally, a related works on energy consumption of video decoding, and a conclusion are provided in sections 9 and 10, respectively.

4

2. Background We describe hereafter some elementary background related to the power consumption in electronic circuits. We discuss then how the performance and energy consumption of video decoding applications depend on architectural, operating system and applicative levels in a video decoding system. In CMOS digital circuits, the total power consumption is the sum of the static and dynamic powers : Ptot = Pstatic + Pdyn

(1)

where Pstatic and Pdyn are defined as : Pstatic = Ileak .V

(2)

Pdyn = Cef f .V 2 .f

(3)

Ileak is the leakage current, V is the supply voltage associated to the clock frequency f and Cef f is the circuit effective capacitance [14]. The static power is related to the circuit fabrication technology and does not depend on its activity. Below 65-nm circuits feature size, it becomes significant and poses new low-power design challenges [32]. On the other hand, the dynamic power is related to the circuit activity. For example, in case of a microprocessor, the dynamic power depends on the type of instructions executed and on the data accessed [41]. In equation 3, this is represented by the Cef f parameter defined as Cef f = A.C, where C is the circuit capacitance and A is the switching probability representing the activity factor. Since the dynamic power is based on the circuit activity, it highly depends on the upper system and application layers. This can be illustrated by the following example dealing with the particular case of a video decoding application. Figure 1 illustrates a simplified representation of a CMOS circuit which processes a set of sequential data D (encoded video frames) using a block B (video decoder). The block B operates at frequencies f2 and f corresponding to the supply voltage levels V1 = 1.06V and V2 = 1.2V respectively2 . If t is 2

V1 and V2 are associated to the frequencies 500 MHz and 250 MHz of the Cortex A8 processor used in our experiments.

5

Figure 1: System and architecture driven frequency scaling the processing time when B operates at a frequency f (Figure 1-a), then the energy consumption is EV2 = PV2 .t where PV2 = Cef f .V22 .f . If we suppose the processing time at frequency f2 (Figure 1-b) is doubled3 , then the ratio between the energy EV1 consumed by the circuit at the frequency f2 with V1 = 1.06V , and EV2 is : Cef f .V12 . f2 .2.t V1 4 EV1 = ( )2 ' = 2 EV2 Cef f .V2 .f.t V2 5 In this case, scaling down the voltage and the frequency decreases the power consumption to PV1 = Cef f .V12 . f2 ' 25 .PV2 which leads to 20% energy saving at the cost of decreased performance. This may represent a scenario where the operating system scales-down dynamically the processor frequency at run time when it detects a load decrease. This illustrates a system-driven voltage scaling. In order to save energy without sacrificing performance, an architecturaldriven voltage scaling [18] can be achieved by using two B blocks which are both clocked at a frequency f2 and supplied with a voltage V1 as described in Figure 1-c. P2.V1 and E2.V1 refer to the power and the energy consumption associated to this configuration. Since the two blocks are operating in parallel, the execution time does not decrease and the ratio between E2.V1 and 3

As we will discuss in section 6.5.1, this supposition is not always true.

6

EV2 is :

Cef f .V12 . f2 .t + Cef f .V12 . f2 .t V1 4 E2.V1 = ( )2 ' = 2 EV1 Cef f .V2 .f.t V2 5

In this configuration, the total power consumption P2.V1 is the sum of the power consumptions of the two blocks, which is equal to 2.Cef f .V12 . f2 = 45 .PV2 . The energy saving is equal to 20% without sacrificing the performance but at a cost of an additional circuit area and thus an additional static power. Theoretically, this type of parallelism provides better performance and energy efficiency for multimedia data processing applications [18]. However, it generates an energy overhead at both architectural and system levels. At the architectural level, the use of additional processing and multiplexing/demultiplexing circuits in the architecture consumes more static power [39, 32]. On the other hand, at the system level, when this type of parallelism is implemented in external specialized processors such as DSPs4 , the GPP/DSP inter-processor communication may generate a system overhead [53], and thus yet another additional energy consumption. As an illustration of such system overhead, Figure 2 describes the steps of a typical DSP video decoding process controlled by a GPP and where the video frames are supposed in an input buffer in the memory : 1. The GPP writes-back the frame from its cache to a shared memory to be accessed by the DSP. 2. The GPP sends the frame parameters (frame location in the memory) to the DSP codec via a GPP/DSP hardware bus. 3. The DSP cache invalidates the entries in its cache corresponding to a frame buffer in the shared memory. 4. The DSP decodes and transfers the frame to the output buffer. 5. The DSP sends the return status to the GPP. The fact that both the DSP and the GPP have their proper cache memory and communicate using a shared memory imposes to manage cache coherency each time data are shared between the DSP and the GPP. Additionally, from the operating system level point of view, the GPP/DSP communication is 4

Parallelism can be used internally in GPPs using pipeline or SIMD instruction sets.

7

Codec parameters 2 Input buffer

Coded Frame

codec 5 Codec return status

GPP

decoded Frame DSP 4

cache 1

Shared Memory

output buffer

cache 3

Figure 2: DSP video decoding managed by a driver through a system call. A frame decoding is considered, from the GPP point of view, as an I/O operation generating a system latency caused by entering the idle state and handling the hardware interrupt. In addition to these system implications, the energy consumption may rely on the upper application level such as the video decoder design and the video properties. For example, if the decoder is multi-threaded, it induces an additional synchronization processing and energy consumption. On the other hand, the video bit-rate, resolution and complexity have an impact on the amount of buffer memory transfers, the decoding time and the energy consumption [38]. The energy consumption of video decoding depends, then, on a combination of a fabrication technology, architecture, system and application parameters. Understanding and estimating the impact of these parameters on the energy consumption of the video decoding on a heterogeneous SoCs is the main objective of this study. 3. Characterization and Modeling Methodology Description In this study, we consider a video decoding process on a given energyconstrained mobile device. The hardware architecture of the mobile device contains a GPP and a DSP within a SoC. Each processor supports different clock frequencies f . A video sequence v consists of a set of video frames. It is characterized by a displaying-rate d (expressed in Frames/s), a bit-rate r (expressed in Kb/s), a spatial resolution s (size of a frame in terms of the number of pixels) and a complexity c, which reflects the temporal and spatial complexity characteristics of the video. We define a video decoding configuration as a set of constant and variable parameters. The constant parameters are related to the processor architec8

ture, the video resolution and the complexity. For a given video sequence and a mobile device, these parameters are not supposed to change. On the other hand, the variable parameters are the video bit-rate r and the processor clock frequency f . The bit-rate may vary depending on the network bandwidth capabilities and the processor frequency is driven by system frequency scaling policy. In this context, we focus in this study on the characterization and the modeling of the energy consumption of the processing resources when decoding a video as a function of the above-cited constant and variable parameters. The objective of the characterization part is to evaluate and explain the energy consumption variation. It consists in executing and measuring the energy consumption of different video decoding configurations. The modeling part, which depends on the measurement results obtained from the characterization phase, aims at estimating the energy consumption. The proposed methodology, described in details in the next subsections, is thus an end-toend approach to evaluate, explain and estimate the energy consumption of video decoding on a SoC. 3.1. Characterization Methodology In the energy characterization of the video decoding process defined above, we focus on the processing elements. The phases which are not part of the actual GPP and/or DSP video decoding are not considered. Thus, the execution time and the energy consumption related to buffering and video display processing were not taken into consideration. In fact, these parts are I/O dependent and their performance may vary according to bandwidth fluctuation or file system performance. Studying the impact of these parts is beyond the scope of this paper. As described in Figure 3, the characterization is divided into 4 steps : 1) video complexity characterization, 2) operating system level characterization, 3) video frame level characterization, and 4) video sequence level characterization. The first step is achieved at the video encoding phase when the remaining ones are executed at the video decoding phase (on an embedded hardware platform). 3.1.1. Video Complexity This step is executed in the video encoding phase to prepare the tested video sequences used in our experimentations. We used H.264/AVC, a mature and widely adopted coding standard [60]. H.264/AVC allows to encode 9

Figure 3: Energy modeling methodology a video according to different profiles. In our experimentations, we used the restricted base profile, which is suitable for use in processing-resources constrained embedded devices. A set of representative raw video sequences (with different video complexities and resolutions) were selected and encoded with different constant bit-rates using the H.264/AVC encoder. To extract the video complexity information, we kept trace of the mapping between each bit-rate and the average quantization parameters qpavg used by the encoder to encode the videos. The quantization parameter, which is defined in the H.264/AVC standards, is adjusted dynamically by the encoder for each video frame or macro-block to fit a targeted bit-rate. The more a video is complex, the higher is the value of the used quantization parameters. Therefore, the information about the video complexity can be provided by the couple c = (r, qpavg ). As illustrated in Figure 3, the output of this step (block 1) served to build a rate model which describes the variation of the bit-rate as a function of

10

qpavg . This model helps to build an energy analytical model as a function of the bit-rate and some video complexity related parameters. The next characterization steps are those dealing with the actual decoding process executed on an embedded platform. 3.1.2. Operating-System Level In this step of the methodology, the power consumption of both the GPP and the DSP at the idle and the active states are measured at different clock frequencies regardless of any video decoding process. The objective of this level is to constitute a set of reference power consumption values that help to understand and quantify the performance and the energy consumption of the different video decoding process phases at a frame granularity. 3.1.3. Video-Frame Level This level relies on the preceding operating system level characterization, which helps in identifying transitions of the processor to active/idle states and the decoding/waiting phases during the decoding process. The objective of this step is to understand how the processing elements are used and where goes the energy when decoding a single frame in order to explain the global performance and energy consumption of both GPP and DSP decoding. For this purpose, the elementary video frame decoding is characterized in terms of system metrics such as the amount of buffer transfers, the GPP/DSP communication latency and the cache coherency maintenance. We also evaluated during this step the overhead of video decoding when using both GPP and DSP. As discussed in section 2, decoding a video comes down to a sequence of frame processing periods. Each period is composed of a set of actions consisting in retrieving the coded frame from the input buffer, decoding it and transferring it to the output buffer. The actual frame decoding step is thus a sub-part of a frame processing period. In the frame processing period, we define the overhead as the actions which are not a part of the frame decoding step. For example, in case of DSP decoding, this may be related to cache coherency maintenance and inter-processor communication as described in section 2. 3.1.4. Video-Sequence Level In this step the average performance (number of decoded frames per second) of H.264/AVC decoding of the overall video sequence is evaluated depending on the video bit-rate and resolution, and the processor clock fre11

quency. Both of the GPP and the DSP were tested. The considered performance evaluation metric is the video displaying rate of the decoded video. In fact we considered that a decoding rate which is lower than the displaying rate of the coded video is not sufficient for playing-back the video with respect to real-time constraints. The overall energy consumption is then calculated by multiplying the sum of the elementary measured power values by the decoding time. The average energy per frame (mJ/frame) is then obtained by dividing the overall energy by the total frames number. Each video was decoded using GPP and DSP processors. This decoding was repeated for all the available clock frequencies. For each bit-rate, resolution, clock frequency, the decoding time and the energy consumption was measured. As shown in Figure 3 (block 4), the result of this phase are triplet data sets (T, r, f ) and (E, r, f ) describing the decoding time and energy variation in terms of r and f . These data are used in the modeling phase, described hereafter, to build a power and a performance model for video decoding. 3.2. Modeling Methodology In our modeling methodology, we used a top-down model decomposition approach in which the energy model is decomposed as a function of two sub-models : a decoding-time model T and an average power model Pavg . E(r, f ) = Pavg (r, f ).T (r, f )

(4)

Pavg is then decomposed into a dynamic Pavgdyn and a static Pavgstat power models : Pavg (qpavg , f ) = Pavgdyn (r, f ) + Pavgstat (r, f ) (5) As discussed previously in section 3.1.1, we propose to use the mapping between the bit-rate r and the the average quantization parameter (qpavg ) to add the video complexity information to our model. For this purpose, we first use the video (qpavg ) instead of the bit-rate (r) as a model parameter for the power and performance models. Therefore, the model described in equation (4) giving the energy in terms of the clock frequency f and the bit-rate r becomes : Pavg (qpavg , f ) = Pavgdyn (qpavg , f ) + Pavgstat (qpavg , f ) 12

(6)

To obtain an energy model as a function of the clock frequency and the bit-rate, we used a rate model Q which describes qpavg in terms of the bit-rate r and some video complexity related coefficients. E(r, f ) = (Pavgstat (Q(r), f ) + Pavgdyn (Q(r), f )).T (Q(r), f ).

(7)

The energy model described in equation (4) is thus decomposed into : 1) a rate model, 2) a static/dynamic power model and 3) a time model. Based on these sub models, the energy model construction phase can start. A model fitting is performed on the video complexity characterization results to develop a rate model. Then, a regression analysis of the video sequence level characterization results is used to develop the power and the decoding-time analytical sub-models. Details on sub-models, experimental measurement and development of the sub-models are described below. 3.2.1. Video Rate Sub-model To express the video bit-rate as a function of the quantization parameters, we used the model proposed in [38]. The authors of this paper assume that the video bit-rate can be described as follows : q −a ) .rmax (8) r=( qmin were q is the step size which is defined as follows [60] : q = 2(qpavg −4)/6

(9)

a is an exponent which represents how fast the video rate changes in terms of the step size parameter. qmin is the lowest step size parameter used to encode the video with the highest bit-rate rmax . By substituting equation (9) in (8) we obtain the following rate model : r

)−1/a ) (10) rmax Using the values (r, qpavg ) returned by the encoder, the parameters qmin , rmax and a can be calculated using model fitting for each coded video. qpavg = 4 + 6.ln2 (qmin .(

3.2.2. Power Sub-model The static power is modeled using the processor data-sheet [54] providing the value of the static power corresponding to each voltage level. To model the dynamic power consumption, the Cef f parameter value is obtained by fitting the measured dynamic power (i.e. the measured total power minus the measured static power) with the model described in equation (3). 13

3.2.3. Decoding-Time Sub-model To develop the video decoding time model T , we used an observation on the experimental results (see section 5.4.1) which reveals a linear relation between 1/t (where t is the decoding time) and both the clock frequency f and the quantization parameters qpavg . This linear relation was validated using a multi-linear regression of 1/t in terms of f and qpavg . 4. Experimental Setup The above discussed methodology is independent from the underlying hardware and software platforms. In this section, we describes the execution of this methodology on an OMAP hardware platform on which run a Linux operating system and Gstreamer multimedia framework. 4.1. Hardware Setup The power measurements were conducted on the OMAP3530EVM board (shown on Figure 5) containing the low-power OMAP3530 SoC. This SoC is based on 65-nm technology and consists of a Cortex A8 ARM processor supporting ARMv7 instruction set and a TMS320C64x DSP. The OMAP3530 supports six operating frequencies ranging from 125 MHz to 720 MHz for the ARM and from 90 MHz to 520 MHz for the DSP. The power consumptions of the DSP and the ARM processors are measured using the Open-PEOPLE framework [10], a multi-user and multi-target power and energy estimation and optimization platform. It includes the NI-PXI-4472 digitizer allowing up to 100 KHz sampling resolution. The OMAP3530EVM board provides a single jumper for measuring both of the DSP and the ARM processor power consumptions. In case of ARM video decoding, the measured power represents the ARM dynamic consumption plus the ARM and DSP static power. In case of DSP video decoding, both the ARM and the DSP are involved. In fact, the ARM controls the DSP which executes the actual video decoding process. The measured power is thus the sum of the static and the dynamic power of both the ARM and the DSP. In what follows, Pstatic , is the sum of the ARM and the DSP static power. The ARM and the DSP dynamic powers are noted Pdynarm and Pdyndsp , respectively. The total power consumption of the ARM and the DSP is noted Ptot

14

End of energy and perormance measurment

Start of energy and perormance measurment

Buffering phase

File system

1

Decoding phase input buffer

3

5

2

filesrc

queue

Buffer queue thread (Regular priority) Physically-contiguous circular buffer FIFO

decoding thread

Buffering thread

TIViddec2 pluging

Encoded video buffer

output buffer

4

ffdec_h264

filesink

Invoke DSP decoder

Decoded video frame

location=/dev/null Video decode thread (REAL-TIME priority)

a- ARM gstreamer H264 video decoding pipe

filesrc

queue

TIViddec2

filesink location=/dev/null

b- DSP gstreamer H264 video decoding pipe

Figure 4: GStreamer GPP and DSP video decoding pipes 4.2. Software Setup On this hardware platform, the Linux operating system version 2.6.32 was used with cpufreq [44] enabled to drive the ARM and the DSP frequency scaling. The DSP was running the DSPBios operating system and was driven from the Linux/GPP side using a driver named DSPlink. On this system, the H.264/AVC video decoding was achieved using GStreamer [23], a multimedia development framework. The use of GStreamer permits an accurate GPP/DSP decoding comparison thanks to its modular design allowing to plug and execute both the GPP and the DSP video decoders into the same software environment. The ARM decoding, was performed using ffdec h264, an open-source plug-in based on the widely used ffmpeg/libavcodec library compiled with the support of the NEON SIMD instruction set. According to [2], NEON boosts performance by 60-150% for video codecs. For DSP decoding, we used TIViddec2, a proprietary GStreamer H.264/AVC baseline profile plug-in provided by Texas Instrument. Its internal design is illustrated in the right side of Figure 4. The video frames are moved from the encoded video buffer (input buffer which contain the coded frames) to a circular buffer by a queuing thread. The video decode-thread invokes the DSP decoder via the dsplink DSP driver. The DSP codec executes a cache invalidation operation so that it can see the right data in the shared memory, decodes the frame and transfers it to the decoded frame buffer using DMA. In our experimental tests, videos were decoded from a flash memory filesystem. As described in section 3.1, we focus on measuring the video decoding 15

Table 1: Hardware and software setup summary Videos

Test sequences Rate Resolution

Applications

Bit-rate (Kb/s) GStreamer

Operating System

DSP Hardware

ARM SoC

ARM plug-in DSP plug-in GPP ARM ARM/DSP DVFS driver DSP driver DSP power management driver Model Frequencies (MHz) Model Frequencies (MHz) Model Voltages levels

Harbor, Soccer, City 30 Frames/s qcif (176x144), cif (352x288), 4cif (704x576) 64, 128, 256, 512, 1024, 1536, 2048, 2560, 3072, 3584, 4096, 4608, 5120 ffdec h264 TIViddec Linux 2.6.32 DSPBios cpufreq DSPLink LPM TMS320C64x 90, 180, 360, 400, 430, 520 Cortex A8 + NEON 125, 250, 500, 550, 600, 720 OMAP3530 0.975, 1.05, 1.2, 1.27 , 1.35, 1.35

Figure 5: Mistral EVM3530 Board

phase. However, GStreamer is multi-threaded program and the buffering operations may interleave with the video decoding operations. This makes the performance and the energy consumption of the decoding phase difficult to measure. To avoid this situation, we developed a customized video decoder using the GStreamer API. As shown in Figure 4, the decoding thread was kept initially in a pause state while the video stream was copied in an input buffer (GStreamer queue element) by the buffering thread. The decoding thread is woke-up when all the video stream is held in the input buffer. On the other hand, the decoded frames are redirected to /dev/null in order to disable the processing related to the frames copy from the output buffer to the display driver memory or to a file. Table 1 gives a complete summary of all the above described hardware and the software setups. 4.3. Video Complexity Characterization Setup For video decoding characterization we encoded 3 well-known YUV raw video sequences : City, Soccer and Harbor in H.264/AVC base-line profile. These sequences represent respectively a low, a medium and a high video complexity. Each video is available in three resolutions (qcif (176x144), cif (352x288) and 4cif (704x576)). Each video was encoded into 13 different bit-rates (from 64 Kb/s to 5120 Kb/s) using x264 encoder [40]. After each coding operation, the values of qp used in the encoding process are extracted from the encoder log and the average quantization parameter qpavg was then calculated accordingly. This step serves to characterize the video complexity and to build a linear performance model as described in section 3.1.1. 16

4.4. Video Decoding Performance and Energy Characterization Setup As described in section 3.1, the video decoding characterization follows the following three subsequent steps. 4.4.1. Operating-System Level In this step, the operating system power consumption was characterized. At each clock frequency, the power consumption of the ARM processor was measured in idle and active states. The active state corresponds to the execution of an incrementation loop of a processor register. The power consumption in the idle state of the DSP was measured by calculating the power difference due to the activation of the DSP 4.4.2. Video-Frame Level In this step, the video decoding power and performance were characterized at the frame granularity. A 100 KHz power measurement sampling (10 µs resolution) was used to discern the maximum power variation information within a frame decoding phase. This is useful especially in case of low quality video where one frame decoding time can be very fast (around in 1 ms). In order to calculate the time overhead, the sum of the frame decoding times is subtracted from the total video decoding time (after disabling the tracing). Similarly, the energy overhead is calculated by subtracting the sum of the frame decoding energy values from the overall video decoding energy. The frame decoding energy is obtained by multiplying the average frame decoding power consumption by the frame decoding time already calculated. In fact, we have noticed that the average power consumption corresponding to the frame decoding phase is constant for a given video resolution. 4.4.3. Video-Sequence Level In the third step, the number of decoded frames per second and the energy (mJ) per frame of ARM and DSP H.264/AVC video decoding were measured. This operation is executed for each tested video sequence (Harbor, Soccer and City), bit-rate, resolution and clock frequency. The video decoding energy is obtained by summing elementary energies using 1 KHz power measurement sampling. 5. Experimental Results of Video Decoding Characterization In this section, we discuss the results obtained after the execution of the characterization methodology discussed above. 17

5.1. Video complexity Table 2 shows the mapping between the resolution, the bit-rate, and the quantization parameters extracted for the considered sequences. The plotted data show that for the same resolution and bit-rate, the average quantization parameter qpavg of the City, Soccer and Harbor are in an increasing order. This is explained by the fact that the Harbor video sequence is more complex than Soccer which is more complex than City. 40

Harbor Soccer City

QCIF

Bit-rate(Kb/s) 64 128 256 512 1024 1536 2048 2560 3072 3584 4096 4608 5120

qcif 32.81 27.69 23.06 18.49 13.51 10.43 7.97 5.69 3.51 1.31 0.44 0.22 0.14

City cif 42.74 36.79 31.83 27.28 23.07 20.71 18.92 17.50 16.33 15.34 14.48 13.70 12.99

Average quantization parameters qpavg Soccer Harbor 4cif qcif cif 4cif qcif cif 51.00 38,22 48,81 51 37,78 45,84 49.38 31,65 41,55 50,99 33,94 41,56 41.52 25,57 35,43 45,27 30,07 37,78 35.76 19,84 29,76 39,09 25,83 34,12 31.27 14,22 24,38 33,75 20,66 30,21 29.14 10,89 21,47 30,87 16,84 27,64 27.83 8,25 19,35 29,01 13,72 25,65 26.86 5,87 17,77 27,61 11,08 23,98 26.09 3,59 16,49 26,5 8,75 22,55 25.45 1,8 15,41 25,6 6,51 21,27 24.89 1,05 14,45 24,85 4,37 20,11 24.40 0,91 13,6 24,17 2,31 19,03 23.94 0,82 12,83 23,57 0,95 18,01

30 20

4cif 51 50,7 45,5 41,26 37,48 35,18 33,53 32,21 31,11 30,15 29,31 28,56 27,88

10 0 0

1000

2000

3000

4000

5000

Bit−rate

50

Harbor Soccer City

CIF 40 30 20 10 0

1000

2000 3000 Bit−rate

4000

50

5000

Harbor Soccer City

4CIF 40 30 20 10 0

1000

2000 3000 Bit−rate

4000

5000

Table 2: Mapping between the video qpavg and the bit-rates 5.2. Operating-System Level Figure 6-a shows both the active and idle power consumption of the ARM processor corresponding to the six available clock frequencies. The power consumption in the active state is almost 35% greater than the one of the idle state. This is due to the Wait For Interrupt (WFI) ARM instruction called when entering the idle state. WFI puts the processor in a low power state by disabling most of the clocks in the processor while keeping it powered up. Note that the WFI power consumption is not equal to the processor static power, which corresponds to the state where all the clocks are gated.

18

farm fdsp MHz 720 520 600 430 550 400 500 360 250 180 125 90

Pactarm 0.5965 0.4997 0.4087 0.3276 0.1238 0.0421

Pidlearm

Pidledsp W 0.4342 0.2312 0.3801 0.1778 0.3089 0.1490 0.2476 0.1217 0.0913 0.0498 0.0275 0.0224

0.7

Pstatic

OMAP 3530 Power consumption 0.6 ARM dynamic power (Active)

0,01975 0,01975 0,01527 0,01135 0,00716 0,00516

Table 3: OMAP3530 power consumption

0.5

Power (W)

Vdd V 1.35 1.35 1.27 1.2 1.05 0.975

ARM dynamic power (Idle) DSP active power (idle)

0.4

Static Power (ARM + DSP)

0.3 0.2 0.1 0

125

250

Frequency (MHz) 550 600

720

Figure 7: OMAP3530 power consumption

Figure 6-b shows the DSP idle power consumption at different frequency levels. The idle state corresponds to the state where DSP is activated without executing any instruction. Table 3 summarizes the measured power characteristics of the Cortex A8 processor and the TMS320C64x DSP. The values of the static power corresponding to each voltage level are taken from [54]. One can notice that the dynamic power represents the major part of the total power as compared to the static power consumption. We can also observe that the ARM and the DSP idle states power consumption is not negligible and may constitute an important part of the energy budget. The power consumption levels measured in this steps provide information on how the energy is consumed in different processor states. The amount of time spent in each state is one of the parameters which impacts the overall energy consumption. This is discussed when analyzing video decoding at a frame level in the following section.

0,6

720 Mhz

ARM idle ARM active

(a) ARM Active/idle power consumption

550 Mhz

0.6

500 Mhz 0,4 250 Mhz

0,3 0,2

125 Mhz

Time (s)

10

15

500 Mhz 250 Mhz

0.3

0 0

20

550 Mhz

0.4

0.1 5

ARM + DSP idle

600 Mhz

0.5

0.2

0,1 0 0

(b) ARM and DSP idle power consumption

0.7

Power(W)

Power (W)

0,5

0.8

720 Mhz 600 Mhz

DSP activation DSP deactivation

5

125 Mhz

10 Time (s)

15

20

Figure 6: Cortex A8 and TMS320C64x DSP dynamic power consumption

19

Memory DSP + ARM

Frame processing periode Overhead

DSP Decoding

DSP idle/ ARM idle

1,2

DSP active/ARM idle

(b) DSP frame decoding (qcif) power consumption

Power (W)

Power (W)

(a) DSP frame decoding (4cif) power consumption

Decoded frame transfer using DMA

1

0,8

DSP + ARM

DSP Decoding

Overhead

1,2

1

Memory

Frame processing periode DSP idle/ARM idle DSP active/ARM idle

Decoded frame transfer using DMA

DSP idle/ARM active

0,8

0,6

DSP idle/ ARM active

0,4

0,6

Memory power increase due to frame copy.

Memory power increase due to frame copy.

0,4

0,2

0,2 0

10

20

30

40 50 Time (ms)

60

70

80

90

100

0

2

4

6

Time (ms)

8

10

12

14

(c) ARM frame decoding (4cif) power consumption

(d) ARM frame decoding (qcif) power consumption

Memory ARM + DSP

Ovearhead

Power (W)

Power (W)

m

Frame processing periode Frame decoding

0,8

Memory ARM + DSP

Overhead Frame processing periode

0,4 Frame decoding

0,6 0,4

0,2

0,2 Memory power increase due to frame copy. 0 0

20

40

60

80 Time (ms)

100

120

140

0 0

1

2

3

4

5 6 Time (ms)

7

8

9

10

Figure 8: ARM and DSP frames decoding 5.3. Video-Frame Level The objective of this step is to understand where goes the consumed energy at a frame granularity and how much energy is consumed in the overhead processing as defined in section 3.1.3. Thus, we restricted our experimentations on 10 frames extracted from the Harbor sequence. These frames are coded in 4cif (4 Mb/s), cif (1 Mb/s) and qcif (128 Kb/s) resolutions. We used 720 MHz clock frequency for the ARM and DSP processors. More extended experiments using other configurations are described at the video sequence level in the next section. Figure 8-a and Figure 8-b show the power consumption levels of 4cif and qcif DSP video decoding. The DSP frame decoding phase is represented by the strip varying between 0.7 W and 1.1 W corresponding to [32 ms, 62ms] (see Figure 8-a) and [6.2 ms, 7.5ms] (see Figure 8-b) intervals. This phase is terminated by a burst of DMA transfers of the decoded frame macro-blocks from the DSP cache to the shared memory. This phase corresponds to the intervals [56 ms, 62ms] (see Figure 8-a) and [7.2 ms, 7.5ms] (see Figure 8-b) and is illustrated by an increase in memory power consumption. When the DSP terminates the frame decoding, it returns to the GPP the execution status and enters the idle state. This event occurs, for example, at 25 ms in Figure 8-a. The ARM wake-up latency is represented by the power level of 0.66 W which is the sum of the power consumption of both ARM and DSP in idle state (0.43 W +0.23 W) as described in table 3. The ARM wake-up 20

is represented by the power transition to 0.83 W level which is the sum of the ARM active state (0.59 W) and the DSP idle state (0.23 W). The ARM sends then the parameters (the next frame to decode) to the DSP codecs and triggers a DSP decoding function. Figure 8-c and 8-d show the power consumption variation in case of 4cif and qcif ARM decoding. Like the DSP decoding, the frame decoding phase is characterized by an increase in the power consumption. The decoded frame copy does not appear clearly as in the case of DSP decoding since the frames are decoded in the ARM cache and evicted when no space is left in the cache. We can also notice that the frame decoding time is lower than the frame decoding period, which is due to a GStreamer overhead. One can observe that the amount of time spent in frame decoding as compared to the total video decoding time varies according to the video resolution. For example, in case of qcif DSP decoding (Figure 8-b), the frame decoding time represents almost 50%. The complete measured time and energy overhead (as described in section 3.1.3) are given in table 4. Table 4: ARM and DSP decoding times and energy overhead Resolution qcif (128 kb/s) cif (1024 kb/s) 4cif (5120 kb/s)

Time Decoding 2.19 10.85 47.23

ARM (ms/Frame) Energy (mJ/frame) Total Overhead Decoding Total Overhead 2.87 10.04 % 1.20 1.54 22.07 % 12.04 9.88 % 6.18 6.87 10.04 % 52.39 9.86 % 27.39 28.4 3.55 %

Time Decoding 1.97 6.016 23.73

DSP (ms/Frame) Energy (mJ/frame) Total Overhead Decoding Total Overhead 4.16 31.01 % 1.71 2.63 34.98 % 8.36 28.11 % 5.35 6.72 20.38 % 25.93 8.48 % 21.59 22.16 2.5 %

The time overhead percentages are 31%, 28% and 8% of the total frames decoding time in case of qcif, cif and 4cif DSP decoding. On the other hand, it is almost constant (10%) in case of ARM decoding. We note that the overhead is not negligible as compared to the total time decoding. For example, the total qcif DSP decoding time is higher than the one of ARM although the frames are processed faster by the DSP. On the other hand, the energy overhead is higher in case of DSP decoding. In fact, the ARM/GPP communication contributes to a significant part of the energy consumption especially in case of low video resolutions. We can also notice that keeping the DSP in idle state waiting to receive a decoding request from the ARM participates to the energy overhead. In fact, the DSP is not deactivated when it is waiting to receive the next frame. The idle interval is so short that entering in a deeper sleep mode would have a negative impact on performance. During this timeout, the DSP consumes about 0.23 Watts at 520 MHz (refer to table 3) without executing any task. For exam21

ARM

DSP

Functions libavcodec omap3 pm idle libc libgstreamer other omap3 pm idle libc libgstreamer dsplinkk other

qcif 42% 26.5% 4.5% 2.5% 31.5% 61% 8% 2.6% 0.8% 27,6%

cif 66% 13% 4% 2% 15% 69% 6% 2% 1.14% 23%

4cif 75% 5% 4.6% 2% 18% 84% 4% 1.15% 0.6% 11%

Table 5: ARM and DSP decoding profiling ple, in case of qcif decoding, the DSP is not used for more than 50% of the time, but still consumes idle power. These are confirmed by an application and system profiling performed on the ARM and DSP video decoding using Oprofile tool [34], a system-wide profiler for Linux systems. Table 5 shows the obtained results. In the case of ARM video decoding, most of the decoding time is spent in the execution of the libavcodec library, which implements the H.264/AVC GStreamer plugin. The amount of the time spent in the idle state by the ARM processor decreases when the video resolution increases. Indeed, frames with higher resolution are decoded in a longer time (refer to section 5.4.1) leading to shorter idle times. In the case of DSP video decoding, most of the time, the ARM processor is in idle state waiting for the DSP to decode the video frames. This explains the increase of idle time when increasing the video resolution. Like in the ARM decoding, the frames with higher resolution are decoded by the DSP in a longer time than low resolution leading to a longer ARM idle time. The control of the DSP from the ARM side corresponds to the call of the DSPlink driver. The actual video decoding process does not appear in the profiling results since it is executed on the DSP side. Whatever the time spent in each one of these actions, the consumed total power on ARM and DSP decoding is described by equations (11) and (12) respectively. 22

Ptotarm = Cef farm .V 2 .farm + Pstatic

(11)

Ptotdsp = Cef farm .V 2 .farm + Cef fdsp .V 2 .fdsp + Pstatic

(12)

In case of DSP video decoding, both ARM and DSP processor are involved in the decoding process (see section 2). At a given time, the power consumption is determined by the effective capacitance of both the ARM and the DSP as described in equation 12. This equation can be simplified using the linear relation between the ARM clock frequencies and those of the DSP (this is specific to the tested hardware platform). In fact, from table 3 we can notice that fdsp = 0.72.farm (this is specific to the measured platform). Therefore, equation 12 becomes : Ptotdsp = (Cef farm + 0.72.Cef fdsp ).V 2 .farm + Pstatic

(13)

We can conclude from this section that the time and the energy overhead are not negligible, especially in the case of DSP decoding of qcif resolution. In addition, when decoding a frame, the power consumption varies depending on the processor state and the executed instruction type but should follow a model α.V 2 .f + Pstatic for both DSP and ARM video decoding cases. These results are used, in the next section, to explain and to model the overall performance and energy consumption of a whole video sequence decoding. 5.4. Video-Sequence Level 5.4.1. Decoding Time We analyze the video decoding performance in term of the number of frame decoded per second (frames/s) which is equal to N/t. N is the total number of frames (300 in our tests) and t is the decoding time. Figure 9 shows a comparison between ARM and DSP video decoding performance in case of 4cif, cif and qcif resolutions for the considered video sequences. The dark (red) flat surface represents the acceptable reference video displaying rate (30 Frames/s). The first observation which can be made, in the case of qcif and cif resolutions, is that the video is decoded at a higher rate than the displaying rate (30 frames/s) even for low clock frequencies regardless of the video bitrate and the processor type. The ratio between the actual decoding speed and the displaying rate increases for high clock frequencies and low bit-rates. 23

qcif ARM and DSP decoding (Harbour)

400 350

ARM

400

DSP

350

qcif ARM and DSP decoding (Soccer) ARM

300

DSP

Frames/s

Frames/s

300 250 200 150

250 200 150 100

100

50 50

0 200

0 200

400

Frquency

600

2000

400

Frquency

600

2000

4000

6000

Bitrate (Kb/s)

0

Bitrate (Kb/s)

0

cif ARM and DSP decoding (Harbour)

180 160

cif ARM and DSP decoding (Soccer)

200

ARM

ARM

DSP

140

DSP

150

120

Frames/s

Frames/s

6000

4000

100 80 60

100

50

40 20

0 0

0 0

200

6000 200

Frquency

600

2000 800

0

800

0

2000

4000

6000

Bitrate (Kb/s)

4cif ARM and DSP decoding (Soccer)

ARM

60

ARM

80

DSP

DSP 50

Frames/s

Frames/s

600

Bitrate (Kb/s)

4cif ARM and DSP decoding (Harbour)

70

400

Frquency

4000

400

40 30 20

60 40 20

10 0 0

6000

200 6000

200

5000 3000

2000

600

1000

4000

400

Frequency

4000

400

Frquency

2000

600 0

Bitrate (Kb/s)

Bitrate (Kb/s)

0

Figure 9: ARM and DSP video decoding performance In the case of 4cif resolution, a decoding rate higher than 30 Frames/s can be performed by the DSP starting from 180 MHz frequency (i.e. 250 MHz ARM frequency) for low bit-rates and starting from 430 MHz frequency (i.e. 600 MHz ARM frequency) for high bit-rates (see table 3 for all the correspondence between ARM and DSP frequencies). The performances of the ARM processor and the DSP are almost equivalent in case of qcif resolution. However, the ARM decoding speed is 43% higher than the DSP in case of 64 Kb/s bit-rate while the DSP decoding speed is 14% higher than the ARM in case of 5120 Kb/s bit-rate. For cif and 4cif resolutions, The DSP decoding is almost 50 % faster than of the ARM in case of cif resolution and 100% in case of 4cif. This ratio decreases drastically for low bit-rates.

24

qcif decoding average power consumption (Harbour)

4cif decoding average power consumption (Harbour)

cif decoding average power consumption (Harbour) 1

0.5

0.8 ARM DSP

Power (W)

DSP ARm

Power (W)

Power (W)

1

0.5

0 6000

0 6000 4000 Bit−rate

600 2000 0 0

200

400

800

Frequency

ARM DSP

0.6 0.4 0.2

0 6000 4000

Bite−rate

600 2000 0 0

200

400 Frequency

800

4000

Bite−rate

2000 0 0

200

400

600

800

Frequency

Figure 10: ARM and DSP video decoding power consumption 5.4.2. Power Consumption Figure 10 illustrates the variation of the average power consumption of the ARM and the DSP video decoding according to the video resolution and bit-rate in case of the Harbor video (the Soccer and City video sequence gave similar results). We notice that the power consumption depends mainly on the clock frequency, which is explained by the dominance of the dynamic power model as compared to the static one. For example, at 720 MHz, the static power is 19.75 mW (see Table 3) which represents 3.4% and 2.8% of qcif video ARM and DSP decoding power consumption (540 mW and 700 mW respectively). We can also observe that, unlike the ARM decoding average power consumption, the DSP power consumption increases when the video resolution increases. The DSP power consumption is thus 30%, 40%, and 50% higher than the ARM’s in case of qcif, cif and 4cif resolutions, respectively. This can be explained by the results obtained in section 5.3 regarding the overhead evaluation. In fact, we found that the percentage of time overhead is almost constant in case of ARM decoding and decreases when the video resolution increases in case of DSP decoding. A frame level power characterization (previous sections) showed that the overhead phases corresponds to a decreased power consumption due to entering the idle state (see to Figure 8). Consequently, the more the time overhead is, the lower is the average power consumption. 5.4.3. Energy Consumption The previous results showed a very important variation of the DSP/ARM performance and power consumption, which depends on the clock frequency, the video bit-rate and the resolution. The energy consumption is the combination of the power consumption and the decoding time properties. Figure 11 shows the energy consumption of the ARM and the DSP video decoding 25

qcif decoding Energy consumption (Soccer)

qcif decoding energy consumption (Harbour)

ARM DSP

ARM DSP 5 5

mJ/Frame

mJ/Frame

4 4 3 2 6000

1

5000 4000

0 100

3 2 1 6000

0 0

3000 200

300

400

Frequency

500

600

400

800

2000

Frequency

Bitrate (Kb/s)

700

4000

200

2000 1000

800

ARM DSP

ARM DSP

10

10

mJ/Frame

8

mJ/Frame

0

cif decoding Energy consumption (Soccer)

cif decoding Energy consumption (Harbour)

6 4 2 0 0

6000

8 6 4 2 6000

0 0

4000

200 400

Frequency

0

4000

200 400

2000

600 800

Bitrate (Kb/s)

600 800

0

4cif decoding Energy consumption (Soccer) ARM DSP

ARM DSP

35

2000

Frequency

Bitrate (Kb/s)

4cif decoding energy consumption (Harbour)

30

30

25

25

mJ/Frame

mJ/Frame

Bitrate(Kb/s)

600

0

20 15 10 5

20 15 10 5

0 0

6000 4000

200 400

Frequency

2000

600 800

0

Bitrate (Kb/s)

0 100

6000 5000 4000 200

300

3000 400

500

Frequency

2000

600

700

1000 800

Bitrate (Kb/s)

0

Figure 11: ARM vs DSP decoding energy consumption of H.264/AVC video (mJ/Frame) in case of 4cif, cif and qcif resolutions for Soccer, Harbor and City video sequences. The DSP qcif video decoding consumes 100% more energy than the ARM in case of low bit-rate and 20% for high bit-rates. This is explained by a lower performance and a higher power consumption of the DSP decoding as compared to the ARM because of the system overhead (see Table 4). On the other hand, The DSP 4cif video decoding consumes less energy than the ARM although it consumes 60% more power. This is due to a better DSP decoding performance, which can be 100% higher than the one of the ARM. In case of cif resolution, we noticed a crossing between the ARM and the DSP energy consumption levels. In fact, for low bit-rate starting from 1Mb/s, the ARM consumes less energy than the DSP. The opposite is true for high bit-rates videos. 26

The analysis of the video sequence level characterization results shows that the performance and the energy consumption of ARM and DSP video decoding highly depend on the video quality. In fact, from the results obtained in the frame level characterization step in the previous section, we have noticed that in case of low video qualities, the DSP performance and the energy efficiency tend to drop as compared to those of the ARM processor. This is explained by the non-negligible overhead due to the ARM-DSP communication. This overhead is independent from the video quality and is responsible of an important part of the processing time and the consumed energy, which explains the drop in DSP performance and energy efficiency. 6. Modeling of the Video Decoding Characterization Results Based on the performance and energy consumption measurement data we have analyzed in the previous section, we describe hereafter the construction of the energy analytical model based on the sub-models decomposition methodology described in 3.2. 6.1. Video Rate Sub-model In this step (see block 1 of Figure 3), a model fitting is performed on the (qp, r) values obtained from the encoder (refer to Table 2) to approximate a, rmax and qmin parameters of the model proposed in [38] and described in equation 8. Table 6 shows the model fitting results. This model has a very good precision especially for high resolutions videos (R-squared values around 97%). Figure 12 illustrates graphically the precision of the model in case of Harbor video. resolution qcif

cif

4cif

Video Harbor Soccer City Harbor Soccer City Harbor Soccer City

a 1.031 0.993 1.026 1.393 1.084 1.353 1.538 1.25 1.34

rmax (Kb/s) 475.5 177.5 75.72 1422 327.9 106.3 644.3 552.7 115

qmin 10.15 18.14 54.74 14.54 32.84 47.53 63.24 54.06 147

R2 0.9615 0.9978 0.9726 0.9937 0.996 0.9972 0.9913 0.9855 0.9805

Table 6: Model fitting results of the bit-rate model

Resolution qcif

cif

4cif

Video Harbor Soccer City Harbor Soccer City Harbor Soccer City

Cef farm 4.20E-007 4.19E-007 4.17E-007 4.28E-007 4.23E-007 4.21E-007 4.18E-007 4.17E-007 4.15E-007

Cef farm−dsp 5.48E-007 5.53E-007 5.51E-007 6.03E-007 6.05E-007 6.04E-007 6.47E-007 6.46E-007 6.44E-007

Table 7: Model fitting results of the dynamic power model

27

50

Predicted

Measured

qp

qp

20

Predicted

30

1000

2000

3000

Bit−rate

4000

5000

10 0

4cif−Harbour

50

Measured

Predicted Measured

40 30

20

10 0 0

cif−Harbour

40

30

qp

qcif−Harbour

40

1000

2000 3000 4000 Bit−rate (Kb/s)

5000

20 0

1000

2000 3000 4000 Bit−rate (Kb/s)

5000

Figure 12: Rate model fitting 6.2. Power Sub-model In this section, we model the variation of the average power consumption of the ARM and the DSP video decoding in terms of video resolution and bit-rate (see Figure 10). We first estimate the static power then we model the dynamic power in order to develop a model for the total power consumption. 6.2.1. Static Power Sub-model To construct the energy model, we used the static power model provided by the OMAP3530 data-sheet [54] (see Table 3). 6.2.2. Dynamic Power Sub-model One can notice that the dynamic power represents the major part of the total power in case of high frequencies. For example, the static power represent 0,02 W of the 0.6 W ARM total power consumption at 720 Mhz (3.33%). On the other hand, it represents 0,00516 W of the 0.05 W total power at 125 MHz frequency (10%). Considering the static power is, thus, important especially for low frequencies. Based on the frame level power consumption analysis in section 5.3, we know that in both ARM and DSP video decoding, the dynamic power consumption should follow the model described in equation 3. In addition, as it can be noticed in Figure 10, it appears that the power consumption does not depend on the bit-rate r. Therefore, we can suppose that (Ptotal − Pstatic )/V 2 = α.f where α = Cef farm in case of ARM decoding and α = Cef farm + 0.72.Cef fdsp in case of DSP decoding (refer to equations 11 and 13). For a validation purpose, we executed a multi-linear regression on (Ptot − Pstatic )/V 2 in terms of f and r. The coefficients of the r variable are almost null, which confirms that the average power is independent from the bit-rate. On the other hand, the coefficient of f are shown in Table 7 for Harbor, Soccer and City video sequences. One can observe that the regression coefficient representing the Cef f dynamic power model parameter is 28

cif ARM and DSP decoding (Harbour)

qcif ARM and DSP decoding (Harbour) ARM DSP

200

200 100

40

0

20 200

qp

400

Frequency 600

0

100 50

60 40

0 200

400

Frequency

ARM DSP

80

150

Frames/s

300

Frames/s

Frames/s

400

4cif ARM and DSP decoding (Harbour)

ARM DSP

20 600

qp

60 40 20

60

0

40 200

0

qp

400

Frequency 600

20

Figure 13: ARM and DSP video decoding time in terms of f and qp almost constant for a given video resolution and increases in case of a higher resolution. Column Cef farm represents the effective capacitance of the ARM processor in case of ARM video decoding. On the other hand, the column Cef farm−dsp represents the effective capacitance of both the ARM processor and the DSP when using the DSP to decode a video. We observe that Cef farm−dsp increases when the video resolution increases unlike Cef farm , which is almost constant. This was discussed in section 5.4.2 6.3. Decoding Time Sub-model In this step, we exploit a linear relation we observed between (1/t) and both f and qpavg as illustrated in Figure 13. A formal multi-linear regression analysis confirmed the observation. The results of this regression showed that the decoding time can thus be described by equation 14. The values of the coefficients αo , α1 , α2 and α3 , obtained by the multi-linear regression analysis, are shown in Table 8. 1/t = αo + α1 .f + α2 .qpavg + α3 .f.qpavg

(14)

The obtained results, in Table 8, clearly show the accuracy of the proposed model as it can be seen from the R-squared values, which are calculated for each test defined by a video sequence, a resolution and a processor type (ARM or DSP). To have a decoding-time model as a function of the bit-rate, the qp parameter in the above model can be expressed in terms of bit-rate using the rate-model (10). Equation (14) becomes : 1/t = αo + α1 .f + (α2 + α3 .f ).(4 + 6.ln2 (qmin .(

29

r rmax

−1/a

))

(15)

Table 8: Multi-linear regression of 1/t in terms of f and qp Resolution

Video Harbor

qcif

Soccer City Harbor

cif

Soccer City Harbor

4cif

Soccer City

Processor ARM DSP ARM DSP ARM DSP ARM DSP ARM DSP ARM DSP ARM DSP ARM DSP ARM DSP

α0 2.040e+00 2.756e+00 3.482e+00 2.795e+00 3.332e+00 2.385e+00 -1.680e+00 -1.945e-01 3.356e-02 1.181e+00 3.128e-02 1.021e+00 -1.078e+00 -4.915e-01 -1.749e-01 6.244e-03 -1.591e-01 5.564e-03

α1 1.895e-04 2.178e-04 2.165e-04 2.300e-04 1.195e-04 1.908e-04 -1.943e-05 5.243e-05 3.622e-05 8.893e-05 3.021e-05 6.320-05 -1.995e-05 1.458e-06 6.355e-06 3.984e-05 5.505e-06 7.654e-05

α2 2.166e-01 6.841e-02 1.733e-01 4.703e-02 1.365e-01 3.573e-02 1.395e-01 9.438e-02 9.569e-02 3.783e-02 8.775e-02 2.803e-02 5.405e-02 2.906e-02 3.101e-02 1.502e-02 2.913e-02 2.277e-02

α3 7.994e-06 4.021e-06 7.352e-06 3.795e-06 6.325e-06 2.911e-06 4.825e-06 3.877e-06 3.342e-06 2.950e-06 3.229e-06 3.120e-06 1.559e-06 1.850e-06 9.765e-07 9.909e-07 6.564e-07 6.097e-07

R2 0.9739 0.9915 0.9929 0.9961 0.9895 0.9913 0.9782 0.994 0.9919 0.9961 0.9901 0.9913 0.9911 0.9985 0.9023 0.8651 0.9523 0.92651

6.4. Energy Model Based on the dynamic power model, the static power model and the decoding time model described respectively in equations (3)and equation (14), the video decoding energy consumption can be calculated as follows : E=

Cef f .V 2 .f + Pstatic r −1/a )) αo + α1 .f + (α2 + α3 .f ).(4 + 6.ln2 (qmin .( rmax

(16)

This model describes the energy consumption in terms of clock frequency f and bit-rate r in addition to the constant parameters Cef f , α0 , α1 , α2 , α3 , qmin , rmax and a. In the next section, each parameter is discussed by highlighting its relation with the architectural aspects, the system and video properties. 6.5. Extracted Model Parameters Discussion 6.5.1. Architecture Related Parameters The constant parameters extracted when building the performance submodel (see equation 14) are related to the memory hierarchy architecture. In fact, we can observe from the analytical model described in equation (14) that 1/t depends on the frequency f , qp and the correlation existing between f and qp weighted with the coefficient α3 . A non-null α3 value means that 30

the decoding speed-up when scaling the clock frequency f depends on qp parameters. This can be explained by the fact that : 1) The memory access rate increases for high quality video (low qp value) [1, 62, 49]. 2) Unlike cpu-bound instructions, memory-bound instructions execution time do not scale when varying the clock frequency [27] due to the memory wall problem [19, 31]. This is illustrated graphically by a twisted surface around qp and f axis in Figure 13. At a constant qp, 1/t varies by a factor of (α1 + α3 .qp) in terms of f . This factor reaches its minimum value when qp = 0, which corresponds to a lossless H.264/AVC coding. On the other hand, at a constant f , 1/t varies by a factor of (α2 + α3 .f ) in terms of qp. This factor reaches its minimum value when f is minimal. This can be explained by the fact that when f decreases, the difference between the processor frequency and the memory frequency decreases. Consequently, the decoding time is not impacted considerably when the memory-bound instruction rate varies. Theoretically, this factor can be null (which means that the decoding time is independent from qp) in one of these cases: 1) the size of the cache memory is large enough to hold the entire video sequence or 2) the processor is clocked at the same frequency as the memory. However, these configurations are not realistic. Thus, the decoding time depends on the combination of α1 , α2 and α3 . The above proposed interpretation of the extracted parameters should be verified more deeply against different memory hierarchy configuration. However, this is hard to achieve on real platform as those used in in this study. Fortunately, some of the state-of the-art of architecture and power simulators such as McPAT [51] and Sniper [15] has integrated recently the support for DVFS for both In-order and out of order processors. In the context of this study, these tools provide an interesting opportunity to verify the impact of memory hierarchy configuration on the performance scaling of video decoding using DVFS. We plan to investigate more deeply this issue in a future works. 6.5.2. System Related Parameters From the profiling results in section 5.3, we have noticed that the time spent by the DSP/ARM processors in active or idle state highly depends on the decoded video resolution. In the developed energy model, this is reflected by a different value of the Cef f parameter (see table 7). The more the processor (GPP or DSP) spends time in idle state, the smaller is its effective capacitance. However, this does not mean a higher energy efficiency. Indeed, 31

in the idle state, the processor does not perform any task. To increase its energy efficiency, it may use lower clock frequency using DVFS [44] or enter into deeper low-power modes using Dynamic Power Management (DPM) [43]. 6.5.3. Video Related Parameters The parameters extracted from the video complexity characterization phase (Section 3.1.1) are related only to the video. The parameter a is an exponent which controls how fast the video rate changes in terms of the step size parameter. On the other hand, rmax and qmin depend on the video complexity. The more a video is complex, the higher is its corresponding rmax parameter and the lower is qmin [38]. Indeed, one can observe from Table 6 that the more the decoded video is complex (Harbor is the more complex and City is the less complex one), the higher is the value of rmax and the lower is the value of qmin . The value of the parameter a seems to be independent from the video complexity. 7. Models Validation We have analyzed in the previous sections the accuracy of each developed sub-model (rate, time and power). The objective of this section is : • To analyze the accuracy of the performance and energy analytical models (equations 15 and 16) resulted from the combination (the reverse process of the sub-model decomposition) of the sub-models (equations 1, 10 and 14). • To investigate the generalization and validity of these models on another execution platform. We used the OMAP4460 SoC as a case study. We show how the sub-models decomposition approach proposed in this study reduces the effort of building the performance and the energy consumption models of video decoding for a another platform. • To provide some guidelines for online performance and energy estimation of video decoding on a given target execution platform. 7.1. Models accuracy on OMAP3530 7.1.1. Decoding Time Model Table 9 shows the accuracy of the performance model described in equation 15 (Frames/s in terms of f and the bit-rate) obtained from the combination of the rate sub-model described in equation 10 (qpavg in term of 32

Frames/s 180

Frames/s 160

CIF video decoding performance on ARM (Harbour)

140 120

CIF video decoding performance on DSP (Harbour)

160

Measured

140

Predicted

120

Predicted Measured

100 100 80 80 60 60 40 40 20 20 0 0 200 400 Frequency (Mhz)

600 800

0

1000

2000

3000

4000

5000

0 0

Bit−rate (Kb/s)

200 400 Frequency (Mhz)

600

800

0

1000

2000

3000 4000 Bit−rate (Kb/s)

5000

Figure 14: Measured vs predicted video decoding performance (OMAP3530) Table 10: Energy model R2

Table 9: Performance model R2 Video Harbour Soccer City

Processor ARM DSP ARM DSP ARM DSP

Energy model R2 qcif cif 4cif 0.9851 0.9840 0.9886 0.9750 0.9789 0.9787 0.9753 0.9813 0.9811 0.9699 0.9701 0.9687 0.9749 0.9803 0.9801 0.9699 0.9689 0.9797

Video Harbour Soccer City

Processor ARM DSP ARM DSP ARM DSP

Energy model R2 qcif cif 4cif 0.9851 0.9840 0.9886 0.9750 0.9789 0.9787 0.9753 0.9813 0.9811 0.9699 0.9701 0.9687 0.9749 0.9803 0.9801 0.9699 0.9689 0.9797

the bit-rate) and the multi-linear time sub-model described in equation 14 (Frames/s in terms of f and qpavg ). The calculated R2 coefficients are about 98%. Figure 14 shows the comparison between the predicted performance values and the measured ones. One can highlight an important observation. The combination of the same rate model (equation 10, which depends exclusively on the video properties), with the multi-linear time sub-model (equation 14, which depends on the execution platform), allows to build an accurate performance model for both ARM and DSP processors. We will confirm this observation for the OMAP4460 platform in section 7.2. 7.1.2. Energy Model Table 10 shows the calculated R2 values of the energy model (equation 16) as compared to the measures. They vary around 97% for almost the video sequences for both ARM and DSP (see Figure 15). To show the accuracy of the energy model for both ARM and DSP, we used it to predict analytically the bit-rates for which the ARM processor is more energy efficient than the DSP in case of cif video resolutions. Figure 16 shows the surface corresponding to the Edsp − Earm function. One can observe that for the frequency f = 720M Hz, the Edsp − Earm is null for the 33

(ARM/DSP) predicted energy difference: CIF−Harbour

Measured

2

5

0 0

5000

5 5 x 10 Frequency

Bit−rate

0

Energy (mJ/Frame)

ARM−qcif energy decoding 4

Predicted

E DSP− E

ARM

1 0 −1 0

Measured

3

DSP is more efficient

ARM is more efficient

Predicted

Energy (mJ/frame)

Energy (mJ/Frame)

DSP−qcif energy decoding 10

Bit−rate=1024 (qp=30)

2

2

1 5000

0 0

2 5

x 10

4 6 Frequency

0

5

4

x 10

Bit−rate

Frequency

6

1000

0

2000

3000

4000

5000

Bit−rate (Kb/s)

Figure 15: Measured vs predicted Figure 16: ARM/DSP predicted energy difference (Harbor/cif video) energy (Harbor/cif video) Frames/s 300

Multi−linear regression results on Cortex A9 (OMAP 4460) Frames/s (CIF Harbour) (all measured data)

250

300

200

Multi−linear model (grid)

200

Video decoding performance on Cortex A9 (OMAP4460)(Harbor)

Measured Predicted 150

100 100

0 1500

50 Mesured performances (points)

1000 Frequency (Mhz)

20 0 0 500 Frequency (Mhz) 1000

30 QP 500

40 0

50

1500

0

1000

2000

5000 3000 4000 Bit−rate (Kb/s)

Figure 17: Multi-linear regression Figure 18: Video decoding Performance results on Cortex A9 (OMAP4460) (Predicted vs measured) bit-rate 1024 kb/s. This corresponds exactly to the results of the experimental measurement shown in Figure 11 for cif decoding where we can notice a crossing between the energy surface at the bit-rate 1024 kb/s.

7.2. Models generalization : The OMAP4460 SoC case study The methodology could not be complete if one do not give how it can be applied to another platform. To validate the methodology on another architecture, we used the OMAP4460 SoC [55] on a Pandaboard [45]. This SoC is based on 45nm technology (vs 65 nm for the OMAP3530) and contains a double core Cortex A9 processor. Each processor supports four frequencies : 350 MHz, 700 MHz, 920 MHz and 1.2 GHz. During the experiments executed on this platform, only one core is activated. The study of multicore parallel 34

Power (W)

1

mJ/Frame 25 P(f) = a*fb + c a = 3.756eâ´07 b = 2.055 c = 0.06283

0.8

CIF decoding energy consumption on ARM Cortex A9 (OMAP 4460) (Harbor)

20 15

0.6

Predicted Measured

10

0.4 5

0.2

0 0 500

0 200

400

600

800

1000

1200

Frequency (Mhz)

Figure 19: Cortex A9 power consumption

1000 Frequency (Mhz)

1500

0

1000

2000

3000

5000 6000 4000 Bit−rate (kb/s)

Figure 20: Cortex A9 energy consumption of Video decoding

video decoding is out of the scope of this work. On this board was used the same software environment as for the OMAP3530: Linux Operating system and Gstreamer video decoder. The board needed some instrumentation to allow a separate power measurement of the Cortex A9 processor. More details on the board instrumentation can be found in [10].

7.2.1. Decoding Time Model Multi-linear regression using the the model of equation 14 was performed on the measures of the Frames/s in terms of the frequency and the qpavg for the OMAP4460 platform. The calculated R2 coefficient was around 97% for all tested videos. Fig. 17 illustrates the accuracy of the multi-linear model. When applying the rate model constants obtained previously in section 6.1, we obtained the performance model illustrated in Figure 18 (Frames/s in terms of the frequency and the bit-rate). The model accuracy is almost around 96% for all tested videos. 7.2.2. Energy Model To obtain the energy model of the video decoding, the average power consumption values of video decoding corresponding to each frequency should be known. In the case of OMAP4460 SoC, we were not able to decompose the power consumption model into a static plus a dynamic power models as the static power values were not available in the OMAP4460 data sheet. To overcome this issue, the average power consumption (static + dynamic) of video decoding corresponding to the different frequencies was fitted with 35

the model af b + c as suggested in [37]. The 512 Kb/s CIF video quality was used to measure the average power consumption values corresponding to each frequency. As highlighted in section 6.2, in case of ARM video decoding, the average power consumption is not impacted by the video quality. Therefore, the measured values can approximate the average power consumption for the other video qualities. Figure 19 shows the results of the power model fitting. Based on the average power model and the performance model of the previous section, the energy was calculated analytically and compared to the measured energy values according to our methodology described in section 3.1.4. Figure 20 shows the surfaces representing the measured and the modeled energy values (mJ/frame). The accuracy of the model (R2 ) was 95%. The results of this study helped in building an accurate energy model of video decoding on the Cortex A9 processor in a much faster time. In fact, the video complexity characterization data (a, qmin and Rmax parameters) calculated in the first experimentation set (for the OMAP3550) was reused. The power consumption values of video decoding corresponding to the different processor frequencies can be measured using a simple oscilloscope. It was shown that this average value combined with the performance model allowed to estimate accurately the energy regardless of the processor fabrication technologies : 45nm for the OMAP4460 and 65nm for the OMAP3530. Thus, the modeling methodology of this paper can easily and efficiently be reused to build up accurate performance and energy models for other platforms. 7.3. Guidelines for online models building We provide here some guidelines for online models building on a given target architecture with the help of the results obtained in this study. Figure 21, illustrates the steps and the roles distribution of model parameters calculation withing a video system. Since a, qmin and Rmax parameters are video dependent, one can suggest to calculate their values at the encoding phase and to send them to the decoder as a metadata (step (1) in Figure 21). Such approaches are subject to an active discussions in the Green Metadata standardization initiative conducted by MPEG [56]. If these video parameters are known, the video decoder can calculate qpavg for the corresponding video sequence using the rate model described in equation (10) (step (2) in Figure 21). Thus, assuming the multi-linear model described in equation 14, online performance model building becomes easier. 36

Power model

Rate model Bit-rate

qp avg=4 +6.ln 2 ( qmin (

Video encoder

a ,q min , R max

qp avg

2

1

r −1a ) ) Rmax

1 =α 0 +α 1 f + α 2 qp avg + α 3 fqp avg t

Video decoder

5

adjusted performance model

f

adaptive filtering Parametric performance model

qp avg

Frequecy setting policy

4 α 0, α 1, α 2, α 3

P=V²f + P static Online/offline power measurement

6 E=

V²f + Pstatic

C eff , Pstatic

α 0 +α 1 f +(α 2 + α 3 f )(4 +6.ln 2( q min (

r − 1a ) )) Rmax

Energy model

Execution platform

3

- Predicted decoding performance Measured decoding performance

+

Σ

Figure 21: Performance and energy consumption models building For example, an adaptive linear filtering technique [58] may be used to adjust α0 , α1 , α2 and α3 parameters online. The parameters adjustment (step (3) in Figure 21) is driven by the prediction error information fed from online performance calculation of video decoding. Once the performance model is calibrated, it can be used within a DVFS policy (step (4) in Figure 21) to set the processor frequency accordingly for future video sequences (step (5) in Figure 21). Refer to section 8.3 for a use case example. The power consumption model is architecture dependent. The average power consumption can be measured offline then used to estimate the energy consumption depending on the video quality. A more accurate power measurement can be done online thanks to power consumption measurement feature supported by some platforms [46]. 8. Applications We give, in this section, different ways to use the models proposed in this study. Before doing this, we present a motivational example where the developed models can be exploited. 8.1. Motivational example To cope with the network bandwidth fluctuation and the heterogeneous mobile device capabilities, more and more video content providers such as Youtube and Netfilix, support adaptive video streaming. As illustrated in 37

Figure 22: Adaptive streaming

Figure 23: Video-quality/Energy Aware Video Decoding on SoCs Fig. 22, in adaptive video streaming [52], a video stream S is divided into sequential and independent elementary segments S1 , S2 , . . . Sn . Each segment Si represents a few seconds video sequence and is coded into different video qualities Q1 , Q2 . . . Qm . A video segment Si having the quality Qj is represented by a chunk Sij . For example, a two minutes video sequence may be divided into 60 segments of 2 seconds duration each. Each segment may be coded into 512 Kb/s, 1 Mb/s, 2 Mb and 4 Mb/s bit-rate video qualities. The video decoder may then switch dynamically between the different video qualities according to the network bandwidth variation or its battery level. In this context, adaptive video streaming poses a new challenge to energy aware video decoding. In fact, the video decoder should take into consideration the run time dynamic variation of the video quality in dimensioning its processing resource to save energy. 38

Figure 24: Video-quality/Energy Aware Video Decoding on SoCs 8.2. Energy aware scheduling of video decoding on heterogeneous multi-core SoCs It was highlighted in section 5.4.3 that the energy efficiency of DSP video decoding as compared to that of the GPP highly depends on the video quality. As illustrated in Fig. 23-a, the combination of energy models developed for the GPP and the DSP may result in an optimal energy model if the low video qualities are decoded on the GPP and the high video qualities on the DSP. This is particularly interesting in the above introduced adaptive video decoding context. As illustrated in Figure 23-b, when a video decoder adapts dynamically the quality of the video, it may decide accordingly on which processor it is decoded depending on the video quality. More details on this application can be found in [9]. 8.3. Video-quality aware Dynamic Voltage and Frequency Scaling As shown in section 5.4.1, the video decoding performance highly depends on the video quality. In the context of a Dynamic Voltage and Frequency Scaling, an adaptive video decoder should react to a video quality change to adjust the processor frequency accordingly. As introduced in section 7.3, if the video related parameters a, qmin , Rmax are sent for each video segment, then, using the rate model, the video decoder may calculate qpavg corresponding to the selected chunk. This would avoid to parse all the frames of the video chunk to extract the quantization parameter and calculate the average value. Thus, as illustrated in Figure 24, assuming the multi-linear performance model (Frame/s in terms of f ad qpavg ), the optimal processor clock frequency foptim allowing a decoding speed as close as possible to the displaying rate can be calculated proactively for a given video quality. One can highlight that the performance model allows to predict the average decoding performance. To decouple the constant frames displaying 39

speed from the frame-to-frame workload variation, a buffer may be inserted between the decoder and the displaying device as suggested in [24]. 8.4. Video-quality/Energy Trade-off One of the advantages of the energy model described in equation (16) is that it separates between the architecture/system related parameters and the video parameters. This allows to evaluate, for a given video, the impact of the quality change on the consumed energy by changing the value of the bit-rate only. If we evaluate the video quality corresponding to each bit-rate (using PSNR or any QoE metric), we can select the Video-quality/Energy trade-off for a given energy-constrained configuration. This can be used for example in a multi-objective (Video-quality maximization, Energy minimization) video decoding technique [26]. 9. Related works We present related research in two categories: first, we describe some relevant works on energy consumption at architectural level, then we present different methodologies for energy consumption estimation and modeling in general and in the case of video decoding in particular. 9.1. Impact of processor architecture on energy consumption At architectural level, the advantages, in terms of performance and energy consumption, of H.264/AVC video decoding using ASICs (Application Specific Integrated Circuit) are highlighted in [61]. In the same way, a more general study [25] investigates the reasons of energy inefficiency of GPPs and proposes guidelines to reduce the energy break-down as compared to ASICs. However, video standards evolve quickly and ASICs do not provide flexibility to adapt to those changes [50]. For example, hardware accelerators for the new MPEG HEVC (High Efficiency Video Coding) standards are still not available on mobile devices (at the date of paper writing). DSP-based solutions aim to conciliate the flexibility of GPPs and the energy efficiency of ASICs. In [18, 39], the authors focus on the performance and energy efficiency of DSPs due to the use of pipelines and parallelism in CMOS circuits. The benefit of using them in energy constrained mobile devices highlighted in [57], especially for video decoding [33]. Although the great advances achieved in the architecture of GPPs and DSPs, their performance tends to stall due to frequency and power wall 40

limitation [11]. To overcome these barriers, modern architectures use SoC integrating more and more heterogeneous processor cores. For example, the latest big.LITTLE ARM architecture contains four Cortex A7 and four Cortex A15 processors [3]. Many studies investigated the impact of task scheduling on multicore architecture for energy efficiency [48, 8]. In case of video decoding, [59] proposes a slice-based parallel H.264/AVC decoding on multicore processor. The frequency of each core is set according to the expected slice workload. However, this requires a fine grained (slice level) workload complexity model to adjust the core frequency accordingly. In [5], authors propose a task level scheduling strategy for multimedia application running on multi-cores processor. Their idea consist to assign the lowest frequency to all the cores. Then, the tasks are load-balanced on the cores according to their loads. If all the cores are at the full load, the frequency of one of them is increased so it can run more tasks. the experimentation tests considered a set of heterogeneous workload inducing video decoding and encoding and audio decoding. In this paper, we analyze the energy efficiency at a task level (decoding of a short video sequence) but on heterogeneous SoC including GPP and DSP processors. Moreover, we take into consideration the video quality parameter and propose according to experimental results guidelines for energy aware scheduling of video decoding on GPP and DSP in case of adaptive video decoding. 9.2. Energy consumption modeling and estimation To evaluate the energy consumption of embedded systems including complex microprocessors, operating systems and applications, two major approaches are used : simulation and empirical modeling based on measurements. Simulators allow to estimate the energy consumption of a running application without using physical measurement tools. For example, Wattch [12], based on SimplerScalar processor simulator framework [4], uses a suite of parameterizable power models for different hardware structure. It is based on a per-cycle resource usage count generated through cycle accurate simulations. In the same way and more recently, the McPAT simulator [35] is proposed to estimate the energy consumption of multi-core architectures. These simulator tools allow hardware architects to explore the energy efficiency of processor architectures early by testing the impact of different hardware configurations. However, they are hard to build and require a very 41

deep knowledge of the targeted microprocessors micro-architecture [28]. For example, as far as we know, there is no energy simulation framework which supports DSPs. Moreover, the role of simulation consists in estimating the energy at run-time and does not allow to build energy models that can predict the energy consumption in terms of a higher-level applications parameters. On the other hand, empirical modeling approaches allow to build energy models based on physical power measurements. They can be implemented rapidly on any available platform using power measurement tools. This allows to focus the effort on building high-level models as according to system and application parameters. In [29], H.264/AVC decoding performance is characterized on different GPP architectures at cycle-level. Based on this approach, a cycle-accurate modeling of the energy consumption of different H.264/AVC decoder phases (entropy decoding, dequantization and inverse transform, motion compensation, and deblocking) is proposed in [37]. The obtained model was used to develop an energy-aware video decoding policy for ARM processor supporting DVFS feature. In [36], the authors proposed an ARM processor energy model considering the variation of the video bitrate. The obtained model was used to develop an energy-aware video decoder for scalable video coding (H.264/SVC) [47]. However, in this proposed model, the author do not consider a realistic memory system by assuming a linear and uniform scaling of the decoding performance when varying the processor frequency which is not true due to the memory off-ship access latency. The impact of off-ship memory access on the performance scalability using DVFS was addressed in some other studies. For example, in [21, 20], the authors propose an on-line performance model for video decoding. Fist, they observe that, in MPEG video decoding applicationS, the execution time of the memory bound instructions tend to be constant from a frame to another. They propose to separate a frame decoding time into two parts : a framedependent part and a frame-independent part. The frame-independent part remains constant regardless of the frame type and tends to correspond to the memory-bound instructions. On the other hand, the frame-dependent part corresponds to the CPU bound instruction. The amount of memory-bound executed instruction is calculated based on the number of Layer 2 cache miss information provided by a processor event counter. The execution time of the CPU-bound instruction is estimated using a moving average filter. The combination of the two information is used to estimate the frame decoding time and to adjust the processor frequency accordingly. Such an approaches rely on the existence of some hardware counters in the execution platform. 42

Moreover, they are based on online monitored information and the proposed models do not consider the impact of video quality on performance scaling. High level empirical models obtained using measured energy data can provide good estimation precision but are hard to apprehend. Even if this type of models is necessary for performance comparison, it fails to describe the relation between the energy consumption behavior and the architecture or video related parameters. In this paper, we have proposed a modeling methodology which allowed to build a comprehensive energy model achieving a balance between an abstract high level model and a lower level one. As far as we know, no work proposed an approach taking into consideration all the studied parameters. 10. Conclusion This paper presents an end-to-end approach for performance and energy consumption characterization and modeling of GPP and DSP for video decoding application. The characterization part of this study allowed to understand the impact of architectural and system design on the overall performance and energy consumption when using different video qualities. Indeed, The obtained results revealed that the best performance-energy trade-off highly depends on the required video bit-rate and resolution. For instance, the GPP can be the best choice in many cases due to a significant overhead in DSP decoding, which may represent 30% of the total decoding energy in some cases. Using a sub-model decomposition approach, the performance and the energy consumption results obtained in the characterization phase served to build an analytical model in terms of the video bit-rate and clock frequency variable parameters in addition to a set of comprehensive constant parameters related to the video complexity and the underlying architecture. The developed model is very accurate (97% R-squared value) for both GPP and DSP video decoding. Moreover, it was shown that the combination of these different submodels, the reverse process of the sub-model decomposition, allows to build an accurate high level performance and energy model for video decoding. This result was used to provide a fast energy model building methodology which can be generalized for a given target architecture.

43

11. Future Works As a future works, we plan to investigate the main issues listed below : Firstly, the analysis of the energy consumption of other codecs : In this paper we focused on H.264/AVC, however, the methodology can be extended to H.265 (HEVC), the successor of H.264/AVC. In fact, although H.265 has introduced lot of changes in the internal coding algorithms, it is still based on the same principles used in almost MPEG codecs family. The high level video related parameters used in this study (QP, bit-rate, step size) are still valid in H.265 standard. Thus, the characterization and the modeling methodology can be executed without any changes since they are independent from the codec internal details. However, the regression analysis may lead to different analytical forms of the rate and performance models. This will be the focus of our future investigations. Secondly, a deeper validation of the proposed performance model using architecture simulation : As discussed in section 6.5, we plan to explore the performance scaling issue in the context of DVFS and investigate the impact of the memory hierarchy configuration using simulation frameworks. Thirdly, the online performance model calibration : In section 7.3 were provided some guidelines for online performance model calibration. We plan to investigate the appropriate techniques (Adaptive filtering, neural network based regression, etc) allowing the video decoder to build the performance model which fit the best the underlying execution architecture. Acknowledgment This work was supported by BPI France, R´egion Ile-de-France, R´egion Bretagne and Rennes M´etropole through the French Project GreenVideo. References [1] Alvarez, M., Salami, E., Ramirez, A., Valero, M., Oct 2005. A performance characterization of high definition digital video decoding using h.264/avc. In: Workload Characterization Symposium, 2005. Proceedings of the IEEE International. pp. 24–33. [2] ARM, 2012. The ARM NEON general-purpose SIMD. http://www.arm.com/products/processors/technologies/neon.php.

44

[3] ARM, 2014. big.little processing. http://www.arm.com/products/ processors/technologies/biglittleprocessing.php. [4] Austin, T., Larson, E., Ernst, D., 2002. Simplescalar: an infrastructure for computer system modeling. Computer 35 (2), 59–67. [5] Bautista, D., Sahuquillo, J., Hassan, H., Petit, S., Duato, J., April 2008. A simple power-aware scheduling for multicore systems when running real-time applications. In: Parallel and Distributed Processing, 2008. IPDPS 2008. IEEE International Symposium on. pp. 1–7. [6] Benmoussa, Y., Boukhobza, J., Senn, E., Benazzouz, D., 2013. Energy consumption modeling of h.264/avc video decoding for gpp and dsp. in Proceedings of 16th Euromicro Conference on Digital System Design. [7] Benmoussa, Y., Boukhobza, J., Senn, E., Benazzouz, D., 2013. GPP vs DSP: A performance/energy characterization and evaluation of video decoding. in Proceedings of the IEEE 21st International Symposium On Modeling, Analysis And Simulation Of Computer And Telecommunication Systems. [8] Benmoussa, Y., Boukhobza, J., Senn, E., Benazzouz, D., 2015. On the energy efficiency of parallel multi-core vs hardware accelerated hd video decoding. SIGBED Rev. [9] Benmoussa, Y., Boukhobza, J., Senn, E., Hadjadj-Aoul, Y., Benazzouz, D., Feb. 2014. Dyps: Dynamic processor switching for energy-aware video decoding on multi-core socs. SIGBED Rev. 11 (1), 56–61. [10] Benmoussa, Y., Senn, E., Boukhobza, J., Lanoe, M., Benazzouz, D., 2014. Open-PEOPLE, a collaborative platform for remote & accurate measurement and evaluation of embedded systems power consumption. in Proceedings of the IEEE 22nd International Symposium On Modeling, Analysis And Simulation Of Computer And Telecommunication Systems. [11] Borkar, S., 2007. Thousand core chips: A technology perspective. In: Proceedings of the 44th Annual Design Automation Conference. DAC ’07. ACM, New York, NY, USA, pp. 746–749.

45

[12] Brooks, D., Tiwari, V., Martonosi, M., 2000. Wattch: a framework for architectural-level power analysis and optimizations. In: Computer Architecture, 2000. Proceedings of the 27th International Symposium on. pp. 83–94. [13] Broussely, M., Archdale, G., Oct. 2004. Li-ion batteries and portable power source prospects for the next 5-10 years. Journal of Power Sources 136 (2), 386–394. [14] Burd, T., Brodersen, R., 1995. Energy efficient CMOS microprocessor design. System Sciences, Proceedings of the Twenty-Eighth Hawaii International Conference on 1, 288–297. [15] Carlson, T. E., Heirman, W., Eeckhout, L., 2011. Sniper: Exploring the level of abstraction for scalable and accurate parallel multi-core simulation. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis. SC ’11. ACM, New York, NY, USA, pp. 52:1–52:12. URL http://doi.acm.org/10.1145/2063384.2063454 [16] Carroll, A., Heiser, G., 2010. An analysis of power consumption in a smartphone. Proceedings of the USENIX Annual Technical Conference, 21–28. [17] Carroll, A., Heiser, G., 2013. The systems hacker’s guide to the galaxy energy usage in a modern smartphone. In: Proceedings of the 4th AsiaPacific Workshop on Systems. APSys ’13. ACM, New York, NY, USA, pp. 5:1–5:7. [18] Chandrakasan, A., Sheng, S., Brodersen, R., Apr. 1992. Low-power CMOS digital design. IEEE Journal of Solid-State Circuits 27 (4), 473 –484. [19] Choi, J., Cha, H., march 2006. Memory-aware dynamic voltage scaling for multimedia applications. Computers and Digital Techniques, IEEE Proceedings 153 (2), 130 – 136. [20] Choi, J., Cha, H., March 2006. Memory-aware dynamic voltage scaling for multimedia applications. Computers and Digital Techniques, IEE Proceedings - 153 (2), 130–136. 46

[21] Choi, K., Soma, R., Pedram, M., Jan 2005. Fine-grained dynamic voltage and frequency scaling for precise energy and performance tradeoff based on the ratio of off-chip access to on-chip computation times. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on 24 (1), 18–28. [22] Cisco, 2013. Cisco visual networking index: Global mobile data traffic forecast update. http://bit.ly/bwGY7L. [23] Don Darling, C. M., Singh, B., 2009. Gstreamer on texas instruments OMAP35x processors. Proceedings of the Ottawa Linux Symposium, 69–78. [24] Gutnik, V., Chandrakasan, A., 1997. Embedded power supply for lowpower dsp. Very Large Scale Integration (VLSI) Systems, IEEE Transactions on 5 (4), 425–435. [25] Hameed, R., Qadeer, W., Wachs, M., Azizi, O., Solomatnikov, A., Lee, B. C., Richardson, S., Kozyrakis, C., Horowitz, M., Jun. 2010. Understanding sources of inefficiency in general-purpose chips. SIGARCH Comput. Archit. News 38 (3), 37–47. [26] He, Z., Liang, Y., Chen, L., Ahmad, I., Wu, D., 2005. Power-ratedistortion analysis for wireless video communication under energy constraints. Circuits and Systems for Video Technology, IEEE Trans on 15 (5), 645–658. [27] Holliman, M. J., Li, E. Q., kuang Chen, Y., 2003. Mpeg decoding workload characterization. Proceedings of Workshop on Computer Architecture Evaluation Using Commercial Workloads. [28] Hong, S., Kim, H., Jun. 2010. An integrated GPU power and performance model. SIGARCH Comput. Archit. News 38 (3), 280–289. [29] Horowitz, M., Joch, A., Kossentini, F., Hallapuro, A., 2003. H.264/AVC baseline profile decoder complexity analysis. Circuits and Systems for Video Technology, IEEE Trans on 13 (7), 704–716. [30] International Technology Roadmap for Semiconductors, 2012. Design. http://www.itrs.net/Links/2012ITRS/2012Chapters/2012Overview.pdf.

47

[31] Keramidas, G., Spiliopoulos, V., Kaxiras, S., 2010. Interval-based models for run-time DVFS orchestration in superscalar processors. In: Proceedings of the 7th ACM International Conference on Computing Frontiers. CF ’10. ACM, New York, NY, USA, pp. 287–296. [32] Kim, N., Austin, T., Baauw, D., Mudge, T., Flautner, K., Hu, J., Irwin, M., Kandemir, M., Narayanan, V., 2003. Leakage current: Moore’s law meets static power. Computer 36 (12), 68–75. [33] Kim, S., Papaefthymiou, M. C., 2000. Reconfigurable low energy multiplier for multimedia system design. In: VLSI, 2000. Proceedings. IEEE Computer Society Workshop on. IEEE, pp. 129–134. [34] Levon, J., Elie, P., 2004. Oprofile: A system profiler for linux. urlhttp://oprofile.sf.net. [35] Li, S., Ahn, J.-H., Strong, R., Brockman, J., Tullsen, D., Jouppi, N., 2009. Mcpat: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In: Microarchitecture, 2009. MICRO-42. 42nd Annual IEEE/ACM International Symposium on. pp. 469–480. [36] Li, X., Ma, Z., Fernandes, F., 2012. Modeling power consumption for video decoding on mobile platform and its application to power-rate constrained streaming. Visual Communications and Image Processing (VCIP), 2012 IEEE, 1 –6. [37] Ma, Z., Hu, H., Wang, Y., Dec. 2011. On complexity modeling of H.264/AVC video decoding and its application for energy efficient decoding. IEEE Trans on Multimedia 13 (6), 1240 –1255. [38] Ma, Z., Xu, M., Ou, Y.-F., Wang, Y., May 2012. Modeling of rate and perceptual quality of compressed video as functions of frame rate and quantization step-size and its applications. IEEE Trans on Circuits and Systems for Video Technology 22 (5), 671 –682. [39] Markovic, D., Stojanovic, V., Nikolic, B., Horowitz, M., Brodersen, R., 2004. Methods for true energy-performance optimization. Solid-State Circuits, IEEE Journal of 39 (8), 1282–1293.

48

[40] Merritt, L., Vanam, R., 2006. x264: A high performance H.264/AVC encoder. [41] Moiseev, K., Kolodny, A., Wimer, S., Oct. 2008. Timing-aware poweroptimal ordering of signals. ACM Trans. Des. Autom. Electron. Syst. 13 (4), 65:1–65:17. [42] OOYALA, 2013. Ooyala Global Video Index Q2 http://go.ooyala.com/wf-video-index-q2-2013.html.

2013.

[43] Pallipadi, V., 2007. cpuidle - do nothing, efficiently... Proceedings of the Ottawa Linux Symposium. [44] Pallipadi, V., Starikovskiy, A., 2006. The ondemand governor: past, present and future. Proceedings of Linux Symposium, 223–238. [45] PandaBoard Project, 2013. Pandaboard. http://www.pandaboard.org. [46] Pathania, A., Jiao, Q., Prakash, A., Mitra, T., 2014. Integrated cpugpu power management for 3d mobile games. In: Proceedings of the The 51st Annual Design Automation Conference on Design Automation Conference. ACM, pp. 1–6. [47] Schwarz, H., Marpe, D., Wiegand, T., 2007. Overview of the scalable video coding extension of the h.264/avc standard. IEEE Trans on Circuits and Systems for Video Technology, 1103–1120. [48] Shao, Z., Zhuge, Q., Xue, C., Sha, E. H.-M., 2005. Efficient assignment and scheduling for heterogeneous dsp systems. Parallel and Distributed Systems, IEEE Transactions on 16 (6), 516–525. [49] Slingerland, N. T., Smith, A. J., 2001. Cache performance for multimedia applications. In: Proceedings of the 15th International Conference on Supercomputing. ICS ’01. ACM, New York, NY, USA, pp. 204–217. [50] Smit, G. J., Kokkeler, A. B., Wolkotte, P. T., van de Burgwal, M. D., 2008. Multi-core architectures and streaming applications. In: Proceedings of the 2008 international workshop on System level interconnect prediction. SLIP ’08. ACM, New York, NY, USA, pp. 35–42.

49

[51] Spiliopoulos, V., Bagdia, A., Hansson, A., Aldworth, P., Kaxiras, S., Aug 2013. Introducing dvfs-management in a full-system simulator. In: Modeling, Analysis Simulation of Computer and Telecommunication Systems (MASCOTS), 2013 IEEE 21st International Symposium on. pp. 535–545. [52] Stockhammer, T., 2011. Dynamic adaptive streaming over HTTP : standards and design principles. Proceedings of the second annual ACM conference on Multimedia systems, 133–144. [53] Texas Instruments, 2010. Codec Engine Overhead. http://processors.wiki.ti.com/index.php/Codec Engine Overhead. [54] Texas Instruments, 2012. OMAP3530 Power Estimation Spreadsheet. http://processors.wiki.ti.com/index.php/OMAP3530 Power Estimation Spreadsheet. [55] Texas Instruments, 2012. OMAP4 mobile applications platform. http://www.ti.com/lit/ml/swpt034b/swpt034b.pdf. [56] The Moving Picture Experts Group, 2014. MPEG systems technologies part 11: Energy-efficient media consumption (green metadata). http://mpeg.chiariglione.org/sites/default/files/files/ standards/parts/docs/w14344-v2-w14344.zip. [57] Wang, A., Chandrakasan, A., 2002. Energy-efficient DSPs for wireless sensor networks. Signal Processing Magazine, IEEE 19 (4), 68–78. [58] Wang, X., Ma, K., Wang, Y., 2011. Adaptive power control with online model estimation for chip multiprocessors. Parallel and Distributed Systems, IEEE Transactions on 22 (10), 1681–1696. [59] Wei, Y.-H., Yang, C.-Y., Kuo, T.-W., Hung, S.-H., Chu, Y.-H., 2010. Energy-efficient real-time scheduling of multimedia tasks on multi-core processors. In: Proceedings of the 2010 ACM Symposium on Applied Computing. SAC ’10. ACM, New York, NY, USA, pp. 258–262. [60] Wiegand, T., Sullivan, G., Bjontegaard, G., Luthra, A., Jul. 2003. Overview of the H.264/AVC video coding standard. IEEE Trans on Circuits and Systems for Video Technology 13 (7), 560 –576. 50

[61] Xu, K., Liu, T.-M., Guo, J.-I., Choy, C.-S., 2010. Methods for power/throughput/area optimization of H.264/AVC decoding. Journal of Signal Processing Systems 60 (1), 131–145. [62] Xu, Z., Sohoni, S., Min, R., Hu, Y., Jan 2004. An analysis of cache performance of multimedia applications. Computers, IEEE Transactions on 53 (1), 20–38.

51

A Methodology for Performance/Energy Consumption ...

Dec 2, 2014 - 3. Characterization and Modeling Methodology Description. In this study, we ...... The systems hacker's guide to the galaxy energy usage in a ...

1MB Sizes 1 Downloads 249 Views

Recommend Documents

2007_8_Participatory Action Research a promising methodology for ...
2007_8_Participatory Action Research a promising methodology for transition planning.pdf. 2007_8_Participatory Action Research a promising methodology for ...

android for beginners. building a fuel consumption ...
Idea in deciding on the best book Android For Beginners. ... those wanting to develop Android applications without having to dig through piles of technical jargon ...

Supplementary Material for “A Consumption-Based ...
Explanation of Expected Stock Returns”. Motohiro Yogo. ∗. January 14, 2005. ABSTRACT. This note contains supplementary material for Yogo (2005).

A versatile, solvent-free methodology for the functionalisation of ...
(flow rate 10 mL/min) applying a constant ramping rate of 10 K/min in a temperature range between 50 and 850ºC. The grafting ratio Δ, i.e. the weight of.

A Methodology for Switching Activity based IO ...
Optimisation is achieved by splitting the spatial locations of the drivers into smaller groups and solving pad require- ment problem for each of the groups. Ground bounce is the main component based on which the pad count is estimated. Special requir

A new methodology for the synthesis of N-acylbenzotriazoles - Arkivoc
Jul 21, 2017 - Abstract. A facile and economic path for an easy access of diverse N-acylbenzotriazoles from carboxylic acid has been devised using NBS/PPh3 in anhydrous ... different types of N-halosuccinimide with 1.0 equiv. of PPh3 and 2.0 equiv. o

A methodology for the automated creation of fuzzy ...
+30 26510 98803; fax: +30 26510 98889. E-mail address: ...... [5] Zahan S. A fuzzy approach to computer-assisted myocardial ischemia diagnosis. Artif Intell ...

A Design Methodology for Selection and Placement of ...
Most of the multimedia surveillance systems nowadays utilize multiple types of ... indicate that an ad hoc methodology is usually employed for designing such ...

Draf paper A Methodology for Evaluating E-Learning ...
D. M. Cardona is student master in system engineer and computer science of the National .... other hand, audio/video streaming of lectures is usually useful.

A TEST-BASED METHODOLOGY FOR PARAMETER ...
4time domain analysis helps to evaluate steady state parameters and time constants ... Test 1. Heat transfer in condenser. The condenser acts as an inherent.

DMAIC - A Six sigma methodology for process improvement.pdf ...
DMAIC - A Six sigma methodology for process improvement.pdf. DMAIC - A Six sigma methodology for process improvement.pdf. Open. Extract. Open with.

1 Towards a Methodology for Literature Reviews in ...
Two types of reports can be produced: descriptive analysis of all results .... CHECK. • Descriptive (process and overall statistical data on final results). • Thematic ...

A Methodology for Finding Significant Network Hosts
(busy servers), interaction with many other hosts (p2p behaviors) or initiate many ... Thus, a significant host could include hosts that produce many flows (e.g. ...

Towards a Consistent Methodology for Evaluating ...
The Netherlands. Email: [email protected]. WWW home ... To address these issues, we propose to release a benchmark dataset which can be used to ...