Sage Journals: Discover world-class research

Abstract

Energy is a scarce resource in real-time embedded systems due to the fact that most of them run on batteries. Hence, the designers should ensure that the energy constraints are satisfied in addition to the deadline constraints. This necessitates the consideration of the impact of the interference due to shared, low-level hardware resources such as the cache on the worst-case energy consumption of the tasks. Toward this aim, this article proposes a fine-grained approach to analyze the bank-level interference (bank conflict and bus access interference) on real-time multicore systems, which can reasonably estimate runtime interferences in shared cache and yield tighter worst-case energy consumption. In addition, we develop a bank-to-core mapping algorithm for reducing bank-level interference and improving the worst-case energy consumption. The experimental results demonstrate that our approach can improve the tightness of worst-case energy consumption by 14.25% on average compared to upper-bound delay approach. The bank-to-core mapping provides significant benefits in worst-case energy consumption reduction with 7.23%.

Keywords

Real-time systems worst-case execution time worst-case energy consumption

Introduction

Real-time embedded systems are becoming widespread, ranging from sensor networks, Internet of Things (IoT) systems,^1,2 and surveillance systems to satellite subsystems. For real-time embedded systems, energy consumption are important design issues, since most of them operate on batteries or drain energy from limited sources. Several authors have argued that when energy plays an important role, it should also become a key factor when it comes to making scheduling decisions.^3–5 That is, in addition to ensuring the deadline constraints, designers also consider whether or not there is enough energy available in the system for the task to complete execution. As a consequence, besides bounding worst-case execution time (WCET) of a task, designers need to analyze the worst-case energy consumption (WCEC) of the task for avoiding potential system failures due to inadequate energy supply at runtime.

Currently, real-time systems are increasingly moving toward multicore architectures. To mitigate the high latency of the off-chip memory, multicore architectures are usually equipped with the on-chip caches. The caches can significantly improve the performance, but its energy consumption is a concern, Several studies^6,7 report that the cache energy consumption accounts for up to 50% of the overall chip due to its large on-chip area and high access frequency. Clearly, the caches are good candidates for energy optimization. Various techniques have been proposed over the years to reduce the energy consumption of the caches; however, in many of these works,^8,9 cache energy models are not tailored to the worst case, and bank-level interference issues are not considered in all these works .

In multicore architecture, the shared cache consists of multiple banks, and cache requests to different banks can be serviced in parallel. However, a bank can only handle one cache request at a time. When two or more cache requests try to access the same bank at the same time, the bank conflict occurs. The bank conflict complicates system behavior, leading to difficulties for the WCET analysis and an important waste of energy. Therefore, its influence on WCET and WCEC has to be taken into account for ensuring safety of systems. To the best of our knowledge, only Paolieri et al.’s¹⁰ work and Yoon et al.’s¹¹ work considered bank conflict on WCET estimation. But they all employ the upper-bound delay (UBD) approach to estimate the interference delay, in which a potential maximum delay (i.e. UBD) that each cache request suffers is bounded, then, this delay is added to each request during WCET analysis. However, not all requests can suffer from bank conflict, even though bank conflicts occur among a group of requests, the delay of bank conflict suffered by each request is different. This method will not only cause pessimistic WCET estimation, but also provide a conservative over-approximation of the WCEC of the task.

In this article, we investigate the impact of bank-level interference on WCEC for real-time multicore systems where the shared cache with multiple banks is used to improve performance. We assume that the target real-time embedded system is a hard real-time system where the deadline constraint of each hard real-time task (HRT) must be met. We make the following major contributions.

We model the WCEC of shared cache from the perspective of HRTs and analyze the WCEC of the HRT.

We present a fine-grained approach to analyze bank-level interference based on request timing, which can bring benefit for the tightness of WCEC. In our approach, we assume that the access to the shared cache is granted using a Interference-Aware Bus Arbiter (IABA).¹⁰

We apply bank-to-core mapping to optimize interference delay and develop an algorithm for finding the best bank-to-core mapping according to the queue of cores, such that the impact of bank-level interference on the WCEC is minimized.

The rest of this article is organized as follows. Section “Related work” reviews the related work on the energy-optimization techniques of cache and bank-level interference analysis. Section “System model” introduces the system model, task model, and cache energy model. In section “Bank-level interference analysis and WCEC computation,” we analyze the delay of bank-level interference suffered by a HRT. In section “Bank mapping optimization for WCEC reduction,” we design an algorithm for finding the best bank-to-core mapping. Section “Evaluation” presents experimental results. Finally, we conclude this article in section “Conclusion and future work.”

Related work

In the existing works, various techniques have been proposed to reduce inter-core interference and optimize the energy consumption of the cache. Most of these techniques aim to improve the average case energy consumption of the cache. Furthermore, they mainly focus on the effect of cache storage interference on energy consumption. Here, we analyze some of the works that we have separated into two different categories: (1) works that focus on cache reconfiguration and (2) works that focus on cache partitioning.

Reconfigurable cache

Several reconfigurable^7,13 cache have been proposed for performance improvement, energy saving, and contention reduction. Zhang et al.⁷ proposed a highly reconfigurable cache architecture, where cache characteristics (such as cache way, block size, and associativity) could be tuned via hardware configuration registers. This cache architecture can achieve up to 40% energy saving, but which cannot guarantee the strict cache isolation among real-time applications. In Hajimiri et al.,¹⁴ inter-task dynamic cache reconfiguration (DCR) technique has been proposed to reduce the contention and optimize energy consumption of the cache in real-time systems. In Mittal et al.,¹⁵ a multicore cache energy saving technique using dynamic cache reconfiguration was proposed to save cache energy by periodically allocating suitable amount of cache space to each running programs.

Cache partitioning

Cache partitioning technique partitions the shared cache into separate regions and designates one or a few regions to individual cores, which can fully eliminate the cache storage interference. Many research works have been done to reduce the interference and optimize energy consumption by cache partitioning. Qureshi and Patt¹⁶ presented a low-overhead cache partitioning technique based on online monitoring and cache utilization of each application. By leveraging configurable cache architecture, authors¹⁷ proposed a technique to eliminate inter-task cache storage interference and optimize cache energy. Suhendra and Mitra¹⁸ proposed the use of shared cache in a predictable manner through a combination of locking and partitioning mechanisms, and explored possible design choices and evaluated their effects on the worst-case application performance.

However, cache partitioning only eliminates cache storage interference and cannot avoid the bank conflict. Paolieri et al.¹⁰ partitioned shared cache using bank-level partitioning or column-level partitioning. In bank-level partitioning, each task is assigned to private bank, and this partitioning requires as many banks as the number of tasks in the system. In column-level partitioning, the shared cache is partitioned into columns, allocating exclusively a subset of the total number of columns to each task, but tasks still experience bank conflict when accessing the same bank at the same time. In this case, they bound the UBD of the bank conflict to compute interferences delays. Yoon et al.¹¹ proposed harmonic round-robin bus arbitration and bank-column cache partitioning scheme, in which bank conflict can be limited in one bus round through optimizing the allocation of bus slots among cores. These approaches took the effect of bank conflict delay into consideration, where a potential maximum delay is added to the time that a request accesses L2 cache. In fact, not all requests can suffer from inter-core interference. Even though inter-core interferences occur among a group of requests, the interference delay suffered by each request is different due to the state of different interference. These approaches based on the maximum delay leads to significant overestimation on WCET, which has a negative impact on the schedulability analysis, performance checking and WCEC of the task.

Unlike the existing works, we investigate the impact of the bank-level interference. Such work is fundamental in establishing tighter WCEC bounds and providing the safety of energy.

System model

In this section, we introduce the basics of the system model, which is composed of architecture model and task model, and then we present the cache energy model from the perspective of HRTs.

Architecture model

We present our architecture model here. Our architecture model assumes a real-time multicore architecture shown in Figure 1, which consists of $N_{core}$ homogeneous cores, $C = {C_{1}, C_{2}, \dots, C_{(N_{core})}}$ . Each core has its own private L1 instruction cache (IL1) and L1 data cache (DL1). All the cores share an L2 combined cache $B$ which is partitioned into $N_{bank}$ banks ${B_{1}, B_{2}, \dots, B_{(N_{bank})}}$ , and each bank is subdivided into $N_{column}$ columns. That is, the shared L2 cache has total $N_{bank} \cdot N_{column}$ columns. Bank access latency is $L_{M}$ cycles (same for read/write operations for all banks). The real-time shared bus connecting cores and the shared L2 cache adopts the IABA.¹⁰ The IABA is composed of one inter-core bus arbiter (XCBA) that schedules among requests from different cores in a round-robin fashion and several intra-core bus arbiters (ICBAs), one per core, which schedules among requests from the same core in first in, first out (FIFO) policy. Bus access latency is $L_{B}$ cycles.

Figure 1.

A real-time multicore architecture.

In this multicore architecture, when tasks running on different cores send requests to access shared bus and shared cache, these requests will be handled by its corresponding ICBA, which selects the next request to be sent to the XCBA. Then, the XCBA is responsible for deciding which of those requests from different cores access the bus for avoiding inter-core bus conflict and bank conflict.

Real-time task model

This article considers a periodic task model, assuming that a real-time task sets $T = {HR T_{1}, HR T_{2}, \dots, HR T_{(N_{task})}}$ comprising $N_{task}$ ( $\leq N_{core}$ ) HRT. For a $HR T_{i}$ , it has a deadline $D_{i}$ and a period $P_{i}$ . The task set hyper-period is named H and is the least common multiple of all period of HRTs in $T$ . We assume the task-to-core mapping is given at design time and task migration is not allowed.

In bank-column cache partitioning, each HRT is assigned a subset of column of the shared L2 cache, the columns allocated to a HRT do not be accessed by any other HRT and remain the same through the system execution. This column-level cache partitioning can avoid cache storage interference since we mainly focus on bank conflict in this article. We use M to denote the all combinations of potential bank-to-core mappings. The jth mapping in M is denoted by $Ma p_{j}$ .

The WCET of a task in multicore architecture can be divided into two parts:¹⁰ fixed execution time (also referred to as single-core bound) and interference delay. While the former is the maximum time duration that a task could take to execute the instruction over its critical path, which is analyzed in isolation and not affected by the other tasks. The latter is the sum of delays incurred for its cache access over the same path. For the multicore architecture adopting IABA, interference delay consists of conflict delay (i.e. bus access delay and bank conflict delay) of cache access in XCBA and bus waiting time of cache access in ICBA. Let $D X_{i}$ be the total conflict delays of a $HR T_{i}$ in XCBA and $D I_{i}$ be the total bus waiting time of the $HR T_{i}$ in ICBA. Then, the WCET of the $HR T_{i}$ under bank-to-core mappings $Ma p_{j}$ can be expressed as follows

\begin{matrix} WCE T_{i} (Ma p_{j}) = & \overset{single - core bound}{\overset{︷}{{WCET}_{fixed}^{i}}} + \\ \underset{interference delay}{\underset{︸}{D X_{i} (Ma p_{j}) + D I_{i} (Ma p_{j})}} \end{matrix}

(1)

where ${WCET}_{fixed}^{i}$ is the fixed execution time of the $HR T_{i}$ , which can be computed by well-known techniques in WCET analysis,¹⁹ and is the same for all job instance of the $HR T_{i}$ . On the contrary, the delay $D X_{i} (Ma p_{j})$ and $D I_{i} (Ma p_{j})$ are not same for different job instances of the $HR T_{i}$ .

Let $Q R_{i}$ be the sequences of request issued by $HR T_{i}$ , the requests of L2 cache access hit and miss in $Q R_{i}$ can be profiled using static timing analysis tool, such as RapiTime²⁰ and Chronos.²¹ Let ${QR}_{L 2 hit}^{i} (\subseteq Q R_{i})$ and ${QR}_{L 2 miss}^{i} (\subseteq Q R_{i})$ be the sequences of L2 cache access hit and miss, respectively, $dxcb a_{j}$ represents the conflict delay suffered by request $q_{j} (\in {QR}_{L 2 hit}^{i})$ in the XCBA, and $dicb a_{j}$ denotes the waiting time of request $q_{j}$ in the ICBA. The total conflict delay ( $D X_{i}$ ) and total waiting time ( $D I_{i}$ ) can be expressed as follows

D X_{i} (Ma p_{j}) = \sum_{\forall q_{j} \in {QR}_{L 2 hit}^{i}} dxcb a_{i}

(2)

D I_{i} (Ma p_{j}) = \sum_{\forall q_{j} \in {QR}_{L 2 hit}^{i}} dicb a_{i}

(3)

Cache energy model

The WCEC of L2 cache can be expressed as follows

WCE C_{L 2} = \sum_{\forall HR T_{i} \in T} E_{L 2}^{i} (Ma p_{j})

(4)

where $E_{L 2}^{i} (Ma p_{j})$ is the energy consumption of L2 cache consumed by a $HR T_{i}$ when bank-to-core mapping is $Ma p_{j}$ , the energy dissipation of L2 cache comprises dynamic energy and static energy,⁶ $E_{L 2}^{i} (Ma p_{j})$ can be expressed as

E_{L 2}^{i} (Ma p_{j}) = E_{dyn}^{i} + E_{sta}^{i} (Ma p_{j})

(5)

in equation (4), $E_{dyn}^{i}$ and $E_{sta}^{i} (Ma p_{j})$ are the dynamic energy and static energy of L2 cache consumed by $HR T_{i}$ , respectively. The dynamic energy dissipation $E_{dyn}^{i}$ originates from cache hits and cache misses

E_{dyn}^{i} = N_{hit}^{i} \cdot E_{hit} + N_{miss}^{i} \cdot E_{miss}

(6)

where $N_{hit}^{i}$ and $N_{miss}^{i}$ are the number of L2 cache hit and miss of $HR T_{i}$ , respectively. $E_{hit}$ represents the cache access energy of a L2 cache hit. $E_{miss}$ denotes the energy dissipation of a L2 cache miss and it is calculated as

E_{miss} = E_{memaccess} + E_{block_fill}

(7)

where $E_{memaccess}$ is the energy dissipation for accessing the off-chip memory, $E_{block_fill}$ is the energy dissipation for filling the fetched data to the L2 cache. The static energy consumption of L2 cache consumed by $HR T_{i}$ can be calculated as

E_{sta}^{i} (Ma p_{j}) = E_{fixed}^{i} + E_{delay}^{i} (Ma p_{j})

(8)

E_{fixed}^{i} = P_{sta}^{i} \cdot {WCET}_{fixed}^{i} .

(9)

E_{delay}^{i} (Ma p_{j}) = P_{sta}^{i} \cdot (D X_{i} (Ma p_{j}) + D I_{i} (Ma p_{j}))

(10)

In equations (9) and (10), $P_{sta}^{i}$ is the static cache power caused by $HR T_{i}$ . We use $S_{i}$ to represent the demanded column amount for the $HR T_{i}$ . Let $P_{sta} (N_{bank} \cdot N_{column})$ be the total power of L2 cache when the capacity of L2 cache is $N_{bank} \cdot N_{column}$ columns. Each HRT exclusively use the allocated columns, and the unused column can also be turned off to reduce power consumption. Thus, the static power consumed by $HR T_{i}$ can be defined as

P_{sta}^{i} = P_{sta} (N_{bank} \cdot N_{column}) \cdot S_{i} / (N_{bank} \cdot N_{column})

(11)

The data for $P_{sta} (N_{bank} \cdot N_{column})$ with a given cache capacity $N_{bank} \cdot N_{column}$ , the $E_{hit}$ , and $E_{block_fill}$ can be obtained using simulation tools like CACTI.²² The value for $E_{memaccess}$ can be obtained from memory specification.⁷

Bank-level interference analysis and WCEC computation

In this section, first, an example is given to explain the working of our interference analysis. Next, we provide an analytical formula for calculating conflict delay and waiting time of request. Finally, we present a fine-grained approach to compute interference delay based on request timing, which is the base of bank-to-core mapping.

Example for interference analysis

In this example, as shown in Figure 2, we assume a four-core system with $L_{B}$ of 2 cycles and $L_{M}$ of 4 cycles, core $C_{1}$ , $C_{2}$ , and $C_{3}$ share bank $B_{1}$ , and core $C_{4}$ is exclusively bank $B_{2}$ . Each HRT has only two cache requests, which are issued at cycles 1 and 9, respectively. In UBD method, UBD is defined by $Max ((N_{core} - 1) \cdot L_{M}$ , $(N_{core} - 1) \cdot L_{B}$ )) in IABA; therefore, the conflict delays that each HRT suffer from is $((4 - 1) \cdot L_{M}) \cdot 2 = 24$ cycles, where 2 is the number of cache requests issued by HRT. However, at cycle 1, the first request issued by $HR T_{1}$ can access bus immediately since $C_{1}$ is the first to be served among cores, no other requests access $B_{1}$ , so the first request of $HR T_{1}$ can access $B_{1}$ at cycle 3. In other words, the first request issued by $HR T_{1}$ does not suffer from bus access delay and bank conflict delay. For the first request from $HR T_{2}$ , it can access the bus at cycle 5 due to bus access conflict and bank conflict from the first request of $HR T_{1}$ , its bus access delay and bank conflict delay are 2 cycles, respectively. Similarly, the conflict delay suffered by the first request of $HR T_{3}$ is 8 cycles, in which the bus access delay is 6 cycles and bank conflict delay is 2 cycles. However, the first request of $HR T_{4}$ access different bank; therefore, it only suffers from the bus access conflict, and the bus access delay is 10 cycles.

Figure 2.

The conflict delays and waiting delays of HRT.

The second round of XCBA starts at cycle 17 since all requests in first round of XCBA finish in this time point. The second request of $HR T_{1}$ is granted access to the bus at cycle 17; hence, its bus waiting time in ICBA is $17 - 9 = 8$ cycles, and its bus access delay and bank conflict delay are 0 cycle in XCBA. For the second request issued by $HR T_{2}$ , the first request of $HR T_{2}$ does not complete at cycle 9, and the time overlapping exists between the bank conflict delay suffered by first request and the bus waiting time suffered by the second request. In this case, the non-overlapping bus waiting time suffered by the second request is $17 - 11 = 6$ cycles since the first request of $HR T_{2}$ completes at cycle 11, and the bus access delay and bank conflict delay suffered by the second request of $HR T_{2}$ in the second round of XCBA is 2 cycles, respectively. So, the interference delay of the second request of HRT2 is $6 + 2 + 2 = 10$ cycles. Similarly, the interference delay of the second request of $HR T_{3}$ and $HR T_{4}$ is 10 cycles, respectively. Based on the above analysis, we can conclude that the total interference delays suffered by four HRTs are 8,14,18, and 20 cycles, respectively. Clearly, these interference delays are less than the interference delays estimation based on UBD method.

Analyzing conflict delay and waiting time

From the above examples, we can see that it is necessary to analyze the conflict delay in the XCBA and the waiting time in the ICBAs for accurately estimating WCEC of HRTs. Let us suppose there is a request $r q_{j}$ from core $C_{j}$ ( $\in C$ ), which tries to access bank $B_{k}$ , arriving the bus at cycle $Tar r_{j}$ . If $Tar r_{j}$ is more earlier than start time $XCB A_{sta}$ of current round of the XCBA, the $r q_{j}$ has to stall in ICBA until it is forwarded to the XCBA (that sends it to the bus) at cycle $XCB A_{sta}$ . In XCBA, the $r q_{j}$ may encounter bus access interference and bank conflict, which depend on previous request of $r q_{j}$ in current round of the XCBA. Let the request $r q_{p}$ from core $C_{p}$ ( $\in C$ ) be the previous request of $r q_{j}$ in current round of the XCBA and $Tac c_{p}$ be the time that the $r q_{p}$ is granted access to the bus. The bus access delay that the $r q_{j}$ suffers can be computed by the following expression

dbu s_{j} = Tac c_{p} + L_{B} - XCB A_{sta}

(12)

Let request $r q_{k}$ from core $C_{k}$ ( $\in C$ ) be the previous request of $r q_{j}$ to access bank $B_{k}$ in current round of the XCBA, the time that the request $r q_{k}$ is granted access to the bus is $Tac c_{k}$ . The finish time of the request $r q_{k}$ to access bank $B_{k}$ is $Tac c_{k} + L_{M} + L_{B}$ , and the start time of the request $r q_{j}$ to access bank $B_{k}$ is $XCB A_{sta} + dbu s_{j} + L_{B}$ . If ( $XCB A_{sta} + dbu s_{j} + L_{B}$ ) is more earlier than ( $Tac c_{k} + L_{M} + L_{B}$ ), the request $r q_{j}$ will suffer from bank conflict, the bank conflict delay is ( $Tac c_{k} + L_{M} + L_{B}$ )–( $XCB A_{sta} + dbu s_{j} + L_{B}$ ), otherwise, the request $r q_{j}$ does not suffer from bank conflict, that is, the bank conflict delay is 0. So, the bank conflict delay that the $r q_{j}$ suffers can be computed by the following expression

\begin{matrix} dban k_{j} = Max (0, Tac c_{k} + L_{M} - XCB A_{sta} - dbu s_{j}) \end{matrix}

(13)

Based on equations (12) and (13), the total conflict delays that the $r q_{j}$ suffers in XCBA and the time that the $r q_{j}$ is granted access to bus can be expressed, respectively, as

dxcb a_{j} = dbu s_{j} + dban k_{j} .

(14)

Tac c_{j} = XCB A_{sta} + dxcb a_{j} .

(15)

Algorithm 1 shows the outline of calculating the conflict delay suffered by a request in one XCBA round. This algorithm takes the current requests to access the XCBA per core, the start time of current XCBA round and a bank-to-core mapping as an input. The $T_{arr [i]}$ to hold the time that request $rq [i]$ is ready to access bus. The $used [i]$ indicates whether or not the request can be handled in current XCBA round, if the request is handled, the $used [i]$ is set to true. The $T_{pre}$ keeps track of the start time of the request to access bank, which is used for computing bus access delay incurred by the request. In line 1, the $T_{pre}$ is initiated with the start time of the current round of XCBA, due to the fact that the first request in each XCBA round does not suffer from bus access delay. Line 5 analyzes that whether or not a request can be handled in current XCBA round. The bus access delay of request is computed in line 6. The bank conflict delay of a request is initialized to 0 in line 8, then the bank conflict delay suffered by the request is computed in lines 9–16. Line 10 uses a procedure $IsSameBank ()$ that determines whether two requests access the same bank based on BtoCmapping[][] and their own address. The time that a request finishes its bank access is computed in line 18. The finish time of the current round of XCBA is computed in line 19.

Algorithm 1: Calculating the conflict delay within one XCBA round.
Input: $rq [i]$ , $XCB A_{sta}$ , $BtoCmapping [] []$
Output: Bus access delay suffered by request $rq [i]$ ( $dbus [i]$ ), bank conflict delay suffered by $rq [i] (dbank [i]$ ), the finish time of the $rq [i]$ to access the bus ( $T_{acc [i]}$ ), the finish time of the current XCBA round ( $XCB A_{fin}$ )
1: $T_{pre} = XCB A_{sta}$ ;
2: $XCB A_{fin} = XCB A_{sta}$ ;
3: for $i = 1; i \leq N_{core}; i + +$ do
4: $used [i] = false$ ;
5: if $T_{arr [i]} = = XCB A_{sta}$ then
6: $dbus [i] = T_{pre} - XCB A_{sta}$ ;
7: $used [i] = true$ ;
8: $dbank [i] = 0$ ;
9: for $k = i - 1; k \geq 1; k - -$ do
10: if $used [k] = = true$ and $Is SameBank (rq [i], rq [k])$ then
11: if $dbus [k] + dbank [k] + L_{M} > dbus [i]$ then
12: $dbank [i] = dbus [k] + dbank [k] + L_{M} - dbus [i]$
13: break;
14: end if
15: end if
16: end for
17: $T_{pre} = T_{pre} + dbank [i] + L_{B}$ ;
18: $T_{acc [i]} = XCB A_{sta} + dbank [i] + dbus [i] + L_{B} + L_{M}$
19: $XCB A_{fin} = T_{pre} + L_{M}$ ;
20: end if
21: end for
22: return $dbus [], dbank [], T_{acc []}, XCB A_{fin}, used []$ ;

As disscussed earlier, ICBA schedules among request from the same core in FIFO policy to access the XCBA. The delay (the bus waiting time) suffered by a request in the ICBA is the time interval between the time that the request reach the ICBA and the time that the request is selected to be sent the XCBA. Let us suppose that requests $rq [j - 1]$ and $rq [j]$ are two requests from the same core, where the $rq [j - 1]$ is granted access to the bus in the previous XCBA round and the $rq [j]$ is granted access to the bus in the current XCBA round, the $XCB A_{sta}$ is the start time of the current round of the XCBA. Since the $rq [j]$ is granted access to the bus in the current XCBA round, the $Tar r_{j}$ is less than $XCB A_{sta}$ . IF $Tar r_{j}$ is later than $T_{acc [j - 1]}$ which $T_{acc [j - 1]}$ is the finish time of the $rq [j - 1]$ to access bus, the waiting time suffered by the $rq [j]$ in the ICBA is $XCB A_{sta} - Tar r_{j}$ , otherwise, the time overlapping existing between the $rq [j - 1]$ and the waiting time suffered by the $rq [j]$ , and the waiting time suffered by the $rq [j]$ is $XCB A_{sta} - T_{acc [j - 1]}$ .The non-overlapping waiting time suffered by the $rq [j]$ can be computed by

dicb a_{j} = XCB A_{sta} - Max (Tar r_{j}, T_{acc [j - 1]})

(16)

Based on the above analysis, we develop an algorithm to compute the total interference delays suffered by each HRT in the XCBA and ICBAs. Algorithm 2 presents the details of the algorithm. The $ispop [k]$ indicates whether or not the request can be fetched for $Q R_{i}$ , the $C_{fin [k]}$ is the finish time of request $rq [j]$ . In line 3, the request of L2 cache access hit, ${QR}_{L 2 hit}^{i}$ , is obtained from $Q R_{i}$ which is determined by static timing analysis tool. In line 9, we pop the request from ${QR}_{L 2 hit}^{i}$ and update ${QR}_{L 2 hit}^{i}$ . According to equation (16), the total waiting time of a HRT in ICBA is computed in line 16. In line 18, algorithm 1 is called to compute the interference delay in one schedule round. The total interference delay is computed in line 19. The start time of the current round is computed in line 24.

Algorithm 2: Calculating the total interference delay of $HR T_{i}$ in the XCBA and ICBA.
Input: $BtoCmapping [] []$ , $Q R_{i}$ , $Q R_{k}$ ( $HR T_{k} \in {T - HR T_{i}}$ )
Output: $DX [i]$ , $DI [i]$
1: $XCB A_{sta} = 0$ ;
2: Initialize conflict delays and waiting time for each core $DX [i] = 0, DI [i] = 0, ispop [i] = true$ ( $1 \leq i \leq N_{core}$ );
3: Obtain ${QR}_{L 2 hit}^{i}$ from $Q R_{i}$ ;
4: Obtain ${QR}_{L 2 hit}^{k}$ from $Q R_{k}$ ( $HR T_{k} \in {T - HR T_{i}}$ );
5: while ( ${QR}_{L 2 hit}^{i} \neq$ NULL) do
6: for ( $k = 1; k \leq N_{core}; k + +)$ do
7: if $ispop [k] = = true$ then
8: if ( ${QR}_{L 2 hit}^{k} \neq$ NULL) then
9: Pop the current request $rq [k]$ from ${QR}_{L 2 hit}^{k}$ ;
10: $ispop [k] = false$ ;
11: end if
12: end if
13: end for
14: for ( $k = 1; k \leq N_{core}; k + +)$ do
15: if $T_{arr [k]} < XCB A_{sta}$ then
16: $DI [k] =$ $DI [k] + XCB A_{sta} - Max (T_{arr [k]}, C_{fin [k]})$ ;
17: $T_{arr [k]} = XCB A_{sta}$ ;
18: Call Algorithm 1 to obtain $dbus [k]$ , $dbank [k]$ , $used [k]$ , $XCB A_{fin}$ ;
19: $DX [k] = DX [i] + dbus [k] + dbank [k]$ ;
20: $C_{fin [k]} = T_{acc [k]}$ ;
21: $ispop [k] = used [k]$ ;
22: end if
23: end for
24: $XCB A_{sta} = XCB A_{fin}$ ;
25: end while
26: return $DX [i]$ , $DI [i]$ ;

Computing the WCEC of HRTs

Algorithm 3 estimates the WCEC of HRTs. In the algorithm, we can see that lines 2–19 analyze all job instance of HRT in one hyper-period. For each job instance of HRT, we first call algorithm 2 to estimate the total conflict delays and waiting time suffered by it (line 5), then compute its WCET estimation based on equation (1). Line 7 judges if this job instance meet the timing constraints. If false, then this job instance is not schedulable. we set the WCEC of the job to infinity. Otherwise, we compute the energy consumption of this job instance in lines 10–14. The total energy consumption of HRTs is computed in line 16.

Algorithm 3: Calculating the WCEC of HRTs.
Input: $N_{core}$ , $N_{task}$ , H, $BtoCmapping [] []$ , $Q R_{i}$ , $D_{i}$ , $P_{i}$ ( $HR T_{i} \in T$ )
Output: The WCEC of all HRTs. $Tota l_{energy}$
1: Obtain $N_{L 2 hit}^{i}$ and $N_{L 2 miss}^{i}$ from $Q R_{i}$ ( $HR T_{i} \in T$ ) ;
2: for ( $k = 0; k \leq H; k + +)$ do
3: for ( $i = 1; i \leq N_{task}; i + +)$ do
4: ifk $\mod (P_{i}) = = 0$ then
5: Call algorithm 2 to compute the interference delay $DX [i]$ , $DI [i]$ suffered by current job instance of $HR T_{i}$ ;
6: $WCET [i] = W_{fixed}^{i} + DX [i] + DI [i]$ ;
7: if $WCET [i] > D_{i}$ then
8: $WCEC [i] = Infinity$ ;
9: else
10: Obtain $N_{L 2 hit}^{i}$ and $N_{L 2 miss}^{i}$ from $Q R_{i}$ ;
11: $E_{dyn}^{i} = N_{L 2 hit}^{i} \cdot E_{hit}$ ;
12: $E_{dyn}^{i} = E_{dyn}^{i} + N_{L 2 miss}^{i} \cdot (E_{memaccess} + E_{blockfill})$ ;
13: $E_{sta}^{i} = P_{sta}^{i} \cdot WCET [i]$ ;
14: $WCEC [i] = WCEC [i] + E_{dyn}^{i} + E_{sta}^{i}$ ;
15: end if
16: $Tota l_{energy} = Tota l_{energy} + WCEC [i]$ ;
17: end if
18: end for
19: end for
20: return $Tota l_{energy}$ , schedule;

Bank mapping optimization for WCEC reduction

Problem formulation

In this section, we will present our bank-to-core mapping algorithm to optimize bank-level interference and improve the WCEC. This optimization problem can be formally defined as

Min (\sum_{i = 1}^{N_{task}} \sum_{k = 1}^{H / P_{i}} E_{L 2}^{ik} (Ma p_{j}), \forall Ma p_{j} \in M)

(17)

where H denotes the hyper-period of all HRTs, and $E_{L 2}^{ik} (Ma p_{j})$ is the energy consumption of L2 cache consumed by kth job instance of $HR T_{i}$ within one hyper-period when bank-to-core mapping is $Ma p_{j}$ .

The optimization problem is subject to the following several constraints

Ma p_{j} = {x_{ik} \cdot nco l_{ik} | 1 \leq i \leq N_{core}, 1 \leq k \leq N_{bank}}

(18)

where $x_{ik}$ denotes that whether $B_{k} (\in B)$ has columns mapped to $HR T_{i} (\in T)$ or not. If $B_{k}$ has columns mapped to $HR T_{i}$ , $x_{ik} = 1$ ; otherwise, $x_{ik} = 0$ . $nco l_{ik}$ represents the columns of $B_{k}$ mapped to $HR T_{i}$ . If $x_{ik} = 1$ , $nco l_{ik} > 0$ ; otherwise, $nco l_{ik} = 0$ .

Let $S_{i}$ denote the number of cache columns required for $HR T_{i}$ ; obviously, the number of cache columns allocated to $HR T_{i}$ must be equal to $S_{i}$ for each bank-to-core mapping.

Thus, the following conditional constraints is satisfied

S_{i} = \sum_{k = 1}^{N_{bank}} x_{ik} \cdot nco l_{ik}, \forall Ma p_{j} \in M

(19)

The number of cache columns allocated to all HRTs must be less than or equal to the total number of cache columns in multicore system. Thus

\sum_{i = 1}^{N_{task}} S_{i} \leq N_{bank} \cdot N_{column}, \forall Ma p_{j} \in M

(20)

In bank-column cache partitioning, the column cannot be shared between any two HRTs. In other words, the columns of one bank allocated to HRTs must be less than or equal to the capacity of one bank, that is

\sum_{i = 1}^{N_{task}} nco l_{ik} \cdot x_{ik} \leq N_{column}, \forall b_{k} \in B, \forall Ma p_{j} \in M

(21)

Fore each $Ma p_{j} (\in M)$ , all HRTs must be completed before their deadline. Therefore, for each job instance of $HR T_{i}$ , the following constraint must be met

WCE T_{ik} (Ma p_{j}) \leq D_{i}, \forall HR T_{i} \in T, \forall Ma p_{j} \in M

(22)

where $WCE T_{ik} (Ma p_{j})$ represents the WCET of the kth job instance of the $HR T_{i}$ within one hyper-period when bank-to-core mapping is $Ma p_{j}$ .

According to equation (1), we can estimate $W_{ik} (Ma p_{j})$ by the following equation

\begin{matrix} WCE T_{ik} (Ma p_{j}) = & {WCET}_{fixed}^{i} + D X_{ik} (Ma p_{j}) \\ + D I_{ik} (Ma p_{j}) \end{matrix}

(23)

In equation (23), $D X_{ik} (Ma p_{j})$ and $D I_{ik} (Ma p_{j})$ are the total delays of the kth job instance of the $HR T_{i}$ in XCBA and ICBA, respectively.

Algorithm for bank-to-core mapping

We will present bank-to-core mapping for optimizing WCEC of HRTs. Intuitively, the bank conflict can be fully eliminated by exclusively mapping a task’s instructions and data to specific bank. But doing so requires as many banks as the number of HRTs in the system. When the number of banks is less than the number of HRTs, a proper method which can efficiently utilize the shared cache space while minimizing the energy consumption is needed. In this article, we optimize bank-to-core mapping for WCEC reduction. Bank-to-core mapping can be divided into three cases.

Case 1: $\sum_{\forall C_{i} \in C} ⌈ S_{i} / N_{column} ⌉ \leq N_{bank}$ . In this case, we exclusively allocate $⌈ S_{i} / N_{column} ⌉$ banks to the core $C_{i}$ .

Case 2: $\sum_{\forall C_{i} \in C} ⌈ S_{i} / N_{column} ⌉ > N_{bank}$ and $\sum_{\forall C_{i} \in C} S_{i} > (N_{bank} - 1) \cdot N_{column}$ .The process of making bank-to-core mapping can be described as follows. (1) Make bank-to-core mapping according to a core queue. We first allocate columns for the first core in the core queue. Next, we allocate columns for the second core, and so on and (2) we first allocate the columns of bank $B_{1}$ to cores. Next, we allocate the columns of bank $B_{2}$ , and so on.

Case 3: $\sum_{\forall C_{i} \in C} ⌈ S_{i} / N_{column} ⌉ > N_{bank}$ and $\sum_{\forall C_{i} \in C} S_{i} \leq (N_{bank} - 1) \cdot N_{column}$ . We first reserve $⌈ ((N_{b a n k} \cdot N_{c o l u m n} - \sum_{\forall c_{i} \in C} S_{i})) / (N_{b a n k} \cdot N_{c o l u m n} ⌉$ for $((N_{bank} \cdot N_{column} - \sum_{\forall c_{i} \in C} S_{i}) \mod N_{bank})$ banks and $⌊ ((N_{bank} \cdot N_{column} - \sum_{\forall c_{i} \in C} S_{i})) / (N_{bank} \cdot N_{column} ⌋$ columns for the rest banks that do not take part in bank mapping, and then apply the method of Case 2 to make bank-to-core mapping.

For Case 1, we eliminate the bank conflict, and the WCEC of the system is minimized. For Case 2 and Case 3, we develop the algorithm to find the best bank-to-core mapping with minimal WCEC when the columns are allocated to $HR T_{i} (\in T)$ is $S_{i}$ . The algorithm 4 shows the concrete detail of bank-to-core mapping, which is based on recursion strategy. In algorithm 4, line 1 initializes the initial WCEC of the system MinWCEC to infinity and initializes decision variables used[] to false for performing recursion call. Line 2 computes the number of columns on each bank, and these columns are allowed to assign HRT and keep on working until the WCET of the HRT, other columns on each bank can be shut down for energy saving. A recursive function $FindBestMapping ()$ is defined to search the optimal bank-to-core mapping in the solution space in lines 3–34. Lines 6–20 generate the bank-to-core mapping $BtoCmapping [] []$ based on the core queue $c_seq []$ . Then, based on bank mapping $BtoCmapping [] []$ , we call algorithm 3 to calculate the interference delays of each HRT in line 21. The best bank mappings are saved in lines 22–25. The recursion of the algorithm is practiced in lines 26–33. In line 28, a core queue is generated in the recursive walk, and the generated core queue is stored in the array $c_seq []$ .

Algorithm 4: bank mapping for optimizing WCEC
Input: $N_{core}$ , $Q R_{i}$ , $D_{i}$ , $P_{i}$ $(HR T_{i} \in T$ )
Output: The minimal WCEC of the system MinWCEC, corresponding best bank-to-core mapping $BestMap [] []$
1: $MinWCEC =$ Infinity, $used [j] = false$ ( $1 \leq j \leq N_{core}$ );
2: $N T_{column} = ⌈ \sum_{\forall HR T_{i} \in T} S_{i} \mod N_{bank} ⌉$ ;
3: function $FindBestMapping (N)$
4: if $N > N_{core}$ then
5: $ncol = N T_{column}$ ; $nbank = 1$ ;
6: for each core $C_{i}$ in $c_seq []$ . do
7: if $S_{i} \geq ncol$ then
8: while $S_{i} \geq ncol$ do
9: $BtoTmapping [i] [nbank] = ncol$ ;
10: $S_{i} = S_{i} - ncol$ ;
11: $nbank + +$ ;
12: $ncol = N T_{column}$ ;
13: end while
14: $BtoTmapping [i] [nbank] = S_{i}$ ;
15: $ncol = ncol - S_{i}$ ;
16: else
17: $BtoTmapping [i] [nbank] = S_{i}$ ;
18: $ncol = ncol - S_{i}$ ;
19: end if
20: end for
21: Call algorithm 3 to compute energy consumption $Tota l_{energy}$ under mapping $BtoCapping [] []$ ;
22: if $MinWCEC > Tota l_{energy}$ then
23: $MinWCEC = Tota l_{energy}$ ;
24: end if
25: end if
26: for $j = 1$ ; $j \leq N_{core}$ ; $j + +$ do
27: if $! used [j]$ then
28: $c_seq [N] = C_{j}$ ;
29: $used [j] = true$ ;
30: $FindBestMapping (N + 1)$ ;
31: $used [j] = false$ ;
32: end if
33: end for
34: end function
35: Call $FindBestMapping (1)$ ;
36: returnMinWCEC, $BestMap [] []$ ;

Evaluation

In this section, we evaluate the effectiveness of interference analysis and bank-to-core mapping approach on energy saving. Before the results are presented, we first introduce the experiment setup.

Experimental setup

We assume our target architecture has six cores, each core has an in-order, five-stage pipeline. CPU clock speed is 500 MHz. The instruction fetch queue size is 4, fetch width is 2, and the instruction window size is 8. Private L1 instruction and data caches are set to 128 B (1-bank, 2-way associativity, 16-byte line, and 1-cycle access latency). The L2 cache is shared among all cores, and it is 8 KB, 4 banks, 4-way associativity, 32-byte line, and 4-cycle access latency( $L_{M}$ ). Each bank is 2 KB and comprises 8 columns, each of which is 128 B. The real-time bus applies the IABA policy, access latency ( $L_{B}$ ) is 2 cycles for a request/access to cross them. The main memory is set to 4 MKB, 8 banks, and 30 cycles access latency. The energy parameters of the L2 cache are generated by CACTI,²³ an integrated cache leakage power model developed by HP.

The WCET analysis of the task is built on top of the open-source timing analysis tool Chronos,²¹ Chronos is originally a single-core WCET analysis tool, and we extended it by adding support for IABA model. The task sets used in our experiment are shown in Table 1. All tasks are from Malardalen WCET benchmarks,²³ which are compiled with GCC cross-compiler for a MIPS-like instruction set.²⁴ The mapping of task to core is given in third column of Table 1. To obtain demanded column amount for each task, we use Chronos to measure the WCET of each task by varying the L2 cache size from 1 to 32 columns. According to the measured results, the demanded column amount for each task is listed in the fourth column of Table 1.

Table 1.

Characteristics of task sets.

Task set	Task	Core	Columns	Period (ms)
Set 1	fir	$C_{1}$	1	20
	bs	$C_{2}$	1	20
	st	$C_{3}$	2	10
	cnt	$C_{4}$	8	40
	fibcall	$C_{5}$	1	10
	bsort100	$C_{6}$	16	120
Set 2	ud	$C_{1}$	4	40
	ludcmp	$C_{2}$	14	40
	expint	$C_{3}$	2	20
	prime	$C_{4}$	4	100
	insertsort	$C_{5}$	4	200
	matmult	$C_{6}$	2	100

Experimental results

Based on above experimental setup, we conduct four experiments. In the first experiment, the bank-to-core mapping is given in advance, and we show the impact of the bank-level interference on WCEC. The second experiment evaluates the impact of the bank-to-core mapping on WCEC saving. The third experiment compares our interference analysis approach to UBD approach. Finally, the final experiment investigates the effect of timing constraint on our approach.

Impact of the interference on WCEC

To quantify the impact of interference on WCEC, we assume the bank-to-core mapping is given as shown in Table 2, and the deadline of $HR T_{i}$ is equal to its period; then, we compute the WCEC of each task considering interference delay and not-considering interference delay, respectively, and compare the difference between them. Figure 3 shows the comparison of WCEC for these two scenarios, where the WCEC of task not-considering interference delay is normalized to 100%. We can see that the WCEC of all tasks for two task sets increased in different levels. For example, the WCEC of cnt increases by 11.3%, and the WCEC of prime increases by 11.7%. The only exception is fibcall, where its WCEC increases by 4.3%. This is because fibcall is a compute-intensive application with little amount of L2 cache access. To sum up, the WCEC of tasks for two task sets can on average achieve 10.27% and 11.38% WCEC increment, respectively. This shows that if the impact of the interference on WCEC is not considered, this may lead to unsafe results when the analysis is to be relied on for guarantees of system behavior within a given energy budge.

Table 2.

A bank-to-core mapping.

Set 1	$B_{1}$	$B_{1}$	$B_{3}$	$B_{4}$	Set 2	$B_{1}$	$B_{2}$	$B_{3}$	$B_{4}$
fir	1	0	0	0	ud	4	0	0	0
bs	1	0	0	0	ludcmp	2	8	4	0
st	2	0	0	0	expint	0	0	2	0
cnt	4	4	0	0	prime	0	0	2	2
fibcall	0	1	0	0	insertsort	0	0	0	4
bsort100	0	3	8	5	matmult	0	0	0	2

Figure 3.

Comparison of WCEC of task considering interference and not-considering interference: (a) task set 1 and (b) task set 2.

Impact of bank-to-core mapping on WCEC

It is valuable to disclose that the ability of bank-to-core mapping can affect the WCEC of HRTs. Using task set 2 in Table 1 as an example, Figure 4 shows that the total WCEC of HRTs in solution space of bank-to-core mapping, we can see that the solution space of bank-to-core mapping is 720 and the total WCEC of HRTs varies from 49076.36 to 52904.4 nJ. The difference in energy consumption of L2 cache is 3828.04 nJ between the best bank mapping and the worst bank mapping, namely, the bank-to-core mapping results in the 7.23% WCEC reduction. This is mainly due to the fact that the task suffers from different interference delays under different bank-to-core mappings, and these interference delays affect the static energy consumption of L2 cache.

Figure 4.

The impact of bank-to-core mapping on WCEC.

Our interference analysis approach versus UBD approach

In this experiment, our focus is the effectiveness of interference analysis approach, the bank-to-core mapping is based on Table 2. We compare our approach with UBD approach where the UBD can be expressed as $UBD = (N_{core} - 1) \cdot Max (L_{B}, L_{M})$ . Figure 5 shows the comparison of WCEC of the approaches normalized with respect to UBD for two task sets, respectively. One of the bank-to-core mappings with the minimum WCEC is shown in Table 3. As it can be seen, for most of the HRTs in two task sets, our approach can significantly improve the WCEC of each HRT. For example, our approach results in up to 19.72% WCEC reduction for cnt, 22.64% WCEC reduction for bsort100, and 21% WCEC reduction for matmult over UBD. In summary, our approach for two task sets can on average achieve 14.4% and 14.1% WCEC reduction compared to UBD approach, respectively. The two task sets as a whole reaches 14.25% WCEC reduction on average. In addition, to observe the effectiveness of the proposed interference analysis approach, we compare the estimated WCEC with the observed WCEC through simulation. We have extended the SimpleScalar toolset²⁴ to facilitate our experimental evaluation. Figure 6 compares our approach and simulation results. In Figure 6, we can see that for a number of benchmarks, such as fir, the proposed approach can obtain a very tight WCEC, which are within 3.56% of the observed simulation result. On average, the estimated WCEC of approach is 5.31% more than the observed WCEC through simulation.

Figure 5.

Comparison of WCEC of two approaches: (a) task set 1 and (b) task set 2.

Table 3.

A bank-to-core mapping with minimum WCEC.

Set 1	$B_{1}$	$B_{1}$	$B_{3}$	$B_{4}$	Set 2	$B_{1}$	$B_{2}$	$B_{3}$	$B_{4}$
fir	0	1	0	0	ud	0	0	4	0
bs	0	1	0	0	1udcmp	8	6	0	0
st	0	2	0	0	expint	0	0	2	0
cnt	8	0	0	0	prime	0	2	0	0
fibcall	0	1	0	0	insertsort	0	0	0	4
bsort100	0	0	8	8	matmult	0	0	0	2

Figure 6.

Comparing the WCEC results by our approach and simulation: (a) task set 1 and (b) task set 2.

Deadline effect

The final experiment shows the effect of deadline on WCEC. Using task set 2 in Table 1 as an example, we vary the deadline of each $HR T_{i}$ from $1.0 * perio d_{i}$ ms to $0.82 * perio d_{i}$ ms in step of $0.03 * perio d_{i}$ ms (there is no solution for deadlines shorter than $0.82 * perio d_{i}$ ms ). Figure 7 shows the result for both our approach and UBD approach. We can observe that our approach can find efficient solutions and outperforms UBD approach consistently at all deadline levels.

Figure 7.

Deadline effect on WCEC.

Conclusion and future work

In this article, we have presented an analysis for the bank-level interference (bank conflict and bus access interference) on a multicore platform with a shared cache, and our analysis approach can provide a tighter bound on the interference delay, which is crucial for WCEC. Experiment results show that our approach can improve the tightness of WCEC by 14.25% on average compared to UBD approach. Moreover, in order to reduce the negative impact of bank-level interference and improve WCEC, we propose to use bank-to-core mapping and develop an algorithm; the experimental results indicate that bank-to-core mapping yields significant benefits in WCEC, with the 7.23% WCEC reduction.

As multicore architecture are already ubiquitous, interference in shared resources should be seriously alleviated. We believe that our analysis and bank mapping can be effectively used for designing predictable real-time multicore systems for processing various complex jobs, e.g., performance optimization,^12,25,26 learning & classification,^27,28 and content searching.^29–32 We believe that plenty of future work exists in this field. We plan to (1) extend our techniques for off-chip memory so as to leverage system-wide energy consumption and (2) explore the effect of hardware pre-fetchers on cache interference delay.

Footnotes

Acknowledgements

The authors thank the anonymous reviewers for their helpful comments and suggestions.

Academic Editor: Xuyun Zhang

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the National Natural Science Foundation of China under grant nos 61370062 and 61462004.

References

Dou

. A context-aware service evaluation approach over big data for cloud applications. IEEE T Cloud Comput 2015; PP: 1.

Duan

Zhou

. Everything as a service (XaaS) on the cloud: origins, current and future trends. In: Proceedings of the 2015 IEEE 8th international conference on cloud computing, New York, 27 June–2 July 2015, pp.621–628. New York: IEEE.

Völp

Hähnel

Lackorzynski

Has energy surpassed timeliness-Scheduling energy-constrained mixed-criticality systems. In: Proceedings of the 20th real-time and embedded technology and applications symposium, Berlin, 15–17 April 2014, pp.275–284. New York: IEEE.

Wagemann

Distler

Honig

. Worst-case energy consumption analysis for energy-constrained embedded systems. In: Proceedings of the Euromicro conference on real-time systems, Lund, 8–10 July 2015, pp.105–114. New York: IEEE.

Legout

Jan

Pautet

Scheduling algorithms to reduce the static energy consumption of real-time systems. Real-Time Syst 2013; 51(2): 99–108.

Zang

Gordon-Ross

A survey on cache tuning from a power/energy perspective. ACM Comput Surv 2013; 45(3): 533–545.

Zhang

Vahid

Najjar

A highly configurable cache for low energy embedded systems. ACM T Embed Comput S 2005; 4(2): 363–387.

Adegbija

Gordon-Ross

Munir

Phase distance mapping: a phase-based cache tuning methodology for embedded systems. Des Autom Embed Syst 2014; 18(3): 251–278.

Wang

Mishra

Ranka

Dynamic cache reconfiguration and partitioning for energy optimization in real-time multi-core systems. In: Proceedings of DAC’11, New York, 5–9 June 2011, pp.948–954. New York: IEEE.

10.

Paolieri

Quiñones

Cazorla

. Hardware support for WCET analysis of hard real-time multicore systems. In: Proceedings of ISCA’09, Austin, TX, 20–24 June 2009, pp.57–68. New York: IEEE.

11.

Yoon

Kim

Sha

Optimizing tunable WCET with shared resource allocation and arbitration in hard real-time multicore systems. In: Proceedings of RTSS’11, Vienna, 29 November–2 December 2011, pp.227–238. New York: IEEE.

12.

Dou

Chen

Weighted PCA-based service selection method for multimedia services in cloud environment. Computing 2016; 98(1): 195–214.

13.

Chen

Huang

. Reconfigurable cache for real-time MPSOCs: scheduling and implementation, microprocessors and microsystems. Microprocess Microsy 2016; 42(7): 200–214.

14.

Hajimiri

Mishra

Bhunia

Dynamic cache tuning for efficient memory based computing in multicore architectures. In: Proceedings of the international conference on VLSI design & international conference on embedded systems, Pune, India, 5–10 January 2013, pp.49–54. New York: IEEE.

15.

Mittal

Yanan

Zhao

MASTER: a multicore cache energy saving technique using dynamic cache reconfiguration. IEEE T VLSI Syst 2014; 22(8): 1653–1665.

16.

Qureshi

Patt

YN.

Utility-based cache partitioning: a low-overhead, high-performance, runtime mechanism to partition shared caches. In: Proceedings of the IEEE international symposium on microarchitecture, Orlando, FL, 9–13 December 2006, pp.423–432. New York: IEEE.

17.

Reddy

Petrov

Cache partitioning for energy-efficient and interference-free embedded multitasking. ACM T Embed Comput S 2010; 9(3): 177–185.

18.

Suhendra

Mitra

Exploring locking & partitioning for predictable shared caches on multi-cores. In: Proceedings of DAC’08, Anaheim, CA, 8–13 June 2008, pp.300–303. New York: ACM.

19.

Wilhelm

Engblom

Ermedahl

. The worst-case execution-time problem—overview of methods and survey of tools. ACM T Embed Comput S 2008; 53(7): 1–36.

20.

Rapita Systems Ltd. RapiTime: worst-case execution time analysis (user guide). York: Rapita Systems Ltd., 2016.

21.

Liang

Mitra

. Chronos: a timing analyzer for embedded software. Sci Comput Progr 2007; 69(1–3): 56–67.

22.

HP. CACTI HP Laboratories Palo Alto (CACTI6), 2014, http://www.hpl.hp.com/

23.

Gustafsson

Betts

Ermedahl

. The Malardalen WCET benchmarks: past, present and future. In: Proceedings of the WCET workshop, Madrid, 18 July 2014.

24.

Burger

Austin

. The simpleScalar tool set, version 2.0. SIGARCH Comput Archit News 1997; 25(3): 13–25.

25.

Sun

Liu

. Achieving efficient cloud search services: multi-keyword ranked search over encrypted cloud data supporting parallel computing. IEICE Transactions on Communications 2015; 98(1): 190–200.

26.

Zhang

Sun

Baowei

. Efficient algorithm for k-barrier coverage based on integer linear programming. China Communications 2016; 13(7): 16–23.

27.

Sheng

Tay

. Incremental support vector learning for ordinal regression. IEEE T Neur Net Lear 2015; 26(7): 1403–1416.

28.

Guan

. Towards efficient multi-keyword fuzzy search over encrypted outsourced data with accuracy improvement. IEEE T Inf Foren Sec 2016; 11(12): 2706–2716.

29.

Ren

Shu

. Enabling personalized search over encrypted outsourced data with efficiency improvement. IEEE T Parall Distr 2016; 27(9): 2546–2559.

30.

Pan

Jin

Lei

. Fast reference frame selection based on content similarity for low complexity HEVC encoder. J Vis Commun Image R 2016; 40(Part B): 516–524.

31.

Huang

Sun

. Enabling semantic search based on conceptual graphs over encrypted outsourced data. IEEE T Serv Comput 2016. DOI: 10.1109/TSC.2016.2622697

32.

Xia

Wang

Zhang

. A privacy-preserving and copy-deterrence content-based image retrieval scheme in cloud computing. IEEE T Inf Foren Sec 2016; 11(11): 2594–2608.

Worst-case energy consumption minimization based on interference analysis and bank mapping in multicore systems

Abstract

Keywords

Introduction

Related work

Reconfigurable cache

Cache partitioning

System model

Architecture model

Real-time task model

Cache energy model

Bank-level interference analysis and WCEC computation

Example for interference analysis

Analyzing conflict delay and waiting time

Computing the WCEC of HRTs

Bank mapping optimization for WCEC reduction

Problem formulation

Algorithm for bank-to-core mapping

Evaluation

Experimental setup

Experimental results

Impact of the interference on WCEC

Impact of bank-to-core mapping on WCEC

Our interference analysis approach versus UBD approach

Deadline effect

Conclusion and future work

Footnotes

Acknowledgements

Declaration of conflicting interests

Funding

References