1

Delay-Optimal Two-Hop Cooperative Relay Communications via Approximate MDP and Distributive Stochastic Learning Rui Wang and Vincent K. N. Lau Department of Electronic and Computer Engineering Hong Kong University of Science and Technology Clear Water Bay, Kowloon, Hong Kong

Abstract In this paper, we study the design of a distributive delay-optimal cross-layer scheduling algorithm for two-hop relay communication systems over frequency selective fading channels. The complex interactions of the queues at the source node and the M relays (RSs) are modeled as an infinite horizon average reward Markov Decision Process (MDP), whose state space involves the joint queue state (QSI) of the queue at the source node and the queues at the M RSs as well as the joint channel state (CSI) of all S-R links and R-D links. As a first step to address the curse of dimensionality, we propose a reduced state MDP formulation. From the associated Bellman’s equation, we show that the delay-optimal power control (and link selection algorithm), which are functions of both the CSI and QSI, has a multi-level water-filling structure. Furthermore, using stochastic learning, we derive a distributive online learning algorithm in which each node recursively estimates a per-node potential function based on real-time observations of the local CSI and local QSI only. Based on the real-time local potential estimates and using approximate MDP, we propose an auction-based algorithm for link selection and show that the combined distributive learning converges almost surely to a global optimal solution for large arrivals. The proposed online learning algorithm is different from the conventional online learning algorithms in two ways: (1) our online iterative solution updates both the value function (potential) and the Lagrange multipliers (LM) simultaneously; and (2) we establish the technical conditions for the September 10, 2009

DRAFT

2

September 10, 2009

almost sure convergence even the per-node potential update equation is no longer a contraction mapping and the existing convergence results (based on contraction mapping) cannot be applied directly to our distributive stochastic learning algorithm. Finally, we show by simulation that the delay performance of the proposed scheme is significantly better than various baselines such as the conventional CSIT-only control and the throughput optimal control (in stability sense).

I. I NTRODUCTION Cooperative relay communication is a promising architecture for future wireless communication systems because it offers huge potential for enhancing system capacity, coverage as well as reliability, thereby attracting tremendous attentions from both industrial and academic areas [1]–[3]. So far, there has been a lot of research interest focused on improving the throughput of downlink relay-assisted cellular systems by cross-layer optimization [4]–[6]. All of these works assume there are infinite backlogs of packets at the transmitter and the cross-layer controller maximizes some physical-layer objective functions (e.g. weighted sum throughput or Proportional Fair (PFS)). The derived control policy is adaptive to the channel state information (CSI) only. Yet, this kind of physical-layer-oriented design may not be able to capture the burstiness of the source arrivals as well as the delay performance for real-time (delay-sensitive) applications. For example, it has been shown in [7] that naive water-filling, which could achieve maximum system throughput for a point to point wireless link with infinite BS backlogs, is not always a good strategy with respect to the delay performance. In fact, the water-filling power control may not even be able to stabilize the queue. This motivates us to study the cross-layer control for cooperative relay communication systems taking the source’s burstiness and queue dynamics into considerations. A. Related Work and Motivation The design framework taking into account both the queueing delay and physical-layer performance is not trivial as it involves combining queueing theory (to model the queue dynamics) and information theory (to model the physical layer dynamics) [8]. In general, there are various

September 10, 2009

3

approaches to deal with delay-optimal control problems. The first approach converts the delay constraint into an equivalent average rate constraint using large derivation theory [9] and solve the optimization problem using an information theoretical formulation based on the rate constraint [10]–[12]. While this approach allows potentially simple solution, the control actions will be a function of the channel state information (CSI) only and such control will be good only for large delay regime where the probability of buffer empty is small. In general, the delayoptimal control actions should be a function of both the CSI and queue state information (QSI). In [3], [7], the authors showed that LQHPR policy is delay optimal for multi-access fading channels. However, the solution utilizes stochastic majorization theory which requires symmetry among the users and is difficult to extend to other situations. In [8], [10], [11], the authors studied the queue stability region of various wireless systems using Lyapunov drift. In [12]– [14], the authors considered the asymptotic delay-power tradeoff of point-to-point and multi-user system, but they focused on asymptotic analysis and derived asymptotically optimal solution for large delay regime using Lyapunov drift. In [15], control policies are developed for multiple access system and broadcast system to achieve order-optimal delay performance. Recently, the delay-optimal cross-layer control policy for point-to-point link, multiaccess channel, broadcast channels have been studied in [16], [17] and [18] respectively. Under the assumption that all queues are large enough, GPD (Greedy Primal-Dual) algorithm [19] and RT-SPD (Real Time Stochastic Primal-Dual) algorithm [20] are proposed to solve utility-based optimization problem under queueing network stability constraint and average delay constraint separately. Stochastic approximation is used to estimate the average commodity rates and the average delay, and asymptotically optimality is also proved for both algorithms. However, in all these works, the topology considered are all one-hop networks including point-to-point wireless links, multiaccess wireless systems as well as broadcast wireless systems. In this paper, we focus on a different topology with two-hop communications between a source node (S), M relay stations (RS), and one destination (D). There is one queue at each of the source node and M RS nodes with random bit arrival to the source node queue (target to the destination). As we shall illustrate, the delay-optimal control for the two-hop relay-assisted communications poses various unique

4

September 10, 2009

challenges that are very different from the conventional one-hop topology. •

Complex interactions of buffers at the source node and the M RSs: The above mentioned works on delay-optimal control for one-hop systems does not need to address the complex interactions between the queues in the source node and the RSs. Hence, the existing approaches cannot be applied directly in our two-hop system. Only a few existing literatures have considered queues in RSs and their dynamics. The authors in [21] obtained a lower bound of the stability region and average delay of a cognitive relay-assisted TDMA uplink system using queue dominance approach, while [22] studied the stability region of two-user slotted ALOHA system with the users acting as cooperative relays for each other. However, in these works, the focus is on the stability region characterization as well as delay analysis of heuristic and simple RS protocols.



The curse of dimensionality: A systematic approach to model the delay-optimal control problem is based on Markov Decision Process (MDP) [16]–[18]. While the optimal solution of an MDP can be characterized by a Bellman’s equation [23], it is well known that there is no simple solution to solve the Bellman’s equation (which is a fixed point problem on a functional space). Brute force approaches such as value iteration and policy iteration cannot lead to any implementable solution [24]. The key issue associated with that is the issue of dimensionality. For instance, the system dynamics in the two-hop wireless system can be characterized by a system state defined by the joint CSI and joint QSI. The size of the state space grows exponentially with the number of RSs M . For example, for a system with maximum buffer length of 20 and 3 CSI states, the total number of system states is 20M +1 × 32M , which is unmanageable even for small number of RS.



Distributive control with local CSI and local QSI: The entire system state is characterized by the global CSI

1

and QSI at the M RSs and the source node. There are a number

of works on resource optimization of relay-assisted communication systems [25]–[28], but these works are focused on physical layer performance (such as throughput) and the solutions 1

Global CSI refers to the CSI between S-R links and R-D links

September 10, 2009

5

are function of the global CSIT. As far as we are aware, the delay-optimal control for twohop cooperative relay system is still not solved even for a centralized solution. Even if one could solve the delay-optimal control problem in a centralized manner, the centralized solution would require knowledge of the global CSI and global QSI in the systems. However, it is very difficult for the source node and the RSs to obtain global knowledge of CSI and QSI due to huge signaling overhead involved. From an implementation consideration, it is desirable to have the delay-optimal control actions at the source node and RSs to be a function of the local CSI and local QSI. This local CSI and local QSI requirement demands a distributive control solution with the source node and RSs determining their local control policies on their own knowledge. There are quite a number of works on using non-cooperative game theory and primal-dual decomposition theory to obtain distributive solutions for deterministic NUM problems [29], [30]. However, due to the stochastic nature of our problem (delay-optimal), these existing approaches cannot be applied. B. Contribution In this paper, we shall address the above challenges by modeling the delay-optimal distributive power and link selection control for two-hop cooperative relay communication systems over frequency selective fading channels as a stochastic optimization problem. Specifically, the complex interactions of the queues at the source node and the RS are modeled by an infinite horizon multi-dimensional MDP, involving the joint queue state of the queue at the source node and the queues at the M RS nodes as well as the joint channel states of all S-R and R-D links. First of all, to alleviate the curse of dimensionality, we proposed a reduced state MDP formulation. Based on an auction-based link selection scheme and the associated Bellman’s equation, we derive a social optimal power allocation and link selection (selecting one active link out of all the S-R and the R-D links) control for the source node and the M RSs. The delay-optimal control actions are function of both the QSI (via the potential function of the Bellman’s equation) and local CSI and the power control solution has a multi-level water-filling structure in which the CSI determines the power allocation across subcarriers but the QSI determines the water-

September 10, 2009

6

level. Secondly, to address the distributive requirement, we solve the Bellman’s equation using a distributive online learning algorithm in which each node locally estimates and learns the per-node potential function based on the real-time observations of the local CSI and local QSI only. Using approximated MDP and the per-node potential function, we derive an auction-based link selection control algorithm to determine which node (source node or one of the M RS nodes) should transmit in the current slot. Reinforced learning has been used in the literature to estimate the value function of the Bellman equation of the MDP [31]. However, the proposed online stochastic learning solution is quite different from conventional reinforced learning in two ways. Firstly, we are dealing with constrained MDP (CMDP) and our online iterative solution updates both the value function (potential) and the Lagrange multipliers (LM) simultaneously. This is in contrast with previous works [18] where the LM are determined offline. Secondly, conventional online learning are designed for centralized solution where the control action are determined entirely from the value function update. Hence, the convergence proof follows from a standard contraction mapping and fixed point theorem argument [32]. However, in our case, the control action is determined not from the per-node potential alone but from the per-node potential of all nodes via a per-slot auction. As a result, the per-node potential update equation is no longer a contraction mapping and the existing convergence results cannot be applied directly to our distributive stochastic learning algorithm. Using separation of time scales, we establish technical conditions for the proposed distributive online learning to converge almost surely to a global delay-optimal solution for sufficiently large arrival rate and buffer size. We also illustrate with numerical examples the delay performance of the proposed scheme and compare against various baselines such as the conventional CSIT-only control and the throughput optimal control (in stability sense). The proposed solution has good significant performance gain over the existing schemes with low complexity O(M ) and low signaling overhead, which is highly desirable for implementation. The paper is organized as follows. We first introduce the system model, source model, queue dynamics and control policy in Section II. In Section III we elaborate the reduced state MDP problem formulation. In Section IV, we derive the delay-optimal power control and link selection

7

September 10, 2009

algorithm in terms of the potential function and discuss the associated structure. In Section V, we derive a distributive solution, which consists of a decentralized online potential learning algorithm based on the local CSI and local QSI as well as a per-stage auction mechanism for link selection. We shall also establish the technical conditions for almost sure convergence as well as the asymptotic global optimal performance of the proposed distributive solution. In Section VI, we discuss the performance simulations. Finally, we summarize the main results in Section VII. II. M ODELS In this section, we shall introduce the two-hop relay-assisted communication model, the source model, the queue dynamics model as well as the control policies. A. System Model We consider a two-hop cooperative relay communication system with one source node (S), M relay stations (RS) and one destination (D), as illustrated in Fig. 1. The source node cannot deliver packets directly to the destination due to limited coverage and the cooperative RSs are deployed to extend the source node’s coverage to reach the destination, similar to the existing literatures [33], [34]. We assume the M RSs are all half-duplex relays using Decode-and-Forward (DaF) mode [35]. Moreover, OFDM technology is used to convert the frequency selective channel into NF parallel flat fading channels (subcarriers). The packet transmission is organized in fixed-duration frames. In each frame, only one link (out of all the S-R links and R-D links) is selected for transmitting information over the NF subcarriers. To facilitate the delay-optimal control, each frame is further divided into three slots elaborated as follows: •

Channel Estimation Slot is used by the RSs for local channel estimation according to the preambles transmitted by the source node and the destination.

8

September 10, 2009



Contention Slot is used by the RSs for link selection, which is elaborated in Section II-D2 .



Transmission Slot is used for data transmission on the selected link.

B. Channel Model and Physical Layer Model We consider a Rayleigh fading channel model where the channel gain of each link is i.i.d. random variable with Rayleigh distribution. In this paper, the resource scheduling is performed distributively on each RS, therefore, we shall first define the channel information available at each RS in this section. Assuming TDD systems, the local CSI of the R-D links can be estimated at the RSs using channel reciprocity. Moreover, the local CSI of the S-R links can be estimated by the RSs as well, therefore, the local CSI information at the m-th RS is given by: ½ ¾ Hm = (HS,m,k , Hm,D,k )|∀k ∈ {1, 2, ..., NF } , where HS,m,k denotes the channel gain of the k-th subcarrier between the source node and the m-th RS, Hm,D,k denotes the channel gain of the k-th subcarrier between the m-th RS and the destination. For notation convenience, we also define the global CSI (GCSI) as H = H1 ∪ ... ∪ HM . We assume the global CSI H is quasi-static in each frame and i.i.d. between frames. The quasi-static assumption is reasonable in practical systems where a frame duration is 5 ms (e.g. WiMAX) and the coherence time for pedestrian mobility is around 50ms. Since the global CSI is distributed at the RS nodes, collecting global CSI costs significant signaling overhead. Hence, we assume each RS only has local CSI in this paper. Let XS,k be the transmit symbol of the source node on the k-th subcarrier, if there is no transmission collision between the source node and the M RSs, the receive symbol at the m-th RS on the k-th subcarrier is given by Ym,k = HS,m,k XS,k + ZS,m,k , 2

Since we target for distributive solution, the control actions (such as whether to transmit a packet) at the M RSs are function

of the local system state only and hence, potential contention could occur.

9

September 10, 2009

where ZS,m,k ∼ CN (0, 1) is the complex Gaussian channel noise. Similarly, let Xm,k be the transmit symbol of the m-th RS on the k-th subcarrier, if there is no collision among the M RSs, the received symbol at the destination on the k-th subcarrier is given by YD,k = Hm,D,k Xm,k + Zm,D,k , where Zm,D,k ∼ CN (0, 1) is the complex Gaussian channel noise. At the beginning of each frame, the selected transmitter should determine the size of the packet (the number of information bits) to be transmitted in this frame. For example, if the source node transmits one packet with RS,m information bits to the m-th RS, the condition for successfully decoding is given by (assuming no collision) RS,m ≤

NF X

¢ ¡ log2 1 + pS,m,k χ|HS,m,k |2 ,

(1)

k=1

for some constant χ where pS,m,k is the transmit power of the source node on the k-th subcarrier. Note that the above expression for the data rate can be used to model both the uncoded and coded systems. For uncoded system using MQAM constellation, the symbol error rate (SER) of m-th RS on the k-th subcarrier is given by [36] −c2

SERm,k ≈ c1 e

pS |HS,m,k |2 R 2 S,m −1

,

c2 and hence, for a target SER ², we have χ = − ln(²/c . On the other hand, for system with 1)

powerful error correction codes such as LDPC with reasonably large block length (e.g 8kbyte) and target PER of 0.1%, the maximum achievable data rate is given by the instantaneous mutual information (to within 0.5dB SNR) [37]. In that case, χ = 1. Similarly, the destination can successfully decode a packet with Rm,D information bits transmitted from the m-th relay if Rm,D ≤

NF X

¢ ¡ log2 1 + pm,D,k χ|Hm,D,k |2 ,

k=1

where pm,D,k is the transmit power of the m-th RS on the k-th subcarrier.

(2)

10

September 10, 2009

C. Bursty Source Model and System Queue Dynamics There is one queue in the source node and one queue in each of the M RSs respectively for the storage of received information bits. Let I(t) indicate the new information bits arrival in each frame at the source node. We assume I(t) is i.i.d. over frames based on a general distribution fI (n) and the information bits arrival is at the end of each frame. Moreover, we define the following notations: •

QS (t) denotes the number of information bits in the source node’s queue at frame t.



Qm (t) denotes the number of information bits in the queue of the m RS (m = 1, 2, ..., M ) at frame t.



Q(t) = {QS (t)}∪{Qm (t)|∀m} denotes the global queue state information (GQSI) at frame t.

Similarly, the GQSI is distributed across all the nodes in the system and it involves significant signaling overhead to deliver GQSI knowledge to all the nodes. In this paper, we shall target to obtain decentralized solution in which the control policy is a function of local CSI and local QSI only. We shall achieve that in two steps. In Section IV, we shall first derive a semi-decentralized solution, which is a function of local CSI but global QSI in terms of the system potential function. In Section V, we shall further decentralize the solution so that it become a function of the local CSI and local QSI. The overall system queue dynamics at the source node and the RSs are summarized below: •

If the source node successfully delivers RS,m information bits to the m-th RS, then QS (t + 1) = min {QS (t) + I(t) − RS,m , NQ } Qm (t + 1) = min {Qm (t) + RS,m , NQ } where NQ is the maximum buffer size, which is finite. Each information bit delivered from the source node will be received by one of the RSs and different RSs may have different information bits in the buffer. When the source node is to deliver one packet to one RS, selecting different RSs with different buffer lengths may have different effects on the average

11

September 10, 2009

packet delay of the system. Therefore, not only the CSI of all S-R links but also the QSI of all RSs should be considered in directing the source node’s packet transmission. Such coupling on the system QSI is unique in delay-optimal control of multi-hop systems. For one-hop systems, the packets transmitted from one queue will not be received by another queue. Fig. 2 shows the top level architecture illustrating the interactions among all the queues in the two-hop cooperative system. •

If the source node fails to deliver any information bit to the RSs , then QS (t + 1) = min {QS (t) + I(t), NQ } .



If the m-th RS successfully delivers Rm,D information bits to the destination, then Qm (t + 1) = max{Qm (t) − Rm,D , 0}.

Due to the limited buffer size, the source node may encounter the packet drop due to the buffer overflow (The RSs will not have the problem of overflow because their information bits are all coming from the source node). In the following control policy design, we shall set a constraint on the target packet drop rate due to buffer overflow at the source node. D. Contention Resolution Protocol among the M RSs Since there are multiple transmitters in the systems and the control policy of the M RSs are based on local CSI without centralized coordination, a contention resolution protocol is needed to coordinate the distributive spectrum access (link selection). In this paper, we shall propose an auction-based contention resolution mechanism as follows. We shall also prove that the proposed auction-based algorithm is social optimal, meaning that it achieves the same optimal performance as if there were global CSI and global QSI. Protocol 1 (Auction-Based Spectrum Access): The MAC-layer procedure of the auction-based spectrum access (link selection) is given below: •

Each RS (say the m-th RS) determines the bids for both S-R link (the link between the source node and the m-th RS) and R-D link (the link between the m-th RS and the destination),

September 10, 2009

12

denoting them as BS,m and Bm,D respectively. Then, all the RSs take turn to submit their smallest bid (min{BS,m , Bm,D }) in the contention slot of the frame. •

Each RS should listen to the bids transmitted by other RSs and compares the received bids with its own bid. If it received one smaller bid from other RSs, this RS should keep silence in the transmission slot. Otherwise, the link with the minimum bid is selected for transmission in the transmission slot. If one S-R link is selected, the selected RS (say the m-th RS) should notify the source node the calculated source node transmission power {pS,m,k |∀k} and rate RS,m . Then the source node will transmit packet with the given parameters; otherwise, if on R-D link is selected, the selected RS will transmit packet to the destination with power {pm,D,k |∀k} and rate Rm,D .

Fig. 3 illustrates an example of link auction with two relays M = 2. In this example, the auction procedure is given below: •

Each RS calculates the bids for both S-R and R-D links, i.e. (BS,1 , B1,D ) for RS1 and (BS,2 , B2,D ) for RS2.



The contention slot is further divided into two mini-slots. In the first mini-slot, the RS1 will submit the smaller bid associated with him, denoted as B1 = min(BS,1 , B1,D ); and in the second mini-slot, the RS2 will submit the smaller bid associated with him, denoted as B2 = min(BS,2 , B2,D ). Each RS should listen to the bid submit by the other RS.



At the end of contention slot, each RS compares the received bid from the other RS with its own bid. – If the RS1 has the smaller bid B1 < B2 , RS2 should remain silent in the transmission slot. Moreover, if B1 = BS,1 , the link between the source node and the RS1 is selected in the transmission slot; otherwise (B1 = B1,D ), the link between the RS1 and the destination is selected in the transmission slot. – If the RS2 has the smaller bid B2 < B1 , RS1 should remain silent in the transmission slot. Moreover, if B2 = BS,2 , the link between the source node and the RS2 is selected in the transmission slot; otherwise (B2 = B2,D ), the link between the RS2 and the destination is selected in the transmission slot.

13

September 10, 2009



If one S-R link is selected, the selected RS will notify the source node in the notification mini-slot of transmission slot, and the source node will start to transmit packet. Otherwise, if one R-D link is selected, the selected RS will notify the destination in the notification mini-slot, and then, start to transmit packet to the destination.

As a result, the link/RS selection control can be parameterized by a bidding vector {(BS,m , Bm,D )|∀m}. We shall refer the bidding vector as the link selection policy in the rest of the paper. E. Control Policy In the following, we first consider a semi-distributive solution in which the action of the m-th RS is based on global QSI and local CSI as well, denoted by κm = (Hm , Q). In Section V, we shall further decentralize the solution w.r.t. the global QSI requirement. The semi-distributive stationary resource control policy for the source node and RSs are defined below. Definition 1 (Semi-Distributive Stationary Control Policy): A semi-distributive stationary control policy Ω = (Ω1 , ..., ΩM ) consists of one stationary policy for each RS. A stationary policy of bidding strategy, rate and power allocation of S-R link, rate and power allocation of R-D link for m m the m-th RS Ωm = (Ωm B , ΩR , Ωp ) is a map from the relay’s local information κm = (Hm , Q) to

bidding action, rate allocation and power allocation actions of the S-R link, rate allocation and power allocation actions of the R-D link. A policy Ω is called feasible if it is a unichain policy [23] and the associated actions satisfy the packet drop rate, rate and transmit power constraints. Specifically, Ωm B (κm ) = {BS,m , Bm,D },

Ωm R (κm ) = {RS,m , Rm },

Ωm p (κm ) = {(pS,m,k , pm,k )|∀k},

for m = 1, 2, ..., M . Moreover, the following packet drop rate, rate and average transmit power constraints should be satisfied: Constraints on the source node: (1),

Pr[QS = NQ ] ≤ D, "

Constraints on the RS nodes:

(2) and E ηm,D

NF X k=1

E

·X S

#

ηS,m

m=1

pm,D,k ≤ P m

NF X

¸ pS,m,k ≤ P S

k=1

∀m = 1, 2, ..., M

14

September 10, 2009

where D is the maximum tolerable packet drop rate of the source node, P S (or P m ) is the average power constraint of the source node (or the m-th RS), ηS,m (or ηm,D ) is an indicator with 1 meaning the S-R link of the source node (or the m-th RS) is selected and 0 meaning otherwise. F. Objective Function The goal of the scheduler is to choose an optimal stationary feasible unichain policy Ω∗ = {Ω1∗ , Ω2∗ , ..., ΩM ∗ } that minimizes the average end-to-end transmission delay. Assume that the arrival rate at the source node falls inside the stability region of the system. Define the global system state as κ(t) = κ1 (t) ∪ κ2 (t) ∪ ... ∪ κM (t) = {Q(t), H(t)}. Given a feasible unichain policy Ω, the process of global system states evolves with the following probability transition kernel ¯ · ¸ ¯ 1 M Pr κ(t + 1)¯¯κ(t), Ω (κ1 (t)), ..., Ω (κM (t)) ¯ · ¸ · ¸ ¯ 1 M ¯ = Pr H(t + 1) Pr Q(t + 1)¯κ(t), Ω (κ1 (t)), ..., Ω (κM (t)) ,

(3)

where the second equation is because the process of channel fading {H(t)} is i.i.d. (independent of the previous channel fading, queue states and actions). Since the system state at frame t + 1 only depends on the system state and actions at frame t, for unichain policy where every state is aperiodic and positive recurrent [13] κ(t) is an ergodic Markov process and there exists an unique steady state distribution πκ . To obtain the system objective function, we first introduce the following lemma. Lemma 1 (Average End-to-End Delay): For small average packet drop rate constraint D, the average end-to-end delay of the two-hop cooperative RS system is given by ·PM ¸ ·PM ¸ T 1X Ω m=S Qm (t) m=S Qm T (Ω) = lim Eκ = E πκ , T →+∞ T λ λ S S t=1

15

September 10, 2009

where we allow the abuse of notations and use m = S, 1, 2, ..., M in the equation3 , EΩ πκ means taking the expectation with respect to the induced steady state distribution πκ (induced by the unichain control policy Ω), λS is the average number of arrival bits per frame at the source node. Proof: Please refer to Appendix A. To simplify the notations, we shall neglect λS in the system objective in the remaining of this paper as it’s a constant. The average power constraint and average packet drop rate constraint for the BS and the m-th relays as well as the average packet drop rate constraint for the BS can also be expressed into a similar form as below. ·X ¸ ·X ¸ NF NF T M M X X 1X lim Eκ ηS,m (t) pS,m,k (t) = Eπκ ηS,m pS,m,k ≤ P S , T →+∞ T t=1 m=1 m=1 k=1 k=1 · · ¸ ¸ NF NF T X X 1X lim Eκ ηm,D (t) pm,D,k (t) = Eπκ ηm,D pm,D,k ≤ P m , T →+∞ T t=1 k=1 k=1 1 T →+∞ T lim

T X

(4)

m = 1, 2, ..., M,

· µ ¶¸ · µ ¶¸ Eκ I QS (t) = NQ = Eπκ I QS = NQ ≤ D.

(5) (6)

t=1

Hence, the delay-optimal design can be formulated as the following constrained Markov decision process (CMDP): Problem 1 (Delay-Optimal-Constrained MDP): Find a feasible stationary unichain policy Ω = (Ω1 , ..., ΩM ) such that the average end-to-end delay is minimized subject to the average power constraint and average packet drop rate constraint, i.e. µX ¶ M Ω min Eπκ Qm Ω

s.t.

m=S

EΩ πκ

µX M

ηS,m

m=1

µ EΩ πκ

ηm,D

NF X

¶ pS,m,k

k=1 NF X

pm,D,k

≤ P S,

[source node’s average power constraint]

¶ ≤ P m,

m = 1, 2, ..., M

[RS’s average power constraint]

k=1

· µ ¶¸ Ω Eπκ I QS = NQ ≤D

3

[Average packet drop rate constraint].

This abuse will also appear in the following of this paper as long as the meaning is clear.

16

September 10, 2009

In the following section, we shall first convert Problem 1 into an unconstrained MDP problem by the Lagrange theory [38], and then proposed an equivalent reduced-state MDP formulation which significantly reduce the size the system space by exploiting the i.i.d. feature of CSI. III. C ONSTRAINED M ARKOV D ECISION P ROBLEM F ORMULATION A. Preliminary on Markov Decision Process In this section, we review some background information on the optimal control with Markov decision process. Consider a controlled Markov chain {κ(t)} on a finite state space S = {0, 1, 2, ..., M } with a finite action space A = {u1 , u2 , ..., uN } and transition kernel · ¸ p(sj |si , u) = Pr transition from the current state si to the next state sj under the action u ∀si , sj ∈ S. Associated with each stage (say the t-th stage) is a cost g(κ(t), u(t)) which depends only on the current state κ(t) and the action in the stage u(t). Given a stationary control policy Ω = α(κ) (α : S → A), which is the set of actions under all realizations of system state, (κ(t), u(t), g(t)) forms a discrete time Markov chain with the underlying probability measure given by πΩ . The optimization objective of an infinite horizon MDP is to choose the control policy Ω so as to minimize the average cost per stage as follows L−1

1X Ω Eκ [g(κ(t), u(t))], min lim Ω L→∞ L l=0 where EΩ denotes the expectation operator under the induced measure πΩ which depends on the control policy Ω. It’s well-known that there is unique optimal average cost per stage for each starting state if the unichain stationary policy is considered [23]. Under stationary unichain policy, the optimal control policy of the above MDP is given by the solution of the Bellman equation. This is summarized in the following Lemma. Lemma 2: If there exists a θ and a vector V = [V (1), ..., V (M )]T such that the following Bellman equation is satisfied: · ¸ X θ + V (si ) = min g(si , u) + p(sj |si , u)V (sj ) ∀si ∈ S, Ω

sj

17

September 10, 2009

then V(s) is called the potential function of the MDP and θ is the optimal average cost per stage satisfying L−1

1X Ω Eκ [g(κ(t), u(t))]. θ = min lim Ω L→∞ L l=0 The Bellman equation in Lemma 2 is a fixed point problem on a functional space. A general solution, known as value iteration [23], can be used to find the potential function V(s) of Bellman equation iteratively. The convergence of the value iteration follows directly by contraction mapping and Banach’s fixed point theorem [39]. Algorithm 1: (Offline Value Iteration) • •

Initiate the potential function, denoted as V0 = [V 0 (1), ..., V 0 (M )]T . Let l = 1. In the l-th iteration, update the potential function according to · ¸ X l l−1 l−1 l−1 V (si ) = min g(si , u ) + p(sj |si , u )V (sj ) − V l (0) ∀si ∈ S. ul−1



sj

If Vl = Vl−1 then terminate the algorithm; otherwise l = l + 1 and jump to the second step.

B. Lagrangian Approach for the CMDP In this paper, we define the state space and action space as follows: S = {H, Q}

and A = {pS,m,k , pm,D,k , RS,m , Rm,D , BS,m , Bm,D |∀m, k}.

From (3), the transition kernel of the system state is given by ¯ · ¸ · ¸ ¯ ¯ Pr H(t + 1) Pr Q(t + 1)¯Q(t), Ω , which is a function of stationary unichain control policy Ω which maps from any system state in S to one action in A. For any given stationary unichain policy Ω, the system state {Q(t), H(t)} evolves as a discrete time Markov chain with the underlying probability measure given by πΩ . Therefore, the Problem 1 is an constrained MDP which can be converted into

18

September 10, 2009

unconstrained MDP by the Lagrange theory [38]. For any vector of Lagrange multiplier (LM) γ = [γS,p , γS,d , γ1,p , ..., γM,p ]T , we define the Lagrangian as · µ ¶¸ T 1X Ω L(Ω, γ) = lim Eκ g κ(t), Ω(κ(t)), γ , T →+∞ T t=1 where ¸ µ ¶ µ ¶ X NF NF M M · X X X pm,D,k +QS +γS,p pS,m,k +γS,d I QS = NQ . g κ(t), Ω(κ(t)), γ = Qm +γm,p ηm,D ηS,m m=1

m=1

k=1

k=1

Therefore, the corresponding unconstrained MDP for a particular vector of LMs is given by min L(Ω, γ) Ω

· µ ¶¸ T 1X Ω = min lim Eκ g κ(t), Ω(κ(t)), γ . Ω T →+∞ T t=1

(7)

According to Lemma 2, for a given LM vector γ, the optimizing unichain policy for the unconstrained MDP (7) can be obtained by solving the Bellman equation w.r.t. (θ, {J(κ)}) as follows ¶ ¾ ½ µ ¶ X µ j i i j i i Pr κ |κ , u(κ ) J(κ ) ∀κi , θ + J(κ ) = min g κ , u(κ ) + i i

u(κ )

(8)

κj

where u(κi ) is the aggregation of the actions of all transmitters given the system state κi ; θ = minΩ L(Ω, γ) is the optimal average reward per stage; {J(κ)} is the potential function of the MDP. For unichain policy Ω, the solution to (8) is unique [23]. Denote the optimal unichain policy to (8) as Ω∗ . Using standard optimization theory [38], the problem in (7) has an optimal solution for a particular choice of the LM γ ∗ , where γ ∗ is chosen to satisfy the average power constraint in (4,5) and packet drop constraint in (6). Moreover, the following saddle point condition holds: L(Ω, γ ∗ ) ≥ L(Ω∗ , γ ∗ ) ≥ L(Ω∗ , γ).

(9)

C. Reduced State Formulation for the CMDP Note that (7) is an infinite horizon Markov Decision Process (MDP). However, there are several technical challenges. First of all, the state space of the MDP in (7) involves H and Q,

19

September 10, 2009

which is extremely huge. Secondly, a brute force solution of (7) will lead to partially observed MDP (POMDP) due to the dependency of local CSI in the control actions [13]. Solving POMDP is well known to be very complicated, resulting in high complexity solutions. In this section, we shall overcome the above challenges by first converting the problem into a reduced state MDP. For notation convenience, we partition the unichain policy Ω into a collection of actions based on the QSI. Specifically, we have the following definition. Definition 2 (Conditional Actions for the m-th Relay): Given a unichain control policy Ωm , we define Ωm (Q) = {Ωm (Q, Hm )|∀Hm } as the collection of actions under a given QSI Q for all possible local CSIT Hm . As a result, the original MDP is equivalent to a reduced state MDP, which is summarized in the following lemma. Lemma 3 (Equivalent MDP on a Reduced State Space): The original problem (7) is equivalent to the following reduced state MDP with state space given by the QSI Q(t): ( ¶¸) · µ T 1X Ω , Eκ g Q(t), Ω(Q(t)) min(L) = min lim Ω Ω T →+∞ T t=1 where

(10)

µ

¶ · µ ¶¸ g Q(t), Ω(Q(t)) = EH g κ(t), Ω(κ(t)) .

Proof: Please refer to Appendix B. As a result of Lemma 3, the reduced state MDP is no longer a POMDP and the state space of the reduced state is significantly reduced. In the following section, we shall elaborate on how to solve the above problem. IV. O PTIMAL S EMI -D ISTRIBUTIVE C ONTROL The problem in (10) is an Markov Decision Process with infinite horizon. Hence, according to Lemma 2, the solution is characterized by the Bellman equation given by: ½ µ ¶ X µ ¶ ¾ i i i j i i j θ + V (Q ) = mini g Q , u(Q ) + Pr Q |Q , u(Q ) V (Q ) u(Q )

Qj

(11)

20

September 10, 2009

where u(Qi ) denote the aggregation of the system’s action when the system state is Qi , V (Qi ) is the system potential for state Qi . Specifically, for unichain policy, there is a unique {θ, V (Q)} that satisfies the Bellman equation (11) and θ, the corresponding u∗ (Qi ) give the optimal value and optimal control of the problem respectively. As a result, the key step in deriving the optimal control actions is to obtain the system potential function V (Qi ) in (11). To illustrate the structure of the solution, we first assume we could obtain the potential function V (Qi ) (e.g. via value iteration [13]) and focus on deriving a semi-distributive control action u∗ (Qi ), which is a function of the local CSI only. Afterwards, we shall propose in Section V an online decentralized iterative algorithm to obtain the potential function V (Qi ), which requires local QSI only at each RS. A. Structure of the Optimal Control Policy In this section, we first deal with the semi-distributive solution as a function of local CSI only. From the Bellman equation in (11), we observe that the dependency of the global QSI is due to the potential function V (Q). Given a potential function V (Qi ), the optimal control action is given by the following problem. Problem 2 (Optimal distributive control with local CSIT): For the given potential function V (Q), find the optimal system actions u∗ (Q) satisfying the Bellman equation in (11). Specifically, ¶ ¾ ½ µ ¶ X µ j i i j ∗ i i i Pr Q |Q , H, u(Q , H) V (Q ) u (Q ) = arg mini EH g (Q , H), u(Q , H) + u(Q )

=

M X

Qj

Qm + γS,d I[QS = NQ ] + mini EH u(Q )

m=S

½X

µ

¶ fI (n)V Q (n) + i

n

· µ ¶¸ X X X i pS,m,k + fI (n)∆V QS (m, RS,m , n) + ηS,m γS,p m

|

n

k

{z

}

FS,m (RS,m ,{pS,m,k })

X m

· µ ¶¸ ¾ X X i ηm,D γm,p pm,D,k + fI (n)∆V Qm (Rm,D , n) |

k

n

{z

Fm,D (Rm,D ,{pm,D,k })

s.t.

(1) and (2),

}

(12)

21

September 10, 2009

where the system action u(·) is defined as £ ¤T u(Qi ) = u1 (Qi ), ..., uM (Qi ) , ¶ µ ¶ µ ¶ µ i i i ∆V QS (m, RS,m , n) = V QS (m, RS,m , n) − V Q (n) , µ ¶ µ ¶ µ ¶ i i i ∆V Qm (Rm,D , n) = V Qm (Rm,D , n) − V Q (n) , and the post-action system states QiS (m, RS,m , n) and Qim (Rm,D , n) are defined as ¤T £ QiS (m, RS,m , n) = min{NQ , QiS + n − RS,m }, Qi1 , ..., Qim−1 , Qim + RS,m , Qim+1 , ..., QiM , £ ¤T Qim (Rm,D , n) = min{NQ , QiS + n}, Qi1 , ..., Qim−1 , Qim − RD,m , Qim+1 , ..., QiM .

One challenge in solving the optimal solution for Problem 2 is the distributive requirement on the local CSI. For example, we require that the m-th RS determines the bid, power allocation and rate allocation based on Hm . From (12), we observe that brute force solution could only lead to a centralized solution requiring global CSI knowledge and hence, it’s not desirable. To obtain distributive solution, we perform the following decomposition as summarized in Lemma 3. Lemma 4 (Problem Decomposition): Given the system potential {V (Qi )}, the following semidistributive control are the global optimal for Problem 2 given any realization of the CSI H and the QSI Qi : •

Optimal S-R and R-D rate allocation at the RSs: µ ¶¸ · ¶+ X Xµ 1 2RS,m /nS i ∗ fI (n)∆V QS (m, RS,m , n) RS,m = arg min γS,p − + nS 1 Q χ|HS,m,k |2 n k χ( |HS,m,ki |2 ) nS | {z } i=1 {z } | Term II on QSI Term I on local CSI at the source node (13) · ∗ Rm,D

= arg min γm,p

Xµ k

|

2Rm,D /nm χ(

n m Q

1

|2 ) nm



1

+

χ|Hm,D,k |2

|Hm,D,ki {z Term I on local CSI at the RS i=1

¶+

}

X |

n

¶¸ .

µ fI (n)∆V

QiS (Rm,D , n)

{z Term II on QSI (14)

}

22

September 10, 2009

where we resort channel gain of NF subcarriers according to their norm, i.e. |HS,m,k1 |2 ≥ |HS,m,k2 |2 ≥ ... ≥ |HS,m,kNF |2 and |Hm,D,k1 |2 ≥ |Hm,D,k2 |2 ≥ ... ≥ |Hm,D,kNF |2 , nS and nm are integers taking value from 1 to NF . The determination of nS and nm follows the water-filling approach, which is elaborated in Appendix C in detail. •

Optimal S-R and R-D power allocation at the RSs: µ ¶+ ∗ 2RS,m /nS 1 ∗ pS,m,k = − nS 1 Q χ|HS,m,k |2 2 nS χ( |HS,m,ki | ) i=1

µ p∗m,D,k



2Rm,D /nm

= χ(

n m Q

i=1 •

|Hm,D,ki

1

|2 ) nm



1

∀k,

(15)

¶+

χ|Hm,D,k |2

∀k,

(16)

Optimal S-R and R-D bid (link selection) at the RSs: ∗ BS,m = FS,m (RS∗ , {p∗S,m,k }) and

∗ ∗ Bm,D = Fm,D (Rm , {p∗m,D,k }) ∀m = 1, 2, ...M. (17)

Proof: Please refer to Appendix C. Remark 1: Note that the optimal control of RSs are all functions of QSI (via the potential function) and local CSI. For example, the optimal rate control in (13) and (14) depends on two terms where the first term is related to the local CSI and the second term is related to the global QSI. This justifies the importance of queue-aware resource allocation. B. Structure of the Asymptotic Optimal Power and Link Selection (Bidding) Control Policy The optimal unichain control policy introduced in the above section involve a discrete search whose complexity may be large when the system state space is large. In the following, we derive a closed-form expression for the unichain control policy which is proved to be asymptotical optimal. Lemma 5 (Asymptotic Optimal Control): Let λS be the average packet arrival rate at the source node. If λS and

NQ λS

(m = S, 1, 2, ..., M ) are sufficiently large and γm,p (m = S, 1, 2, ..., M )

are sufficiently small, the following resource allocation policy is asymptotically optimal for any realization of the CSI H and QSI Qi .

23

September 10, 2009



Optimal S-R and R-D rate allocation at the RSs: µ ¶ µ 0 i ¶ 0 NF Y VS (Q ) − Vm (Qi ) ∗ NF 2 RS,m = log2 χ ( |HS,m,k | ) + NF log2 γS,p ln 2 k=1 µ ∗ Rm,D

= log2 χ

NF

(18)

µ 0 i ¶ ¶ Vm (Q ) ( |Hm,D,k | ) + NF log2 γm,p ln 2 k=1 NF Y

2

0

(19) 0

where VS (Qi ) = V ([QiS + 1, Qi1 , ..., QiM ]T )/2 − V ([QiS − 1, Qi1 , ..., QiM ]T )/2 and Vm (Qi ) = V ([QiS , ..., Qim + 1, ..., QiM ]T )/2 − V ([QiS , ..., Qim − 1, ..., QiM ]T )/2. •

Optimal S-R and R-D power allocation at the RSs: 0

p∗S,m,k

0

V (Qi ) − Vm (Qi ) 1 = S − γS,p ln 2 χ|HS,m,k |2

∀k

(20)

0

p∗m,D,k •

V (Qi ) 1 = m − γm,p ln 2 χ|Hm,D,k |2

∀k

(21)

Optimal S-R and R-D bid at the RSs: Same as (17). Proof: Please refer to Appendix D.

Remark 2: The asymptotically optimal control in Lemma 5 are closed-form functions of both the CSI and QSI where the it depends on the QSI indirectly via the potential function V (Qi ). For the power control solution, it has the form of multi-level water-filling where the power is allocated according to the CSI across subcarriers but the water-level is adaptive to the QSI. Similarly, all the optimal rate control and biding strategy (link selection) depend on both the CSI and the QSI. Remark 3: The control actions are functions of the potential function V (Qi ) and the LMs. Unfortunately, determining the potential function and the LMs involve solving the Bellman equation in (11), which has exponential complexity. Furthermore, even if we could solve it, the global QSI knowledge (from the source node and all the RSs) will be required, which is highly undesirable. V. D ISTRIBUTIVE O NLINE L EARNING FOR P OTENTIAL F UNCTION In this section, we shall proposed a distributive algorithm to determine the potential function V (Q) and the LMs γ = {γS,d , γS,p , γ1,p , ..., γM,p } which requires knowledge of the local QSI

24

September 10, 2009

and local CSI only at each RS. We show that under the proposed auction mechanism, the localized learning algorithm converges almost surely. Furthermore, for sufficiently large M , the proposed distributive solution is asymptotically global optimal. We shall first review the background materials on stochastic learning before elaborating the proposed algorithm. A. Preliminary on Stochastic Approximation and Learning In this section we shall review some preliminary results on stochastic approximation and reinforce learning. The stochastic approximation algorithm considered in this paper can be characterized by the following d-dimensional recursion · ¸ X(n + 1) = X(n) + ²(n) h(X(n)) + Z(n) ,

(22)

where X(n) = [X1 (n), X2 (n), ..., Xd (n)]T is a d-dimension vector, {²(n)} is a sequence of positive step size. If the following conditions are satisfied:



The map h is Lipschitz: ||h(x) − h(y)|| ≤ L||x − y||, for 0 < L < ∞, P P 2 n ²(n) < ∞, n ²(n) = ∞,



{Z(n)} is a Martingale difference sequence with respect to the increasing family of σ-field:



Fn = σ(Xm , Zm , m ≤ n). Furthermore, {Zn } are square-integrable with ¯ ¸ · ¯ 2¯ E ||Zn+1 || ¯Fn ≤ C(1 + ||Xn ||2 ) a.s.,

n ≥ 0,

for some constant C > 0. •

The iterates remain bounded almost surely, i.e. supn ||Xn || < ∞, a.s.

we have the following theorem on the convergence property (Theorem 2, [40]): Theorem 1: Almost surely, the sequence X(n) generated by (22) converges to a (possibly sample path dependent) compact connected internally chain transitive invariant set of the following ordinary differential equation (ODE) ˙ X(t) = h(X(t)).

25

September 10, 2009

The stochastic approximation approach can also be used to solve the Bellman equation associated with an Infinite Horizon MDP [23]. Specifically, using stochastic approximation, one can iteratively estimate the potential function based on the real-time observations of the system state and rewards. This technique is known as reinforced learning [40] or Q-learning in the literature and is a very powerful technique because it could solve the Bellman equation iteratively even if we do not know explicitly the transition kernel of the MDP. Using the theory of stochastic approximation, we could also establish sufficient conditions for convergence of the iterative learning. For example, using the notation defined in Section III-A we define a Q-factor of an infinite horizon MDP problem as θ + Q(si , u) =

X

· ¸ Pr(sj |si , u) g(si , u, sj ) + V (sj ) ,

(23)

sj

where g(si , u, sj ) denotes the system cost of transition from state si to sj using the action u. Hence, the value iteration for the Q-factors is given by · ¸ X l l+1 Pr(sj |si , u) g(si , u, sj ) + min Q (sj , u) − min Ql (sI , u), Q (si , u) = u

sj

u

where the state sI is the reference state. The online Q-learning algorithm is given by the following recursion µ

l+1

Q

¶ (si , u) = Q (si , u) + ² g(si , u, sj ) + min Q (sj , u) − min Q (sI , u) − Q (si , u) l

l

l

l

u

l

u

(24)

where sj and g(si , u, sj ) are all the real-time observations obtained online. Using the theory of stochastic approximation in Theorem 1, the Q-learning algorithm in (24) will converge almost surely to a fixed point Q∞ (si , u) ∀si , u, where the converged fixed point satisfies the Bellman equation ∞

θ + Q (si , u) =

X sj

· ¸ ∞ Pr(sj |si , u) g(si , u, sj ) + min Q (sj , u) ∀si u

and V (si ) = minu Q∞ (si , u) ∀si . The convergence property of the online learning algorithm (24) is summarized below:

26

September 10, 2009



· Note that g(si , u, sj ) + minu0 Q (sj , u ) is an unbias estimation of Pr(sj |si , u) g(si , u, sj ) + ¸ l minu Q (sj , u) , we have 0

l

· ¸ l g(si , u, sj ) + min Q (sj , u ) = Pr(sj |si , u) g(si , u, sj ) + min Q (sj , u) + δMl , 0 0

l

u

u

where {δMl } is a Martingale difference sequence. •

For sufficiently large l, the equation (24) can be rewritten as ½X · ¸ l+k−1 X l+k l m m Q (si , u) = Q (si , u) + ² Pr(sj |si , u) g(si , u, sj ) + min Q (sj , u) m=l

u

sj

¾ − min Q (sI , u) − Q (si , u) , m

m

u

where the estimation noise is averaged out almost surely due to the property of Martingale sequence. •

Define the functional mapping T as follows: · ¸ X u Pr(sj |si , u) g(si , u, sj ) + min Q(sj , u) . Tsi (Q) = u

sj

It can be proved that T is a contraction mapping [23]. Therefore, using Banach’s fixed point theorem [40], the recursion (24) will converge to a fixed point Q∞ satisfying · ¸ X ∞ ∞ Pr(sj |si , u) g(si , u, sj ) + min Q (sj , u) . θ + Q (si , u) = u

sj •

Compared with the equation (23), we have the following relationship between the Q-factor and the potential function: V (si ) = min Q∞ (si , u). u

B. Per-Node Potential and Distributive Learning Algorithm In the semi-distributive control policies proposed in Section IV, each transmitter should exchange the queue state information (QSI) in each frame since the control policies are the functions of global QSI, which may lead to significant communication overhead especially when the number of relays M is large. Therefore, in this section, we shall propose a distributive control algorithm where the control policies of each transmitter depends only on the local CSI and local

27

September 10, 2009

QSI, thereby reducing the communication overhead due to global QSI exchange. Specifically, we shall use the feature-based method to approximate the original system potential vector V e m |m}, and update the per-node potential as well as by a linear form of per-node potential {V the LMs locally in each RS according to the local CSI and local QSI. The linear approximation architecture [41] for the original system potential function V is given by V (Q) ≈

NQ M X X

Vem (q)I[Qm = q] = WT F (Q),

(25)

m=S q=1

where the parameter vector W and the feature vector F(Q) are given below: · ¸T W = VeS (1), , ..., VeS (NQ ), Ve1 (1), , ..., Ve1 (NQ ), ..., VeM (1), , ..., VeM (NQ ) · ¸T F(Q) = I[QS = 1], ..., I[QS = NQ ], ..., I[QM = 1], ..., I[QM = NQ ] , e m = [Vem (1), ..., Vem (NQ )]T (m = S, 1, 2, ..., M ) as the per-node potential vector and we define V of the source node and the RSs respectively, which could be obtained via distributive per-node stochastic learning (to be elaborated below). For the above linear architecture of approximated MDP, the update on the parameter vector is done only when the system is in some reference states. In this paper, we define the reference states as follows: SI = {im,q |∀m = S, 1, ..., M ; q = 1, 2, ..., NQ }, where im,q denotes the state with Qm = q and Qi = 0 (i 6= m). For the sake of notation convenience, let κ em = {Hm , Qm , QS } be the observed local state of the m-th RS node, and geS (γS , κ em ) = QS + γS,p

M X

ηS,m

m=1

gem (γm , κ em ) = Qm + γm,p ηm,D

NF X

pS,m,k + γS,d I[Qm = NQ ]

k=1 NF X

pm,D,k

where

m = 1, 2, ..., M

k=1

be the per-stage rewards of the source node and the M RSs respectively. The system procedure for distributive online learning is given below: 1. Initialization: Each RS initiate the LM and per-node potential vector for itself, denoted 0 e 0 , as well as the LMs and per-node potential vector for the source node, as γm,p and V m

28

September 10, 2009

0 0 e 0 , where the initialization of (γS,d , γS,p ) and V e 0 should be the denoted as (γS,d , γS,p ) and V S S

same among all the RSs. 2. Calculation of Control Actions: At the beginning of the l-th frame, the source node broadcasts its QSI QlS to all the RSs. Each RS determines the actions including the S-R and R-D power allocation ({pS,m,k }, {pm,D,k }), S-R and R-D rate allocation (RS,m , Rm,D ) as well as the S-R and R-D bid (BS,m , Bm,D ) for this frame according to the Lemma 4 (or Lemma 5) by approximating the original system potential vector (11) as (25). The unichain control policy for each RS in Lemma 4 (or Lemma 5) is feasible as the unichain control policy of each RS becomes the function of local observed system state κ em (t) = {Hm , Qm , QS } only. Specifically, the S-R and R-D rate allocation actions at the m-th RS are given by ∗ RS,m

· Xµ = arg min γS,p χ(

k

∗ Rm,D

2RS,m /nS nS Q

i=1

· Xµ = arg min γm,p

1

|HS,m,ki |2 ) nS 2Rm,D /nm

χ(

k

n m Q

i=1

µ ¶¸ ¶+ X i + fI (n)∆Ve QS (m, RS,m , n) − χ|HS,m,k |2 n 1

¶+ X µ ¶¸ i e − + fI (n)∆V QS (Rm,D , n) . 1 χ|Hm,D,k |2 nm n 1

|Hm,D,ki |2 )

where µ ¶ i e ∆V QS (m, RS,m , n) = VeS (QiS + n − RS,m ) − VeS (QiS ) + Vem (Qim + RS,m ) − Vem (Qim ) µ ∆Ve

¶ QiS (Rm,D , n)

= VeS (QiS + n) − VeS (QiS ) + Vem (Qim − Rm,D ) − Vem (Qim ).

and the S-R and R-D power allocation actions at the m-th RS are given by ¶+ µ ∗ 1 2RS,m /nS ∗ ∀k, pS,m,k = − nS 1 Q χ|HS,m,k |2 2 nS χ( |HS,m,ki | ) i=1

µ p∗m,D,k



2Rm,D /nm

= χ(

n m Q

i=1

|Hm,D,ki

1

|2 ) nm



1 χ|Hm,D,k |2

¶+ ∀k.

3. Contention-Based Link Selection: All the RS take turns to submit their smallest bid ∗ ∗ min{BS,m , Bm,D } as well as one bit indicating whether its buffer is empty in the contention

29

September 10, 2009

slot of the frame, where ∗ BS,m

= γS,p

X

p∗S,m,k

+

= γm,p

µ ¶ i ∗ e fI (n)∆V QS (m, RS,m , n) ,

n

k ∗ Bm,D

X

X

p∗m,D,k

k

+

X

¶ µ i ∗ e fI (n)∆V Qm (Rm,D , n) .

n

The link with the smallest submitted bid among all the RSs will be used for packet transmission in the frame. If one S-R link is selected, the selected RS (say the m-th RS) should notify the source node the calculated source node transmission power {pS,m,k |∀k} and rate RS,m . The source node will transmit packet with the given parameters; otherwise, if one R-D link is selected, the selected RS will transmit packet to the destination with power {pm,D,k |∀k} and rate Rm,D . 4. Local Per-Node Potential Update: According to the one-bid indicator submit by all RSs in the contention slot, each RS could determine whether the current system state fall into the set of reference state SI . If the current system state is the reference state, each RS updates the LMs and per-node potential vectors according to Algorithm 2 at the end of the e l+1 = V e l and γ l+1 = γ l . Finally, Let l = l + 1 and jump transmission slot. Otherwise, let V m m to the step 2. Fig. 4, illustrates the above procedure by a flowchart. The algorithm for LMs and per-node potential vector update is given below: Algorithm 2 (Online Learning Algorithm for Per-Node Potential and LMs): The update of per-

30

September 10, 2009

node potential and LMs in the mth (m = 1, 2, ..., M ) RS is given by:  · ¸   l l l l l l l  Ql = im,q Vem (q) + ²v γS,d I[QS = NQ ] + Qm + B −Vem (Qm ) − T0 (VeS )   | {z }  New observations Veml+1 (q) = (26) in the l-th frame      Veml (q) Ql = 6 im,q µ l γm,p

NF X

²lγ (

¶+ −P m )

l+1 γm,p

=

l+1 γS,p

|k=1 {z } New observations in the l-th frame µ ¶+ NF X l l = γS,p + ²γ ( pS,m,k −P S )

+

pm,D,k

m = 1, 2, ..., S

(27)

(28)

k=1

| {z } New observations in the l-th frame

µ l+1 γS,d

=

l γS,d

+

²lγ (

¶+ −D)

I[QlS

=N ] {z Q} | New observations in the l-th frame

(29) µ

l

where m = S, 1, 2, ..., M , B = max{BS,m , Bm,D |∀m},

¶ {²lv },

{²lγ }

are the sequences of step

size which satisfy ∞ X l=0 ∞ X

²vl = ∞, ²vl > 0, ²γl = ∞, ²γl > 0,

l=0

∞ · X

¸

(²vl )2

+

(²γl )2

< ∞,

l=0

Moreover, T0 (VeSl ) =

X

lim ²vl = 0

l→+∞

lim ²γl = 0

l→+∞

²γl = 0. l→+∞ ²v l lim

fI (n)VeSl (n).

n

Remark 4 (Comparison with Conventional Reinforced Learning): Compared with the traditional learning approach in MDP problem mentioned above, there are two key novelties in the online update algorithms proposed in this paper. Firstly, most of the existing literature regarding online learning addressed unconstrained MDP only. In the case of CMDP, the LM are determined

31

September 10, 2009

offline by simulation. In our case, both the LM and the per-node potential function are updated simultaneously according to (26),(27),(29). Secondly, conventional online learning are designed for centralized solution where the control actions are determined entirely from the potential update. The dynamic update equations can be characterized by a contraction mapping (T-mapping in [23]) and the convergence argument follows directly from Banach’s fixed point theorem [40]. However, in our case, the control action is determined not from alone but from via a per-frame bidding for spectrum access. During the iterative updates, both the per-node potential/LM as well as the control actions are changed dynamically. As a result, the local per-node potential update equations in (26) are not a contraction mapping anymore and the existing convergence analysis techniques (using contraction mapping and fixed point theorem argument) cannot be applied directly to our distributive stochastic learning algorithm. Remark 5 (Comparison with Deterministic NUM): In conventional iterative solutions for deterministic NUM, the iterative updates (with signaling message exchange) are performed within the CSI coherence time and hence, this limits the number of iterations and the performance. However, in the proposed online algorithm, the updates in the iteration steps evolves in the same time scale as the CSI and QSI. Hence, the algorithm could converge to a better solution because the number of iterations is no longer limited by the coherence time of CSI. C. Convergence Analysis In this section, we shall establish technical conditions for the almost-sure convergence of the online distributive learning algorithm. Since we have two different step size sequences {²lv }, {²lγ } and ²lγ = o(²lv ), the LMs update and the per-node potentials update are done simultaneously but over two different time scales. During the per-node potential update (timescale I), we have γ l+1 −γ l+1 = O(²lγ ) = o(²lv ), the LMs appear to be quasi-static [40] during the per-node potential update in (26). We have the following relationship between the original system potential vector V and the parameter vector W: V = MW

and W = M† V,

(30)

32

September 10, 2009

where M ∈ RNS ×(M +1)NQ (NS is the total number of system states) with the i-th row equal to F(Qi ), M† ∈ R(M +1)NQ ×NS with only one element of 1 in each row and the positions of 1s correspond to the positions of the reference system states {im,q |∀m, q} in the original system potential vector V. Hence, we first have the following lemma on the convergence of the per-node potential learning. Lemma 6 (Convergence of Per-Node Potential Learning over Timescale I): Define the extended f ∈ RNS ×(M +1)(NQ +1) , M f † ∈ R(M +1)(NQ +1)×NS as mapping and inverse mapping matrices M follows: •



f corresponds to one global queue state, denoted as Qi , and is given by Each row of M · ¸ i i i i i i I[QS = 0], I[QS = 1], ..., I[QS = NQ ], ..., I[QM = 0], I[QM = 1], ..., I[QM = NQ ] , f † is constructed by inserting the row vector [1, M (including the first row).

0, ..., 0 ] into M† on every NQ rows | {z } NS −1 zeros

Denote l f l−1 f† Al−1 = (1 − ²l−1 v )I + M P(Ω )M²v

l−1 f l−1 f† and Bl−1 = (1 − ²l−1 )M²v v )I + M P(Ω

where Ωl is the unichain system control policy at the l-th frame, P(Ωl ) is the transition matrix of system states given the unichain system control policy Ωl , I is identity matrix. If for the entire sequence of control policies {Ωl }, there exists an δ > 0 and some positive integer β such that [Aβ−1 ...A1 ](a,δ) ≥ τ l ,

[Bβ−1 ...B1 ](a,δ) ≥ τ l

∀a,

(31)

where [·](a,b) denotes the element in a-th row and b-th column and τ l = O(²lv ), the following statements are true: •

The update of the parameter vector (or per-node potential vector) will converge almost surely for any given initial parameter vector W0 and LM γ, i.e. lim Wl (γ) = W∞ (γ).

l→∞ •

The steady state parameter vector W∞ satisfies: µ ¶ ∞ † ∞ θe + W (γ) = M T γ, MW (γ)

(32)

33

September 10, 2009

where θ is a constant, the mapping T is defined as follows: T(γ, V) = min[g(γ, Ω) + P(Ω)V]. Ω

(33)

Proof: Please refer to Appendix E. Remark 6 (Interpretation of the Conditions in Lemma 6): Note that Al and B l are related to the transition probability of the reference states. Condition (31) simply means that there is one reference state accessible from all the other reference states after some finite number of transition steps. This is a very mild condition and will be satisfied in most of the cases in practice. Note that (33) is equivalent to following Bellman’s equation on the reference states SI : µ ¶ X · ¸ θ + V (im,q ) = g γk , im,q , u(im,q ) + Pr Qj |im,q , u(im,q ) V (Qj ), ∀im,q ∈ SI . Qj

On the other hand, since the ratio of step sizes satisfies

²lγ ²lv

→ 0 during the LM update (timescale

II), the per-node potential function will be updated much faster than the Lagrange multipliers. Hence, in timescale I, Lagrange multipliers can be treated as quasi-static during the update of the local per-node potential function, and the update of Lagrange multipliers in timescale II will trigger another update process of the local per-node potential function in timescale I. By the e ∞ (γ l )|| = 0 w.p.1. Hence, during the LM updates el − V Corollary 2.1 of [42], we have lim ||V m m l→∞

in (27) and (29), the per-node potential update in (26) is seen as almost equilibrated. Define · ¸ G(γ) = EΩ(Q) g(γ, Q) , we have the following convergence of the LMs. Lemma 7 (Convergence of the LM over Timescale II): The iteration on the vector of LMs γ = [γS , γ1 , ..., γM ]T (m = S, 1, 2, ..., M ) converges almost surely to the set of maxima of ∗ T G, i.e. γ ∗ ∈ arg max G(γ), and γ ∗ = [γS∗ , γ1∗ , ..., γM ] satisfies the power and packet drop rate

constraints in (4),(5) and (6). Proof: Please refer to Appendix F. Based on the above lemmas, we summarized the convergence performance of the online peruser potential and LM learning algorithm in the following theorem.

34

September 10, 2009

Theorem 2 (Convergence of Online Per-node Learning Algorithm): For the same conditions µ ¶ ∗ e∞ ∗ e as in Lemma 6, we have (γm , Vm ) → γm , Vm (γ ) w.p.1. for all m = S, 1, 2, ..., M , where ¶ µ ∗ e∞ ∗ γm , Vm (γ ) satisfies µ ∗ θm e

+

e ∞ (γ ∗ ) V m

em =T

¶ ∗ e∞ ∗ γm , Vm (γ )

and the average power constraint (4,5) as well as the average packet drop rate constraint (6), where e is a (NQ + 1) × 1 vector with all elements equal to 1. D. Asymptotic Performance of the Distributive Algorithm Finally, we shall show that the performance of the distributive algorithm is asymptotically global optimal for large average arrival bit rate at the source node and large buffer size. Theorem 3 (Asymptotically Global Optimal at High Traffic Loading): For sufficiently large NQ and high traffic loading4 , the performance of the online distributive per-node primal-dual potential learning algorithm is asymptotically global optimal. In other words, at high traffic loading, the proposed distributive online learning algorithm achieves the same performance as the centralized solution obtained by brute-force solving the Bellman equation in (11). Proof: Please refer to Appendix G. VI. S IMULATION AND D ISCUSSION In this section, we shall compare our proposed online per-node potential learning algorithm via stochastic approximation and approximated MDP (Section V) to the delay optimal control with the global QSI (Section IV) and two other reference baselines. Baseline 1 refers to the CSIT only scheduling, in which the link selection (bidding) and power allocation are adaptive to the CSIT only so as to optimize the end-to-end capacity (assuming infinite backlog at the buffers). Baseline 2 refers to the throughput optimal policy (in stability sense), namely the dynamic backpressure algorithm [43]. In the simulations, we consider Poisson packet arrival at the source node with 4

High traffic loading means large average arrival rate at the source node λS and the system is stable.

September 10, 2009

35

deterministic packet size. The pathloss of all S-R links and R-D links is 10dB. There are 10 independent subbands (NF = 10) with total bandwidth 10MHz. The scheduling slot duration is 5ms. Figure 5 illustrates the average end-to-end delay versus average transmit SNR with relay number M = 2, buffer size NQ = 6 packets, packet size L = 40K bits and the average arrival rate λ = 200pck/s. It can be observed that both the optimal MDP solution5 and the distributive MDP solution6 have significant performance gain compared with the two baselines. Moreover, in Figure 6, we compare the optimal control policy (rate, power, bid) in Lemma 4 with the closed-form (asymptotic optimal) control policy in Lemma 5 for relay number M = 2, buffer size NQ = 22 packets, packet size L = 5K bits and the average arrival rate λ = 1200pck/s. It’s shown that for moderate buffer size (NQ = 22) the performance of the closed-form expressions in Lemma 5 is very close to the optimal control policy, which is based on exhaustive search. Figure 7 illustrates the average end-to-end delay versus the number of relays with transmit power P0 = 8.5dB, buffer size NQ = 10 packets, packet size L = 40K bits and the average arrival rate λ = 200pck/s. It’s obvious that the distributive solution has significant gain in delay over the baselines. Figure 8 further illustrates the cumulative distribution function (cdf) of the queue length for M = 7, transmit power P0 = 8.5dB, buffer size NQ = 10 packets, packet size L = 40K bits and the average arrival rate λ = 200pck/s. It can be seen that the distributive solution achieves a smaller queue length compared with the other baselines. Figure 9 illustrates the convergence property of the proposed distributive online learning algorithm at buffer size NQ = 6 packets, packet size L = 40K bits and the average arrival rate λ = 200pck/s. We plot the average potential function of the first relay versus scheduling slot index at a transmit SNR= 9dB. It can be seen that the distributive algorithm converges quite fast. Compared with Figure 5, we can observed that even the average delay corresponding to the potential function at the 500-th scheduling slot is already very close to the average delay of converged potential, which is better than the two baselines. 5

The optimal MDP solution refers to the optimal control policy of Lemma 3, which is presented in Section IV

6

The distributive MDP solution refers to the distributive control policy derived in Section V via approximated MDP.

36

September 10, 2009

VII. S UMMARY In this paper, we study the design of a distributive delay-optimal cross-layer scheduling algorithm for two-hop relay communication systems over frequency selective fading channels. We use the infinite horizon average reward MDP to model the complicated interactions between the source node’s queue and the relays’ queue, whose state space involves the joint queue state (QSI) of the queue at the source node and the queues at the M RSs as well as the joint channel state (CSI) of all S-R links and R-D links. To address the curse of dimensionality, we propose a reduced state MDP formulation where the control policy is the function of local CSI but global QSI. Moreover, using stochastic learning and approximated MDP, we derive a distributive online learning algorithm in which each node recursively estimates a per-node potential function based on real-time observations of the local CSI and local QSI only. It’s proved that this distributive learning algorithm converges almost surely to a global optimal solution for large bit arrival rate and buffer size. The simulation results show that the delay performance of the proposed scheme is significantly better than various baselines such as the conventional CSIT-only control and the throughput optimal control. A PPENDIX A: P ROOF OF L EMMA 1 Let W , WS and WR be the average time (with the unit of frames) one information bit staying in the system, the source node’s queue and some relay’s queue respectively, thereby, W = WS + WR .

(34)

Let λS be bit arrival rate of the source node. The average number of bits received by the source node is given by λS (1−D), which is also the average number of information bits received by the relay clusters as the source node and the relay cluster are cascade. Let NS and NR be the average number of information bits in the source node’s queue and the relays’ queues respectively, by Little’s Law NS = (1 − D)λS WS

and NR = (1 − D)λS WR .

(35)

37

September 10, 2009

Combine (34) and (35), we have W =

NS + N R . λS (1 − D)

Since the change of system queue state forms a Markov chain, we have P ¸ · QS + M m=1 Qm , W = E πκ λS (1 − D) where πκ is the steady state distribution. For sufficiently small packet drop rate requirement 1 − D ≈ 1, the end to end average delay is given by P ¸ · QS + M m=1 Qm W = E πκ . λS

A PPENDIX B: P ROOF OF L EMMA 3 The Bellman’s equation for the MDP problem in (7) is given by θ + J(Qi , Hi ) ½ µ ¶ µ ¶ ¾ X i i i i j j i i i i j j = min g Q , H , u(Q , H ) + Pr Q , H |Q , H , u(Q , H ) J(Q , H ) i i u(Q ,H )

=

Qj ,Hj

½ µ µ ¶ µ ¶ ¶ ¾ X j i i i i j i i i i j j min Pr Q |Q , H , u(Q , H ) Pr H J(Q , H ) . g Q , H , u(Q , H ) + i i

u(Q ,H )

Qj ,Hj

where Qi and Hi denote one of the global QSI and CSI, u(Qi , Hi ) is the aggregation of system actions when the system state is (Qi , Hi ). Denote V (Qj ) = EH J(Qj , Hj ) and take expectation w.r.t Hi on the above equation, we have ½ · µ ¶¸ X µ ¶ ¾ i i i i i j i i j θ + V (Q ) = mini EH g Q , H , u(Q , H ) + Pr Q |Q , u(Q ) V (Q ) , u(Q )

Qj

which is the Bellman’s equation of (10). A PPENDIX C: P ROOF OF L EMMA 4 We first show the S-R power allocation of mth RS {pS,m,k } given the data rate RS,m . By the standard water-filling approach, we have ¾ ½ NF X χ|HS,m,k |2 = RS,m max 0, log2 β k=1

µ and pS,m,k =

1 1 − β χ|HS,m,k |2

¶+ ,

38

September 10, 2009

where β is the Lagrangian multiplier. We resort the channel gain of all subcarriers according to their norm, i.e. |HS,m,k1 |2 ≥ |HS,m,k2 |2 ≥ ... ≥ |HS,m,kNF |2 . Obviously, when increasing RS,m , more and more subcarriers are given positive transmit power, and in the resorted sequence of subcarriers the subcarrier with small index will get positive power before the subcarrier with large index. The RS,m and the subcarriers with positive power follows the following relationship:   P S +1 P S |HS,m,ki |2 |HS,m,ki |2   ni=1 ≤ RS,m < ni=1 log2 |HS,m,k : Subcarriers k1 -knS with positive power log2 |HS,m,k |2 |2  P F  RS ≥ N i=1 log2

nS

nS +1

|HS,m,ki |2 |HS,m,kn |2

:

All subcarriers with positive power

F

Suppose RS,m is in the region where subcarriers k1 -knS are scheduled with positive power, we have 1 = β

2RS /nS χ(

nS Q

i=1

|HS,m,ki

1

.

|2 ) nS

Therefore, (15) and (16) are straightforward. Then we turn to the proof of optimal rate allocation and biding strategy. Notice that FS,m (·) and Fm,D (·) are the function of local CSI at the mth relay, we have ½X ¾ X mini EH ηS,m FS,m (RS,m , {pS,m,k }) + ηm,D Fm,D (Rm,D , {pm,D,k }) u(Q )

=

m

min

{BS,m (Qi ,Hm )},{Bm,D (Qi ,Hm )}

+

X

ηm,D

m

min

Rm,D ,{pm,D,k }

EH

½X m

m

ηS,m

min

RS,m ,{pS,m,k }

FS,m (RS,m , {pS,m,k })

¾ Fm,D (Rm,D , {pm,D,k }) .

Therefore, the policies (13,14) are straightforward. Moreover, since the minimization with local CSI should not be smaller than the minimization with global CSI, we have ½X ¾ X min EH ηS,m FS,m (RS,m , {pS,m,k }) + ηm,D Fm,D (Rm,D , {pm,D,k }) i i {BS,m (Q ,Hm )},{Bm,D (Q ,Hm )}

½

≥ EH

min

{BS,m },{Bm,D }

·X m

m

∗ ηS,m FS,m (RS,m , {p∗S,m,k })

m

+

X

¸¾ .

∗ ηm,D Fm,D (Rm,D , {p∗m,D,k })

m

Notice that the lower bound (36) can be achieved by the bidding strategies (17), these bidding strategies are global optimal.

(36)

39

September 10, 2009

A PPENDIX D: P ROOF OF L EMMA 5 When γS and

NQ γS

are sufficiently large, it with large probability that

Qm γS

(m = S, 1, 2, ..., M )

is sufficiently large. Although V is discrete, we could interpolate the value of V between the discrete values so that it is differentiable by following the same approach in [44]. Moreover, it can be proved that the potential function V (Q) is increasing polynomial in Q = [QS , Q1 , ..., QM ]T . Hence, we have the following equations: µ ¶ 0 0 i V QS (m, RS,m , n) = V (Qi ) + (n − RS,m )VS (Qi ) + RS,m Vm (Qi ) ¶ µ 0 i V QS (Rm,D , n) = V (Qi ) − Rm,D Vm (Qi ) 0

where Vm =

∂V ∂Qm

(37) (38)

(m = S, 1, ..., M ). Moreover, when Qm (m = S, 1, 2, ..., M ) is sufficiently

large, we have V ([QiS , ..., Qim + 1, ..., QiM ]T ) − V ([QiS , ..., Qim − 1, ..., QiM ]T ) 0 = Vm Qm →∞ 2 lim

m = S, 1, 2, ..., M.

On the other hand, when γm,p is sufficiently small, the transmitter will schedule positive power on all subcarriers, and therefore nS , nm → NF in (13) and (14). Substituting (37,38) into (13,14) and taking derivative w.r.t. RS,m and Rm,D , we can get the following closed-form expression on the optimal rate allocation: ¶ µ µ 0 i ¶ 0 NF Y VS (Q ) − Vm (Qi ) ∗ 2 NF RS,m = log2 χ ( |HS,m,k | ) + NF log2 γS,p ln 2 k=1 µ ∗ Rm,D

= log2 χ

NF

¶ µ 0 i ¶ Vm (Q ) ( |Hm,D,k | ) + NF log2 γm,p ln 2 k=1 NF Y

2

Furthermore, (20,21) are straightforward with the above conclusions. A PPENDIX E: P ROOF OF L EMMA 6 Since we consider the Rayleigh fading channels, where the channel gain varies from 0 to infinity, it’s easy to see that each reference state will be updated comparably often

7

in the

asynchronous learning algorithm. Quoting the conclusion in [45], the convergence property 7

Please refer to [31] for the definition of ”comparably often”.

40

September 10, 2009

of the asynchronous update and synchronous update is the same. Therefore, we consider the convergence of related synchronous version for simplicity in this proof. Let c ∈ R be a constant, we have T0 (cVeSl ) = cT0 (VeSl ). Similar to [46], it’s easy to see that e m } is bounded almost surely during the iterations of algorithm. the local per-node potential {V According to the construction of parameter vector W, it’s clearly that the update on the local e m is equivalent to the update on the parameter vector W and to prove per-node potential V the convergence of Lemma 6 is equivalent to prove the convergence of update on the parameter vector W. In the following, we first introduce and prove the following lemma on the convergence of learning noise. Lemma 8: Define ¸ · l l l q = M g(Ωl ) + P(Ωl )MW − MW − T0 (MW )e , l



when the number of iterations l ≥ j → ∞, the procedure of update can be written as follows with probability 1: W

l+1

j

=W +

l X

²iv qim .

i=j

Proof: The update of local per-node potential can be written in the following vector form: · ¸ l+1 l l † l l l l l W = W + ²v M g(H ) + J MW − MW − T0 (MW )e , where the matrix Jl with exactly one element of 1 in each row denotes the state transition from the l-th frame to the l + 1-th frame of the realtime observation. Define · ¸ l † l l l l l Y = M g(H ) + J MW − MW − T0 (MW )e and δZl = Yl − ql and Zl =

l P i=j

²iv δZi . The online potential estimation (26) can be rewritten as

Wl+1 = Wl + ²lv Yl = Wl + ²lv ql − ²lv δZl j

= W +

l X i=j

²iv qi − Zl .

(39)

41

September 10, 2009

Our proof of Lemma 8 can be divided into the following steps: 1. Letting Fl = σ(Wm , m ≤ l), it’s easy to see that E[δZl |Fl−1 ] = 0. Thus, {δZl |∀l} is a Martingale difference sequence and {Zl |∀l} is a Martingale sequence. Moreover, Yl is unbiased estimation of qlm and the estimation noise is uncorrelated. 2. According to the uncorrelated estimation error from step 1, we have ¯ ¯ · ¸ · X ¸ l ¯ ¯ l 2¯ i i 2¯ E |Z | ¯Fj−1 = E | ²v δZ | ¯Fj−1 i=j

¯ · ¸ l X ¯ i i 2¯ = E |²v δZ | ¯Fj−1 i=j

e = Z

l X

(²iv )2 → 0 when

j → ∞,

i=j

¯ · ¸ ¯ i 2¯ e where Z ≤ max E |δZ | ¯Fj−1 is a bounded constant vector and the convergence of j≤i≤l l P e (²i )2 is from the definition of sequence {²i }. Z v v i=j

3. From step 1, {Zl |∀l} is a Martingale sequence. Hence, according to the inequality of Martingale sequence, we have ¯ ¸ ¯ ¯ Pr sup |Z | ≥ λ¯Fj−1 ≤ j≤i
i

¯ · ¸ ¯ l 2¯ E |Z | ¯Fj−1 λ

∀λ.

From the conclusion of step 2, we have ¯ · ¸ ¯ i lim Pr sup |Z | ≥ λ¯¯Fj−1 = 0 ∀λ. j→∞

j≤i
Hence, from (39) we almost surely have Wl+1 = Wj +

Pl

i i i=j ²v qm

when j → ∞.

Moreover, the following lemma is about the limit of sequence {qlm }. Lemma 9: Suppose the following two inequalities are true for l = a, a + 1, ..., a + b g(Ωl ) + P(Ωl )MWl ≤ g(Ωl−1 ) + P(Ωl−1 )MWl g(Ωl−1 ) + P(Ωl−1 )MWl−1 ≤ g(Ωl ) + P(Ωl )MWl−1 ,

(40) (41)

42

September 10, 2009

then we have

b βb c−1

|qia+b | ≤ C1

Y

(1 − τ a+iβ ) ∀i,

(42)

i=0

where qia+b denotes the ith element of the vector qa+b , C1 is some constant. Proof: We first define the following extended parameter vector as follows f = [VeS (0), VeS (1), ..., VeS (NQ ), Ve1 (0), Ve1 (1), ..., Ve1 (NQ ), ..., VeM (0), VeM (1), ..., VeM (NQ )]T , W where the initial values of {Vem (0)|m = S, 1, 2, ..., M } are all 0 and the initial values of f has M + 1 {Vem (i)|m = S, 1, 2, ..., M ; i = 1, 2, ..., NQ } are the same as those in W. Thus, W f more elements than W. It’s easy to see that the iteration on the extended parameter vector W f M f † is equivalent to the iteration on the W with the mapping and inverse mapping matrices M, with the mapping and inverse mapping matrices M, M† , where the equivalent means the common f l and Wl are the same and the values of the elements {Vem (0)|m = S, 1, 2, ..., M } elements in W remain 0. Thus, fW f l = MWl M

l = 0, 1, 2, ...

el in the following Furthermore, we define the extended version of ql , denoted as q ¸ · l l l † fW f −M fW f − wl e . f g(Ωl ) + P(Ωl )M e =M q el and ql is given by Obviously, the relationship between q l l l l l T el = [0, q1l , ..., qN q , 0, qN , ..., q2N , ..., 0, qM NQ +1 , ..., q(M +1)NQ ] . Q Q +1 Q

Therefore, from (40) and (41), we have · ¸ ¸ · l † l l † l l f g(Ωl ) + P(Ωl )M fW f −M fW f − wl e ≤ M f g(Ωl−1 ) + P(Ωl−1 )M fW f −M fW f − wl e e =M q l−1

e q

· ¸ l−1 l−1 fW f −M fW f − wl−1 e = M g(Ωl−1 ) + P(Ωl−1 )M · ¸ l−1 l−1 † f f f f f ≤ M g(Ωl ) + P(Ωl )MW − MW − wl−1 e f†

fW f l ). According to Lemma 8, we have where wl = T0 (MWl ) = T0 (M l−1 l−1 fl = W f l−1 + ²l−1 q Wl = Wl−1 + ²l−1 ⇒W , v q v e

43

September 10, 2009

therefore, l

e q

el q

· ¸ l−1 † l−1 l−1 f P(Ω )M² f el−1 + wl−1 e − wl e = Bl−1 q el−1 + wl−1 e − wl e q ≤ (1 − ²v )I + M v ¸ · † l f l−1 l−1 f el−1 + wl−1 e − wl e. el−1 + wl−1 e − wl e = Al−1 q ≥ (1 − ²v )I + M P(Ω )M²v q

Notice that l−1 l−1 l−1 l f l−1 f† Al−1 e = (1 − ²l−1 v )Ie + M P(Ω )M²v e = (1 − ²v )e + (M + 1)²v e = (1 + M ²v )e l−1 l−1 l−1 f l−1 f† )M²v e = (1 − ²l−1 Bl−1 e = (1 − ²l−1 v )e + (M + 1)²v e = (1 + M ²v )e. v )Ie + M P(Ω

Notice that Al−1 e = Bl−1 e, we have el−β − C1 e ≤ q el ≤ Bl−1 ...Bl−β q el−β − C1 e Al−1 ...Al−β q el−β ] ≤ q el + C1 e ≤ (1 − τ l )[max q el−β ] ⇒ (1 − τ l )[min q    max q el + C1 ≤ (1 − τ l ) max q el−β ⇒   min q el + C1 ≥ (1 − τ l ) min q el−β · ¸ l l l l−β l−β e − min q e ≤ (1 − τ ) max q e − min q e ⇒ max q el − min q el ≤ C2 (1 − τ l ) ∀i, ⇒ |qil | ≤ max q el where the first step is due to conditions of Lemma 6 on matrix sequence {Al } and {Bl }, max q el denote the maximum and minimum elements in q el respectively, C1 and C2 are all and min q el ≤ 0. Hence, the conclusion is constants, the first inequality of the last step is because min q straightforward. Therefore, the proof of Lemma 6 can be divided into the following steps: Qb βl c−1 1. From the property of sequence {²lv }, we have i=0 (1 − ²iβ v ) → 0 (l → ∞). 2. According to the first step, note that τ l = O(²lv ), from (42), we have ql → 0 (l → ∞). 3. Therefore, the update on {Wl } will converge, and the fixed point of the convergence W∞ satisfies T0 (MWl )e + W∞ = M† T(MW∞ ). This completes the proof.

44

September 10, 2009

A PPENDIX F: P ROOF OF L EMMA 7 By Lemma 4.2 of [47], G(γ) is a concave and continuously differentiable except at finitely many points where both right and left derivatives exist. By envelope theorem [47], the ODE 0

of γ is the same as the gradient ascent γ (t) = ∇G(γ(t)). Thus,

G(γ(t)) dt

= |∇G(γ(t))|2 > 0

for almost every t as long as γ(t) is not in arg max(G(·)). Therefore, the above ODE will converge to arg max(G(·)) which corresponds to

G(γ(t)) dt

= 0, i.e. the power and packet drop rate

constraints are satisfied. A PPENDIX G: P ROOF OF T HEOREM 3 Without loss of generality, in this proof we shall consider the aforementioned approximated MDP V (Q) =

NQ M X X

Vem (q)I[Qm = q]

(43)

m=S q=1

on the following redefined set of reference states SI : SI = {jm,q |m = S, 1, 2, ..., M ; q = 0, 1, ..., qI − 1, qI + 1, ..., NQ }, where the state jm,q is given by jm,q = [QS = qI , Q1 = qI , ..., Qm = q, ..., QM = qI ]T ,

(44)

and qI < NQ is sufficiently large. Correspondingly, the inverse mapping M† (30) should also e m } is updated on the reference states SI [41]. be redefined such that the per-node potential {V First of all, following the similar approach in the proof of Lemma 6, it’s easy to see that under e ∞ (γ)} the new reference states the per-node potential could also converge almost surely to {V m for any given LMs γ. Next, when the conditions of Theorem 3 are satisfied, given any ² > 0, there is one integer Q0 (²) such that for all q > Q0 (²) and qI = Q0 (²), we have (from the proof of Lemma 5 in Appendix D): Vem∞ (q − r) − Vem∞ (q) = Vem∞ (qI − r) − Vem∞ (qI ) + O(²).

(45)

45

September 10, 2009

Moreover, since {Vem∞ (q)} are all monotonically increasing functions with respect to q and µ ¶ ∞ 8 e e {Vm (NQ )} are all bounded , we have Vm Q0 (²) = O(²) for sufficiently large arrivals. Therefore, (45) holds for all q ∈ [0, NQ ] for sufficiently large NQ and input arrivals. Similarly, for sufficiently large NQ and input arrivals, we have VeS∞ (q + n − r) − VeS∞ (q + n) = VeS∞ (qI + n − r) − VeS∞ (qI + n) + O(²)

(46)

Vem∞ (q + r) − Vem∞ (q) = Vem∞ (qI + r) − Vem∞ (qI ) + O(²).

(47)

e ∞ (γ)} onto Hence, with the above equations and using the converged per-node potential {V m (11) for the reference states, we get VeS∞ (q)

= q + γS,d I[q = NQ ] +

X

µ ¶ ½ ∞ ∞ fI (n) VeS (q + n) − VeS (n) + min EH u

n

· µ X X X ηS,m γS,p pS,m,k + fI (n) VeS∞ (q + n − RS,m ) − VeS∞ (q + n) m

n

k

¶¸¾ ∞ ∞ e e +Vm (qI + RS,m ) − Vm (qI ) ½ Vem∞ (q)

= q+

Vem∞ (q)

+ min EH u

(48)

· ¸¾ X ∞ ∞ ηm,D γm,p pm,D,k + Vem (q − Rm,D ) − Vem (q) ,(49) k

where m = 1, 2, ..., M . Finally, for any system state Qi = [QiS , ..., QiM ]T , substitute the above equations into the RHS 8 e∞ Vm (q)

measures the contribution of reward q + γm,p

P k

pm,D,k if the system starts at Q = q. For finite NQ , Ve ∞ (q) is

always bounded. On the other hand, since the system is stable, the queue length is bounded with probability 1 for arbitrarily ∞ large NQ and hence, Vem (NQ ) must be bounded almost surely for arbitrarily large NQ .

46

September 10, 2009

of the original Bellman’s equation in (12), we get ½X µ ¶ M X i i i Qm + γS,d I[QS = NQ ] + mini EH fI (n)V Q (n) + u(Q )

m=S

n

· µ ¶¸ X X X i ηS,m γS,p pS,m,k + fI (n)∆V QS (m, RS,m , n) + m

X m a

=

M X m=S

X

n

k

· µ ¶¸¾ X X i ηm,D γm,p pm,D,k + fI (n)∆V Qm (Rm,D , n) n

k

Qim

+

γS,d I[QiS

= NQ ] +

X

fI (n)VeS∞ (QiS

+ n) +

n

M X m=1

½ Vem∞ (Qim )

+ mini EH u(Q )

· µ ¶ X X ∞ i ∞ i e e ηS,m γS,p pS,m,k + fI (n) VS (QS + n − RS,m ) − VS (QS + n) + Vem∞ (Qim + RS,m )

m

n

k

¸ X · ¸¾ X ∞ i ∞ i ∞ i e e e −Vm (Qm ) + ηm,D γm,p pm,D,k + Vm (Qm − Rm,D ) − Vm (Qm ) m b

=

M X

Qim

+

γS,d I[QiS

k

= NQ ] +

X

fI (n)VeS∞ (QiS + n) +

m=1

n

m=S

M X

½ Vem∞ (Qim ) + mini EH u(Q )

· µ ¶ X X X ∞ i ∞ i e e ηS,m γS,p pS,m,k + fI (n) VS (QS + n − RS,m ) − VS (QS + n) + Vem∞ (qI + RS,m ) m

n

k

¸ X · ¸¾ X ∞ ∞ i ∞ i −Vem (qI ) + ηm,D γm,p pm,D,k + Vem (Qm − Rm,D ) − Vem (Qm ) + O(²) m c

=

M X

Vem∞ (Qim ) +

m=S i

= V (Q ) +

X

k

X

fI (n)VeS∞ (n) + O(²)

n

fI (n)VeS∞ (n) + O(²),

n

where equality (a) is due to (43), equality (b) is due to (47), equality (c) is due to (48) and P i e∞ (49). Since n fI (n)VS (n) is a constant independent of Q and ² is chosen arbitrarily, we P PNQ e ∞ have shown that the approximate value function V (Q) = M q=1 Vm (q)I[Qm = q] in (43) m=S e m }) can satisfy the original Bellman equation (11) (formed by converged per-node potential {V asymptotically (when NQ → +∞). As a result, the proposed distributive per-node potential algorithm converges to the global optimal solution and this completes the proof. R EFERENCES [1] IEEE 802.16’s Relay Task Group. [Online]. Available: http://www.ieee802.org/16/relay/index.html.

47

September 10, 2009

[2] WINNER- Wireless World Initiative New Radio. [Online]. Available: http://www.ist-winner.org/. [3] O. Oyman, N. Laneman, and S. Sandhu, “Multihop relaying for broadband wireless mesh networks: From theory to practice,” IEEE Communications Magazine, vol. 45, no. 11, pp. 116–122, November 2007. [4] G. Li and H. Liu, “Resource allocation for OFDMA relay networks with fairness constraints,” IEEE Journal on Selected Areas in Communications, vol. 24, pp. 2061 – 2069, Nov. 2006. [5] T. C.-Y. Ng and W. Yu, “Joint optimization of relay strategies and resource allocations in cooperative cellular networks,” IEEE Journal on Selected Areas in Communications, vol. 25, no. 2, pp. 328 – 339, Feb. 2007. [6] Y. Cui and V. K. N. Lau, “Distributive subband allocation, power and rate control for relay-assisted ofdma cellular system with imperfect system state knowledge,” IEEE Transactions on Wireless Communications, submitted for publication. [7] S. Kittipiyakul and T. Javidi, “Resource allocation in ofdma: How load-balancing maximizes throughput when water-filling fails,” UW Technical Report, UWEETR-2004-0007, 2004. [8] R. A. Berry and R. Gallager, “Communication over fading channels with delay constraints,” IEEE Transactions on Information Theory, vol. 48, pp. 1135–1148, May 2002. [9] R. S. Ellis, Entropy, Large Deviations, and Statistical Mechanics. Springer, 2006. [10] D. Wu and R. Negi, “Effective capacity: A wireless link model for support of quality of service,” IEEE Transactions on Wireless Communications, vol. 2, pp. 630–643, July 2003. [11] D. Hui and V. Lau, “Cross-layer design for ofdma wireless systems with heterogeneous delay requirements,” IEEE Transactions on Wireless Communications, vol. 6, pp. 2872–2880, Aug. 2007. [12] J. Tang and X. Zhang, “Quality-of-service driven power and rate adaptation over wireless links,” IEEE Transactions on Wireless Communications, vol. 6, pp. 3059–3068, Aug. 2007. [13] D. P. Bertsekas, Dynamic Programming - Deterministic and Stochastic Models. [14] X. Cao, Stochastic Learning and Optimization: A Sensitivity-Based Approach.

Prentice Hall, NJ, USA, 1987. Springer, 2007.

[15] M. J. Neely, “Order optimal delay for opportunistic scheduling in multi-user wireless uplinks and downlinks,” IEEE/ACM Transactions on Networking, vol. 16, Oct. 2008. [16] I. Bettesh and S. Shamai, “Optimal power and rate control for minimal average delay: The single-user case,” IEEE Transactions on Information Theory, vol. 52, pp. 4115–4141, Sept. 2006. [17] M. Goyal, A. Kumar, and V. Sharma, “Optimal cross-layer scheduling of transmissions over a fading multicaaess channel,” IEEE Transactions on Information Theory, vol. 54, pp. 3518 – 3537, Aug. 2008. [18] Y. Cui and V. K. N. Lau, “Delay-optimal power and subcarrier allocation for ofdma systems via stochastic approximation,” IEEE Transactions on Wireless Communications, submitted for publication, 2008. [19] L. Tassiulas and A. Ephremids, “Dynamic server allocation to parallel queues with randomly varying connectivity,” IEEE Transactions on Information Theory, vol. 39, pp. 466 – 478, Mar. 1993. [20] E. M. Yeh and A. Cohen, “Throughput and delay optimal resource allocation in multiaccess fading channels,” in Proc. ISIT, June-July 2003, p. 245. [21] A. K. Sadek, K. J. R. Liu, and A. Ephremides, “Cognitive multiple access via cooperation: Protocol design and performance analysis,” IEEE Transactions on Information Theory, vol. 53, no. 10, pp. 3677 – 3696, Oct. 2007.

48

September 10, 2009

[22] Y.-W. Hong, C.-K. Lin, and S.-H. Wang, “On the stability region of two-user slotted aloha with cooperative relays,” in Proc. IEEE ISIT, June 2007, pp. 356 – 360. [23] D. Bertsekas, Dynamic Programming and Optimal Control, Vol. 2.

Athena Scientific, 2007.

[24] W. B. Powell, Approximate dynamic programming: solving the curses of dimensionality. Wiley-Interscience, 2007. [25] W. Nam, W. Chang, S.-Y. Chung, and Y. Lee, “Transmit optimization for relay-based cellular ofdma systems,” Communications, 2007. ICC ’07. IEEE International Conference on, pp. 5714–5719, June 2007. [26] O. Oyman, “Opportunistic scheduling and spectrum reuse in relay-based cellular ofdma networks,” Global Telecommunications Conference, 2007. GLOBECOM ’07. IEEE, pp. 3699–3703, Nov. 2007. [27] G. Li and H. Liu, “Resource allocation for ofdma relay networks with fairness constraints,” IEEE Journal on Selected Areas in Communications, vol. 24, no. 11, pp. 2061–2069, Nov. 2006. [28] T. C.-Y. Ng and W. Yu, “Joint optimization of relay strategies and resource allocations in cooperative cellular networks,” IEEE Journal on Selected Areas in Communications, vol. 25, no. 2, pp. 328–339, February 2007. [29] B. Wang, Z. Han, and K. Liu, “Distributed relay selection and power control for multiuser cooperative communication networks using buyer/seller game,” May 2007, pp. 544–552. [30] Z. Zhang, J. Shi, H.-H. Chen, M. Guizani, and P. Qiu, “A cooperation strategy based on nash bargaining solution in cooperative relay networks,” IEEE Transactions on Vehicular Communications, vol. 57, no. 4, pp. 2570–2577, July 2008. [31] J. Abounadi, D. Bertsekas, and V. S. Borkar, “Learning algorithms for markov decision processes with average cost,” SIAM Journal on Control and Optimization, vol. 40, pp. 681–698, 1998. [32] H. Huang and V. K. N. Lau, “Delay-optimal distributed power and transmission threshold control for s-aloha network with finite state markov chain (fsmc) fading channels,” IEEE Transactions on Wireless Communications, submitted for publication. [33] M. Gastpar and M. Vetterli, “On the capacity of wireless networks: the relay case,” INFOCOM 2002. Twenty-First Annual Joint Conference of the IEEE Computer and Communications Societies. Proceedings. IEEE, vol. 3, pp. 1577–1586 vol.3, 2002. [34] ——, “On the capacity of large gaussian relay networks,” IEEE Transactions on Information Theory, vol. 51, no. 3, pp. 765–779, March 2005. [35] T. Cover and A. Gamal, “Capacity theorems for the relay channel,” IEEE Transactions on Information Theory, vol. 25, no. 5, pp. 572–584, Sep 1979. [36] Z. Han, Z. Ji, and K. Liu, “Non-cooperative resource competition game by virtual referee in multi-cell ofdma networks,” IEEE Journal on Selected Areas in Communications, vol. 25, no. 6, pp. 1079–1090, August 2007. [37] T. Richardson and R. Urbanke, “The capacity of low-density parity-check codes under message-passing decoding,” IEEE Transactions on Information Theory, vol. 47, no. 2, pp. 599–618, Feb 2001. [38] S. Boyd and L. Vandenberghe, Convex Optimization.

Cambridge University Press, 2004.

[39] V. I. Istratescu, Fixed Point Theory: An Introduction.

Springer Science & Business, 2002.

[40] V. S. Borkar, Stochastic Approximation: A Dynamical Systems Viewpoint. Cambridge University Press, 2008. [41] J. N. Tsitsiklis and B. van Roy, “Feature-based methods for large scale dynamic programming,” Machine Learning, vol. 22,

September 10, 2009

49

pp. 59–94, March 1996. [42] V. S. Borkar, “Stochastic approximation with two time scales,” Systems Control Lett. 29, pp. 291–294, 1997. [43] L. Georgiadis, M. Neely, and L. Tassiulas, Resource Allocation and Cross Layer Control in Wireless Networks.

Now

Publishers Inc, 2006. [44] I. Bettesh and S. Shamai, “Optimal power and rate control for minimal average delay: The single-user case,” IEEE Transactions on Information Theory, vol. 52, no. 9, pp. 4115–4141, Sept. 2006. [45] V. S. Borkar, “Asynchronous stochastic approximation,” SIAM J. Control and Optim., vol. 36, pp. 840–851, 1998. [46] V. S. Borkar and S. P. Meyn, “The ode method for convergence of stochastic approximation and reinforcement learning algorithms,” SSIAM J. on Control and Optimization 38, pp. 447–469, 2000. [47] V. S. Borkar, “An actor-critic algorithm for constrained markov decision processes,” Systems Control Lett. 54, pp. 207–213, 2005.

50

September 10, 2009

RS1 Destination Source

RS2

Fig. 1. Illustration of system topology where the source node delivers packets to destination via two relays and the transmitters’ queue states are also plotted.

Fig. 2. A block diagram illustrating the complex interactions of queues at the source node and the M relays. The bit departures from the source node queue will be delivered to one of the M relay queues via a contention resolution mechanism (to be elaborated in Section II-D). Each node also has a local power control mechanism which is adaptive to the local CSI and local QSI. The RS selection for delay-optimal is not trivial and it is not clear if we should always select the RS with the strongest S-RS link. For example, in a scenario where the source node want to deliver bits (or packets) to the destination via one of the two RSs. Suppose the S-RS1 link is good but the queue length of RS1 is large. On the other hand, suppose the S-RS2 link is medium but the queue of RS2 is empty. In this example, it’s not obvious whether the source node should choose RS1 (to exploit good link condition) or choose RS2 so that RS1 and RS2 could enjoy spatial diversity in the R-D link.

51

September 10, 2009

%6

%' 56

6RXUFH

%6

56

%'

'HVWLQDWLRQ

)UDPH6WUXFWXUH &KDQQHO (VWLPDWLRQ6ORW

&RQWHQWLRQ6ORW 1RWLILFDWLRQPLQL VORWRI 7UDQVPLVVLRQVORW

56 56

&RQWHQWLRQ0LQL 6ORWIRU56  6XEPLWPLQ %6 %'

Fig. 3.

7UDQVPLVVLRQ6ORW :LQQHUDPRQJDOOWKH65DQG5'OLQNVZLOO WUDQVPLWSDFNHWLQWKLVVORW

&RQWHQWLRQ0LQL 6ORWIRU56  6XEPLWPLQ %6%'

An example of auction mechanism in two-relay scenario. Each RS calculate the bids for both S-R and R-D links, i.e.

(BS,1 , B1,D ) for RS1 and (BS,2 , B2,D ) for RS2. The contention slot is further divided into two mini-slots. In the first mini-slot, the RS1 will submit a bid min(BS,1 , B1,D ) and in the second mini-slot, the RS2 will submit a bid min(BS,2 , B2,D ). The link with the minimum bid is selected for the transmission slot. If one S-R link is selected, the selected RS will notify the source node in the notification mini-slot of transmission slot, and the source node will start to transmit packet. Otherwise, if one R-D link is selected, the selected RS will notify the destination in the notification mini-slot, and then, start to transmit packet to the destination.

52

September 10, 2009

,QLWLDOL]H WKH/0VDQGSHUQRGH SRWHQWLDOYHFWRUVIRUWKHVRXUFHDQG LWVHOI

5HFHLYH WKH46,RIWKHVRXUFHDW WKHEHJLQQLQJRIDIUDPH

&DOFXODWH WKHFRQWURODFWLRQIRU WKHIUDPHDFFRUGLQJWR/HPPD  RU /HPPD EDVHGRQORFDO&6, 46, DQGWKH46,RIWKHVRXUFH

6HOHFW RQHOLQNRXWRIDOO65OLQN DQG5'OLQNVDFFRUGLQJWRWKH PLQLPXPELGFULWHULDLQWKH FRQWHQWLRQVORWRIWKHIUDPH

8SGDWH 7KH/0VDQGSHUQRGH SRWHQWLDOYHFWRUIRUWKHVRXUFHDQG LWVHOIDFFRUGLQJWR$OJRULWKP DW WKHHQGRIDIUDPHZKHQWKHV\VWHP LVLQWKHUHIHUHQFHVWDWH 

Fig. 4.

The procedure of potential learning.

45

Dynamic Backpressure Baseline 2

40

Average End−to−End Delay

35

CSIT Only Baseline 1

30 25 20 15 10

Centralized Optimal Solution Proposed Distributive Online Learning

5 0

Fig. 5.

0

1

2 3 4 5 6 7 Average Power Constraint Per Transmitter (dB)

8

9

Average end-to-end delay versus transmit SNR. The number of relays M = 2, the buffer size NQ = 6 packets, each

packet has a fixed size of 40K information bits, the average arrival rate λ = 200pck/s. The packet drop rate of the baseline 1, baseline 2, proposed distributive online learning and centralized optimal solution are 13%, 10%, 5% and 5% respectively.

53

September 10, 2009

45 40

Proposed Distributive Online Learning (with closed form control in Lemma 5)

Average End−to−End Delay

35 Dynamic Backpressure Baseline 2

30 25 20

CSIT Only Baseline 1

15 10 5 0

Fig. 6.

Proposed Distributive Online Learning (with combinatorial search control in Lemma 4) 0

1

2 3 4 5 Average Power Constraint Per Transmitter (dB)

6

7

Average end-to-end delay versus transmit SNR. The number of relays M = 2, the buffer size NQ = 22 packets, each

packet has a fixed size of 5K information bits, the average arrival rate λ = 1200pck/s. The packet drop rate of the baseline 1, baseline 2, proposed distributive online learning with combinatorial search control and proposed distributive online learning with closed form control are 49%, 10%, 5% and 5% respectively.

54

September 10, 2009

45 40

Average End−to−End Delay

35

CSIT Only Baseline 1

30 25

Dynamic Backpressure Proposed Distributive Online Learning Baseline 2

20 15 10 5 0

2

3

4 5 Number of relays

6

7

Fig. 7. Average end-to-end delay versus the number of relays with transmit SNR = 8.5dB. The buffer size NQ = 10 packets, each packet has a fixed size of 40K information bits, the average arrival rate λ = 200pck/s. The packet drop rate of the baseline 1, baseline 2 and proposed distributive online learning are 47%, 28% and 1% respectively.

55

September 10, 2009

1 0.9

0

CDF of Queue Length (Pr[Q>=Q ])

0.8 0.7

CSIT Only Baseline 1

0.6 0.5 0.4

Dynamic Backpressure Baseline 2

0.3 0.2

Proposed Distributive Online Learning

0.1 0

Fig. 8.

0

1

2

3

4 5 6 Queue Length Q0

7

8

9

10

Cumulative distribution function (cdf) of the queue length with transmit SNR = 8.5dB. The number of relay M = 7,

the buffer size NQ = 10 packets, each packet has a fixed size of 40K information bits, the average arrival rate λ = 200pck/s. The packet drop rate of the baseline 1, baseline 2 and proposed distributive online learning are 47%, 28% and 1% respectively.

56

September 10, 2009

5

Ve1 (5) Ve1 (4)

4

Potentional Function

3

Average Delay Proposed: 7.8 CSIT Only: 14.5 Backpressure: 8

2

1

Ve1 (3) Ve1 (2)

Average Delay Proposed: 7.5 CSIT Only: 14.5 Backpressure: 8

0

Ve1 (1) −1

−2

0

500

1000

1500 2000 2500 Number of Iterations

3000

3500

4000

Fig. 9. Illustration of convergence property. Potential function versus the scheduling slot index with transmit SNR = 9dB. The buffer size NQ = 6 packets, each packet has a fixed size of 40K information bits, the average arrival rate λ = 200pck/s. The packet drop rate is 5%.

Delay-Optimal Two-Hop Cooperative Relay ...

Sep 10, 2009 - algorithms in two ways: (1) our online iterative solution updates both the value function (potential) and the ... a good strategy with respect to the delay performance. ..... the storage of received information bits. ... Protocol 1 (Auction-Based Spectrum Access): The MAC-layer procedure of the auction-based.

756KB Sizes 2 Downloads 239 Views

Recommend Documents

A Cooperative Phase Steering Scheme in Multi-Relay ...
Computer Science, Korea Advanced Institute of Science and Technology, Dae- ... This research was supported in part by the center for Cooperative Wireless.

Effective Relay Selection for Underwater Cooperative ...
the overall throughput performance of the network with energy constraint. A best ... the advantages of our proposed criterion over the conventional channel state ...

Cooperative Relay Service in a Wireless LAN
such as directory listing in NFS [8]. As a result ... rate our system design, which consists of three components working ...... We transferred a large file from the AP.

subminiature dip relay - GitHub
Bounce time ... 2) In case 5V of transistor drive circuit, it is recommended to use .... with the consideration of shock risen from transit and relay mounting, it should.

Relay Expansion - Onion Wiki
Mar 22, 2016 - THE INFORMATION CONTAINED IN THIS DRAWING IS THE SOLE PROPERTY OF. ONION CORPORATION. ANY REPRODUCTION IN PART ...

04.21.18 Relay Finals.pdf
3) Kojima, Matthew FR 4) Takara, Matthew JR. --- Iolani School B DQ. --- Waianae High School A DQ. --- Pearl City A DQ. --- Punahou School C DQ. --- Punahou ...

Dry-Reed Relay Datasheet - GitHub
Small size, light weight and low cost. • Application: Mainly use in Cordless Phone,. Answering Machine and Security Alarm System etc.… • UL File No.E147052.

2017 Relay Results.pdf
3 Bailey Russell 7 MPII 4-00.00 4. 3 Danielle Ann Bonilla 8 KSKI 4-00.00 5. 5 Amanda Yip 8 MPII 3-10.00 4. --- Kailey Harris KSKI NH. --- Cameryn Okamura 7 Iolani Inter NH. --- Piper Fox 8 Punahou NH. Event 8 Boys High Jump Relay. Meet Record: 16-06

UC Berkeley's Bicycle Cooperative
What we do: Our primary service is opening up our facility to the public several ... experience for the customer, but also the empowerment that comes from ... growing number of bicycle essentials for retail sale, including inner tubes, cables,.

Cooperative Learning.p65
Beach, Florida: Learning Publications, ... Indiana University/Richland-Bean Blossom Community Schools/ ... districts in Indiana and Nebraska to integrate best-practice strategies in school violence prevention into comprehensive school-based ...

frame relay pdf free download
Loading… Page 1. Whoops! There was a problem loading more pages. frame relay pdf free download. frame relay pdf free download. Open. Extract. Open with.

Cataloge Schneider-Relay Controllers.pdf
Page 1 of 14. 3-phase supply control relays model RM4 T. Monitor 3-phase supplies, to protect motors and other loads. against the faults. - Rotational direction of phases (all types). - Failure of one or more phases (all types). - Undervoltage (RM4 T

2017 LA Jets Relay Carnival.pdf
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. 2017 LA Jets ...

Monitoring current relay PRI-51 - Cookson Controls
Ip. Un. 15 18. A1. A2. A1 A2. B1 B2. 15 18. A1 A2. B1 B2. B1 B2. 16. 15 18. 16. L. N. 16. 15. 16 18. A2. B1 B2. A1. PRI-51. 97. Always specify all reference name of current relay according to required range, for example PRI-51/5. Output contact ... 1

Multiple Relay Topologies in Cellular Networks
network coding, outage probability ... distributed channel-coding cooperation included in schemes ... who”, meaning that for each RN from the R-node set, we.

DOE RELAY SA-1 FORM.pdf
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. DOE RELAY ...

S12(RELAY INTERLOCKING-BRITISH).pdf
SIGNAL ENGINEERING & TELECOMMUNICATIONS. SECUNDERABAD - 500 017. Page 2 of 194 ... 187. 12. Annexure – V: Typical Building Plan For S&T Requirement- Bd`s Lr Dt 13.08.08 190. Prepared By P. Sreenivasu , LS2 ... Page 3 of 194. Main menu. Displaying S

Siemens siprotec 7ut61 relay manual
Sign in. Page. 1. /. 15. Loading… Page 1 of 15. Page 1 of 15. Page 2 of 15. Page 2 of 15. Page 3 of 15. Page 3 of 15. Siemens siprotec 7ut61 relay manual.Missing:

Multiple Relay Topologies in Cellular Networks
cooperative wireless networks [5] in order to increase the diversity gain. For the multiple-source multiple-relay (MSMR) topologies, the graph representation of ...

Relay For Life 2017 Forms.pdf
Page 2 of 3. American Cancer Society Relay For Life. Youth Partcipaton Agreement. Every American Cancer Society Relay For Life event requires one ...

2015 XTERRA Knoxville Relay Results.htm.pdf
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. 2015 XTERRA ...

Collaborative Relay Beamforming for Secrecy
Conclusion. SDR Approach. Using the definition X ww†, we can rewrite the optimization problem in as max. X. N0 + tr(hh†X). N0 + tr(zz†X). s.t diag(X) ≤ p.

Non-Cooperative Games
May 18, 2006 - http://www.jstor.org/about/terms.html. ... Org/journals/annals.html. ... garded as a convex subset of a real vector space, giving us a natural ...