Alternative Data Gathering Schemes for Wireless Sensor Networks

Viewer
Transcript

Alternative Data Gathering Schemes for Wireless Sensor Networks Xu Li∗ , Kaiyuan Lu∗ , Nicola Santoro∗ Isabelle Simplot-Ryl† and Ivan Stojmenovic‡ ∗

SCS, Carleton University, Canada

{xlii, klu2, santoro}@scs.carleton.ca †

LIFL, University of Lille 1, France [email protected]

‡

EECE, University of Birmingham, UK, and SITE, University of Ottawa, Canada [email protected]

Abstract One fundamental task of wireless sensor networks (WSNs) is to collect useful information from the sensory field and answer users’ queries. In order to make effective use of the gigantic amount of individual sensor readings, it is essential to equip WSNs with scalable and energy-efficient data gathering mechanisms. Distinct characteristics of WSNs, e.g., large node density, unattended operation mode, high dynamicity and severe resource constraints, pose a number of design challenges on sensor data gathering schemes. An ever-increasing number of research activities have been carried out on this fundamental and attractive research issue. In this paper, we survey some existing solutions, with an emphasis on four data-centric storage-based methods, i.e., Geographic Hash Table (GHT), Distributed Index for Multi-dimensional range query (DIM), double ruling information brokerage, and landmark-based information storage and retrieval, which cover a range of design choices. We comparatively discuss advantages and disadvantages of the four schemes in detail, in accordance with several important design factors including communication overhead, communication hotspots, fault-tolerance and query locality. We seek to present advances on the topic, to trigger new ideas and to help extend existing techniques to increasingly complicated future protocols. Index Terms: Data gathering, Data-centric storage, Wireless sensor networks.

1. Introduction Wireless sensor networks (WSNs) are collections of micro-sized sensors that are powered by low-energy batteries and equipped with micro-processors, small memory and radio transceivers. They are usually deployed at random in unknown or hazardous environments for object monitoring and target tracking. Recently, WSNs have emerged as a promising solution to a wide range of applications from data collection to distributed control in both military and civilian fields. In most WSN applications, one fundamental task is to retrieve useful data about the sensory field and answer users’ queries. As these data are sensor readings about concerned events, we will use “data” and “events” interchangeably in the sequel. WSN-based data retrieval is however a challenging problem because of overwhelming data volume and distributed data storage mode. The challenge is even augmented by the distinct characteristics of WSNs, e.g., large network size (usually thousands of nodes), stochastic node distribution, high node density, dynamic and unreliable environment, and severe resource (especially energy) constrains. In order to ensure the functionality of WSNs, it is essential to develop scalable, robust, self-adaptive and energy-efficient data gathering mechanisms, for which the following design factors should be considered. Data-centric models v.s. address-centric models: Data-centric models are more desirable than address-centric models for data gathering in WSNs. An address-centric model assigns sensor nodes unique ID (or names or labels) based on low-level network topology information. It emphasizes on data holders and suits queries that are issued to individual sensors. But WSNs center around data rather than nodes holding the data, and in most cases, queries are issued to the whole network instead of a particular sensor node. Data-centric models well support such data-oriented queries by giving names to data based on their attributes, e.g., event-type, sampling time, geographic location, rather than to sensor nodes. Notice that redundancy often exists in the sensory data collected by neighboring sensors or by a single sensor at different time, it is not wise to allow every node to reply queries directly with raw their readings. A local preprocessing phase for redundant information removal may be needed. Localized algorithms v.s. distributed algorithms: Localized data gathering algorithms are more desirable than distributed algorithms that reply on globalized structures (e.g., spanning tree). The construction of a globalized structure does not necessarily requires every node to know the entire network topology; but it requires nodes to know whether they are part of the structure and what their roles are in the structure. The acquisition of this knowledge however involves global computation. In this case, the maintenance of such a structure is expensive in both bandwidth and energy especially when network topology changes frequently. Localized algorithms are distributed in nature; but they do not rely on any globalized structure. They enable sensors to make decisions solely using local 577

knowledge, i.e., k-hop neighborhood information for a constant k. As they restrict the impact of any topology change within its vicinity (a range of k-hop), they are adaptive to topological changes, scalable, and both communication- and energy- efficient. Fault tolerance: Considering the unreliable nature of WSNs, a data gathering algorithm should have embedded fault tolerance capability, especially in critical or real-time applications such as intrusion detection and emergency rescue. Here, fault tolerance means that a query submitted to the network can always be correctly answered within finite time in spite of environment dynamics and node failures. To improve the entire system’s fault tolerance performance, redundancy must be enforced at different levels ranging from hardware to networking protocols and to application softwares. In terms of data gather algorithms, it should be accomplished by data redundancy (i.e., storing data at multiple locations) and route redundancy (i.e., maintaining multiple data communication routes). In addition, distributed data storage mode is obviously more resilient against node failure than concentrated mode. The remainder of this paper is organized as follows: Sec. 2 presents a taxonomy of the existing data gathering approaches and briefly reviews some schemes in the literature; Sec. 3 describes four different data-centric storage based information storage and retrieval schemes in detail; Sec. 4 compares and comments the four schemes; Sec. 5 concludes the paper.

2. Literature review Figure 1 depicts a taxonomy [3] of sensor data gathering and dissemination schemes in the literature. At the top level of the taxonomy are the three different classification methods, i.e., storage location oriented, diffusion direction oriented, and structure oriented. In each of these three branches, the corresponding classification is displayed.

Figure 1: Taxonomy of sensor data gathering and dissemination schemes In the storage location oriented branch, there are three categories, external storage approach, local storage approach and datacentric storage approach. The three approaches lead to very different cost √ structures respectively. In external storage approach, each sensor transmits its readings to an external sink at message cost of O( n) per transmission, where n is network √ size. The intuition behind this cost is that, in worst case, transmission spans the entire network whose diameter is approximately n on average. As the external sink collects and stores data from all sensors, external queries (i.e., queries generated outside the network) will be cost free. √ However, each in-network query has to be delivered to the sink, generating O( n) messages. In local storage approach, each sensor stores its collected data locally at no communication cost. Because data are distributed in the network, each query, whether in-network or external, has to be directed to all the sensors (e.g., by flooding), leading to O(n) messages. In data-centric storage approach, each sensor maps its collected data to a unique label, e.g., a geographic location or virtual coordinate in the network, using a global √ hash function, and sends the data to a sensor determined by the label through underlying routing protocol. This approach yields O( n) messages for either storage or query. Data-centric storage-based data gathering is the focus of this paper. Its four representative schemes including GHT [20], DIM [15], Double-Ruling [21] and Landmark [6] will be studied later, in Sec. 3 and 4. In the diffusion direction oriented branch, there are two categories, pull diffusion approach and push diffusion approach. In pull diffusion approach, a sink “pulls” data from data sources by actively sending queries. This approach can be further divided into two sub-categories, two-phase pull diffusion and one-phase pull diffusion. In a two-phase pull diffusion algorithm [12], both queries and replies are broadcasted throughout the network, resulting in multiple routes established between a source and a sink. A sink gradually increases data rate to identify the best route, i.e., the route with lowest latency, to each source, which will be subsequently used by the source for data transmission. In a one-phase pull diffusion algorithm [9], queries are also broadcasted by a sink to all the sensors. But replies are delivered back to the sink along the backward transmission path of the queries with lowest latency, instead of using blind flooding. Specifically, each node retransmits a reply message only to the neighbor from which it receives the first copy of the query. Compared with two-phase pull diffusion, the one-phase variant greatly reduces communication overhead, especially when a large number of different events are queried. In push diffusion approach [10], a source actively “pushes” its collected data to a sink by flooding, and a sink subscribes to interesting events by reinforcing best data delivery paths (e.g., by changing data rates). In the structure oriented branch, data gathering algorithms are categorized according to their employed network structure. Tree is a commonly used structure for data gathering. In tree-based approach, each sink has a data gathering tree rooted at itself. This tree can be a localized minimum spanning tree including all the sensor nodes [22], or a simple reverse multi-cast tree with data sources as leaves [12, 16, 17]. Cluster is another favored structure for date gathering. In cluster-based approach [11, 4, 18], sensors are grouped into clusters, and each cluster has a node elected as cluster head. In a cluster, cluster members send their readings to the cluster head, which processes its received data and report a digest to sinks. A grid structure can be adopted for scenarios with a single source and multiple mobile sinks [23]. The source constructs a grid where sensors close to the grid points are elected as dissemination nodes. A 578

sink connects to the nearest grid point in local grid, and uses the overlay network composed of dissemination nodes to obtain data from the source. After the sink moves, what it needs to do is simply connecting to its new local grid.

3. Data-Centric Storage for Data Gathering and Dissemination In this section, we will focus on data-centric storage-based data gathering, which are in general suitable for scenarios where event locations are not known in advance, and in-network queries are a common phenomenon, and the network is large in scale and has many simultaneously detected events that are not necessarily all queried by users. We will study four representative schemes, i.e., Geographic Hash Table (GHT) [20], Distributed Index for Multi-dimensional range query (DIM) [15], double ruling information brokerage [21], and landmark-based information storage and retrieval [6]. As as we will see, even though the four schemes belong to the same category, their internal mechanisms are very different, and each of them has its own pros and cons so that they cover a range of design choices under various situations. 3.1. Geographic Hash Table Based Date Gathering (GHT) GHT [20] is a localized data gathering scheme. It hashes an event Evt to a unique geographic location L(Evt) according to the type/key/index of Evt. This implies that the same type of events have the same hash location. GHT stores Evt among the nodes that immediately enclose L(Evt) and direct any query for Evt to them. Under this circumstance, the query can be answered by any of the storage nodes. Data communication for storage and retrieval is implemented by a combined greedy-face routing protocol GPSR [13], which has been recognized to be a duplicate (with errors) of the well-known Greedy-Face-Greedy (GFG) routing protocol [2]. Before going through the details of GHT, we will introduce protocol GFG at short length. 3.1.1. Greedy-Face-Greedy (GFG) GFG [2] is a combination of the simple geographic greedy forwarding strategy and the face routing technique. It is the first protocol of its kind that provides guaranteed packet delivery [8]. In a GFG routing process, greedy forwarding takes in charge of packet delivery whenever possible, while face routing is used only for passing packets around the void areas that block greedy forwarding. A node greedily forwards a packet towards the destination by choosing as the next hop its neighbor geographically closest to the destination. In the local-minima case that the node itself is closest to the destination among its neighborhood, the packet is forwarded in face routing mode using right-hand/left-hand rule until the destination or a node yet closer to the destination is found. The right-hand rule is that, to traverse the interior (resp., exterior) of a face, a packet is forwarded in the counterclockwise (resp., clockwise) direction along the perimeter of the face. The left-hand rule is just the opposite of the right-hand rule. Since face routing is based on a planar graph, a pre-processing phase is needed for plaFigure 2: Face routing from s to d. narizing the underlying network graph, which is modeled as a unit disk graph. Any planar graph may be used to support face routing as long as it can be constructed in a localized manner. One option is Gabriel Graph (GG). To construct GG, a node u preserves every outgoing edge uv that satisfies the condition: the diametral circle passing trough u and v contains no other neighbor nodes than v. Another possible option is Relative Neighborhood Graph (RNG). To construct RNG, a node u preserves every outgoing edge uv that satisfies the consideration: the lune area of the two circles, respectively centered at u and w, with radius |uv| contains no other nodes than v. Localized Delaunay Triangulation (LDT) [14] can be used as well. There are two variant of the GFG. One employs the “before-crossing” scheme. Suppose a routing packet P kt for destination node t is switched to face routing mode at node u. Draw an imaginary line ut from u to t. Line ut intersects a sequence of faces in the previously constructed GG (or RNG, or LDT). u sends P kt along the first face using right-hand (or, left-hand) rule. P kt traverses along the face perimeter and reaches a node v. Before v forwards P kt, it checks if the link, say E, that the packet is going to pass intersects the imaginary line ut at certain point p. If yes, it will, instead, sends P kt to its incident link next to link E in the clockwise direction, resulting in a face change. In the new face, P kt is forwarded in the same way (i.e., using right-hand rule), and the next face change happens only at the edge intersecting the remaining line segment pt. Face change ensures that the routing message progresses towards the destination. The other variant employs “after-crossing” scheme. This scheme is the same as the before-crossing except that face change happens after a routing packet passes the intersecting link, and that, in the new face, hand is changed as well, either from right-hand to left-hand or from left-hand to right-hand. Figure 2 illustrates the two variants of GFG. The route discovered using the before-crossing scheme is marked by solid arrows; the route established by the after-crossing scheme is highlighted by dashed arrows 3.1.2. Data storage and retrieval When a sensor s detects an event Evt, it hashes Evt to a unique geographic location L(Evt) in the sensory field using the type/key/index of Evt. Thereafter, s encapsulates Evt in an update message and sends the message towards L(Evt) by routing protocol GPSR (a duplicate of GFG). The routing protocol guarantees the update message to be received by the home node of Evt, which is a node geographically closest to L(Evt). The home node locally stores the event data retrieved from the update message. GHT supports hash-point-based data retrieval. That is, queries for an event Evt are routed towards L(Evt) following a similar procedure as update messages, and be eventually received and answered by the home node of Evt. In fact, as we will see in the next section, the queries can also be answered by any replica node on the home perimeter of Evt. 579

3.1.3. Home perimeter refresh protocol The home node of Evt sends at interval Th a refresh message that contains all its stored entries for Evt to L(Evt) by GPSR. The message will be routed along a face perimeter enclosing L(Evt) and returned back to the home node. This face perimeter is called home perimeter of Evt, and each node in the home perimeter is called replica node. Replica nodes locally store the data about Evt carried by the refresh message, and append any additional relevant data that it has to the message. Both the home node and the replica nodes maintain an expire timer Td for Evt to ensure data refreshness. They reset Td every time they receive a refresh message (whether from themselves or from other nodes). To ensure this periodical refresh process to generate only local traffic, a node is assumed not to move many communication radio ranges in a period shorter than the refreshment interval Th . During home perimeter traversal, a refresh message may reach a node closer to L(Evt) than the current home node. In the case, this node will become the new home node and then initiate its own refresh process. By this means, it is always ensured that a node closest to L(Evt) be the home node of Evt and store the event data in spite of topology change (e.g., caused by node failure or node mobility). Every time when a replica node receives a refresh message originated by another node, it stores the embedded event data and sets a timer Tt . When time out, it initiates a refresh process so as to tolerate possible home node failure. This refresh process will elect a new home node in the case that current home node fails. Figure 3 shows an example of home perimeter refreshment. In this example, node d sends an update message towards the hash location (marked by a triangle in the figure) and finally reaches the home node p. Node p periodically sends a refresh message along the home perimeter and then finds a new comer q closer to the hash location than itself. Then, q becomes the new home node, and a new home perimeter is identified. A querier s that Figure 3: Home perimeter refreshment. previously obtained data from a replica node now need to acquire the data from the new home node q. 3.1.4. Optimization To improve performance on dynamic topologies, a join optimization technique is used in GHT. That is, when a node u detects a new neighbor w, it sends w all the event entries it currently has for which w is closer to the event destination (i.e., the hash location) than u, and for which u is the closest among its neighbors to that event destination. Besides, structured replication is adopted to balance storage load. The sensory field is evenly partitioned into grids. By repeatedly grouping four neighboring squares, a quad-tree hierarchy is established over the squares of different size. The home node of Evt selects a number of mirrors from the hierarchy in a certain depth d; a node that detects Evt routes the event to the nearest mirror of the home node. However, in this case, queries must be routed to all the mirror nodes. 3.2. Distributed Index for Multi-dimensional Data Gathering (DIM) DIM [15] is a localized data gathering scheme. It supports scalable multi-dimensional range queries by utilizing a distributed index, which maps the multi-dimensional space of sensor readings to a two-dimensional geographic space using a geographic localitypreserving hash function. This mapping allows each node in the network to claim a disjoint subset of attribute space for itself; that events falling into that space are routed to and stored at that node. Similarly to GHT [20], DIM uses a combined greedy-face routing protocol to support data communication for data storage and retrieval. 3.2.1. Network Zone Division A rectangle R that contains the entire sensory field is known as priori. R is evenly divided into two zones at level 0 by a vertical line; each of the two level-0 zones is further evenly subdivided into two level-1 zones by a horizontal line. Recursively apply this division to every obtained zone until a desirable level is reached. A zone at level i is uniquely encoded to a binary bit array of length i as follows: for an odd j (corresponding to a vertical division), the j-th bit is ‘1’ if the zone is in the right region, or ‘0’ otherwise; for an even j (corresponding to a horizontal division), the j-th bit is ‘1’ if the zone is in the above region, or ‘0’ otherwise. The address of a zone is defined as the centroid of the zone. Figure 4 shows a four-level zone division. Note the above zone division and naming are executed by each node locally. As sensors may not be uniformly deployed in the sensory field, it is possible there there are empty zones at a given level. A node is said to “own” a zone if the zone is the largest zone that contains only that node and no others. This ownership definition allows nodes to be mapped to zones of difference size. To build such a node-zone relationship, each node independently and asynchronously decides its own tentative zone (which is assumed to initially contain the entire network) at startup. The only information a node needs for zone Figure 4: Event insertion in DIM. determination is the network boundary, which is known as a priori, and the location of its neighbors, which can be easily obtained through local communication. When a node has no one-hop neighbor in certain direction, it is not able to determine the zone boundary in that direction. In the case, some nodes may have zones with undecided boundaries. Undecided zone boundaries will be decided during event insertion and query process. 580

3.2.2. Event insertion In DIM, an event Evt is normalized as a tuple of m attributes hA1 , · · · , Am i. Denote by s the node that detects Evt and by k the length of the zone code of s. Firstly, s hashes Evt to a binary code that has equal length k as its own zone code. The hash code is computed as follows: (i) for 1 ≤ i ≤ m, if Ai < 0.5, the i-th bit of the code is assigned 0, or 1 otherwise; (ii) for m + 1 ≤ i ≤ 2m, if Ai−m < 0.25 or 0.5 ≤ Ai−m < 0.75, the i-th bit of the code is assigned 0, or 1 otherwise; (iii) repeat this procedure until all k bits have been decided. Next, node s generates an event insertion message containing Evt and the hash code. In the message, s also marks itself as the owner of Evt. Then it sends the message to the address (i.e., centroid) of the zone corresponding to the hash code by a combined greedy-face routing protocol. Upon receiving the event insertion message, an intermediate node u first recomputes the hash code of event Evt in accordance with the length of its own zone code, and updates the message with the hash code if the code is longer than the one embedded in the message. Afterwards, u compares the hash code with its own zone code. If u has a longer match than the current owner of Evt, it sets itself to be the event owner in the message and delivers the message to next hop towards the address of the hash zone. In the case that u has an exactly match, u further checks if its zone has no undecided boundaries. If the answer is yes, it stores Evt locally; otherwise, it sets itself as Evt’s owner in the message and forwards the message to probe around the void area caused by the undecided boundary. During a zone boundary exploration process, there are two possible cases: (i) no other node updates the message, or (ii) a node v updates the message with a longer hash code. In the first case, the message will go back to u, and then u knows it itself is the true owner of A and stores A locally. In the second case, the message will be forwarded by v to a zone at a finer level, and then the message will either indeed reaches a node that owns that zone, or goes back to v and then returned to u as u is in fact the closest node to that zone. During the above process, if the message encounters a node w whose zone overlaps with u’s, w will communicate with u, and the two nodes u and w then shrink their zones properly. Figure 4 shows an event insertion process. In the figure, both nodes a and b initially guess 0 on their own zone codes. Node a hashes an event < 0.4, 0.8, 0.9 > to code 0, marks itself as the owner, and sends an event message, briefly represented as a pair (code, owner), towards the address of zone 0. Each intermediate node refines the event message by updating the hash code and forwards it properly. The trajectory of the message is marked by solid arrowed lines. The dashed lines pointing to a triangle indicate the transient hash zone at each intermediate node. Finally, a node b finds that it itself is the true owner of the event and then stores the event locally. Also, b notices that the marked owner a has the same zone code as itself and then shrinks down its zone; meanwhile, it notifies a, which then also shrinks its zone. After the event insertion, a and b decide their uncertain zone boundary and change their own zone code to 00 and 01 respectively. 3.2.3. Range query In DIM, data retrieval is hash-point-based as in GHT [20]. DIM applies the same encoding rules to range query and launches query messages like an event insertion message. Notice that the zone division introduced in Sec. 3.2.1 actually establishes a tree hierarchy. A query will be routed down from the querier zone in the zone tree, and will be answered by some nodes at leaf level. On its way to the leaf zones, the query is split into multiple sub-queries if there is an overlapping between the zone of an intermediate node u and the zone code associated with the query. Specifically, if the range of the first attribute contains the value 0.5, u divides the query into two sub-queries from the middle, i.e., one of which has the first attribute ranges from 0 to 0.5, and the other from 0.5 to 1. If one of them overlaps u’s zone, u continues splitting in this way until there is a sub-query small enough to fall completely into its zone; otherwise, it stops and retransmits. In the zone tree, the first split happens at the root of the smallest subtree that contains the entire query. 3.3. Double Ruling Information Brokerage Double ruling information brokerage (Double-Ruling for short) [21] is a localized data gathering scheme. It differs from GHT [20] and DIM [15] in both data storage and data retrieval. Instead of storing data at a isolated node, it stores data or data pointers along a one-dimensional curve, called replication curve; in addition to hash-point-based data retrieval, it also supports rendezvous based data retrieval: a query travels along a one-dimensional curve, called retrieval curve, to fetch data at the intersection point of the replication curve and the retrieval curve. In this scheme , data communication along a curve is implemented by curve routing [19]. 3.3.1. Mapping Sensor Field to Sphere Since two arbitrary curves in a finite region of the plane are not guaranteed to intersect, the authors turn their attention to a three-dimensional sphere that has the nice property that any two of its diametral circles must intersect. This raises the problem of mapping the sensory field to a virtual sphere. Stereographic projection [5] is used in this scheme to provide one-to-one mapping between points on a sphere and points in the plane (i.e., the sensory field). Consider a virtual sphere S of radius r tangent to the sensory field F placed at the center of F . Refer to the tangent point as south pole and its antipodal point as north pole. As shown in Fig. 5, a point h∗ on F can be uniquely mapped to a point h on S, where the line through h∗ and the north pole of S intersects with S. Given the location (x∗ , y ∗ ) of h∗ , the location (x, y, z) of h can be easily calculated, and vice versa. Moreover, every diametral circle on S can be mapped to a circle in F . In F , although these images circles may have different size and centers, they all enclose the image of the south pole of S. Notice that the mapping does not preserve distance or area, and that the distortion around the north pole can be high. 581

Figure 5: Stereographic projection.

However, it is proven that the length of a circle ξ on S is not too much different from the length of its image circle ξ ∗ in F . In fact, we can adjust the radius r of S to a suitable value so that the mapping gives a constant distortion on distance. This provides the rationality of using stereographic projection for mapping sensory field to a sphere. Each sensor computes its image on S and use the image to perform data replication and retrieval as if it were on S. 3.3.2. Data Replication A data producer hashes its data to a location h on the sphere S according to data type/key/index. Then it routes the data to h following a diametral circle on S. This circle is called replication curve and denoted by C(p, h). It is uniquely defined by the location p of the data producer and the hash location h. During the routing process, every node along C(p, h) either replicates data locally or stores a pointer pointing to where actual data is stored in order to save storage space. Notice that, in addition to the hash location h, all the ¯ as well. Thus there are actually two far-apart rendezvous replication curves of the same data type pass through the antipodal point h ¯ storing all the data of the same type, providing to good fault-tolerance support. nodes h and h 3.3.3. Data Retrieval Suppose that a data consumer at location q wants to obtain a certain type of data generated by a data producer at location p. The consumer first computes the hash point h of the data and then gets the data using different methods depending on system requirement. ¯ like in protocol GHT [20]. An alternative and preferable An intuitive and simplest way is to direct the query to h or the antipodal point h method is trajectory based distance-sensitive data retrieval. In the trajectory based distance-sensitive retrieval method, the consumer computes, based on its own location q and the hash point h, a retrieval circle L(q, h) on S, which ¯ Then it travels (using message) along L(q, h) is perpendicular to the line through h and h. until it hits the replication curve C(p, h) of the data. Notice that L(q, h) may not be a diametral circle. L(q, h) and C(p, h) have two intersection points. To quickly find the closer one, the consumer uses a doubling trick: it travels a distance 2i in a randomly selected direction with i = 0 at initiation; if it has not encountered an intersection point, it turns around, increments i and travels a distance 2i in the opposite direction from q; it stops once reaching an intersection point. The traveled distance is proven bounded by O(d), where d is the distance between p and q on S. If the consumer wants to aggregate all the data of concerned type, it just need to traverse the entire L(q, h) since L(q, h) intersects all the replication curves of that data type. As a matter of fact, it is sufficient for the consumer to traverse any closed circle on S that ¯ yielding large degree of flexibility separates the hash point h and the antipodal point h, and load balancing. The most powerful retrieval strategy is that the consumer traverses an Figure 6: Distance sensitive data retrieval arbitrary diametral circle to obtain all the data stored in the network. 3.3.4. Routing along a curve In a discrete WSN, data replication and retrieval curves can be implemented as routing paths [6]. In the process of curve routing, two pieces of additional information are attached to a routing packet: a parametric equation of the curve and the direction along the curve for forwarding the message. Based on local position information, each forwarding node makes a greedy decision to infer the next hop neighbor that is further along the curve than itself in the required direction. Specifically, each intermediate node takes points at small intervals along the curve within its transmission range, and computes the nearest such point for every neighbor. And then the message is forwarded to the neighbor who advances furthest along the curve in the required direction, so as to ensure that data replication or retrieval is completed in the fewest number of hops. 3.4. Landmark-based Information storage and retrieval Landmark-based information storage retrieval (Landmark for short) [6] is a distributed location-free data gathering scheme. It is grounded heavily on a landmark-based distributed routing protocol (GLIDER) [7]. It establishes a two layer hierarchy. At the top layer are a number of pre-determined landmark nodes, while at the bottom layer are regular sensors. A distributed hash table is then constructed using rooted shortest path trees at the top level and finger trees at the bottom level for data storage and retrieval. Below we will first introduce the basic building block GLIDER routing and then go through the details of this scheme. 3.4.1. GLIDER Naming and Routing Protocol Based on a set of carefully selected landmark nodes, a Voronoi diagram [1], called landmark Voronoi complex (LVC), is first constructed over the network using graph distance in a distributed fashion. Specifically, each landmark node broadcasts a message with a hop counter recording the number of hops that the message has traveled. A node may receive such a message from multiple landmark nodes, but it forwards only the messages of the nearest, in terms of graph distance, landmark nodes it has ever seen. By listening to these messages, a node is able to maintain a list of closest landmarks. In LVC, each Voronoi cell is referred to as Voronoi tile. The nodes having only one closest landmark are generally the internal nodes of the Voronoi tile of that landmark, and the nodes with multiple closest landmarks are the border nodes of several adjacent Voronoi tiles. However, a node with only one landmark may also be considered as border node in the case that one of its neighbors exists in a different Voronoi tile. 582

After LVC construction, one node is elected as leader to collect the information about all the Voronoi tiles such as the their adjacency relation and the graph distance between neighboring landmarks. By the collected information, the leader node builds a Delaunay triangulation using landmark nodes. This Delaunay triangulation is called combinatorial Delaunay triangulation (CDT). It stores a global connectivity abstract about the underlying network. Note that both LVC and CDT are an approximation to the ones constructed using exact geographic location information. Then the leader node broadcasts CDT to all the landmark nodes, each of which then computes the shortest path tree rooted at itself over CDT and broadcasts CDT together with the tree to all the node within its own Voronoi tile. The shortest path trees computed by landmark nodes will be later used as routing table. Having obtained CDT, each node starts a flooding process confined to its Voronoi neighborhood, i.e., its home tile and neighboring tiles, to compute its local landmark coordinate. A node’s local landmark coordinate is a vector of the neighborhood distances, i.e., the distances Figure 7: GLIDER routing from s to d. from the node to each of its reference landmarks. A node’s reference landmarks include its home landmark and neighboring landmark nodes, which can be derived from CDT. Nodal local landmark coordinate is used to conduct greedy routing within a Voronoi tile. When a node wishes to talk with another node in a different tile, it first finds the shortest path from its home landmark (home tile) to the destination landmark (destination tile) in its pre-computed shortest path tree. Then, routing consists of two parts: inter-tile routing and intra-tile routing. In inter-tile routing, the next tile in the path is set to be temporary destination, and the goal is to penetrate current tile to reach next tile by intra-tile routing. Intra-tile routing is straightforward with the local landmark coordination system by applying greedy routing. During greedy routing, if a local minimum is reached, a flooding process, restricted within current tile, is started to break the tie. Figure 7 illustrates GLIDER routing from node s to node d. In the figure, landmark nodes are represented by big dots; the shortest path tree rooted at the home landmark of s is shown by solid lines; routing path is highlighted by thick broken lines; the other network details are omitted. 3.4.2. Data Replication A data producer s, whose home tile is denoted by T (s), first hashes data Evt to a tile HT (Evt) according to data type/key/index. Then it routes, by GLIDER, Evt to HT (Evt) along the shortest path sp(T (s), HT (Evt)) (referred to as replication path by us) in the shortest path tree rooted at the landmark node of T (s). On its way to HT (Evt), Evt is replicated in each intermediate tile. The replication is restricted to a finger tree F (p) rooted at a randomly selected sensor p in the tile. Finger tree F (p) has three branches, each of which ends at a boundary of its home tile. If the tile has at least three neighbor tiles, one branch of F (p) intersects the boundary bmain with the next neighboring tile along sp(T (s), HT (Evt)), and the other two intersect the two adjacent boundaries of bmain as shown in Fig. 8, where finger tree is indicated by solid arrowed lines. If a tile has less than three neighbor tiles, all the three branches intersect bmain . All nodes on a finger tree store either the data or a pointer to where the data is stored, depending on the tradeoff between storage cost and querying cost. Figure 8: Finger tree and retrieval path. 3.4.3. Data Retrieval When a data consumer c in tile T (c) wants to obtain data Evt, it computes HT (Evt) using the same hash function, and sends a query along sp(T (c), HT (Evt)) (referred to as retrieval path) in the shortest path tree rooted at the landmark of T (c) by GLIDER routing. If an intermediate tile has Evt, it (in fact, the corresponding landmark node) answers the query immediately; otherwise it forwards the query towards HT (Evt), and Evt will be finally retrieved from another intermediate tile or from HT (Evt). In order to detect as soon as possible whether an intermediate tile has Evt, in-tile data retrieval obeys two rules : (i) if the current tile has at least three neighboring tiles, the query is routed by GLIDER inter-tile routing to one of the adjacent boundaries of bm ain, and then from there to the other adjacent boundary of bmain and finally to bmain , as shown in Fig. 8, where retrieval path is marked by dashed arrowed line; (ii) otherwise, the GLIDER intra-tile routing scheme is used. It is proven that these rules guarantee that the retrieval path sp(T (c), HT (Evt)) intersect with the finger tree if there is one. The retrieval path produced by the above rules follows a zig-zag line, which brings extra message overhead. To improve the performance, finger tree construction is slightly modified so that a query will hit the finger tree in most cases by just following the GLIDER routing path to the hash tile. This modification replaces the zig-zag curve by a natural GLIDER route towards the hash tile, and query cost is noticeably reduced without sacrificing any other network resource.

4. Comparative Study In this section, we shall comparatively study the performance of the above described data gathering schemes. The study is based on four design factors including communication overhead, communication hotspot, fault tolerance and query locality that are crucial to a scalable and robust sensor data gathering approach, especially with respect to data-centric storage-based methods. A summarized comparison can be found in Tab. 1. 583

Table 1: Scheme comparison GHT [20] DIM [15] Assumptions

Double-Ruling [21]

Landmark [6]

location-awareness

yes

yes

yes

no

network boundary

yes

yes

yes

no

large

moderate

moderate

very large

no

no

some level

no

Communication Overhead Query Locality storage hotspot

Hotspot Problem

query hotspot distributed storage

Fault Tolerance

data redundancy retrieval flexibility

no

relived

yes

yes

aggravated

relived

no

relived

yes

yes

yes

yes

3 replications

2 replications

dynamic

dynamic

no

no

yes

no

4.1. Communication overhead Although GHT is a localized algorithm, it is likely to have high communication complexity due to its periodic home perimeter refreshment. A home node periodically sends a refresh message along the home perimeter to maintain consistency of home node selection. This message contains all the relevant event data stored at the home node and thus is large in size. In worst case, its size can be as large as the size of the home node’s local memory. Such a long message has to be divided into small pieces and transmitted piece by piece in practice, possibly leading to large message overhead. Landmark [6] is grounded on the location-free routing protocol GLIDER [7]. GLIDER uses network-wide flooding to construct (i.e., landmark Voronoi complex, combinatorial Delaunay triangulation, shortest path tree) and set up nodal landmark coordinates. These globalized structures and virtual coordinates are sensitive to topological change, and their maintenance is expensive (usually involve global computation). Since WSNs are failure-prone environments, Landmark is more costly than localized GHT. Compared with GHT and Landmark, both DIM and Double-Ruling have relatively moderate message cost because they use neither periodic messages nor globalized topology-sensitive structures. 4.2. Communication hotspot A communication hotspot is a node that is accessed frequently, or simultaneously by a large number of sensors. It is a communication bottleneck in the network and oftentimes contributes to single point of failure. Nodes around a hotspot involve heavily in message relay for the hotspot, and thus may deplete their battery power quickly, causing energy holes and shortened network lifetime. Apparently, a sound data gathering scheme should not generate hotspots. In GHT, home node mirrors are used to balance storage load and avoid the hotspot problem during data storage. But, queries have to be sent to all the home node mirrors, leading to an increased number of hotspots during data retrieval. DIM divides data into disjoint data sets according to multiple data attributes and store them at distinct nodes. This multi-attribute-based division has larger cardinality than data-typed-based division and thus decreases the probability of occurrence of hotspots. Landmark and Double-Ruling both route data to a unique hash location determined by data type and thus have hotspot problem during data storage. But, because they support trajectory based data retrival (queries are answered at an intersection of a replication path and a retrival path) rather than hash-point-based retrival (i.e., queries are answered at the hash location), no hotspot or a reduced number of hotspot will occur during data retrieval. 4.3. Fault tolerance The four schemes achieve fault tolerance ability by using data redundancy and/or distributed data storage. They store data of an event type distributedly and redundantly at multiple nodes. The selection of storage node has great impact on fault tolerance performance. These nodes should be located as far apart from each other as possible, because otherwise they may fail all together in certain regional damaging incidences (e.g., a fire), causing permanent and total data loss. GHT uses evenly distributed home node mirrors to balance storage load. That is, data of an event type are stored at home node mirrors closest to event location rather than concentrated at the home node. By this mechanism, data of an event type will not be totally lost even if the home node fails. Furthermore, GHT uses replica nodes (nodes on the home perimeter) to tolerate home node failure. In a network with certain level of density, most face perimeter of its planar graph are 3-hops in length [20], and the few replica nodes are closely located around the hash location. Under these circumstances, the improvement is limited. DIM hashes event data to nodes according to multiple event attributes rather than solely by event type. As a result, an event type is mapped to multiple nodes evenly distrusted in the network. Data of an event type will not be totally lost unless its entire set of hash nodes fail, which rarely happens. However, as these hash nodes store disjoint sets of entries for the event type, DIM is vulnerable for partial data loss. To prevent partial data loss, DIM backs up event data stored by a hash node at a node located in an zone adjacent to the zone of the hash node. This method provides very limited improvement because the only backup node is geographically close to the hash node. Double-Ruling has the best fault tolerance capability among the four schemes. Event data are redundantly stored at nodes along a replication curve defined by the location of data source and the hash location of the event type; data of the same event type from different 584

sources are routed to the same hash location along different replication curves. Queriers are able to retrieve the data by querying along any curve separating the hash location and the antipodal hash location of the event type. Retrival flexibility and well distributed data replication together ensure data availability and accessibility in the presence of node failures. Like double-ruling, Landmark replicates event data along a replication path defined by the source tile and the hash tile. Unlike double-ruling, its retrival path (a shortest path from the querier tile to the hash tile) is fixed. Due to the lack of retrieval flexibility, Landmark has inferior fault tolerance capability compared with Double-Ruling. Consider the following scenario: all the nodes in the hash tile fail and all the nodes in the intersection tile of the retrival path and the replication path fail too. Since the retrival path is a pre-determined shortest path, data retrieval will not succeed in this case although the event data themselves still exist (in other tiles) in the network. 4.4. Query locality Query locality refers to that the distance (i.e., hop count) traveled by a data query is proportional to the distance between the querier and the data source. The lack of query locality induces prolonged data acquisition latency and increased message overhead. All the four schemes provide limited support for query locality. In particular, GHT and DIM perform worst because they use hash-point-based data retrieval and add no restriction on the distance between data source and corresponding hash node. Double-Ruling achieves high degree of flexibility in query processing by mapping the sensory field to a virtual sphere and replicating and querying data along circles on the sphere. It supports trajectory-based distance-sensitive data retrieval for fetching concerned data from a particular data producer. However, in order to obtain all the data of a certain type, a query message has to traverse the entire retrieval curve, whose length could possibly be very long. In Landmark, for a pair of data source and querier that are far apart from each other, the possibility of quick intersection of their replication path and retrival path becomes low, and in the worst case, the query travels all the way to the hash tile of the data. Landmark provides no guarantee on query locality.

5. Conclusions Wireless sensor networks (WSNs) have opened new vistas for a wide range of application domains, and also pose many design challenges to the research community. One of the fundamental research issues of WSNs is data gathering. In this paper, we introduced an existing taxonomy for data gathering in WSNs and detailedly studied four representative data-centric storage-based schemes, i.e., GHT [20], DIM [15], Double-Ruling [21] and Landmark [6], which cover a range of design choices. Double-Ruling is the most promising one of the four schemes. It outperforms the others in almost all aspects. But we notice that the four schemes all suffer from hotspot problem because of the inherit weakness of their employed hash-point-based data storage and/or retrieval method, and that they all have weakness in query locality. Mentionably, Landmark is a special scheme that requires no geographic location information for operation but induce huge message overhead. It scarifies precious network bandwidth and energy for relaxing the location-awareness requirement, which has however been considered reasonable and crucial to the functionality of WSNs in surveillance applications. Thus it can not be considered a practical solution candidate in practice. The incompleteness of previous work indicates that the research on data gathering is still far from maturity. On the basis of existing proposals, and in the face of plenty of research possibilities, we can expect more research efforts devoted to this area in the future and can certainly hope for more fruitful subsequent periods.

6. Acknowledgements This work was partially supported by NSERC Strategic grants STPGP 336406-07 and STPSC 356913-2007B, NSERC Discovery and Doctoral grants, and UK Royal Society Research Merit Award.

7. References [1]

Aurenhammer, F. and Klein, R. “Voronoi Diagrams”. http://www.pi6.fernuni-hagede/publ/tr198.pdf

[2]

Bose, P., Morin, P., Stojmenovic, I., and Urrutia, J. “Routing with Guaranteed Delivery in Ad Hoc Wireless Networks”. In Proceedings of the 3rd ACM International Workshop on Discrete Algorithms and Methods for Mobile Computing and Communications (DIAL-M), pp. 48–55, 1999.

[3]

Chen, W.-P. and Hou, J. C. “Data Gathering and Fusion in Sensor Networks”. In I. Stojmenovic, editor, Handbook of Sensor Networks, chapter 8, pp. 493-526, Wiley, 2005.

[4]

Chen, W.-P., Hou, J. C., and Sha, L. “Dynamic clustering for acoustic target tracking in wireless sensor networks”. IEEE Transactions on Mobile Computing, 3(3):258 – 271, 2004.

[5]

Coxeter, H. S. M. Introduction to Geometry, John Wiley & Sons, New York, 2nd edition, 1969.

[6]

Fang, Q., Gao, J., and Guibas, L. J. “Landmark-Based Information Storage and Retrieval in Sensor Networks”. In Proceedings of the 25th IEEE International Conference on Computer Communications (INFOCOM), pp. 1 – 12, 2006.

[7]

Fang, Q., Gao, J., Guibas, L., de Silva, V., and Zhang, L. “GLIDER: gradient landmark-based distributed routing for sensor networks”. In Proceedings of the 24th Annual Joint Conference of the IEEE Computer and Communications Societies (INFOCOM), vol. 1, pp. 339 – 350, 2005. 585

[8]

Frey, H. and Stojmenovic, I. “On Delivery Guarantees of Face and Combined Greedy-Face Routing algorithms in Ad Hoc and Sensor Networks”. In Proceedings of the 12th Annual ACM International Conference on Mobile Computing and Networking (MobiCom), pp. 390 – 401, 2006.

[9]

Heidemann, J., Silva, F., and Estrin, D. “Matching data dissemination algorithms to application requirements”. In Proceedings of the 1st International Conference on Embedded Networked Sensor Systems (Sensys), pp. 218 – 229, 2003.

[10] Heinzelman, W., Kulik, J., and Balakrishnan, H. “Adaptive protocols for information dissemination in wireless sensor networks”. In Proceedings of the 5th Annual ACM/IEEE International Conference on Mobile Computing and Networking (MobiCom), pp. 174 – 185, 1999. [11] Heinzelman, W., Chandrakasan, A., and Balakrishnan, H. “Energy-efficient communication protocol for wireless microsensor networks”. In Proceedings of the 33rd Annual Hawaii International Conference on Systems Science (HICSS), 2000. [12] Intanagonwiwat, C., Govindan, R., and Estrin, D. “Directed Diffusion: A Scalable and Robust Communication Paradigm for Sensor Networks”. In Proceedings of the 6th ACM Annual International Conference on Mobile Computing and Networking (MobiCom), pp. 56 – 67, 2000. [13] Karp, B. and Kung, H.T., “Greedy Perimeter Stateless Routing for Wireless Networks”. In Proceedings of the 6th ACM Annual International Conference on Mobile Computing and Networking (MobiCom), pp. 243 – 254, 2000. [14] Li, X.-Y., Calinescu, G., Wan, P.-J., and Wang, Y. “Localized Delaunay triangulation with application in ad hoc wireless networks”. IEEE Transactions on Parallel and Distributed Systems, 14(10): 1035–1045, 2003. [15] Li, X., Kim, Y. J., Govindan, R., and Hong, W. “Multi-dimensional range queries in sensor networks”. In Proceedings of the 1st ACM International Conference on Embedded Networked Sensor Systems (Sensys), pp. 63 – 75, 2003. [16] Madden, S., Franklin, M. J., and Hellerstein, J. M., and Hong, W. “The Design of an Acquisitional Query Processor for Sensor Networks”. In Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, pp. 491-502, 2003. [17] Madden, S., Franklin, M. J., Hellerstein, J. M., and Hong, W. “TAG: a Tiny AGgregation service for ad-hoc sensor networks”. ACM SIGOPS Operating Systems Review, vol. 36, pp. 131 – 146, 2002. [18] Melodia, T., Pompili, D., Gungor, V.C., and Akyildiz, I. F. “A Distributed coordination Framework for Wireless Sensor and Actor Networks”. In Proceedings of the 6th ACM International Symposium on Mobile Ad Hoc Networking and Computing (MobiHoc), pp. 99 – 110, 2005. [19] Nath, B. and Niculescu, D. “Routing on a curve”. ACM SIGCOMM Computer Communication Review, 33(1):155-160, 2003. [20] Ratnasamy, S., Karp, B., Shenker, S., Estrin, D., Govindan, R., Yin, L., and Yu, F. “Data-centric storage in sensornets with GHT, a geographic hash table”. Mobile Networks and Applications, 8(4):427–442, 2003. [21] Sarkar, R., Zhu, X., and Gao, J. “Double Rulings for Information Brokerage in Sensor Networks”. In Proceedings of the 12th ACM Annual International Conference on Mobile Computing and Networking (MobiCom), pp. 286 – 297, 2006. [22] Tan, H. O., Korpeoglu, I., and Stojmenovic, I. “A Distributed and Dynamic Data Gathering Protocol for Sensor Networks”. In Proceedings of the 21st IEEE International Conference on Advanced Information Networking and Applications (AINA), pp. 220 – 227, 2007. [23] Ye, F., Luo, H., Cheng, J., Lu, S., and Zhang, L. “A two-tier data dissemination model for large-scale wireless sensor networks”. In Proceedings of the 8th Annual ACM/IEEE International Conference on Mobile Computing and Networking (MobiCom), pp. 148 – 159, 2002.

586

Energy efficient schemes for wireless sensor networks ...