End-user perspectives of Internet connectivity problems

Viewer
Transcript

Computer Networks 56 (2012) 1710–1722

Contents lists available at SciVerse ScienceDirect

Computer Networks journal homepage: www.elsevier.com/locate/comnet

End-user perspectives of Internet connectivity problems Sihyung Lee a,⇑, Hyong S. Kim b a b

Seoul Women’s University, 621 Hwarangro, Nowon-Gu, Seoul 139-774, South Korea Carnegie Mellon University, 5000 Forbes Ave., Pittsburgh, PA 15213, USA

a r t i c l e

i n f o

Article history: Received 13 February 2011 Received in revised form 20 July 2011 Accepted 17 January 2012 Available online 26 January 2012 Keywords: Network measurement Connectivity problems End-user perspective Network monitoring Availability Access network

a b s t r a c t Network connectivity problems often create severe issues to end users, ranging from malfunction of applications (e.g., WWW and email) to complete loss of connectivity. This paper seeks to characterize these problems and discover the most efﬁcient ways to improve network connectivity from the perspective of end users. Over a period of 7 months, we monitor network connection failures from 103 hosts used daily by end users. We ﬁnd that more than 60% of downtime involves misconﬁgurations in the end hosts. These errors occur for various reasons, such as subtle interactions between end-host applications and routers, inconsistent network policies, and software bugs. Solving these problems can require an excessive amount of time, for expert and non-expert users alike. In contrast, problems occurring in network cores and servers are less visible to end users. For example, certain routing problems in network cores are much less likely to be seen than they are reported previously (i.e., persistent forwarding loops comprise roughly 0.02% of the observed downtime, contrary to 2.5%, as reported in previous studies). Our results show that, although a single error in a network core or a server might affect a number of end users, the accumulated impact of errors near end hosts is much larger than that of errors in network cores and servers. We thus believe that by focusing on the problems that occur at or near end systems, we can signiﬁcantly improve network availability for end users. Ó 2012 Elsevier B.V. All rights reserved.

1. Introduction When Internet users are unable to access online resources, they quickly become frustrated. Although these connectivity disruptions can be traced to various segments of an Internet path (Fig. 1), emphasis of past research has been on analyzing problems in core networks and servers [12,19,24,40] since one such problem can affect a large number of end users. For example, on one occasion, a conﬁguration error in a single router caused the complete disruption of connectivity between YouTube and the rest of the Internet for hours [23]. Correcting these problems is also critical for network administrators. In contrast, problems near end hosts have been relatively neglected since each of these problems typically impacts only a few users. ⇑ Corresponding author. Tel.: +82 2 970 5608; fax: +82 2 970 5974. E-mail addresses: [email protected] (S. Lee), [email protected] (H.S. Kim). 1389-1286/$ - see front matter Ó 2012 Elsevier B.V. All rights reserved. doi:10.1016/j.comnet.2012.01.009

We re-evaluate this assumption, and we show that connectivity problems frequently occur near end hosts and that their accumulated impact is far more signiﬁcant than errors in core networks and servers. We also observe a diverse range of the root causes of these problems, and end users’ struggles to solve them. In particular, we aim to characterize network connectivity problems as they are seen by end users. In this way, we can discover how to improve end users’ perception of network connectivity, as opposed to network administrators’ perspective. This is important since the perspectives of end users could be different from those of network administrators – many end users’ ﬁrst reaction to a connectivity problem is to examine their local networks before they blame core networks and servers. To this end, we deployed our measurement systems in geographically diverse hosts, and observed the connection failures that occurred in these hosts. We then estimated the location, duration, and potential causes of these

S. Lee, H.S. Kim / Computer Networks 56 (2012) 1710–1722

29.3% 8.1%

62.6% Access networks

Core & edge

& End hosts Clients

networks

Destinations Servers

Fig. 1. Locations of possible failures in the Internet. The numbers represent the percentage of connectivity problems that an end user experiences.

failures using a combination of measurements, including active probes and passive monitoring of connections and conﬁguration changes. In more than 60% of the occurrences, the root causes of the identiﬁed problems involved end hosts and access networks, such as parameter mismatches between end hosts and access routers. Table 1 presents selected results. Table 3 provides more examples of these problems, which are both diverse and complex. Although some problems happen due to users’ lack of basic knowledge, most of the identiﬁed problems equally happened across different expertise levels, and even expert users could not easily solve the problems. Based on the extent and complexity of problems near end hosts, we claim that management of such problems is as important as those in core networks and servers. We also argue that focusing more on end-host problems can signiﬁcantly reduce the amount of time that end users experience networking problems. This paper makes the following contributions: We improve our understanding of connectivity loss from the perspective of end users. Most previous studies on network failures either do not consider failures that involve end hosts [19,40], or they focus on performance bottlenecks (e.g., congested links) [2,17,25,27].

1711

We focus on connectivity loss and quantify these problems by observing both network connections and host status. We report the results of running our measurement system over a seven-month period. We present various characteristics of identiﬁed problems, such as their long duration and frequent recurrence (e.g., over 10% of nearhost problems lasted more than an hour and reoccurred within a week). We also classify these problems according to their root causes, and then suggest solutions to prevent these problems. We identify and demonstrate the potential beneﬁts of observing real user activities when quantifying problems from the perspective of real end users, as compared to previous studies that artiﬁcially create connections to random destinations [19,40]. In particular, we show that the failures in the middle of the Internet are far less signiﬁcant than the previous works have demonstrated. We describe the design of the passive and active measurement techniques we employ, and present the methods we use to characterize connection patterns. These techniques can be used as a basis of network diagnostic systems, and they can also be used to classify trafﬁc for other purposes, such as security, rate-limiting, and billing. The remainder of the paper is organized as follows. In Section 2, we review related work. We then describe our dataset in Section 3, and explain the measurement methodology for our analysis in Section 4. In Section 5, we present our analysis results, the classiﬁcation and quantiﬁcation of network connectivity problems. We then further analyze the causes of these problems and discuss possible solutions in Section 6. Lastly, we conclude in Section 7 with future directions. 2. Related work In this section, we present previous works relevant to the analysis of network failures. 2.1. Problems in edge and core networks A signiﬁcant amount of work has been performed to analyze and circumvent failures in network cores and edges. Refs. [5,16] characterize failures in the IP backbone

Table 1 Results summary. Results related to near-host problems Extent: More than 60% of the downtime experienced by the end users is related to near-host problems. This corresponds to over 6.9% of the time the users spent accessing the Internet Frequency and duration: A signiﬁcant fraction of near-host problems are long-lasting (over 10% last > 1 h) and continue to occur (most recur within a week) Types of problems: problems are caused by diverse reasons, such as conﬂicts among applications, incorrect conﬁguration changes triggered by software updates, and inconsistent conﬁguration procedures. Both expert and non-expert users struggle to diagnose these problems due to the lack of proper support and automation Results related to measurement methodology We ﬁnd evidence that previous measurement studies may over/under-estimate problems when they artiﬁcially created trafﬁc equally over all destinations, without considering the popularity of these destinations. For example, users often access a small number of destinations, and these popular destinations are much less likely to experience forwarding loops (0.02% of the downtime) than is reported by previous studies

1712

S. Lee, H.S. Kim / Computer Networks 56 (2012) 1710–1722

in terms of their frequency, duration, and type. These analyses are followed by solutions that reduce core and edge downtime. Hubble [12] and iPlane [18] help pinpoint networks that are responsible for a large amount of packet drops and delays. To avoid paths through these networks, Pathlet Routing [29] and MIRO [41] provide multiple alternative paths. Faulty network conﬁgurations are shown to cause more than 50% of failures in computer networks [6,9,31]. Minerals, rcc, and NetPiler [14,26,35] are thus developed to identify these errors. Grifﬁn and others also propose guidelines for designing network conﬁgurations that prevent route oscillation and loops [13,22,39]. We believe that this collective body of works has substantially reduced problems in core networks, and thus we observe relatively more problems near end hosts.

a company). Our work focuses on connectivity loss, which may disconnect a user from the Internet or may prevent the user from accessing certain destinations. In addition to the objectives, the methodologies are different. The three previous studies rely on passive measurements, but our work additionally performs active measurements by replicating various user requests (e.g., HTTP requests and TCP SYNs). Our work also leverages traditional probing techniques (e.g., ping and traceroute). These active measurements provide richer information about the duration of problems and the location of problematic areas.

2.2. End-user perspective of problems

We examined over 18 million sessions, collected from a total of 103 end hosts over a 7-month period. This period is between June 2009 and December 2009 for the ﬁrst set of 57 hosts, and between November 2010 and May 2011 for the second set of 46 hosts. The collected logs contain approximately 112 GB of data. These logs are novel in three respects. First, the hosts are operated by real users on a daily basis. The network trafﬁc and access patterns are thus not artiﬁcially created as in [2,8,19,25,27,32,37,40]. Second, the users own the hosts, freely accessing various Web sites and downloading content of interest to them. This setting is different from that of enterprises, where policies typically discourage the use of certain applications (e.g., p2p) and Web sites [15]. Third, the logs record various types of user activities in detail, such as Web accesses, application usages, and conﬁguration changes. These detailed measurements would not have been possible if we used unprivileged software, such as Java applets [8]. Our passive measurements are then reinforced by active measurements to more accurately identify locations and durations of problems. To summarize, the logs are real and detailed, allowing us to observe a broad range of user behavior. The 103 hosts (61 desktops and 42 laptops) are geographically dispersed: 39 hosts in North America, 15 in Europe (Greece, Italy, and Portugal), and 49 in Asia (Korea, Taiwan, and Singapore). These hosts are mainly based in residential areas and academic networks, and 31 laptops occasionally moved as their users travel. The hosts run various versions of Windows (XP/Vista/7). As shown in Fig. 1, the hosts are connected to the Internet through a single ﬁrst-hop router, and thus a failure within this ﬁrst-hop can entirely disconnect the hosts. We solicited users to participate in the measurement on a voluntary basis through social networks and emails. The 103 participants then downloaded and installed the measurement system on their personal machines. Gift cards were offered as an incentive to participate. Since our measurements might collect the private information of users, we preserved their privacy by taking the following measures. First, we anonymized all IP addresses in our logs according to the preﬁx-preserving algorithm [20]. Second, we recorded the headers of TCP and HTTP packets, but did not record their payload nor the URIs that the users access. Third, we published the source codes of our tool

Several previous studies analyze the quality of Internet paths from end hosts’ perspective. We classify these works into three groups according to their data collection methods and the locations of data collection. The ﬁrst group of studies collects user trafﬁc from aggregation points within service providers’ networks [17,24] in order to characterize the aggregated behavior of user trafﬁc. Our work collects data from individual end hosts in order to better characterize failures as perceived by each end user. This collection method has additional visibility into problems that happen between end hosts and aggregation points. The second group of studies [2,8,19,25,27,32,37,40] collects data by artiﬁcially generating trafﬁc. Most of these studies are performed on testbeds, which are used for measurement studies, (e.g., PlanetLab and RON). The methodology of our work differs from the studies in two aspects. First, these works observe potential failures in artiﬁcially created connections, either to randomly chosen destinations or between two hosts that participate in their studies. The artiﬁcial connections may not precisely represent real end user patterns. For example, users may hardly access destinations that are shown to be highly problematic by these studies. Our study focuses on connections made by real end hosts based on their usage behavior and thus more closely characterize failures experienced by real end users. We quantify this difference in Section 5.4. The second difference between our work and the previous studies is that these studies perform measurements only when their end hosts maintain access to the Internet. Our measurements continue to track cases where end hosts lose access to the Internet, thereby taking into account the failures that may completely disconnect hosts from the Internet. To further investigate the reasons for these problems, we also collect additional information from end hosts, such as conﬁguration changes and NIC status. The third group of studies collects real user trafﬁc from end hosts, but the objectives are different from our work. ConnectionWatch [10], HostView [11], and OneClick [21] all focus on users’ perspectives on performance characteristics (e.g., round trip time). Giroire et al. [15] characterize network port usage and session lifetime as users move across different environments (e.g., inside and outside of

3. Data description

S. Lee, H.S. Kim / Computer Networks 56 (2012) 1710–1722

on the Web, so that the users could inspect these codes [43]. The 103 participants represent a wide range of users: their ages varied from 17 to 60, and more than half of the users are not comfortable with fault diagnosis in computer systems (e.g., they have no knowledge about IP addresses). These participants’ most dominant trafﬁc types are related to Web and email accesses, corresponding to more than 80% of the trafﬁc volume. For several participants, p2p applications corresponded to 5-20% of the trafﬁc. Our dataset represents a small fraction of the total users worldwide. This is because deploying detailed instrumentation in real user machines is not easy due to data sensitivity (as compared with a testbed or an enterprise setting, in which installation can be somehow enforced). However, we believe that our results represent important characteristics of network connectivity problems experienced by real users, for the following reasons. First, we observe the 103 hosts for an extended period (7 months) in order to compensate for the small number of participants. This extended period allows us to observe connections to nearly 33.6% of the entire/24 IP preﬁxes (i.e., measurements performed over sessions to 5.64 million distinct/24 preﬁxes), a non-negligible fraction of the Internet. Second, we test statistical signiﬁcance of our results – near-host problems are signiﬁcant, and that these problems account for more than twice the number of problems in the other two areas. Note that the number of hosts used in our measurement is comparable to the 37 hosts in Paxson’s work [40] and 39 hosts in Savage’s work [37]. 4. Measurement methodology Fig. 2 illustrates the components of our measurement system. The four data-collection components (i.e., passive monitoring, active measurement, status monitoring, and user report) run continuously in real-time on the end hosts, when these machines are powered on. The result analysis engine runs on a remote server, where measurements are periodically collected. The primary objective of our measurement collection is to determine the location and duration of problems related to networking. We monitor connectivity problems that oc-

1713

cur when (i) an end host cannot connect to destinations at the IP layer and (ii) an end host fails to retrieve data at the application layer even though the underlying IP-layer connection is successfully established. The measurement is performed according to the following steps. (1) Our measurement system passively monitors connections made by end hosts and identiﬁes connection failures. We focus on TCP connections and HTTP transactions since TCP and HTTP dominate Internet trafﬁc by a signiﬁcant margin [17]. A failure occurs when a TCP SYN is not followed by a SYN/ACK or when an HTTP request does not receive a nonerroneous reply, within a speciﬁed time. The passive monitoring component leverages the pcap library [30]. (2) If the passive monitoring observes a connection failure, the system determines the location of this failure by initiating an active measurement toward the same destination D and port P. The active measurement is comprised of a combination of probes, replications of the failed connection attempts toward (D, P) (i.e., TCP SYN or HTTP request) as well as pings and traceroutes toward D. The probe is sent at a speciﬁc interval and continues until the problem is resolved, to estimate the duration of the failure. We timestamp each measurement and use these timestamps to estimate the duration of problems. (3) In addition to monitoring connections, we also monitor the status of Network Interface Cards (NIC). For example, if all NICs fail to authenticate themselves to a wireless AP and are disconnected, the host completely loses its connectivity, and the problem is identiﬁed as existing near the host. We use a binary representation of NIC status, connected or disconnected. We log the status each time it is changed, and we also timestamp each log entry for estimating the duration of NIC-down periods. The secondary objective of our measurement is to further categorize the types and investigate the causes of discovered problems. We use the details of failed connections in passive measurements, such as identiﬁcation of

Fig. 2. Measurement system overview.

1714

S. Lee, H.S. Kim / Computer Networks 56 (2012) 1710–1722

responsible applications and port numbers. We also (i) collect conﬁguration changes related to networking (e.g., proxy server settings) and (ii) ask users of each end host to document networking problems they experience (User Report). The collected information is then correlated with the failed connections. For example, we observed that HTTP connections started to fail and also discovered that a proxy conﬁguration was added just before the onset of this failure. In the following subsections, we describe more details about how we perform the passive/active measurements (Section 4.1), how we monitor changes in NIC status and conﬁgurations (Section 4.2), and how we estimate the locations and durations of connectivity problems using the various measurement components (Section 4.3). We also present methods to increase the accuracy of our analysis by correcting artifacts in our logs (Section 4.4). 4.1. Measurement settings Failure timeout: We consider a connection request to be failed if we observe no response within 5 s. This 5-s period indicates a potential failure because the round-trip time of a packet is rarely more than 5 s [38]. We thus initiate active measurements after the 5-s period so as to identify failures at the earliest possible time. Active measurement interval: The probe sets are sent at speciﬁc intervals. These intervals are independent and exponentially distributed, according to Wolff’s PASTA theory [33]: if measurement times form a Poisson process, the proportion of our measurements that observe a given state is equal to the amount of time that the Internet spends in that state. We use two different intervals: an interval of 2 min is employed for the ﬁrst six probe sets, while an interval of 15 min is used if the problem is not resolved after these six sets. The short intervals for the ﬁrst six probes allow us to more accurately measure the duration of transient failures. The long intervals after the ﬁrst six probes limit the measurement overhead by suppressing unnecessary probes. The probability of a valid response drops below 0.2% if none of the ﬁrst six probes succeeds in receiving valid responses [12]. Also, the increase of intervals prevents the probes from being misinterpreted as DOS attacks. Composition of a probe set: For each probe set, we send three consecutive probes in order to increase the accuracy of our estimation. For example, we deﬁne a ping failure as three consecutive pings with no reply. We limit traceroutes to 30 hops since most Internet paths are completed within 30 hops, as shown in [40]. Finally, in order to reduce the measurement overhead, we do not allow any duplicate probes to the same (D, P). Termination of active measurement: A series of probes sent to (D, P) ends either (i) when we no longer observe the failure or (ii) when the duration of the probes exceeds 24 h. This 24-h limit prevents probes from running indeﬁnitely. Although we may miss a few problems that do last longer than 24 h, we believe this does not signiﬁcantly bias our results. After termination, another probe will soon be re-initiated if the user re-attempts to access D and if the problem persists.

4.2. Collection of additional information Monitoring NIC status: We observe the connectivity of NICs to help identify problems near end hosts. To determine the connectivity of a NIC, we use Windows Management Instrumentation (WMI), which consists of groups of components that are monitored (e.g., conﬁgurations related to NICs) and functions to access the status of these components. We monitor three WMI groups: Network Adapter, Network Adapter Conﬁguration, and System Driver. We determine that a NIC is connected if these three WMI groups have the following properties; otherwise, we consider the NIC to be disconnected: NetworkAdapter:NetConnectionStatus = ‘‘connected’’ or ‘‘authenticated’’ NetworkAdapterConﬁguration: ConﬁgManagerErrorCode = ‘‘working’’ NetworkAdapter:Availability = ‘‘running’’ NetworkAdapter: SystemDriver:Status = ‘‘OK’’ Conﬁguration changes related to networking: We collect conﬁgurations related to IP, wireless, proxy, browser and security since incorrect conﬁgurations often lead to connectivity problems. We also monitor the status of NIC drivers because end hosts occasionally lose connectivity when NIC drivers are corrupted. We access these conﬁgurations through (i) the above three WMI groups, (ii) Windows registries (Internet Settings, ﬁrewall policies, IE settings, and Google Chrome settings), and (iii) ﬁle systems (Mozilla FF conﬁgurations). The full list of conﬁgurations can be found at [43]. The entire conﬁgurations are collected the ﬁrst time our measurement tool is installed, and then any changes are collected incrementally. User report: We encourage the participants of our measurements to specify the nature of connectivity problems they experience. To make it more convenient for the users, we implemented user reports as icons on the systems tray so that users could easily click this icon and describe problems. Since user reports also indicate approximate failure times, we can examine conﬁguration changes starting from the observed time range, which also provides helpful information for discovering the causes of failures. For example, a user reported that incorrect proxy conﬁgurations led to failed HTTP connections. This report led us to ﬁnd a series of conﬁguration changes as a signature: At time t1, a new proxy server at IP3 is added to a browser’s settings; at time t2, HTTP connections made by this browser are directed to IP3 and fail; and at time t3, the proxy setting is lifted, and HTTP connections no longer go through IP3 and succeed. 4.3. Estimating the locations/durations of connectivity problems We classify locations of connectivity problems into three areas: near-host (within one hop from an end host), near-destination, and in the middle of the Internet (core and edge networks).

S. Lee, H.S. Kim / Computer Networks 56 (2012) 1710–1722

A connection failure that is identiﬁed by our system is classiﬁed as near-host if this problem surfaces within the ﬁrst-hop of the end host and no packets are forwarded past this ﬁrst-hop. This deﬁnition includes both an end host and its ﬁrst-hop router since these devices are often managed by end users, and the conﬁgurations in the devices often conﬂict with each other. More precisely, a connectivity problem is near-host either (i) if all NICs are down or (ii) if all existing active measurements (as well as passive measurements) observe no inbound packets from after the ﬁrst hop. If a failure is not assigned into near-host, then it is classiﬁed into near-destination and middle, according to the active measurements. A problem is considered near-destination (i) if the active measurements elicit any responses from the destination or (ii) if the last responsive IP of the traceroute is within the same/24 subnet as the destination [24]. Otherwise, this problem is assigned into the middle category. Once the location of a connectivity failure is determined, its duration is computed by tend tinception. For the case where all NICs are down, tend is the time when the NIC status changes back to connected, and tinception is the time when all NICs become disconnected. Otherwise, tend is the termination time of the probes corresponding to the failure, and tinception is the initiation time of the same probes. 4.4. Data cleaning To further increase the accuracy of our analysis, we do not include the following periods in our calculation of network downtime. These periods do not signiﬁcantly contribute to the end users’ perception of network connectivity. Idle time and intentional ofﬂine periods: we exclude periods where end users are logged out and also those where end hosts are shut off or on standby. During these idle periods, users do not use the hosts to access the Internet, and we thus suspend our measurements. We then resume these measurements as soon as end hosts go up and the users log in. Our analysis also excludes intentionalofﬂine-periods, i.e., the periods where users choose to stay ofﬂine. For example, laptop users who often travel may stay at places where they have no Internet access, and may work ofﬂine for an extended period of time. We identify these periods by log-in sessions where no additional connection attempts are made after an initial attempt. This initial attempt could be automatically generated by the OS or a wireless management application. Nevertheless, according to our logs and the conversations with the participants, we ﬁnd that they attempted to stay online most of the time. Never-accessed destinations: We ﬁlter out failed connections to (D, P) if no successful connections have been made to (D, P) throughout our entire 7-month measurement period. This is based on the assumption that if services at (D, P) are signiﬁcant to users, (D, P) must have been successfully accessed once or more during this period. We analyze failures ﬁltered by using this rule, and nearly 90% of these failures involve connections made by p2p applications. These p2p applications attempt to connect to peers that

1715

dynamically receive new IPs over time and that often go down. The rest of the failures represents connections to unassigned/unreachable IP ranges (e.g., private addresses). We ﬁnd that many of these failures occur when some form of malware is scanning IP addresses.

5. Measurement results 5.1. Proportion of near-host problems We ﬁrst classify identiﬁed failures into the three areas, near-host, middle, and near-destinations, according to our classiﬁcation, as shown in Section 4.3. We then accumulate the durations of the problems for each area. These accumulated durations represent the impact of failures in these three areas: i.e., how long the failures in each area disturb the connectivity of end users. Fig. 3 presents the proportion of these three classes of problems that we observed over the 7-month period. Each vertical bar represents one participant, and it is divided into the three classes: neardestination at the top, middle in the middle, and near-host at the bottom. The bars are ordered according to the proportion of near-host problems. We present the results for the ﬁrst set of 57 hosts due to space limitation, but a similar trend can be found in the second set of 46 hosts. The majority of the participants (70.9% – 73 of the 103) experience greater near-host downtime than that in the other two areas combined. The rest of the participants experience greater near-destination downtime. Only two users (e.g., user ID 11 in Fig. 3) experience greater middle downtime, and these users frequently reported their providers being on maintenance, often on a weekly basis. We also observe similar proportions of the three classes in the total number of hours. Among a total of 22.2K hours of downtime, near-host downtime corresponds to 13.9K hours – nearly 6.9% of the non-idle time during which users access the Internet.2 The other two classes, near-destination and middle, correspond to 6.5K hours and 1.8K hours, respectively. Although the majority of the participants consistently experience more near-host problems, we further conﬁrm this using statistical tests. We test for signiﬁcant differences between the percentages of near-host problems and those of other two areas over the 103 samples. We hypothesize that we see more near-host problems than those in the other two areas, as shown in Table 2. pnear-host denotes the percentage of near-host problems and pmiddle,near-dest those in the other two areas combined. To test our hypotheses, we conduct the one-sided t-test on pnear-host and pmiddle,near-dest. This test results in a p-value, the smallest value of the signiﬁcance level a for which H0 can be rejected. The smaller the p-value becomes, the more compelling the evidence that H0 should be rejected and Ha accepted. The test is signiﬁcantly in favor of our argument: the 2 If a connectivity problem is identiﬁed, we measure the entire duration of this problem until it is resolved, rather than measuring only the period during which users attempt to connect; users may want to access a destination even though they do not explicitly re-try. However, we exclude cases where connections cannot be made but users may not experience problems, as shown in Section 4.4.

1716

S. Lee, H.S. Kim / Computer Networks 56 (2012) 1710–1722

Fraction of downtime (%)

% of LoC incidents

near -destination middle near -host

# of LoC incidents

100

11479

90

10331

100

100

90

90

80

9183

80

80

70

8035

70

70

60

60

60

6887

50

50

50

5740

40

40

30

30

20

20

10

10

40

1

10

20 30 User ID

40

p-value is far less than 0.05, and H0 showing no difference is therefore rejected at the a = 0.05 signiﬁcance level. The hours of near-host downtime could be even higher than what is presented here for the following two reasons. First, we focus on HTTP and do not exhaustively analyze all application-layer errors. Many of these errors occur when applications are not correctly conﬁgured on the client side (e.g., users are unable to use an application when they incorrectly enter registration information or when the conﬁguration is corrupted by other applications). Second, we could not observe problems that are visible only when we have access to the states of applications (e.g., an application hangs and does not send any requests). In the following sections, we further investigate the near-host problems, such as their frequency, duration, type, and causes. The detailed analysis of problems in the other two areas have been extensively studied in previous works [5,12,16,17,31] and is outside the scope of this paper. 5.2. Frequency and duration of near-host problems We now examine (i) how long each near-host failure lasts and (ii) how often these failures reoccur over the 7-month period. Fig. 4 presents the cumulative distribution of near-host failure durations. As shown in the ﬁgure, nearly 45% of the near-host incidents last shorter than 10 min. We ﬁnd that most of these short-lived incidents occur when the hosts attempt to connect to a wireless AP, either because the hosts are moved from the range of one AP to another, or because

6

12

24 hr

Fig. 4. CDF of the durations of near-host failures.

50

Fig. 3. Proportion of problems in the three regions.

1 hr

Log (duration of connectivity loss)

0

0

4592

10 min

the hosts are restarted after stand-by. Although the number of the short-lived incidents is signiﬁcant, their accumulated duration accounts for only 5.8% of the entire near-host downtime. The rest of the near-host incidents (55%) last more than 10 min, which correspond to 94.2% of the accumulated downtime. Among these long-lived incidents, nearly 10% lead to more than an hour of connectivity loss. Since the long-lasting incidents (i.e., those that last more than 10 min) comprise a signiﬁcant fraction of the downtime, we then investigate whether the long-lasting incidents are one-time rare events or whether they continue to occur to the users. We analyze the inter-arrival times of these incidents and present them in Fig. 5(a). Nearly 80% of the incidents are followed by another problem within 24 h. As a glance into our data, we also show a one-week timeline in Fig. 5(b). Each thin horizontal line represents the week timeline of one participant, and a bold line denotes the period of one near-host incident. The ﬁgure shows that near-host incidents both last long and occur frequently for a number of users. This is also true across the entire 7-month period. We present the results for the ﬁrst set of 57 hosts, but a similar trend can be found in the second set of 46 hosts as well. The recurrence of near-host problems is due to several reasons. First, host conﬁgurations can be incorrectly modiﬁed as users continue to install/remove/update applications. Second, many users fail to resolve near-host problems, and they resort to temporary measures, such as rebooting routers and hosts. Lastly, users unnecessarily spend time on conﬁguring their hosts when conﬁguration interfaces are inconsistent across locations and applications. We describe more details of these reasons in Sections 5.3 and 6. In summary, a signiﬁcant fraction of near-host incidents are sufﬁciently long to cause inconvenience to users, and

Table 2 Summary of statistical tests for signiﬁcant differences in the percentage of near-host problems (pnear-host) and those of the other two areas (pmiddle,near-dest). Hypotheses H0: pmiddle,near-dest = pnear-host (the amount of near-host problems and those of other areas bear no signiﬁcant differences) Ha: pmiddle,near-dest < pnear-host (more near-host problems exist) pmiddle,near-dest (10th, median, 90th)

pnear-host (10th, median, 90th)

p-Value

(0.33%, 18.5%, 96.3%)

(3.68%, 81.5%, 99.7%)

p < 0.05

1717

S. Lee, H.S. Kim / Computer Networks 56 (2012) 1710–1722

User ID 1

(%)

Loss of connectivity events

(#) 10

100

6261

80

5008

20

60

3757

30

40

2504

20

1252

40 50

0 30 min 1 hr

6

12

0 24 hr

Log (inter-arrival time of connectivity loss events)

(a)

09 -01

09 -08 Date

(b)

Fig. 5. Frequency of near-host failures: (a) CDF of inter-arrival time of near-host failures and (b) near-host failure periods over a week between 2009-09-01 and 2009-09-08.

these problems re-occur, some within a few hours of the previous incident. 5.3. Classiﬁcation of near-host problems Motivated by the recurrence and long duration of nearhost problems, we investigate their likely causes and classify the problems according to these causes. To this end, we ﬁrst (i) discover the signatures of each type of near-host problems in our logs and then (ii) match our logs against these signatures in order to quantify each type of problem. To discover the signatures, we use a machine learning technique called a decision tree as it has been successfully used by NetPrint [7] for discovering signatures of homenetwork problems in network trafﬁc and conﬁgurations. We label our logs with the problem types that are reported in the user reports. These labeled logs are fed into the machine learning algorithm, which then discover signatures for each problem type. The signatures we have found cover nearly 65% of our dataset, and the remainders are classiﬁed as ‘unknown’. In the following subsections, we present the identiﬁed problems in three groups, as shown in Fig. 6. More speciﬁc examples in these groups can be found in Table 3. In Section 6, we suggest solutions for reducing these problems. Inconsistent conﬁgurations between hosts and networks: Problems often occur because end host conﬁgurations conﬂict with network policies. For example, an end host may be conﬁgured to use an SMTP server at IP1, but the network may only allow SMTP connections to be made through a designated server at IP2. A network policy like this helps the network ﬁlter out spam emails. This class of problems is likely to happen (1) when users are not aware of changes in network policies, or (2) when host conﬁgurations are incorrectly modiﬁed by application updates, installations, and removals. The associated applications and underlying network protocols often do not provide users with sufﬁcient information for troubleshoot-

Fig. 6. Classiﬁcation of near-host problems.

ing. In Table 3, problems 1–8 belong to this category. Resolving these problems require changes in the host conﬁgurations (problems 1–4), network policies (problems 5– 7), or both (problem 8). Problems in end hosts and ﬁrst-hop routers: Some problems occur within end hosts and ﬁrst-hop routers, which are not related to interactions between end hosts and networks. For example, buggy software can corrupt NIC drivers and prevents communication over a particular port. This group of problems also includes hardware failures and cabling problems. These problems are often ﬁxed temporarily by restarting hardware/software, reinstalling software, and restoring conﬁguration settings to a previous point. However, permanently repairing them takes a large amount of time since a number of potential root causes exist for the same symptom. In Table 3, problems 9, 10 and 11 belong to this category. Usability of conﬁgurations and fault diagnosis: A number of complex and inconsistent conﬁguration environments often confuse users. Consequently, users make mistakes and spend excessive time conﬁguring hosts and routers. Although problem 12 in Table 3 is directly related to this category, we claim that most of the other problems are also somehow relevant to this category; the usability of

1718

S. Lee, H.S. Kim / Computer Networks 56 (2012) 1710–1722

Table 3 Sample problems that involve end hosts. In most of the scenarios, the end hosts are behind a router, as shown in Fig. 1. #

Reported problems

Causes and ﬁxes

1

An email client could not send outgoing emails when the users are at a particular set of locations

The service provider changed policies such that SMTP connections are made only through a set of servers in their networks. In other occasions, the provider changed port numbers they use for emails. The user was not notiﬁed of this policy change. > Fixes: reconﬁgure SMTP server IP addresses and port numbers

2

An end host regularly lost complete access to the Internet

3

An end host could not connect to wired Internet service

The router ceased to function when a Bittorrent client created a number of outgoing connections, and this overﬂowed the connection table at the router. The user could not easily associate this problem with Bittorrent since many other applications were also running > Fixes: (i) limit the number of connections that Bittorrent can make if possible; (ii) terminate Bittorrent; or (iii) replace the router with one of a larger capacity The IP option in the host was incorrectly changed to ‘‘static’’. The user was not aware of this change and assumed that the Ethernet port was not working. The OS’s built-in diagnosis tool did not run automatically > Fixes: switch the IP option to ‘‘dynamic’’

4

A Web browser could not connect to any Web sites.

A proxy server was conﬁgured in the browser, and this proxy was not correctly working. The user did not remember when and why these conﬁgurations were made > Fixes: (i) disable the proxy setting; or (ii) reconﬁgure the setting with the IP address of a working proxy-server

5

A client application failed to receive data from a server. This application uses transport-layer port number p1

The client machine was behind a router, and this router was conﬁgured to forward packets on port p1 toward IP1. The client machine was initially assigned IP1, but the DHCP assigned a different IP, IP2, when the router was restarted > Fixes: (i) reconﬁgure the router so that it forwards packets on port p1 toward IP2; or (ii) replace the router with one that can statically assign IP1 to the client machine, based on the MAC address

6

An end host was unable to access the Internet through a router.

The router allowed access of a machine only if this machine was registered with its MAC address. The registration of the problematic end host was deleted, either mistakenly by the user or by an update of the router software. The OS’s diagnostics tool only suggested that the user reboots the router, before it could not further diagnose the problem. > Fixes: register the machine’s MAC address in the router

7

A remote-desktop client was unable to connect to a remote server

A ﬁrewall in the router blocked remote-desktop trafﬁc. The user was not aware of this conﬁguration change > Fixes: reconﬁgure the ﬁrewall so that it opens the remote-desktop port

8

An end host was unable to access wireless service at a particular set of locations

The service provider prevented access from the host since this host was not fully updated with security patches or the host carried a virus that generated spam emails > Fixes: fully update the host or remove the malware. Then ask the service provider to lift the access restriction

9

A user had to conﬁgure the same wireless settings repeatedly since these settings were reset to default values on a regular basis

A third-party wireless conﬁguration program continued to rewrite wireless settings. These settings were conﬁgured using the OS’s default conﬁguration panel > Fixes: uninstall the third-party program (or replace with one that does not conﬂict)

10

A network interface was turned off and could not be turned on

The interface drivers were corrupted by a bug in a software update > Fixes: uninstall the interface and reinstall it

11

A user was unable to connect to port 443 (HTTPS)

The installation of a packet monitoring program dropped all incoming packets from port 443 > Fixes: uninstall the program (or reconﬁgure the program so that it does not conﬂict)

12

Users spent excessive time connecting to wireless service at public places (as much as 30 min)

Different locations have inconsistent sets of procedures, which confused users. Some places have complex procedures, such as purchasing a card, sign-up, log-in, and completing a survey. These procedures also need to be repeated at places where credentials are temporary (e.g., a new passcode is issued every hour) > Fixes: enforce consistent procedures (e.g., as a standard) and improve the usability of conﬁguration interfaces

conﬁguration interfaces and diagnostic tools lengthen the time to repair the problems. This is true for both expert and non-technical users, although the problem is slightly

aggravated for non-technical users; one third of the 103 participants of our study do not know how to access their home routers to change conﬁgurations nor do they know

1719

S. Lee, H.S. Kim / Computer Networks 56 (2012) 1710–1722

presents the number of other users who access the same 140K destinations, as shown in Fig. 7(a). The ﬁgure shows that the popular destinations of a single user is also popular across users (e.g., Facebook), whereas most of the unpopular destinations are accessed only by the single user. More precisely, 62.6% of the top 1% destinations are also accessed by other users, sometimes up to 35 other users. In contrast, 95% of the rest of the destinations are accessed only by a single user. Our analysis indicates that in reality, different destinations vary greatly in their popularity and this must be properly considered if we generate artiﬁcial trafﬁc for measurements. In the next subsection, we further investigate how this popularity distribution affects measurement results. Extent of forwarding loops: To quantify the differences between the use of real trafﬁc traces and artiﬁcial trafﬁc, we analyze the extent of forwarding loops in our logs and compare this result with those of previous works by Xia [19] and Paxson [40]. Forwarding loops have frequently been a subject of both empirical and theoretical works and presented as a serious threat [13,19,40]. To identify forwarding loops, we analyze traceroute results and look for sequences of IPs or domain names that continue to reappear in these results. Since it is known that certain loops in traceroute may not represent real loops, we take all of the precautionary measures that are taken by Paxon and Xia’s work. This also makes sure that our methods of counting forwarding loops are consistent with those used by the two previous works. For instance, we disregard the following two cases, as done by the previous works [19,40,42].

how to run command-line diagnostic tools such as ping and traceroute. We also ﬁnd that this category includes more than a quarter of the problems that last between 10 min and 40 min (as shown on the left of Fig. 4). These problems happen when users make a series of failed attempts to connect to wireless services. If we reduced these durations to 1 min for the most frequent traveler of the 103 participants, nearly 63 hours could be saved over the 7-month period. 5.4. Popularity of destinations and their impact on measurement results One aspect that differentiates our work from a large body of previous works [2,8,19,25,27,32,37,40] is that our measurements are based on real users’ network access traces, rather than artiﬁcially generating measurements to random destinations. Thus, we aim to answer the question as to how much difference the use of real-user traces would make when we analyze the extent of network connectivity problems. To this end, we ﬁrst examine the popularity of destinations that the users access: (i) how often these destinations are accessed by a single user and (ii) how widely these destinations are accessed across multiple users. Then, we demonstrate the impact of (i) and (ii) on estimating the extent of one representative problem, forwarding loops. We choose forwarding loops since considerable work has been devoted to studying forwarding loops, which could be used as reference for comparison. Popularity distribution of destinations: For each destination d that is accessed by the participants, we count the number of packets that are exchanged with d. This count approximately represents the popularity of d, relative to the other destinations. Fig. 7(a) presents this count for the 140K destinations accessed by a single user. This ﬁgure shows that the popularity distribution of destinations is nearly long-tailed; fewer than 100 destinations are accessed frequently, and the other destinations are only occasionally accessed, mostly fewer than several times. For the other users, we observe that the popularity distribution is also long-tailed, and the total number of destinations varies between 10K and 180K. We then analyze whether the destinations accessed by a single user is also accessed by the other users. Fig. 7(b)

(1) We do not consider loops that have only ⁄ and ! between two occurrences of the same addresses. We also do not consider loops that contain a single address appearing continuously. A combination of NATs and ﬁrewalls can cause this type of traceroute result. (2) We also remove transient loops that end before one round of traceroute terminates. We focus on long-term persistent loops that occur as a result of conﬁguration errors, rather than temporary loops that occur while the routing converges.

# users who access the same destination

# of accesses

25

200K

20

150K

15

100K

10

50K

5 1

0

100

200

300

400

140K

Destination IP addresses

(a)

0

20K

40K

60K

80K

100K

120K

140K

Destination IP addresses

(b)

Fig. 7. Popularity distribution of destinations: (a) # of access per destination (one user) and (b) # of other users who access the same destinations.

1720

S. Lee, H.S. Kim / Computer Networks 56 (2012) 1710–1722

As a result of the analysis, we ﬁnd that persistent forwarding loops cause slightly over 0.02% of the entire downtime that end users experience. Only a handful of destinations are associated with these forwarding loops. These results contrast with those of the previous works that show a prevalence of forwarding loops [19][40] (e.g., 2.47% of routable addresses contain persistent loops, as reported in Jan. 2007). We believe that this difference comes from the fact that we monitor destinations according to real user access patterns, rather than choosing destinations uniformly over a large pool. We further look into the destinations associated with the forwarding loops. We ﬁnd these destinations do not represent destinations that are signiﬁcant to users, according to two reasons. First, all of the destinations lie on the far right side of Fig. 7, which are rarely accessed and are accessed only by a single user. Second, in all of the forwarding loops that we observe, two different addresses continue to repeat. Xia et al. [19] conjecture that this is a signature of pull-up route misconﬁguration, where loops occur when packets are sent to IPs that are not assigned to live machines. To further conﬁrm this conjecture, we sent probe sets to the destinations after our 7-month measurement periods. Although we did not observe forwarding loops, the destinations continued to fail to respond to our probe sets, and traceroutes to half of these destinations ended with the error ‘‘destination host unreachable.’’ In summary, we ﬁnd that forwarding loops are much less signiﬁcant in our logs than those presented in the previous works. We believe that this difference comes from the fact that the previous works equally weigh all destinations although the popularity distribution of the destinations is not uniform but long-tailed. The results in the previous works are still signiﬁcant, as they show the extent of the problems, equally all over the entire address space. However, to interpret these results from the perspective of real end users, we need to properly weigh the results according to real users’ access patterns.

6. Discussion After analyzing the observed near-host problems in Sections 5.1–5.3, we discuss suggestions for reducing these problems. We argue that the automation of conﬁgurations and problem diagnosis is important for the following reasons. First, a diverse set of potential causes exist for the same symptom, so even expert users often forget to assess all of these possibilities. For example, all of problems 6, 8, 9, and 10 in Table 3 appear to be failures in accessing wireless service, but their causes vary widely, such as conﬂicts among applications, and inconsistent conﬁgurations in networks and hosts. Second, many conﬁguration changes are not made by the users but made by applications, systems, and network administrators, and thus the users often make false assumptions that the conﬁgurations remain the same as they are previously conﬁgured. Consequently, they end up spending a large amount of time. For exam-

ple, problem 3 happened to many expert users, but they did not suspect that their IP options could be incorrect since these options were not changed by the users. Instead, the users assumed that their Ethernet ports were having problems. Third, the problems presented in this paper continue to occur on a repeated basis. Applications and operating systems are updated, incorrectly overwriting existing conﬁgurations or conﬂicting with other applications. Network policies evolve, creating inconsistencies with host conﬁgurations. Users also move around networks with different policies. Lastly, in many cases, existing fault diagnostic tools did not provide the users with sufﬁcient information to solve the problems, across all three classes, as shown in Fig. 6. For example, in one failure event, a host failed to receive an IP from a router since the MAC address was not registered at this router. The diagnostic tool in Windows 7 suggested rebooting the router (problem 6 in Table 3). There exist multiple proposals of diagnostic systems. These systems are extremely useful for detecting certain types of problems. However, no single approach can cover all of the various types of failures that we identiﬁed, and probably a hybrid approach would perform better. We classify the diagnostic systems into three groups according to their detection methods and collaboration model. These methods and collaboration models determine the coverage and accuracy of fault diagnosis. In addition to the classiﬁcation, we also provide suggestions for improving the systems. Fault-diagnosis with network-wide knowledge: Many connectivity problems that we found were caused by mismatches between host conﬁgurations and network policies (problems 1–8 in Table 3). Resolving these problems requires (i) the capability to observe or infer network policies, in addition to monitoring one host alone (for resolving all of problems 1–8) and (ii) the capability to automatically reconﬁgure network devices if identiﬁed root causes reside in these devices’ conﬁgurations (for resolving problems 5–8). Diagnostic systems, such as NetMedic [34] and Sherlock [28], pinpoint origins of problems based on network-wide knowledge. These systems build a graph of dependencies among different components in a network, and then they use this graph to trace the origins. The systems assume that the conﬁgurations and status of various nodes can be readily obtained, which might be possible only in corporate environments where both routers and end hosts are managed by network administrators. A more viable solution, and what we believe is the important, is to design network management protocols that allow end hosts to communicate with network devices and to share information for diagnosing problems. Using these protocols, applications can inform end hosts of mismatches and guide them to discover correct conﬁgurations. The protocols also reconﬁgure both network devices and end hosts according to the problem diagnosis. The protocols can either be embedded in existing protocols or be developed separately. This automation should also apply

S. Lee, H.S. Kim / Computer Networks 56 (2012) 1710–1722

where new features are added. This solution could be used, for example with DHCP, which is designed to supplement static conﬁguration of IPs, but which also adds another step to conﬁgure: whether to use DHCP or to statically conﬁgure IPs. Incorrect choice of this option leads to connectivity loss, as we observe in our study (problem 3 in Table 3). To automate this step, network protocols can inform the IP conﬁguration utility of the option used in the network so that the utility chooses the correct option. Fault diagnosis with collaboration among end hosts: When obtaining network-wide knowledge is not possible, the next option is to infer network policies/states. One inference method requires collaboration among end hosts. NetPrint [7] builds a knowledge base, where multiple hosts in different networks can share the problems they experience as well as the solutions they identiﬁed. This system can provide solutions given that a problem commonly appears in many different hosts. However, a signiﬁcant fraction of the observed failures are speciﬁc to a subnet/host/ application, and thus ﬁnding a common solution may not always be possible. In this regard, one approach for future investigation is to (i) automatically identify host h that does not experience the same problem (e.g., by replicating the same failed connection attempts at h), and (ii) infer the sources of the problem from the differences in the conﬁgurations of h and the problematic host. For example, h1 may experience a problem accessing a Web site, whereas h2 does not. By comparing Web-related conﬁgurations of h1 and h2, we might identify that an erroneous add-on is installed in h1’s browser, whereas h2 does not have this add-on. This collaboration is particularly useful when a range of possible conﬁguration options exist but current network policies/states cannot be easily inferred by a single host alone (e.g., a service provider may use several untypical network ports for emails, as shown in problem 1). Also, trying all possible options is cumbersome, and users often forget to do so. Fault diagnosis by a single end host: When collaboration among end hosts is not enabled, each host can infer network policies/states by testing different conﬁguration options within the host and by then watching whether connectivity is restored. Among the failures that are caused by interactions between end hosts and access networks (e.g., problems 1–8 in Table 3), this single-host inference can efﬁciently resolve cases where a limited set of conﬁguration options exist (e.g., ‘‘static IP’’ vs. ‘‘dynamic IP’’ in problem 3, and ‘‘disable proxy server’’ in problem 4). Most of the existing systems in this group typically depend on a set of predeﬁned rules, so a new rule needs to be added to examine each new type of problem. Netalyzr [8] and SNIPS [36] scan a list of network ports and use this information to determine whether most-widely-used network services are running (e.g., DNS and SMTP) and whether these services are allowed at a ﬁrewall. In addition to analyzing network services, the singlehost diagnosis can resolve many problems that are contained within the single host (e.g., problems 9–11). For example, Window’s Network Diagnostics Framework (NDF) examines the accurateness of conﬁgurations in end hosts (e.g., NIC driver corruption in problem 10). This system may also detect and resolve problems related to the

1721

peculiarities of applications (e.g., problems 9 and 11) if new rules are added and if the system can trace all applications that are responsible for conﬁguration changes. Additional considerations: When errors are detected and solutions are identiﬁed, we can then repair the errors. This needs to be performed carefully since complex dependencies exist among conﬁgurations and applications, so one conﬁguration change can break seemingly-unrelated applications. We also ﬁnd it worth targeting problems related to accessing wireless service, since these problems comprise over 30% of near-host downtime. The wireless problems are different from those in wired networks since wireless has its own medium and protocols. Some of the existing solutions [3,4] provide statistics, which need to be further analyzed to identify problems (e.g., # of dropped packets at an AP). Other solutions locate sources of connectivity problems into different areas but at a coarse level of granularity, such as client, wireless medium, and AP [1]. More work needs to be done to accurately pinpoint the sources of these failures.

7. Conclusion We analyze network connection failures that end users experience through a combination of passive and active measurements. We discover that the source of more than 60% of downtime is caused near end hosts (e.g., end system misconﬁgurations and software failures), rather than in core networks and servers. We also ﬁnd that certain routing anomalies in core networks are much less likely to happen than they are reported in previous studies (e.g., persistent forwarding loops are responsible for only 0.02% of the downtime). These results do not mean that the problems in core networks and servers do not have high impact. Instead, we show that problems near end hosts disturb users more frequently, and thus their accumulated impact can be much larger. In order to reduce the likelihood of these problems, we argue for better protocol design and fault diagnosis near end hosts. We are currently extending the analysis system for troubleshooting problems near end hosts and also increasing the number of users. This system will be tested in enterprise and academic networks. The system detects failures based on the failure signatures that we identiﬁed (Section 5.3). We also ﬁnd that the correlation of connection failures and their responsible applications could help in classifying applications for various purposes, such as anomaly detection, rate-limiting and billing. For example, we ﬁnd interesting failure patterns related to high-impact applications, such as p2p applications and DNS. We discover that BitTorrent clients often create an excessive number of concurrent connections with peers, and that this causes low-end routers to crash when their connection tables overﬂow. We also ﬁnd that certain malware redirects DNS requests to a fake server, and that this server always returns invalid IPs in the range of 169.xxx.xxx.xxx. This phenomenon causes the succeeding connections toward these IPs to fail. These patterns help identifying the associated applications and thus taking appropriate actions, such as dropping or rate-limiting the trafﬁc.

1722

S. Lee, H.S. Kim / Computer Networks 56 (2012) 1710–1722

Acknowledgments We thank all of the participants of the analysis for their time and valuable suggestions. This work was supported by a research grant from Seoul Women’s University (2012).

References [1] A. Adya, P. Bahl, R. Chandra, L. Qui, Eden – architecture and techniques for diagnosing faults in IEEE 802.11 infrastructure networks, in: Proceedings of the ACM MOBICOM, 2004. [2] A. Akella, S. Seshan, A. Shaikh, An empirical evaluation of wide-area Internet bottlenecks, in: Proceedings of the Internet Measurement Conference (IMC), October 2003. [3] Aircheck Wi-Fi Tester, February 2011.. [4] Airwave Management Platform, February 2011. . [5] A. Markopulu, G. Iannaccone, S. Bhattacharya, C.-N. Chuah, Y. Ganjali, C. Diot, Characterization of failures in an operational IP backbone network, IEEE/ACM ToN 16 (4) (2008) 2307–2317. [6] A. Wool, A quantitative study of ﬁrewall conﬁguration errors, IEEE Computer 37 (6) (2004) 62–67. [7] B. Aggarwal, R. Bhagwan, T. Das, S. Eswaran, V.N. Padmanabhan, G.M. Voelker, NetPrints: diagnosing home network misconﬁgurations using shared knowledge, in: Proceedings of the USENIX NSDI, 2009. [8] C. Kreibich, N. Weaver, B. Nechaev, V. Paxson, Netalyzr: illuminating the edge network, in: Proceedings of the IMC, 2010. [9] D. Oppenheimer, A. Ganapathi, D. Patterson, Why do Internet services fail, and what can be done about it, in: Proceedings of the USITS, 2003. [10] D. Joumblatt, R. Teixeira, ConnectionWatch: passive monitoring of round-trip times at end-hosts, in: Proceedings of the ACM CoNEXT Student, Workshop, 2008. [11] D. Joumblatt, R. Teixeira, J. Chandrashekar, N. Taft, HostView: annotating end-host performance measurements with user feedback, in: Proceedings of the ACM HotMetrics, June 2010. [12] E. Katz-Bassett, H.V. Madhyastha, J.P. John, A. Krishnamurthy, D. Wetherall, T. Anderson, Studying black holes in the Internet with Hubble, in: Proceedings of the USENIX NSDI, 2008. [13] F. Le, G.G. Xie, H. Zhang, Understanding route redistribution, in: Proceedings of the IEEE ICNP, 2007. [14] F. Le, S. Lee, T. Wong, H.S. Kim, D. Newcomb, Detecting networkwide and router-speciﬁc misconﬁgurations through datamining, IEEE/ACM ToN 17 (1) (2009) 66–79. [15] F. Giroire, J. Chandrashekar, G. Iannaccone, K. Papagiannaki, E. M. Schooler, N. Taft, The cubicle vs. the coffee shop: behavioral modes in enterprise end-users, in: Proceedings of the PAM, 2008. [16] G. Iannaccone, C. Chuah, R. mortier, S. Bhattacharyya, C. Diot, Analysis of link failures in an IP backbone, in: Proceedings of the ACM Internet Measurement Workshop, November 2002. [17] G. Maier, A. Feldmann, V. Paxson, M. Allman, On dominant characteristics of residential broadband Internet trafﬁc, in: Proceedings of the ACM IMC, 2009. [18] H.V. Madhyastha, T. Isdal, M. Piatek, C. Dixon, T. Anderson, A. Krishnamurthy, A. Venkataramani, iPlane: an information plane for distributed services, in Proceedings of the USENIX OSDI, 2006. [19] J. Xia, L. Gao, T. Fei, A measurement study of persistent forwarding loops on the Internet, Elsevier Computer Networks 51 (17) (2007) 4780–4796. [20] J. Xu, J. Fan, M.H. Ammar, S.B. Moon, Preﬁx-preserving IP address anonymization: measurement-based security evaluation and a new cryptography-based scheme, in: Proceedings of the IEEE ICNP, 2002. [21] K. Chen, C. Tu, W. Xiao, OneClick: A framework for measuring network quality of experience, in: Proceedings of the IEEE Infocom, April 2009. [22] L. Gao, J. Rexford, Stable Internet routing without global coordination, IEEE/ACM ToN 9 (6) (2001) 681–692. [23] M.A. Brown, T. Underwood, E. Zmijewski, The day the YouTube died, NANOG 43, June 2008. [24] M. Dahlin, B. Baddepudi, V. Chandra, L. Gao, A. Nayate, End-to-end WAN service availability, IEEE/ACM ToN 11 (2) (2003) 300–313. [25] M. Dischinger, A. Haeberlen, K. P. Gummadi, S. Saroiu, Characterizing residential broadband networks, in: Proceedings of the ACM IMC, October 2007.

[26] N. Feamster, H. Balakrishnan, Detecting BGP conﬁguration with static analysis, in: Proceedings of the USENIX NSDI, May 2005. [27] N. Hu, L. Li, Z. M. Mao, P. Steenkiste, J. Wang, Locating Internet bottlenecks: algorithms, measurements, and implications, in: Proceedings of the ACM SIGCOMM, August 2004. [28] P. Bahl, R. Chandra, A. Greenberg, S. Kandula, D.A. Maltz, M. Zhang, Towards highly reliable enterprise network services via inference of multi-level dependencies, in: Proceedings of the ACM SIGCOMM, August 2007. [29] P.B. Godfrey, S. Shenker, I. Stoica, Pathlet routing, in: Proceedings of the ACM SIGCOMM, 2009. [30] Pcapr, Feb. 2011. . [31] R. Mahajan, D. Wetherall, T. Anderson, Understanding BGP misconﬁgurations, in: Proceedings of the SIGCOMM, August 2002. [32] R. Mahajan, N. Spring, D. Wetherall, T. Anderson, User-level Internet path diagnosis, in Proceedings of the ACM SOSP, October 2003. [33] R. Wolff, Poisson arrivals see time averages, Operations Research 30 (2) (1982) 223–231. [34] S. Kandula, R. Mahajan, P. Verkaik, S. Agarwal, J. Padhye, P. Bahl, Detailed diagnosis in enterprise networks, in: Proceedings of the ACM SIGCOMM, August 2009. [35] S. Lee, T. Wong, H.S. Kim, NetPiler: detection of ineffective router conﬁgurations, IEEE JSAC Special Issue on Network Infrastructure Conﬁguration 27 (3) (2009) 291–301. [36] System & Network Integrated Polling Software (SNIPS), February 2011. http://netplex-tech.com/snips. [37] S. Savage, A. Collins, E. Hoffman, J. Snell, T. Anderson, The end-to-end effects of internet path selection, in: Proceedings of the ACM SIGCOMM, September 1999. [38] S. Shakkottai, R. Srikant, N. Brownlee, A. Broido, kc claffy, The rtt distribution of TCP ﬂows in the Internet and its impact on TCP-based ﬂow control, CAIDA Technical Report tr-2004-02, January 2004. [39] T.G. Grifﬁn, F.B. Shepherd, G. Wilfong, The stable paths problem and interdomain routing, IEEE/ACM ToN 10 (2) (2002) 232–243. [40] V. Paxson, End-to-end routing behavior in the Internet, IEEE/ACM ToN 5 (5) (1997) 601–615. [41] W. Xu, J. Rexford, MIRO: multi-path interdomain routing, in: Proceedings of the ACM SIGCOMM, September 2006. [42] Z.M. Mao, J. Rexford, J. Wang, R.H. Katz, Towards an accurate ASlevel traceroute tool, in: Proceedings of the ACM SIGCOMM, 2003. [43] S. Lee, Privacy notes for data collection program, February 2011. .

Sihyung Lee received the B.S. (with summa cum laude) and M.S. degrees in Electrical Engineering from Korea Advanced Institute of Science and Technology (KAIST) in 2000 and 2004, respectively, and the Ph.D. degree in Electrical and Computer Engineering from Carnegie Mellon University (CMU) in 2010. He then worked at IBM TJ Watson Research Center as a post-doctoral researcher. He is currently an assistant professor at Seoul Women’s University in the Department of Information Security. His research interests include the management of large-scale network conﬁgurations and sentiment pattern mining from social network trafﬁc.

Hyong S. Kim received the B.Eng. (Hons) degree in Electrical Engineering from McGill University in 1984, and the M.A.Sc. and Ph.D. degrees in Electrical Engineering from University of Toronto, Toronto, Canada, in 1987 and 1990, respectively. He has been with Carnegie Mellon University (CMU), Pittsburgh, PA, since 1990, where he is currently the Drew D. Perkins Chaired Professor of Electrical and Computer Engineering. His primary research areas are advanced switching architectures, fault-tolerant, reliable, and secure network architectures, and network management and control.

Solaredge Cellular CDMA Internet Connectivity Kit SE1000 ...