FACT: A Framework for Authentication in Cloud-based ...

Viewer
Transcript

1

FACT: A Framework for Authentication in Cloud-based IP Traceback∗ Long Cheng† , Dinil Mon Divakaran† , Aloysius Wooi Kiak Ang‡ , Wee Yong Lim† , Vrizlynn L. L. Thing† † Cyber Security & Intelligence Department, Institute for Infocomm Research (I2 R), Singapore ‡ Department of Electrical and Computer Engineering, National University of Singapore {chengl, divakarand, weylim, vriz}@i2r.a-star.edu.sg, [email protected]

Abstract—IP traceback plays an important role in cyber investigation processes, where the sources and the traversed paths of packets need to be identified. It has a wide range of applications, including network forensics, security auditing, network fault diagnosis, and performance testing. Despite a plethora of research on IP traceback, the Internet is yet to see a large-scale practical deployment of traceback. Some of the major challenges that still impede an Internet-scale traceback solution are, concern of disclosing ISP’s internal network topologies (in other words, concern of privacy leak), poor incremental deployment, and lack of incentives for ISPs to provide traceback services. In this work, we argue that cloud services offer better options for practical deployment of an IP traceback system. We first present a novel cloud-based traceback architecture, which possesses several favorable properties encouraging ISPs to deploy traceback services on their networks. While this makes the traceback service more accessible, regulating access to traceback service in a cloud-based architecture becomes an important issue. Consequently, we address the access control problem in cloud-based traceback. Our design objective is to prevent illegitimate users from requesting traceback information for malicious intentions (such as ISPs topology discovery). To this end, we propose a temporal token-based authentication framework, called FACT, for authenticating traceback service queries. FACT embeds temporal access tokens in traffic flows, and then delivers them to end-hosts in an efficient manner. The proposed solution ensures that the entity requesting for traceback service is an actual recipient of the packets to be traced. Finally, we analyze and validate the proposed design using real-world Internet datasets. Index Terms—IP Traceback; Access Control; Authentication; Cloud-based Traceback

I. I NTRODUCTION IP traceback is an effective solution to identify the sources of packets as well as the paths taken by the packets. It is mainly motivated by the need to trace back network intruders or attackers with spoofed IP addresses, for attribution as well as attack defense and mitigation. For example, traceback is useful in defending against Internet DDoS attacks [1]. It also assists in mitigating attack effects [2]; DoS attacks, for instance, can be mitigated if they are first detected, then traced back to their origins, and finally blocked at entry points. In addition, IP traceback can be used for a wide range of practical applications, including network forensics, security auditing, network fault diagnosis, performance testing, and path validation [3], [4]. ∗ This material is based on research work supported by Singapore National Research Foundation under NCR Award No. NRF2014NCR-NCR001-034.

While many different IP traceback approaches have been proposed, none of them has achieved universal acceptance or practical deployment. The risk of leaking network topology information ranks as the major challenge in hindering the acceptance of traceback techniques. ISPs (Internet Service Providers) are normally reluctant to allow any external party to gain visibility into their internal structure, since such exposure not only leaks sensitive information to their competitors [5], but also makes their networks vulnerable to attacks. For example, an adversary may misuse traceback services to reconstruct an ISP’s network topology [6]. As a result, ISPs will not wish to participate if the deployment of traceback could leak any sensitive information. Incremental deployability is another important factor for a viable IP traceback solution; it is unrealistic to expect all ISPs to deploy IP traceback services in their networks at the same time [7]. Unfortunately, existing IP traceback mechanisms are inadequate in providing guarantees on privacy and support for incremental deployment. Besides technical shortcomings, economic inefficiency, such as lack of financial incentive for ISPs, also hinders the practical deployment of existing traceback solutions. The advent of cloud services, however, offer a new appealing option to support IP traceback service over the Internet. It provides an opportunity to design a traceback system that is incrementally deployable. Cloud storage also increases the feasibility of logging traffic digests for forensic traceback. With a proper access control mechanism, cloud-based traceback can alleviate ISP’s privacy concerns of disclosing its internal network topology. In addition, the pay-per-use nature of cloud service provides incentives to encourage ISPs to deploy traceback service in their networks. Consequently, migrating traditional traceback solutions to cloud becomes more of a natural choice. In this work, we first present a novel cloud-based traceback architecture, which exploits increasingly available cloud infrastructures for logging traffic digests, in order to implement forensic traceback. Such cloud-based traceback simplifies the traceback processing and makes traceback service more accessible. It not only possesses privacy-preserving and incremental deployment properties, but also increases robustness against attacks and presents high financial motivation. Yet, regulating access to cloud-based traceback service becomes an important problem. In this paper, we also address the access control problem in the cloud-based traceback architecture. To this end, we propose a framework for authentication in cloud-

2

based IP traceback, named FACT, which enhances traditional authentication protocols such as the password-based scheme in cloud-based traceback. Our key idea is to embed temporal (time-based) access tokens in traffic flows and then deliver them to end-hosts in an efficient manner. The proposed method not only ensures that the user (or entity) requesting for traceback service is an actual recipient of the packets to be traced, but also adapts well to the limited marking space in IP header. Evaluation studies using real-world Internet traffic datasets demonstrate the feasibility and effectiveness of our proposed FACT traceback authentication scheme. The rest of the paper is organized as follows. We begin by reviewing existing works from the perspective of IP traceback system architecture in Section II. We describe the novel cloudbased traceback architecture in Section III. Section IV presents the proposed authentication framework for cloud-based IP traceback. We present and discuss the results of performance evaluation in Section V. II. R ELATED W ORK In this section, we provide a new taxonomy to classify existing IP traceback works based on system architecture, and subsequently motivate the need for a new traceback architecture. This taxonomy will enable us to identify the fundamental reasons hindering the practical deployment of traceback techniques. A. Classification of IP Traceback System Architectures The majority research efforts on traceback can be broadly classified into three categories: 1) end-host centric marking, 2) distributed logging, and 3) overlay-based logging. We briefly survey the related works accordingly. 1) End-host Centric Marking: A large number of existing contributions in IP traceback focuses on packet markingbased traceback [8], [9], [10], [11], [12], [13], [14], [15], [16], [17], [18], [19]. As shown in Fig. 1, in marking-based traceback, routers add packet-tracing information (e.g., router identity) into IP headers to help end-hosts trace packets with spoofed source addresses. When sufficient number of packets are received at the end-host (victim), the path that the flow of packets has traversed could be reconstructed by the end-host. For the example in Fig. 1, after receiving several packets, the victim knows that the routing path [R1 ⇒R2 ⇒R3 ⇒R5 ] was followed by the flow of packets of interest. The figure shows router R4 as a legacy router, which does not participate in the marking process; therefore the end-host can skip R4 to trace the source of the packets. Essentially, marking-based traceback works fine in partial traceback deployment. From the perspective of system architecture, we term such traceback solutions as end-host centric marking, because the traceback procedure (i.e., path reconstruction) is purely conducted at the end-host. Marking-based traceback was considered to be a promising approach to realize IP traceback, since it imposes relatively little computational and storage overhead on routers [5]. However, end-host centric marking has several shortcomings. First, it incurs a heavy burden at end-hosts by requiring them to log the received packet-tracing information and then to reconstruct

Marked packet R1

P1

t1 Attacker

R2

P2

t2 R1

R3

P3

t3 R2

R3

Legacy Router

R4

R5

t4

R1 R2 R3 R5

P4

R5

P1 P2 P3 P4

Victim

Fig. 1: End-host centric marking for IP traceback network paths [20]. Second, marking-based traceback has the risk of disclosing ISPs’ topology information to external entities. End-hosts can potentially reconstruct upstream router maps of the network from received packets with marking values [6]. Third, this approach is also vulnerable to compromised routers. A downstream router can erase all marking information from upstream routers, leading to a dysfunctional traceback mechanism. Similarly, a compromised router can also forge other routers’ markings and inject misleading information to confuse the end-host for path reconstruction [21]. Fourth, end-host centric marking lacks incentives to early traceback adopters, since an ISP that deploys traceback service does not benefit its own customers, but most likely it protects customers of other ISPs. A lack of interoperability is another challenge when deploying marking-based traceback. For example, if different ISPs adopt heterogeneous message content encoding schemes, end-hosts must be able to decode all these markings from different ISPs. Finally, the marking space in IP header is rather limited, which poses a challenge for marking a large traceback message, such as router identity with authentication [10]. 2) Distributed Logging: Orthogonal to the packet marking scheme, logging-based traceback involves the storing of packet digests (e.g., hashed values of packets [22]) at intermediate routers on the path toward end-hosts. This is illustrated in Fig. 2. The traceback procedure is initiated by the victim host when it sends a query to the last-hop router. During the process, upstream routers are sequentially queried hop-byhop, in a reverse-path flooding manner, in order to reconstruct the attack path [23]. We group such traceback schemes under distributed logging traceback, since the network path is reconstructed in a distributed manner (i.e., without a central point). Performing packet-level logging [22], however, requires significant storage space and high processing overhead at intermediate routers. In other words, logging-based traceback requires infrastructure support. To reduce the required log storage, researchers have proposed to sample traffic flows, rather than recording individual packet digests [24]. For instance, in [25], a solution is proposed that samples and logs only around 3.3% of packets. Since traffic information is logged on individual routers locally, distributed logging has better privacy-preserving property in comparison to marking-based approach. Nevertheless, it is still vulnerable to information leakage in the event that adversaries misuse the traceback technique for ISPs topology discovery. Another drawback of distributed logging is the lack of properties favoring partial deployment. Fig. 2 illustrates this, with an example of obstruction of traceback query in distributed logging. When R5 queries a legacy router R4 along the reverse path of attack packets, the traceback process reaches an impasse at R4 since it is not traceback-enabled.

3

TABLE I: Comparison of existing IP traceback solutions according to different system architectures Architecture

Disadvantages Long detection cycle; high computational and storage overhead on end-hosts; vulnerable to compromised routers; serious privacy concerns; lack of economic incentive Significant storage space requirement at routers; poor incremental deployability; vulnerable to compromised routers; need infrastructure support; lack of economic incentive Inefficient traceback processing; potential information leakage vulnerability; need infrastructure support; lack of economic incentive

End-host Centric Marking

Distributed Logging

Overlay-based Logging

Buffer received packets

Legacy Router

Log

Advantages Little computational and storage overhead on routers; less infrastructure support Incur small overhead at end-hosts; support single-packet traceback and forensic investigations Improved incremental deployability; support heterogeneous logging techniques and forensic investigations

Legacy AS

AS2

AS5

Attack path

Attacker

R1

R2

R3

R4

R5

Traceback request

Victim

Fig. 2: Distributed logging for IP traceback

AS7

AS3

AS1

Traceback request authentication

Attacker Victim

The same problem is encountered if any router on the attack path is compromised. Hence, this approach is also vulnerable to attacks. 3) Overlay-based Logging: Overlay-based traceback architecture has been proposed to address the aforementioned partial deployment issue in distributed logging. Authors in [26] proposed a logging-based traceback solution for AS (Autonomous System)-level partial deployment scenario, where all traceback-deployed ASes exchange deployment information with each other. As such, any AS is aware of the traceback deployment information of all other ASes. For the example in Fig. 3, AS7 knows that AS3 and AS6 are the one-hop neighbors, and its two-hop neighbors include AS1 , AS2 and AS4 . Upon receiving a traceback request from the victim, AS7 (being the last-hop AS) will first send queries to its onehop traceback-deployed AS neighbors, i.e., AS3 and AS6 . If the attack path cannot be reconstructed, it sends queries to its two-hop traceback-deployed AS neighbors, and so on. Apparently, such flooding-based traceback process suffers high communication overhead and low scalability. Recently, authors in [27] proposed the SampleTrace, an incrementally deployable flow-based traceback scheme. Different from prior methods using hash-based techniques for logging [22], [24], [25], SampleTrace exploits existing xFlow (sFlow, NetFlow and IPFIX) function to implement traceback, which increases the feasibility for practical deployment. In SampleTrace, each traceback-deployed AS has a traceback server, which exposes the traceback functionality to other ASes, end-users or IDS (intrusion detection system). An ASlevel overlay network is built among all traceback-deployed ASes. As a result, attacking flows can be traced back over hop-by-hop flooding to upstream neighboring ASes in the overlay network. However, flooding-based querying remains an inefficient approach for traceback. B. The Need for a New System Architecture for IP Traceback Table I shows a summary of advantages and disadvantages of the different IP traceback system architectures. From the comparative summary, none of the existing traceback solutions fully provide satisfactory properties favouring traceback

AS6 AS4

Fig. 3: Overlay-based logging for IP traceback

deployment by ISPs. End-host centric marking faces inherent privacy issues to ISPs and several other technical problems. Poor incremental deployability and high resource requirement at individual routers are intrinsic problems in distributed logging. Overlay-based logging improves the incrementally deployability, but still suffers from problems such as high communication overhead in traceback processing. Certainly, there are also other traceback approaches that do not fit into this classification. For example, ICMP messages can be generated for traceback purposes [28], or as suggested in the [29], ICMP error messages can be useful in detecting sources of spoofed IPs. While the former generates additional traffic, the latter is a passive approach dependent on path scatters. More recently, packet traceback for Software-Defined Networks (SDN) has been proposed [30], [31]. Zhang et al. [31] proposed to use packet-processing policies from higher-level SDN controllers to derive how a packet reaches its current location, without the need of marking or logging. In addition to technical shortcomings, the lack of financial motivation for ISPs to deploy anti-spoofing mechanisms [32] is another reason why IP traceback is still an open and challenging problem despite much research. To address this issue, Gong et al. [5] proposed to restrict packet marking information to only paid customers based a subscription charging model. That is, each AS that deploys the traceback service charges a fee to its customers (networks or end users) who are interested in accessing to the service. Thus, only paying customers can get the marking information. However, this pay-as-you-go charging model is more attractive to users because in many instances, customers only need traceback services after they have been attacked. Due to these limitations in traditional traceback systems, we are in need of a new traceback system architecture such as the cloud-based traceback presented in the next section.

4

III. S YSTEM A RCHITECTURE

Tracing attack packets

A. Motivations 1) Exploiting cloud infrastructures for forensic traceback: Storage requirement was considered the main limiting factor for logging-based traceback [5]. However, over time, technology advances increase the feasibility of logging-based solution. With the advancement of distributed file system, ISPs start to offer cloud storage services, where traceback logs can be stored and managed in local ISPs’ data centers. In traditional logging-based traceback [22], [25], [26], traffic digests are assumed to be stored at local routers for some period of time, which is greatly constrained by the limited storage capacity. Consequently, traceback must be initiated before the corresponding log tables are overwritten. In cloud-based traceback, storage available for storing traceback logs is higher by multiple orders of magnitude than traditional logging-based traceback systems. In addition, the pay-per-use nature of cloud service encourages network providers to deploy the traceback service. It is not only technically sound but also economically preferable to migrate the logging-based traceback solution to cloud computing environment. This motivates us to exploit the increasingly available cloud infrastructures for logging the traffic digests for forensic traceback. 2) Utilizing generic network functions for flow-level logging: Nowadays, network service providers routinely collect flow-level measurements to guide the execution of many network management applications [33]. Flow-based monitoring technologies like xFlow (NetFlow, IPFIX, sFlow, jFlow) are increasingly being deployed with applications that range from customer accounting, identification of unwanted traffic, anomaly detection, to network forensic analysis [27]. Take the NetFlow [34] for example, routers report collected flow statistics to a centralized unit for further aggregation at preconfigured time interval. Hence, flow-level logging in cloudbased IP traceback that utilizes generic network functions becomes a promising traceback solution. B. Cloud-based Traceback Architecture Based on the above two motivations, we propose the cloudbased traceback architecture, as depicted in Fig. 4. It exhibits a hierarchical structure which is organized in three layers, the central traceback coordinator layer, AS-level traceback server layer (i.e., the overlay layer) and router layer (i.e., the underlying network layer). 1) Intra-AS Structure: A traceback server is deployed in each traceback-deployed AS. Traffic flow information collected at traceback-enabled routers will be exported to internal cloud storage which is managed by the traceback server in each AS for long-term storage and analysis. Routers may independently sample the traffic or collect the traffic flow in a coordinated fashion [33]. In the interest of space, we do not discuss the details of traffic sampling, instead refer interested readers to [27], [35], [33] for more details about sampling and logging traffic flows. Typically, flow-level traffic digests contain the following information, source IP address, destination IP address, source port, destination port, protocol, timestamp, etc. Data aggregation will be performed at the traceback server.

Traffic accounting

Path validation

Fault diagnosis

WS‐API

Traceback Coordinator

WS‐API

WS‐API WS‐API

Traceback server and cloud storage Export traffic digests

Traceback‐deployed AS

Traceback‐deployed AS

Traceback‐deployed AS

Fig. 4: Architecture overview of cloud-based traceback Since the traceback server as well as internal cloud storage is managed by local AS, sensitive information could be secured. Thus, cloud-based traceback has the potential to offer stronger privacy-preserving guarantee. 2) Traceback as a Service: Traceback-enabled ASes expose their traceback services in the traceback coordinator, e.g., by publishing traceback services in standard form using the Web service technology (WS-API). The published traceback service is accessible as a charged service to network forensic investigators (e.g., victims, network administrators, or law enforcement agencies) and other applications, as shown in Fig. 4. The traceback coordinator is the central point/portal of access into the system. It functions mainly as a querying hub without storing any traceback data, retrieving logs from individual traceback servers when requested and authenticated. 3) Inter-AS Logical Links: To maintain inter-AS logical relations, and achieve efficient traceback processing and high incremental deployability, we introduce the flow-level marking at AS-level border routers. The key idea is to add an extra attribute to flow logs to indicate the immediate upstream traceback-deployed AS that the packet flow has been progressed from. In this way, we maintain logical links between these traceback-deployed ASes. As a result, during the traceback process, a downstream AS will be able to know the next AS that should be contacted for tracing the flow. Tracebackdeployed AS

AS2

AS5

Attack path

Legacy AS

Traceback logical link

AS1 Attacker

AS4

AS1 packet

AS7

AS3 packet

Legacy AS AS3 packet

AS3

Victim

AS6

AS6 packet

Fig. 5: Example of marking at AS-level border routers In our design, a border router marks its AS identity (e.g., the global unique 16-bit AS number or internally assigned ID) on flows that leave from an AS to another AS. Prior works [36], [37], [9] identified up to 25 bits in IP header that may be used for marking. In IPv6 networks, more fields of IPv6 header such as Flow Label (24 bits) and Hop-by-Hop options (8 bits) can be used for marking [38]. The flow marking is similar to the

5

marking scheme in [39], which marks every flow (e.g., mark the first few packets of a flow), instead of every packet. A flow in this context can be defined as a unidirectional sequence of packets between two endpoints that have a common flow ID with no more than a specific inter-packet delay time. Fig. 5 illustrates an example of our approach. AS4 and AS5 are legacy ASes, and the others are traceback-deployed ASes. Assume an attack flow traverses through [AS1 → AS3 → AS4 → AS6 → AS7 ]. When the border router in AS1 receives a packet in the attack flow from its local AS and forwards the packet to AS3 , it marks the local AS number in the packet’s IP header. When the packet is forwarded by routers in AS3 , the upstream traceback-deployed AS information will be recorded in the flow report. Since flow marking is transparent to legacy routers and ASes, our scheme works well in partial deployment situations. For the example in Fig. 5, AS6 knows the packet flow has come from AS3 . Note that once a packet has been marked by a border router (e.g., the corresponding marking field in IP packet header has non-zero values), the downstream ASes will mark this packet deterministically. As a result, the marking information of previous AS will be overwritten by the downstream AS. Therefore, our marking scheme protects the privacy of ASes from end-hosts. We also highlight that the required marking space does not increase along the path as the marking information of previous AS will be overwritten by the downstream AS. The same marking space will be reused by the last hop AS for passing the tokens for authentication, which will be described later. 4) Traceback Processing: In our proposed cloud-based traceback, traceback procedure starts with an investigator sending queries to the traceback coordinator. Suppose a user starts a traceback request consisting of the 5-tuple flow ID (srcIP, dstIP, srcPort, dstPort, protocol) and the estimated attack time. The traceback coordinator will first contact the traceback server in the same domain of the victim, which is responsible for the authentication of this traceback request (the details are given in Section IV). Upon verification, retrieved result including the upstream traceback-deployed AS information will be returned from the corresponding traceback server that witnessed the flow of interest. In the next step, the traceback coordinator sends a query to the traceback server of the upstream AS. The traceback coordinator will terminate the recursive query process until a traceback server identifies itself as the first traceback-deployed AS on the attack path. Each traceback server generates an attack graph for its local domain. Apparently, this approach achieves efficient traceback processing by avoiding the traceback query flooding. Note that flexibility rests with the ISP—the granularity of an attack graph can be controlled by each individual traceback server to avoid leak of sensitive information. Attack graphs from each AS are assembled together to form a complete attack graph by the traceback coordinator. C. Benefits of the Cloud-based Traceback Given the promise of cloud computing with reduced infrastructure costs, ease of management, high flexibility and scalability [40], deploying traceback service in cloud not only

meets several favorable properties identified by prior arts [6], but also presents new appealing opportunities. We argue that such a centralized system simplifies the traceback processing and well addresses the technical and economic challenges for the practical deployment of an IP traceback system. We list the main advantages of cloud-based traceback as follows. 1) The cloud architecture makes a traceback system incrementally deployable without much extra effort, thus providing a progressive traceback solution. 2) It has the potential to offer stronger privacy-preserving guarantees. With each ISP handling their individual traceback servers independently, their privacy and autonomy can be securely and adequately maintained. 3) Cloud-based traceback shows increased robustness against attacks. As the cloud storage is for private use, the AS can hide the storage server from the Internet, by placing it within its private network. Besides multi-layer restrictions (using IP addresses, ports, protocol, user access control, etc.) can be put in place. The information can also be stored in encrypted form. A private cloud storage is robust against the tampering by the attackers, without resorting to cryptographic techniques. For example, it is possible the central server checks for any routing inconsistencies and figures out compromised routers or corrupted information. This is in contrast to marking-based approach, where compromised routers pass spoofed or erase marking information to misdirect the traceback procedure. Likewise, in traditional loggingbased approach, the hop-by-hop traceback process [35] is also vulnerable to compromised routers. 4) Cloud-based traceback architecture enables forensic investigations in the aftermath of attacks, as logs can be maintained for longer period than in traditional logging-based traceback (where router storage capacity is limited) 5) The pay-by-use nature of cloud service encourages ISPs’ involvement to deploy the traceback service, where the traceback coordinator can distribute monetary rewards to traceback deployers. It is worth mentioning that the proposed cloud-based traceback architecture resonates highly with the software-defined networking (SDN), which is an emerging paradigm that decouples networks control plane and data plane physically [41]. SDN offers a centralized view of the network in each AS, and shows similarities with our cloud-based traceback architecture. Since SDN architecture provides more customized and flexible traffic flow measurement, and routers regularly send collected flow statistics to the controller [42], our cloud-based traceback can well integrate into SDN. D. The Need for a New Traceback Authentication In the context of cloud-based traceback, suppose a malicious entity has access to the cloud-based traceback service, and can retrieve recordings from the corresponding traceback server. On one hand, there exists a risk that a misbehaving user derives the ISP’s network topology after collecting sufficient traceback results. On the other hand, malicious users may launch denial of service (DoS) attacks against the traceback service [22]. In addition, we expect to protect legal Internet users’ privacy since they normally do not want to be traced. Therefore, any entity wishing to perform a traceback should

6

be appropriately authenticated. User name and password are widely used as the main authentication mechanism. However, password-based authentication is not scalable and suffers from password cracking vulnerability. This paper proposes an enhanced user authentication scheme which is customized for regulating access to traceback service in a cloud-based traceback system. IV. AUTHENTICATION IN C LOUD - BASED T RACEBACK This section describes a novel token-based authentication framework in cloud-based traceback. We first present the adversary model and the design objective. Then, we introduce the design overview of the FACT authentication framework, followed by detailed descriptions of its key components. A. Adversary Model and Design Goal We consider that an adversary may attempt to acquire traceback information for ill intentions. Examples of adversary are potential attackers or competitors who wish to retrieve such information for ISPs topology discovery [5]. An adversary may use traceback techniques to invade Internet-user’s privacy, such as tracing those users who have visited certain websites. We also consider an adversary may launch DoS attacks to the traceback system. Our design goal is to ensure that the individual requesting for the traceback procedure is an actual recipient of the packetflow to be traced (privileged entities such as law enforcement investigators may not be applicable). This will prevent users with malicious intents from retrieving traceback information that is not meant to be released to them. User authentication can also prevent DoS attacks to traceback services. To elaborate, in a DoS attack to a traceback server, attackers send illegitimate queries to the traceback server, thereby forcing the server to initiate large number of traceback queries. Such DoS attacks can be mitigated effectively by enforcing authentication1 . The authentication solution should be lightweight and robust, minimally affecting routers and routing protocols.

be traced. The issuance of access tokens can be triggered on-demand by deployed security solutions, or end-users who subscribed to traceback service and may retrieve the traceback logs later [5]. For example, an intrusion detection system detects potential anomalies, and thus triggers the traceback server to issue access tokens to the end-host. If it is indeed a DDoS attack, it is likely that the victim needs to collect traceback information as forensic evidence so as to ‘prosecute’ the perpetrators. The end-host could also pass the gathered access tokens to some other entities such as law enforcement agency, whom they are willing to trust, for forensic investigation. As shown in Fig. 6, the last-hop router takes on the role of passing tokens to end-hosts. This role can be assigned to edge routers of an ISP, in particular, to the routers connecting to customer premises. We make the common assumption that a router failure will not affect the token marking functionality, as the backup router that becomes active during the event of failure (or even attack) will carry on with the function. However, if a router is compromised, the users it serves will be affected, until the router is secured back again. Yet, note that only partial customer base of the ISP will be affected. Our idea is to use traffic flow to carry access tokens to endhosts without incurring extra message overhead. This makes the access token known only to the actual recipients of the packet-flow who may want to retrieve the flow information later for forensic analysis in a cloud-based traceback system. Malicious users are unlikely to be able to obtain the token. Since the access tokens vary both temporally and spatially, even if an adversary manages to intercept tokens, it is difficult to impersonate a legitimate end-host all the time. Traceback Coordinator Authentication

Traceback server On demand access token issuance Token embedded in traffic flows

1 However, attacks in which both the source and destination machines are controlled by the attacker, to subsequently overwhelm the traceback server with legitimate queries, are not mitigated by our authentication solution here. Rather, such attacks, which usually require coordination among large number of machines or bots (and are therefore expensive), can be blocked at or near the sources once they are located using the traceback service.

1100001101010101, Timestamp

Token extraction

B. FACT Design for Cloud-based Traceback 1) Framework Overview: Token-based access control has been widely used to protect sensitive information in cloud computing environment [43], [44]. Instead of authenticating with username and password for protected resources, a user obtains a time-limited token, and uses this token for authentication. Fig. 6 illustrates the proposed framework for authentication in cloud-based IP traceback, named FACT. In our design, an access token is associated with a "validity period", where an entity in possession of an access token is granted to retrieve traffic flow data of that specific period. A traceback server distributes temporal access tokens to endhosts, who are indeed the intended recipients of packets to

Traceback query

1100001101010101

Traceback Client

Last‐hop router

End‐host

Fig. 6: Framework overview for temporal token-based authentication in cloud-based traceback Specially, we introduce the traceback client at the endhost. As illustrated in Fig. 6, the traceback client is in charge of extracting the tokens from incoming marked packets and storing the reconstructed access tokens for further use. It can be considered as a black box, hiding the actual implementation from the end-host. An end-host with a valid access token can retrieve the corresponding traceback information through the cloud-based traceback system. 2) Key Challenge: A key challenge is how to transmit a token to end-hosts in an efficient and robust manner after the token is issued by the traceback server in an AS. One straightforward approach is to write the token in IP packet header, so that end-host can obtain the token when receiving the marked

7

packets. We refer to this scheme as direct marking. However, the available marking space in IP header is rather limited [37], [9], [45]. For example, most packet marking methods have suggested using the 16-bit identification (ID) field, but the newly released RFC 6864 now prohibits any such use [46]. While the length of an access token should be sufficiently large to make it hard to guess. Another alternative solution might be employing the network flow watermarking technique [47], which attempts to manipulate the statistical properties of a flow of packets to insert the token into network flow. Unfortunately, the watermarking-based approach introduces significant delays to the traffic flow, and it suffers from low robustness and severe decoding errors [48]. Since tokens to be delivered to end-hosts are used for authentication and validation, accuracy and robustness are of paramount importance in token delivery. C. Match-based Marking for Token Delivery 1) Basic Idea: Our design objective is to adapt to the limited marking space in IP header for efficient token delivery. An ideal case is that, there is an entire bitwise match between certain pre-defined packet fields and the token, i.e., the bit values in specific packet fields (either in IP header or data payload) and the token are entirely equivalent. In this case, we only need a minimum of 1-bit flag to mark the packet so as to indicate that it contains the token. However, the likelihood of such an occurrence is very rare. Suppose the token has a size of 64 bits, and the bit values in a packet are random variables, the chances of a full match could be as low as 2164 . In addition, using only one packet to deliver a token is vulnerable to packet drop attacks. In FACT, we propose an efficient token delivery scheme to spread a token across a wide spectrum of packets. This design makes the token difficult to be captured and thus reduce the risk that attackers launch packet dropping attacks, while minimizing the bit space per packet required for marking. The basic idea is that, we partition a token into a sequence of non-overlapping fragments. Given an IP packet at the lasthop router, we check whether certain field (or hash values) of this packet matches any fragment of the token that is to be delivered to an end-host. If there is a match, we mark the packet to notify the end-host that it carries partial information of the token. When the end-host receives a marked packet, it will extract the partial token information embedded in the received packet. Given a collection of marked packet, the endhost can reconstruct the complete access token. 2) Possible Attributes for Token Fragment Match: Since an access token is essentially a random bit string, we want to find attributes in IP packets with the largest variance for token fragment match. It is likely that fields in data payload have pronounced differentiated values compared with other fields. We also compared the uniqueness of different attributes in IP header using CAIDA datasets [49], and found that the 16-bit checksum field and identification field in IPv4 header may be used for token fragment match. Since the matching operation is only performed at the last hop after the checksum is recalculated, the checksum will not be adjusted when it arrives at the end-host. Therefore, both checksum and

identification fields can be used for our purpose. However, when the Network Address Translator (NAT) is in effect, we cannot use the IP header checksum for token fragment match, since NAT changes the IP address as a packet arrives at the destination host and the checksum value is calculated over the IP header. Another option is to use the hash value of a particular attribute for token fragment match. In this case, the last-hop router and traceback client at the end-host should have the same hash functions. 3) Marking Procedure: For clear illustration, let MA denote the selected match attribute for token fragment match. We first define the token fragment match, and then describe the marking procedure. Definition. Token Fragment Match: Given a token fragment (TF) and the selected attribute (MA) of an IP packet, if MA contains a non-empty subset of set bits (i.e., bits that are set to 1) in TF, and MA retains all the clear bits (i.e., bits that are set to 0) in TF, we call this a token fragment match between MA and TF. Token Fragment 0100110101001101 MA 0000110000001100 (a) Token Fragment Match

0100110101001101 0010110001001001 (b) Token Fragment Mismatch

Fig. 7: Examples of token fragment match and mismatch Fig. 7(a) shows an example of the token fragment match, where we assume the size of a token fragment TF is 16bit. An example of mismatch is illustrated in Fig. 7(b); since MA fails to retain all the cleared bits in TF, it does not match with TF. According to the definition of token fragment match, we know that the probability of token fragment match is highly dependent on the percentage of cleared bits in the token. For example, given a 16-bit token fragment with 50% cleared bits (i.e., 8 cleared bits) and assuming MA has random distribution of values, the match probability is 218 . This low probability may lead to poor performance of the token delivery. The smaller a token fragment, the higher the expected match probability. But decreasing the size of token fragment will increase the marking space requirement and the number of marked packets. Hence, there is an inherent trade-off between the match probability and the required marking space. In this work, we mainly introduce the generic FACT authentication framework, and leave the optimal token fragmentation as an open problem for future research. Without loss of generality, we assume an access token is partitioned into n non-overlapping fragments. Let f denote the length of each token fragment. The length of MA is equal to f . Suppose there are k (k ≥ n) bits marking space in each IP header that can be used to encode information for token delivery at the last-hop router. For simplicity, we use the 8bit long token fragment (i.e., f = 8) to describe our token delivery design, where f can also be set to different values. In order to minimize the marking space requirement and improve the marking efficiency, we use the 8-bit MA for token fragment match. As a result, we use only 1 bit for each token fragment to indicate a match or a mismatch with the MA value.

8

11000011 10101010 TF2 01010101 TF3 00111100 TF0 TF1

MA

10000010

Marking space: 4 bits

1

n=4, l=8 bits

1

0

0

TF0 TF1 TF2 TF3

Fig. 8: Example of packet marking for token delivery For the example in Fig. 8, assume the token length is 32bit and the marking space is 4-bit, where the marking space is used to indicate token fragment match. When the last-hop router receives a packet, the MA value will be checked for any token fragment match by traversing down the token fragments. We check if the first token fragment TF0 matches with MA, and find a match for TF0 , and thus set "1" for the first bit in the marking filed. Similarly, MA matches with TF1 , and thus the marking value of the second bit is set "1". Finally, we get the marking value "1100" in this example. Note that all packets to the end-host, regardless of whether they are suspicious or not, could be used for marking, resulting in a fast and efficient token delivery. Note that our design can be easily extended to adapt to available marking space in IP header. For the example in Fig. 8, if the IP header has 8 bits for marking, we can select two 8-bit MA1 and MA2 for token fragment match. We use 2 bits for each token fragment to indicate the usage of MA1 or MA2 . That is, "00" denotes there is no token fragment match neither with MA1 nor MA2 , "10" denotes the token fragment match with MA1 , "01" denotes the token fragment match with MA2 , and "11" denotes the match with both. This operation increases the token fragment matching ratio and thus further improves the token delivery efficiency. 4) Concise Marking: If the last-hop router simply marks all the packets that match any token fragment, we call such simple marking scheme as the blind marking. One drawback of the blind marking is that, since the last-hop router does not keep track of the portions of the token that has been relayed to an end-host, it has to be executed throughout a specified time period without knowing whether an access token has been fully matched or not. Moreover, when a partial token has already been formed at the end-host, the blind marking may result in marked packets carrying redundant information to the endhost. To minimize the marking overhead, we introduce the idea of concise marking. MA

11000011 10101010 TF2 01010101 TF3 00111100 TF0 TF1

Original Token

10000010 11000011 10101010 01010101 00111100

00010101 11000011 10101010 01010101 00111100

00111000

10000010

11000011 10101010 01010101 00111100

Redundant

t1

t2

t3

t4

Fig. 9: Example of the concise marking scheme Whenever the last-hop router finds a token fragment match, it marks the packets and takes note on which bit values have been relayed to the end-host. As shown in Fig. 9, the lasthop router keeps track of the token delivery progress to an end-host. It will only mark the next packet if and only if

this packet can carry new set bit values to the end-host. For example, at time t1 , TF0 and TF1 find token fragment matches with MA, and thus the last-hop router updates their remaining set bits as "01000001" and "00101000", respectively. At time t2 , the remaining set bits of TF2 are updated as "01000000". Later at time t3 , the remaining set bits of TF3 are updated as "00000100". However, at time t4 , it finds a redundant token fragment match, thus it will not perform the packet marking.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Input: Token fragments T Fi , i ∈ [0, n − 1] Output: Marked packets remainingBitsi ← T Fi ; i ∈ [0, n − 1] while ConciseMarking(Packet P ) do M A = getMatchAttribute (P ); mark ← 0; for i=0 to n − 1 do if ConciseMatch (M A, T Fi , &remainingBitsi ) then mark |= (1 (8-i)); //8-bit marking space end end if mark 6= 0 then MarkPacket (P , mark); end if ∀ i, remainingBitsi == 0 then break; end end

Algorithm 1: Algorithm for token delivery using concise marking 1 2 3 4 5 6 7 8 9 10 11

Function: bool ConciseMatch(value, T F , *remainingBits) if (value ⊕ key) & value == 0 then return false; end completedBits = (T F ⊕ ∗remainingBits); newBits = value & (∼ completedBits); if newBits == 0 then return false; end ∗remainingBits = (∗remainingBits ⊕ value) & (∼ completedBits) ; return true;

Algorithm 2: Function for concise token fragment match Algorithm 1 describes the concise marking-based token delivery in FACT. Suppose there is an access token to be delivered to an end-host. When the last-hop router receives a packet, it first extracts MA (line 3). Then, for all token fragments, it sequentially checks whether there is a concise token fragment match. If yes, the marking filed is updated and then embedded in the packet’s IP header (lines 5-12). The benefit of concise marking includes the reduction of redundant packets to be marked. In this way, the maximum number of packets to be marked would be the number of set bits that the token has. It also provides an end point to the token delivery. When the entire token has been relayed to the host, there will be no need to mark any further packets, ending the token delivery process (lines 13-15). Algorithm 2 describes the function to check concise token fragment match. It first makes sure there is a token fragment match (lines 2-4). Then, it checks any new bit can be conveyed

9

The traceback client deployed at the end-host is in charge of the token extraction. The last-hop router can use a preamble to notify the traceback client at the end-host that a new access token has been issued. For example, all bits are set in the marking field to indicate a preamble. In this case, the lasthop router will neglect the matching case with all marking bits set. It is a viable solution and affects the performance insignificantly since the probability of all token fragments match the MA is extremely low. When the traceback client receives a token delivery preamble, it will generate a token instance with all bits cleared. Upon receiving a marked packet, the traceback client updates the temporal token. Since the lasthop router keeps track of the token fragment delivery progress in concise marking, it sends out a postamble to end the token delivery once the entire token has been relayed to the endhost. After receiving a certain number of marked packets, the full access token can be recovered at the end-host. 10000010

1 1 0 0

MA

Marking

TF0 = TF0|MA

TF1 = TF1|MA

00010101

0 0 1 0

MA

Marking TF2 = TF2|MA

TF0 10000010

TF2 00000000

TF0 10000010

TF1 10000010

TF3 00000000

TF1 10000010

(a)

TF2 00010101 TF3 00000000

(b)

Fig. 10: Example of the token extraction at the end-host Let us revisit the example in Fig. 8, its corresponding token extraction procedure is illustrated in Fig. 10. The end-host decodes the marking "1100" when receiving the first marked packet. It then updates the token with TF0 = TF0 |MA and TF1 = TF1 |MA, where "|" is the bitwise OR operator. Then, after receiving the second marked packet, the traceback client updates TF2 as shown in Fig. 10(b). Note that to reconstruct a new access token, the traceback client does not need to store the marked packets. It only needs to maintain a token instance in the buffer, and keeps updating the token when receiving marked packets until a postamble is received. E. Design Discussions 1) Comparison with Direct Marking: For the direct marking, a token normally needs to be partitioned into a sequence of fragments so that one fragment can be embedded into an IP header. As a result, the marking space must contain two parts, namely the fragment index field and payload field, in order to make sure a token can be reconstructed at the endhost. Given a fixed token length l, we can derive the number of token fragments n that leads to the minimal marking space required by the direct marking, by solving Eq. (1). l k ∗ = min(dlog2 ne + d e), n ∈ [1, l], (1) n where k ∗ denotes the minimal marking space, dlog2 ne is the bit length of fragment index, and d nl e is the bit length of

minimize n, l subject to dlog2 ne + d e <= k, n 60

Token Length: 64-bit

50 40 30 20 Minimum (32,7)

10 0

0

Minimum (64,7)

5 10 15 20 25 30 35 40 45 50 55 60 Number of Token Fragments

(a) Minimal marking space

Mininum # of Marked Packets

D. Token Extraction

payload in marking space. Similarly, given a known token length l and available marking space k, we can obtain the minimum number of token fragments, i.e., the minimum number of marked packets in direct marking, by solving (2).

Required Marking Space (bit)

by the selected attribute. Finally, the remaining set bits of each token fragment is updated (line 10).

(2)

n ∈ [1, l].

35 Token Length: 64-bit

30 25 20 15 10 5 0

7

8 9 Avaiable Marking Space (bit)

10

(b) Minimum number of marked packets

Fig. 11: Constraints in direct marking scheme Fig. 11 depicts the derived minimal marking space and minimum number of marked packets in direct marking scheme, where the access token length is 64-bit long. As shown in Fig. 11(a), for the direct marking, a 64-bit token requires at least 7-bit marking space, and the corresponding number of token fragments is either 32 or 64. On the other hand, our design well adapts to limited marking space. Given the token length l and available marking space k, we can simply set the length of each token fragment to be l/k to adapt to k-bit marking space. For example, if the access token length is 128bit long, and there are only 8 bits marking space available in IP header, the length of each token fragment will be set to 16bit. In addition, customized marking schemes may be applied for the extreme limited marking space scenarios. Fig. 11(b) shows the minimum number of token fragments given fixed marking space. For example, when the marking space is 8-bit, a token has to be divided into at least 16 fragments to be delivered to the end-host, where the marking space contains 4-bit fragment index field and 4-bit payload field. This means that at least 16 marked packets are required in direct marking scheme. However, in Section V, we demonstrate through experiment that our scheme requires less number of marked packets. 2) On-Demand Token Issuance: In traditional loggingbased traceback, attack packets as "material evidence" are required for launching traceback query, where the typically query size is between 1MB and 10MB [25]. In contrast, in our FACT design, end-hosts do not need to store a large amount of packet digests for future forensic investigation. It applies the on-demand procedure to issue access tokens, and allows the issuance as a charged service. Therefore, the tokenbased authentication incurs lower overhead on the end-host. It is worth mentioning that the proposed FACT authentication framework can also be applied to traditional logging-based traceback system to convey information to end-hosts. F. Computational Complexity of Authentication Process The authentication process checks whether the clientprovided token is valid. Specifically, the task involves search-

10

V. E VALUATION This section presents evaluation results of the FACT authentication framework using MAWILab 2016 [50] and CAIDA 2007 [49] traffic traces. The main objective is to investigate the efficiency of access token delivery to end-hosts. A. Experiment Settings We implement the token delivery scheme on Ubuntu Linux desktops using the libpcap library. Our program, emulating a last-hop router, receives packets from an offline capture (e.g., MAWILab [50] traffic traces) by invoking the pcap_next_ex function, which retrieves the IP header of the next captured packet. Given a token to be delivered to the end-host, our program checks whether to mark the captured packet or not. After performing the marking, our program calls pcap_dump function to output a marked packet, and then sends it to the end-host. Unless otherwise specified, we use the eighth character of data payload for token fragment match and adopt the concise marking scheme. We assume there are 8 bits marking space available in IP header, and the token fragment length is set to 8 bits by default. We partition the single 1.6GB MAWILab tcpdump trace file into multiple traces and randomly select three different files to evaluate our solution. The results have been averaged over 500 runs, and the corresponding standard deviations are provided as error bars. We implement the traceback client for token extraction and management at the end-host. The token extraction is the reverse process of token marking. The traceback client maintains an access token table. Upon successfully receiving a token, it stores the token as well as the received time in the table. After the traceback client receives a marked packet with all set bits in the marking field, it considers this packet a preamble to initiate the token delivery. If the traceback client receives consecutive marked packets with all set bits in the marking field, it considers these packets a postamble to end the token delivery. In this context, when the traceback client receives the first all-bit-set marked packet, it delays making

0.08

Dataset 1 Dataset 2 Dataset 3

25%

Token delivery delay (s)

18 16 14 12 10 8 6 4 2 0

# of Marked Packets

ing and matching against a set of tokens, given time-period, token, and client IP address as input. This boils down to comparing the client-provided token against each token selected based on the input criteria. Comparing two bit-strings of wordsize takes constant time. Assuming n tokens are generated during the given time-period for the given client, the time for computation is O(n). When a token is generated for each flow that is active, the number of tokens generated per second is equal to the number of active flows; and this number can range from a few tens for most average users to a few thousands for servers. Therefore, the number of search and comparison operations is bounded by this number, which is low in terms of computational load. On the other hand, in many cases, access tokens are issued in an on-demand manner. Therefore, the number of tokens generated for an end-host in the database is very limited. We conclude that the computational load is not a bottleneck in the implementation of cloud-based traceback.

50%

75%

Percentage of Set Bits in a Token

(a) Number of marked packets

Dataset 1 Dataset 2 Dataset 3

0.06 0.04 0.02 0

25%

50%

75%

Percentage of Set Bits in a Token

(b) Token delivery delay

Fig. 12: Impact of token pattern using MAWILab dataset decision for a short period to distinguish the preamble and postamble. We use the following metrics for comparison. • Number of marked packets for token delivery: the number of packets marked by the last-hop router for delivering an access token to an end-host. • Token delivery delay: the time elapsed from a preamble sent by the last-hop router to the last marked packet received by the end-host when delivering an access token. B. Experiment Results 1) Impact of Token Pattern: According to the definition of token fragment match, the token delivery performance will have a strong correlation with the number of set bits in the token. In this experiment, we vary the set bit percentage per token from 25% to 75%, where the offsets of set bits are randomly generated in a token. The token length is set to 64 bits in this experiment. In Fig. 12(a), we compare the number of marked packets with increasing set bit percentage per token. We observe that the three different datasets show a similar trend. With the test cases for 25% set bits, packets could only be marked if the eighth characters of data payloads have values that contain the 75% of cleared bits in a token fragment. Therefore, the probability of token fragment match is lower than the 50% and 75% set bits cases. As a result, as shown in Fig. 12(a), the average number of marked packets for 25% set bits is around 8. With 50% set bits in each token, the average number of marked packets is increased to 14. Fig. 12(b) illustrates the token delivery delay results under different token patterns. The overall trend is that, as the set bit percentage per token increases, the token delivery delay decreases remarkably. The reason is that, the token fragment match probability increases with increasing the set bit percentage per token, which accelerates the process of token delivery. However, we observe that in the condition of 25% set bits per token, for dataset 2, the token delivery delay is much larger than the other cases. This indicates that, the token delivery performance is dependent on traffic flows. As the proposed scheme is based on the diversity of particular attributes for token fragment match, there exits a factor of randomness in the token delivery process. One question that arises is regarding the evaluation results in Fig. 12. Why does the 75% set bits per token setting yield less number of marked packets than that in 50% set bits case? Intuitively, as the number of set bits per token increases, it requires more packets to convey these bits to the end-host

11

16

Dataset 1 Dataset 2 Dataset 3

4 3

Dataset 1 Dataset 2 Dataset 3

Token delivery delay (s)

Dataset 1 Dataset 2 Dataset 3

5

Conveyed Set Bits Per Match

32 48 Token Length (Bit)

Dataset 1

0.025 Dataset 2 0.02 Dataset 3 0.015 0.01 0.005 0

64

(a) Number of marked packets

16

32 48 Token Length (Bit)

64

(b) Token delivery delay

Fig. 14: Impact of token length using MAWILab dataset

2 1

50%

75%

0

Percentage of Set Bits in a Token

(a) Set bits per mark

25%

50%

75%

Percentage of Set Bits in a Token

(b) Conveyed set bits per packet

Fig. 13: System insight with different token patterns using MAWILab dataset Fig. 13(a) plots the set bits per mark with different token patterns, which indicates the number of token fragment matches found given a marked packet. As shown in the figure, the results of 75% set bits per token case are higher than other cases. From Fig. 13(b), we observe that the corresponding conveyed set bits per marked packet increase with increasing the set bit percentage per token. A packet can convey more set bits of a token since the token fragment match probability is higher in 75% set bits cases. That is why although the number of set bits in 75% set bits per token setting increases, the number of marked packets is fewer than that in 50% set bits case in Fig. 12(a). 2) Comparison with Blind Marking: In the second experiment, we compare the performance of the blind marking and concise marking schemes. Since both schemes have the same token delivery delay performance, we only report the results of the number of marked packets in Table II.

As the token fragment length increases, the number of marked packets and the token delivery delay also increase accordingly for all datasets. Since tokens are randomly generated, Fig. 14 exhibits higher variance than the results in Fig. 12(b). From this experiment, we can expect the marking overhead in our design will only increase linearly with increasing the fragment length. 4) Impact of Attribute for Matching: Finally, we investigate the impact of different attributes in IP header used for matching (including upper checksum, upper identification field, and hash value of the upper checksum) on the token delivery performance. Since the checksum and identification fields in MAWILab dataset show very limited variance (one possible reason is that these fields have been sanitized before data publishing), we choose three different CAIDA [49] traffic traces in this experiment. The set bit percentage per token is set to 50%, and the token length is set to 64 bits. 25 20

Dataset 1 Blind Marking Concise Marking

1891 8.5

25% Set Bits Dataset 2 Dataset 3 2777 8.2

1442 7.2

Dataset 1 645 13.3

50% Set Bits Dataset 2 Dataset 3 833 13.8

671 14.5

As shown in Table II, with the test cases for 25% set bits, the blind marking scheme incurs 46 to 338 times higher than the concise marking in term of the number of marked packets. While for 25% set bits cases, concise marking provides significant marking overhead reduction compared with the blind marking. The reason is that there are a large number of redundant token fragment matches in the condition of 25% set bits per token. On the other hand, redundancy in packet marking actually increases the robustness of the token delivery against packet drops. Therefore, to improve the robustness of the solution, we can introduce probabilistic redundant marking in our concise marking scheme. 3) Impact of Token Length: Next, we study the impact of token length on the token delivery performance where the token fragment length is fixed to 8-bit. In this experiment, we randomly generate tokens without specifying the percentage of set bits per token. Fig. 14 shows the evaluation results.

Checksum Identification Hash

15 10 5 0

TABLE II: Number of marked packets using MAWILab dataset

Percentage of Set Bits: 50%

Dataset 1

Dataset 2

Dataset 3

(a) Number of marked packets

Token delivery delay (s)

25%

# of Marked Packets

Set Bits Per Mark

3.5 3 2.5 2 1.5 1 0.5 0

18 16 14 12 10 8 6 4 2 0

# of Marked Packets

(the maximum number of marked packets required by concise marking is proportional to the number of set bits per token). However, we observe the counter-intuitive results. We discuss the underlying reason in Fig. 13.

100

Checksum Identification Hash

80

Percentage of Set Bits: 50%

60 40 20 0

Dataset 1

Dataset 2

Dataset 3

(b) Token delivery delay

Fig. 15: Impact of different attributes for matching using CAIDA dataset Fig. 15(a) plots the number of marked packets comparison results when different attributes are used in the token fragment match. For dataset 1 and dataset 2, using hash value yields a little fewer number of marked packets compared with using the other two attributes. But for dataset 3, using checksum slightly outperforms the case of using hash value. Fig. 15(b) reports the performance comparison on token delivery delay. We observe that using different datasets could exhibit different delay results. Overall, using hash value for token fragment match leads to a lower average token delivery delay than using the other two attributes. From this experiment, we show the feasibility of using attributes in IP header, such as checksum and identification fields. However, it may lead to a high token delivery delay than that using payload fields which exhibit pronounced differentiated values. 5) Computational Overhead: To measure the marking delay for individual packets, we conducted an experiment to measure the time taken to mark 10,000 packets. We used an Ubuntu desktop with standard configuration (Intel Xeon

12

processor @ 3.50GHz and 16GB of RAM) as the last-hop router that marks packets in our experiment. The total marking delay (for 10,000 packets) is around 2.5ms, which indicates that the computational overhead of the marking at the last-hop router is negligible. As marking is done only for subscribed users, in actual implementation, the last-hop router also needs to maintain a set of subscribed users. For each arriving packet, the router has to perform a membership query on this set. Such a set of subscribed users can be represented by a dictionary implemented using a hash table. With hash table implementation, membership query takes constant time for both average case and worst case, while requiring space linear in the number of users [51], [52]. VI. C ONCLUSION In this work, we first presented the cloud-based IP traceback architecture, which possesses several favorable properties that previous traceback schemes failed to satisfy simultaneously. We then focused on the access control problem in the context of cloud-based traceback, where the objective is to prevent illegitimate users from requesting traceback information for ill intentions. To this end, we proposed the FACT, an enhanced user authentication framework which ensures that the entity requesting for the traceback procedure is an actual recipient of the flow packets to be traced. Evaluation studies based on real-world Internet traffic datasets demonstrated the feasibility and effectiveness of the proposed FACT. As for our future work, we will investigate the optimal marking scheme in token delivery, and implement FACT framework on our cloud-based IP traceback testbed. R EFERENCES [1] H. Aljifri, “IP traceback: a new denial-of-service deterrent?” IEEE Security and Privacy, vol. 1, no. 3, pp. 24–31, 2003. [2] M. Sung and J. Xu, “IP traceback-based intelligent packet filtering: a novel technique for defending against Internet DDoS attacks,” IEEE Trans. on Parallel and Distributed Systems, vol. 14, no. 9, pp. 861–872, 2003. [3] L. Lu, M. C. Chan, and E.-C. Chang, “A general model of probabilistic packet marking for ip traceback,” in ASIACCS ’08, 2008, pp. 179–188. [4] T. H.-J. Kim, C. Basescu, L. Jia, S. B. Lee, Y.-C. Hu, and A. Perrig, “Lightweight Source Authentication and Path Validation,” in SIGCOMM ’14, 2014, pp. 271–282. [5] C. Gong and K. Sarac, “Toward a Practical Packet Marking Approach for IP Traceback,” International Journal of Network Security, vol. 8, no. 3, pp. 71–84, 2009. [6] A. Yaar, A. P., and D. Song, “FIT: fast internet traceback,” in INFOCOM ’05, 2005, pp. 1395–1406. [7] H. Lee, M. Kwon, G. Hasker, and A. Perrig, “BASE: An incrementally deployable mechanism for viable ip spoofing prevention,” in ASIACCS ’07, 2007, pp. 20–31. [8] A. Belenky and N. Ansari, “On Deterministic Packet Marking,” Computer Networks, vol. 51, no. 10, pp. 2677–2700, 2007. [9] Y. Xiang, W. Zhou, and M. Guo, “Flexible Deterministic Packet Marking: An IP Traceback System to Find the Real Source of Attacks,” IEEE Trans. on Parallel and Distributed Systems, vol. 20, no. 4, pp. 567–580, 2009. [10] D. X. Song and A. Perrig, “Advanced and authenticated marking schemes for IP traceback,” in INFOCOM ’01, 2001, pp. 878–886. [11] T. Peng, C. Leckie, and K. Ramamohanarao, “Adjusted probabilistic packet marking for IP traceback,” in NETWORKING 2002, 2002, pp. 697–708. [12] A. Yaar, A. Perrig, and D. Song, “Pi: a path identification mechanism to defend against DDoS attacks,” in Proc. Symposium on Security and Privacy, 2003, pp. 93–107.

[13] T. Law, J. Lui, and D. Yau, “You can run, but you can’t hide: an effective statistical methodology to trace back DDoS attackers,” IEEE Trans. Parallel Distrib. Syst., vol. 16, no. 9, pp. 799–813, 2005. [14] B. Al-Duwairi and M. Govindarasu, “Novel hybrid schemes employing packet marking and logging for IP traceback,” IEEE Trans. on Parallel and Distributed Systems, vol. 17, no. 5, pp. 403–418, 2006. [15] A. Yaar, A. Perrig, and D. Song, “StackPi: New Packet Marking and Filtering Mechanisms for DDoS and IP Spoofing Defense,” IEEE Journal on Selected Areas in Communications, vol. 24, no. 10, pp. 1853– 1863, 2006. [16] J. Liu, Z.-J. Lee, and Y.-C. Chung, “Dynamic probabilistic packet marking for efficient IP traceback,” Computer Networks, vol. 51, no. 3, pp. 866 – 882, 2007. [17] Z. Gao and N. Ansari, “A practical and robust inter-domain marking scheme for IP traceback,” Computer Networks, vol. 51, no. 3, pp. 732 – 750, 2007. [18] S. Yu, W. Zhou, R. Doss, and W. Jia, “Traceback of ddos attacks using entropy variations,” IEEE Trans. on Parallel and Distributed Systems, vol. 22, no. 3, pp. 412–425, March 2011. [19] L. Cheng, D. M. Divakaran, W. Y. Lim, and V. L. L. Thing, “Opportunistic Piggyback Marking for IP Traceback,” IEEE Transactions on Information Forensics and Security, vol. 11, no. 2, pp. 273–288, 2016. [20] V. Thing, M. Sloman, and N. Dulay, “Locating network domain entry and exit point/path for DDoS attack traffic,” IEEE Trans. on Network and Service Management, vol. 6, no. 3, pp. 163–174, 2009. [21] K. Park and H. Lee, “On the effectiveness of probabilistic packet marking for IP traceback under denial of service attack,” in INFOCOM ’01, 2001, pp. 338–347. [22] A. C. Snoeren, C. Partridge, L. A. Sanchez, C. E. Jones, F. Tchakountio, S. T. Kent, and W. T. Strayer, “Hash-based IP Traceback,” in SIGCOMM ’01, 2001, pp. 3–14. [23] C. Gong and K. Sarac, “A More Practical Approach for Single-Packet IP Traceback using Packet Logging and Marking,” IEEE Trans. on Parallel and Distributed Systems, vol. 19, no. 10, pp. 1310–1324, 2008. [24] T.-H. Lee, W.-K. Wu, and T.-Y. Huang, “Scalable packet digesting schemes for IP traceback,” in ICC ’04, 2004, pp. 1008–1013. [25] J. Li, M. Sung, J. Xu, and L. Li, “Large-scale IP traceback in high-speed Internet: practical techniques and theoretical foundation,” in Security and Privacy ’04, May 2004, pp. 115–129. [26] C. Gong, T. Le, T. Korkmaz, and K. Sarac, “Single packet IP traceback in AS-level partial deployment scenario,” in GLOBECOM ’05, 2005. [27] H. Tian and J. Bi, “An Incrementally Deployable Flow-Based Scheme for IP Traceback,” IEEE Communications Letters, vol. 16, no. 7, pp. 1140–1143, July 2012. [28] A. Mankin, D. Massey, C. Wu, S. F. Wu, and L. Zhang, “On design and evaluation of "intention-driven" ICMP traceback,” in Proc. of the 10th International Conference on Computer Communications and Networks (ICCCN), 2001, pp. 159–165. [29] G. Yao, J. Bi, and A. V. Vasilakos, “Passive IP Traceback: Disclosing the Locations of IP Spoofers From Path Backscatter,” IEEE Trans. on Info. Forensics and Security, vol. 10, no. 3, pp. 471–484, March 2015. [30] S. Shin, P. Porras, V. Yegneswaran, M. Fong, G. Gu, and M. Tyson, “Fresco: Modular composable security services for software-defined networks,” in NDSS ’13, 2013. [31] H. Zhang, J. Reich, and J. Rexford, “Packet traceback for softwaredefined networks,” Princeton University, Tech. Rep., 2015. [32] B. Liu, J. Bi, and A. V. Vasilakos, “Toward incentivizing anti-spoofing deployment,” IEEE Transactions on Information Forensics and Security, vol. 9, no. 3, pp. 436–450, 2014. [33] V. Sekar, M. K. Reiter, W. Willinger, H. Zhang, R. R. Kompella, and D. G. Andersen, “CSAMP: A system for network-wide flow monitoring,” in NSDI’08, 2008, pp. 233–246. [34] “Cisco Systems NetFlow Services,” https://www.ietf.org/rfc/rfc3954.txt. [35] M. Sung, J. Xu, J. Li, and L. Li, “Large-scale IP Traceback in Highspeed Internet: Practical Techniques and Information-theoretic Foundation,” IEEE/ACM Trans. Netw., vol. 16, no. 6, pp. 1253–1266, 2008. [36] D. Dean, M. Franklin, and A. Stubblefield, “An algebraic approach to ip traceback,” ACM Trans. Inf. Syst. Secur., vol. 5, no. 2, pp. 119–137, 2002. [37] M. T. Goodrich, “Probabilistic packet marking for large-scale IP traceback,” IEEE/ACM Trans. Netw., vol. 16, no. 1, pp. 15–24, 2008. [38] S. Amin and C. S. Hong, “On IPv6 traceback,” in ICACT’06, 2006. [39] V. Aghaei-Foroushani and A. Zincir-Heywood, “IP traceback through (authenticated) deterministic flow marking: an empirical evaluation,” EURASIP Journal on Info. Security, 2013.

13

[40] J. Sherry, S. Hasan, C. Scott, A. Krishnamurthy, S. Ratnasamy, and V. Sekar, “Making middleboxes someone else’s problem: Network processing as a cloud service,” in SIGCOMM ’12, 2012, pp. 13–24. [41] W. Xia, Y. Wen, C. H. Foh, D. Niyato, and H. Xie, “A survey on software-defined networking,” IEEE Communications Surveys Tutorials, vol. 17, no. 1, pp. 27–51, 2015. [42] M. Yu, L. Jose, and R. Miao, “Software defined traffic measurement with opensketch,” in NSDI ’13, 2013, pp. 29–42. [43] A.-R. Sadeghi, T. Schneider, and M. Winandy, “Token-based cloud computing: Secure outsourcing of data and arbitrary computations with lower latency,” in TRUST’10, 2010, pp. 417–429. [44] A. Khaled, M. Husain, L. Khan, K. Hamlen, and B. Thuraisingham, “A Token-Based Access Control System for RDF Data in the Clouds,” in Proc. CloudCom ’10, 2010, pp. 104–111. [45] M.-H. Yang and M.-C. Yang, “RIHT: A novel hybrid IP traceback scheme,” IEEE Trans. on Info. Forensics and Security, vol. 7, no. 2, pp. 789–797, 2012. [46] “RFC 6864: Updated Specification of the IPv4 ID Field,” http://tools.ietf.org/html/rfc6864. [47] Z. Lin and N. Hopper, “New attacks on timing-based network flow watermarks,” in Security ’12, 2012. [48] X. Gong, M. Rodrigues, and N. Kiyavash, “Invisible flow watermarks for channels with dependent substitution deletion and bursty insertion errors,” IEEE Trans. on Info. Forensics and Security, vol. 8, no. 11, pp. 1850–1859, 2013. [49] “The CAIDA UCSD DDoS Attack 2007 Dataset,” http://www.caida.org/data. [50] “MAWILab 2016 Traffic Trace,” http://www.fukuda-lab.org/mawilab. [51] M. L. Fredman, J. Komlós, and E. Szemerédi, “Storing a Sparse Table with 0(1) Worst Case Access Time,” J. ACM, vol. 31, no. 3, pp. 538–544, Jun. 1984. [52] Y. Lu, B. Prabhakar, and F. Bonomi, “Perfect Hashing for Network Applications,” in 2006 IEEE International Symposium on Information Theory, July 2006, pp. 2774–2778.

Long Cheng is a Research Scientist at Institute for Infocomm Research (I2 R), Singapore. He received the Ph.D. degree in Computer Science from the State Key Lab of Network and Switching Technology, Beijing University of Posts and Telecommunications, China in 2012. Before joining I2 R, he worked as a Research Fellow at Singapore University of Technology and Design from 2012 to 2014. During the period from 2009 to 2011, he was a research assistant in the Hong Kong Polytechnic University and a visiting student in the University of Texas at Arlington, USA. He received the Best Paper Award of IEEE WCNC 2013 and Erasmus Mundus Scholar Award in 2014. His research interests include network security and forensics, wireless sensor networks, cyberphysical systems, mobile and pervasive computing. He is a member of ACM and IEEE.

Dinil Mon Divakaran (Senior Member, IEEE) is a Research Scientist at the Cyber Security & Intelligence Department in the A*STAR Institute for Infocomm Research (I2 R), Singapore. He is also an A*STAR Graduate Scholarship PhD advisor. Prior to this, for two years he was a Research Fellow at the Department of Electrical and Computer Engineering in the National University of Singapore (NUS). He has also worked as an Assistant Professor in the School of Computing and Electrical Engineering at the Indian Institute of Technology (IIT) Mandi. He carried out his PhD at the INRIA team in ENS Lyon, France. He holds a Master degree from IIT Madras, India. His research works revolve around the applications of statistical models, machine learning and game theory, as well as the study of optimization problems and design of algorithms, all in the broad area of computer networks. He is keenly interested in the study of systems, architectures and protocols in the context of network security and QoS delivery.

Aloysius Wooi Kiak Ang is an undergraduate student in Department of Electrical and Computer Engineering, National University of Singapore. He did the Final Year Project (FYP) at the Cyber Security & Intelligence Department in the Institute for Infocomm Research (I2 R). His research interest includes software engineering and network security.

Wee Yong Lim is the Deputy Lab Head for the Security Intelligence Lab, in Cyber Security & Intelligence R&D Department at the Institute for Infocomm Research (I2 R), A*STAR, Singapore. His research interest includes predictive intelligence, text analysis, object recognition and machine learning.

Vrizlynn L. L. Thing leads the Cyber Security & Intelligence R&D Department at the Institute for Infocomm Research (I2 R), A*STAR, Singapore. The department focuses on digital forensics, cybercrime, cyber security and mobile security research and technology development. She is also an A*STAR Graduate Scholarship Ph.D. advisor, and an Adjunct Associate Professor at the Singapore Management University, and the National University of Singapore. She has over 13 years of security and forensics R&D experience with in-depth expertise in cyber crime & attack evolvement detection and mitigation, cyber security, digital forensics, and security intelligence & analytics. Her research draws on her multidisciplinary background in computer science (Ph.D. from Imperial College London, United Kingdom), and electrical, electronics, computer and communications engineering (Diploma from Singapore Polytechnic, B.Eng. and M.Eng by Research from Nanyang Technological University, Singapore). During her career, she has taken on various roles with the key focus to lead and conduct world-class industry-relevant R&D that brings a positive impact to our economy and society. She also participates actively as the Principal Investigator and Lead Scientist of several collaborative projects with industry partners such as MNCs and the government agencies.

A Proposed Framework for Proposed Framework for ...