Visualizing a Malware Distribution Network Sebastian Peryt1

Jose Andre Morales2

William Casey3

Aaron Volkmann4

Bud Mishra5

Yang Cai6*

Cylab-CIT and SCS Carnegie Mellon Univ.

SEI-CERT Carnegie Mellon Univ.

SEI-CERT Carnegie Mellon Univ.

SEI-CERT Carnegie Mellon Univ.

Courant Institute New York Univ.

Cylab-CIT and SCS Carnegie Mellon Univ.

ABSTRACT In this paper, we present a case study of visual analytics of a Malware Distribution Network (MDN), a connected set of maliciously compromised domains used to disseminate malicious software to victimize computers and users. We formally define the graph of an MDN to visualize top-level-domain (TLD) data collected from Google Safe Browsing reports in a temporal manner characterizing the topological structure. From the collected data, we were able to identify and label a TLD’s role in malware distribution. The visual analytics provided insights on the topological structure of MDNs over time including highly connected and persistent TLDs and subnetworks. Keywords: malware; malware distribution network; visualization; top-level domain; Google Safe Browsing; behavioral graph. Index Terms: graph theory; network flows; malware and its mitigation; visual analytics; visualization design and evaluation methods. 1

INTRODUCTION

Rapidly growing cyber security data streams enable us to discover dynamic behavioral patterns from sequential events. Recently emerged visual analytics has become a powerful tool in seeing through the fog of cyber wars beyond numbers. In this paper, we present a case study of visual analytics of a Malware Distribution Network (MDN), which is a connected set of maliciously compromised top-level domains (TLDs) used to facilitate the dissemination of malicious software attempting to victimize computers. MDNs have been used in botnets [1-2], spam campaigns [3] and distributed denial of service attacks (DDoS) [4]. In general, MDNs are the essential back end distribution highway fueling underground economies in monetized schemes generating large revenues for malicious actors [3]. In this study, we provide a fundamental basis for the construction and analysis of an MDN’s topological structure. Our MDN graphs are top-level domain based with three classifications defining the role of a TLD. Using these classifications, we visualized an MDN over a month and gained insight on its structure, persistence, and evolution over time. Our MDN graphs were based on Google Safe Browsing (GSB) reports [5]. Our research shows the novel approach of leveraging GSB reports to graph MDNs reveals deep insight into structural changes over time. The principal contributions of this paper include: 1) a formal definition for Malware Distribution Networks, 2) a novel visualization model that reveals the existence of persistent subnetworks and individual TLDs’ critical to the successful operation of MDNs, and 3) use of GSB as a rich MDN data set. [email protected]; [email protected]; [email protected]; [email protected];

[email protected] *corresponding author: Yang Cai

[email protected]

2

RELATED WORK

A corpus of research [6-7] has proposed various approaches to identifying malicious URLs. Research such as [8-9], describe and analyze the use of MDNs and their various components in monetized schemes such as botnets, pay-per-install affiliation programs and traffic direction systems. In [10], Behfarshad describes MDNs as a set of landing pages, redirectors, and malware repositories. The authors suggest detecting the presence of an MDN by identifying drive by download attempts via two methods: top-down and bottom-up. The research of Provos, et al [11-12] is part of the early work describing web based malware and the existence and identification of MDNs. Their research provided insight on identifying malicious URLs and explains the critical role of iframe1 redirects as the fundamental link binding multiple URLs together in an MDN. The culmination of their work is the Google Safe Browsing service which is the key data source for our research, where we enhance the current literature by fundamentally defining an MDN and visualizing a graphical topological structure including vertex role based node types indicative of a TLD’s role in malware distribution. Our work further shows how to visualize MDNs using GSB reports, the results of which revealed complex MDNs providing insight on their structure, evolution, and persistence over time. 3

MALWARE DISTRIBUTION NETWORK GRAPHS

An MDN is a dynamic structure topologically consisting of interconnected TLDs. An MDN changes over time which is captured with topological structure visualization at different points in time. We formally define a graph capturing the temporal topological structure of an MDN. A representation of an MDN at a given point in time is defined as a directed graph 𝑀(𝑡) = 〈𝑉(𝑡), 𝐸(𝑡)〉 such that 𝑉(𝑡) is a set of vertices at time t, 𝐸(𝑡) is a set of directed edges: (𝑣𝑖 (𝑡), 𝑣𝑗 (𝑡)) ∈ 𝑉(𝑡) at time t. In a given graph 𝑀(𝑡), a vertex 𝑣 ∈ 𝑉(𝑡) represents an MDN node in one or more modes. The full range of a node’s modality in the MDN is unknown. We suggest, based on GSB reports, three possible roles. In an MDN, a node may act as an intermediary (MI) by facilitating malicious traffic, a malicious host (MH) or root malicious host (RMH) if malware files were hosted from that domain. In our visualizations, MI and MH nodes have incoming edges, RMH nodes do not. This implies malware in RMH nodes came from some source that is not represented by a node in the MDN. 4

DATA COLLECTION

Several MDNs were graphed for a one month period from 03-012014 to 03-27-2014. We initialized the data collection using the top 500 most visited websites reported by Alexa.com on 03-012014. The process of collecting malicious external links and crawling GSB occurred daily three times a day. SearchDiggity [13] is a freely available search tool [13] leveraging Google and Bing providing enhanced search results. We use the 1iframes

are a simple and widely accepted mechanism for redirection in http traffic.

978-1-5090-1605-1/16/$31.00 ©2016 IEEE

MalwareDiggity (MD) feature that provides a list of malicious external links. First, we input the list of TLDs from Alexa to MD which submits each domain to Bing’s linkfromdomain search operator [15] that returns the TLD name of all external URLs linking from the domain name. Second, each external domain name is queried in GSB to determine if the specific domain is currently classified as malicious. If classified malicious, the query returns the URL of the domain’s GSB diagnostic report. Completion of both steps produces a list of external malicious links accessible from one or more of the Alexa TLDs. Retrieved GSB reports is the dataset for MDN graphs. The details of a GSB report include several other identified malicious domains which when clicked redirects to that domain’s report. By following this chain we exhaustively retrieved all relevant reports. Report retrieval ended when a report contained no domain links or when all domain links led to a previously retrieved report. Sometime after completing data collection, Google changed their GSB diagnostic report structure. During our collection period, a GSB diagnostic report included 5 questions: 1. What is the current listing status for “domain name”? 2. What happened when Google visited this site? 3. Has this site acted as an intermediary resulting in further distribution of malware? 4. Has this site hosted malware? 5. How did this happen? We classify a node as an MI when a yes answer is given to question 3 and we classify a node as MH when a yes answer is given to question 4. We designate an MH node with no in-degree as an RMH node and discovery occurred via graph analysis. A node can be identified as having multiple node types. The only allowed node type classification for a single node is RMH, MH, MI, MH & MI. 5

6

DYNAMIC MALWARE BEHAVIORAL MODEL

Graphs are represented by an augmented adjacency list data structure that is designed to capture both the dependencies of graph links and the mode of linkage type – MI or MH. We describe this data structure as a list of key – value pairs, whose keys are the addresses of a websites, denoted as a source and key values are a pair where by destination is the address of a domain which is reported as being affected by the source. To place all of the website addresses on the visualization, we used the Dynamic Behavioral Graph to incorporate event frequencies, protocol types, packet contents and data flow information into one graph. In contrast to a typical Force-Directed Graph such as D3 [16], our model goes beyond the aesthetic layout of a graph to reveal the dynamic sequential patterns in a three-dimensional virtual space. In the model, the attraction force between a pair of nodes is calculated using formula: ‖𝑥𝑗 − 𝑥𝑖 ‖ 𝑓𝐴 = 𝛼∙𝑇

2

where: i and j are distinct nodes, α is the value of elasticity where a greater value increases the length of the edge, T is equal to the average time between each nodes’ time stamps and ‖𝑥𝑗 − 𝑥𝑖 ‖ is the distance between two nodes.

PREPROCESSING FOR VISUALIZATION

Sixty-five collections were gathered from GSB from 03-01-2014 to 03-27-2014. We first extracted all RMH nodes from networks containing at least 2 nodes. Next we found all persistent RMH nodes. We used the following algorithm to create chains of connected nodes for every RMH node for each collection. function CREATE CHAINS(rmh; rows) search_rows ← empty array of tuple output_rows ← empty array of tuple for all rows do if rows[i] == rmh then insert rowsi into search_rows end if end for while search_rows.size > 0 do row ← search_rows[0] drop search_rows[0] if row.destination == output_rows.destination then insert row into output_rows continue end if temp_rows ← empty array of tuple for all rows do if rows[i].source == row.destination then insert rowsi into temp_rows end if end for insert row into output_rows insert temp_rows into search_rows end while end function

Figure 1. Example of the 3D Dynamic Behavior Graph of an MDN (Red - MI; Blue – MH)

We use a gradient arc for displaying the direction of edges. The decrease of alpha value indicates the direction, with 1 at the source and 0 at the end. This novel visual representation also enables us to add the packet information to the edges. Figure 1 shows an example of the 3D dynamic behavioral graph of an MDN built from multiple chains originating at the same RMH node and was recorded from different collections. In the figure, circles are individual nodes of a network. For each visualization we display all of the nodes despite the number of occurrences in each individual collection, within a given date range. The edges, on the other hand, are only displayed if they are present in all of the collections. This way we can highlight paths of malware distribution that are common for all of the collections within a given date range. 7

DISCOVERED MALWARE DISTRIBUTION PATTERNS

From our visual analytical experiments, we have discovered at least 3 major patterns.

2016 IEEE SYMPOSIUM ON VISUALIZATION FOR CYBER SECURITY (VIZSEC)

7.1

Persistent Shared Edges from Root Malicious Host Based on over 100 case studies, we found a RMH may originate different edges over a period of time. However, for certain edges they are persistent and are shared by the connected nodes over time. Figure 2 shows a small MDN with a single RMH, where we have 4 persistent shared edges through the whole period of data collection from 03-01-2014 through 03-27-2014.

01-2014 to 03-27-2014. Here, there are only 86 shared edges. There are also fewer groups of edges and they are not as big as in Figure 4, but still we have long chains of edges attached to them. It can indicate that center nodes of node groups are important in malware distribution. This information might be very helpful, especially when, as in this case, the RMH momentarily affects adjacent nodes who then go on to serve as persistent intermediaries of malware.

Figure 2 Shared edges for collections from 03-01-2014 to 03-272014 (Blue – MH)

Figure 4. Malware spread pattern with multiple points of origin for a collections from 03-01-2014 to 03-12-2014 (Red – MI, Blue – MH)

7.2 Burst Pattern The Burst pattern consists of multiple connected groups of edges. All edges have the same direction with a single origin. They appear mostly in big networks. Figure 3 shows a typical branching pattern, where we have a branch originating from a RMH node and going down the figure, as well as chain of edges originating at this same point and ending at the last connected node on the right of the figure.

Figure 5. Malware spread pattern with multiple points of origin for a collections from 03-01-2014 to 03-27-2014 (Red – MI, Blue – MH)

Figure 3 Visualization of a Burst pattern of chains for collections from 03-01-2014 to 03-27-2014 (Red – MI, Blue – MH)

7.3 Malware Spread with Multiple Origins Despite the fact that malware originates at an RMH node, the path malware takes to be distributed in every collection might be different. Figure 4 and 5 show that there can be multiple nodes that originate shared edges for all of collections. Figure 4 presents first collection period from 03-01-2014 to 03-12-2014. As we can see there are two big, main groups of edges and some smaller, with long joined edges in between. In this case there are 148 common edges among all of the collections. Figure 5 presents chains for this same RMH node but for collection time from 03-

7.4 Dynamics of Malware Spread Pattern Malware spreading is a dynamic process. Figure 6 and 7 show how with different lengths of collection process periods, the number of shared edges decreases leaving the most persistent chains. Both figures show chains for this same RMH where Figure 6 highlights edges for the collection period from 03-012014 to 03-12-2014, while Figure 7 shows edges for collection time from 03-01-2014 to 03-27-2014. In those figures we can see branching patterns, which vary in their sizes. Figure 6 presents main group of nodes on the right and about 7 other smaller groups. On the other hand, Figure 7 presents a much smaller branch block with the main group of edges on the right but the number of edges is smaller than previously displayed. Also we can see a decrease in the total number of clusters. This can mean that in those additional days separating collections, malware has been transmitted using several different paths, but still some of the clusters and edges are visible, which means they are utilized for all collections.

2016 IEEE SYMPOSIUM ON VISUALIZATION FOR CYBER SECURITY (VIZSEC)

funded and supported by the Department of Defense under Contract No. FA8721-05-C-0003 with Carnegie Mellon University for the operation of the Software Engineering Institute, a federally funded research and development center. References herein to any specific commercial product, process, or service by trade name, trade mark, manufacturer, or otherwise, does not necessarily constitute or imply its endorsement, recommendation, or favoring by Carnegie Mellon University or its Software Engineering Institute. [Distribution Statement A] This material has been approved for public release and unlimited distribution. Please see Copyright notice for non-US Government use and distribution. DM-0003890.

Figure 6 Malware spread pattern for collections from 03-01-2014 to 03-12-2014 (Red – MI, Blue – MH)

REFERENCES [1]

[2]

[3]

[4] Figure 7 Malware spread pattern for collections from 03-01-2014 to 03-27-2014 (Red – MI, Blue – MH)

[5]

8

[6]

CONCLUSIONS

In this study, we formally defined and visualized a URL- based malware distribution network using Google Safe Browsing reports. Our novel graphical representation of GSB reports produced comprehensive MDN graphs for the total collection period. We discovered our MDN graphs provide critical inferred data about TLDs and subnetworks belonging to an MDN that cannot be derived from static GSB reports alone. The key data provided by our MDN graphs are: TLD role in distribution, TLD and subnetwork persistence and structural trends over time. Our analysis for a one-month period revealed the vast majority of malware distribution is conducted over a small number of highly complex MDNs. Using visualization methods we found 3 major patterns in MDN graph: persistent shared edges, burst pattern and dynamics of malware spread pattern both from single and multiple points of origin. Our study has demonstrated how GSB reports can be leveraged to conduct temporal tracking and analysis of malware distribution. Our data analysis revealed that the discovery of highly critical TLDs and subnetworks can be conducted in an efficient way potentially facilitating traffic reduction and TLD removal to continually reduce the risk of malicious compromise to the computers and devices of users simply visiting popular websites. Future work includes adding data such as WHOIS, DNS, IP addresses, and geo-locations to enhance our current results and associate malware binaries and malicious actors with MDN components.

[7]

[8]

[9] [10]

[11]

[12]

ACKNOWLEDGEMENT

[13] [14]

This study is in part supported by Northrop Grumman Corporation and PNC Bank. This material is based upon work

[15]

G. Gu, R. Perdisci, J. Zhang, and W. Lee, “BotMiner: Clustering analysis of network traffic for protocol- and structure-independent botnet detection,” in Proc. of the 17th USENIX Security Symposium (Security’08), 2008. G. Gu, J. Zhang, and W. Lee, “BotSniffer: Detecting botnet command and control channels in network traffic,” in Proc. of the 15th Annual Network and Distributed System Security Symposium (NDSS’08), February 2008. D. McCoy, A. Pitsillidis, G. Jordan, N. Weaver, C. Kreibich, B. Krebs, G. M. Voelker, S. Savage, and K. Levchenko, “Pharmaleaks: understanding the business of online pharmaceutical affiliate programs,” in Proc. of the 21st USENIX conference on Security symposium, ser. Security’12. Berkeley, CA, USA: USENIX Association, 2012, pp. 1–1. M. Karami and M. Damon, “Understanding the emerging threat of ddos-as-a-service,” in Proc. of the USENIX Workshop on LargeScale Exploits and Emergent Threats, 2013. “Google safe browsing,” https://developers.google.com/safebrowsing/ J. Zhang, C. Seifert, J. W. Stokes, and W. Lee, “Arrow: Generating signatures to detect drive-by downloads,” in Proc. of the 20th International Conference on World Wide Web, WWW 2011, Hyderabad, India, March 28 - April 1, 2011, S. Srinivasan, K. Ramamritham, A. Kumar, M. P. Ravindra, E. Bertino, and R. Kumar, Eds. ACM, 2011. C. Rossow, C. Dietrich, and H. Bos, “Large-scale analysis of malware downloaders,” in Proc. of the 9th international conference on DIMVA, ser. DIMVA’12. Berlin, Heidelberg: Springer-Verlag, 2013. J. Caballero, C. Grier, C. Kreibich, and V. Paxson, “Measuring payper-install: the commoditization of malware distribution,” in Proc. of the 20th USENIX conference on Security, ser. SEC’11. Berkeley, CA, USA: USENIX Association, 2011. M. Goncharov, “Traffic direction systems as malware distribution tools,” Trend Micro, Tech. Rep., 2011. Z. Behfarshad, “Survey of malware distribution networks,” Electrical and Computer Engineering, University of British Columbia, Tech. Rep., 2012. N. Provos, D. McNamee, P. Mavrommatis, K. Wang, and N. Modadugu, “The ghost in the browser analysis of web-based malware,” in Proc. of the first conference on First Workshop on Hot Topics in Understanding Botnets, ser. HotBots’07. Berkeley, CA, USA: USENIX Association, 2007. N. Provos, P. Mavrommatis, M. A. Rajab, and F. Monrose, “All your iframes point to us,” in Proc. of the 17th conference on Security symposium, ser. SS’08. Berkeley, CA, USA: USENIX Association, 2008. http://www.stachliu.com/2012/08/search-diggity-install/. “Bing linkfromdomain search operator,” http://www.bing.com/blogs/ site_blogs/b/search/archive/2006/10/16/search-macroslinkfromdomain.aspx http://www.d3.org

2016 IEEE SYMPOSIUM ON VISUALIZATION FOR CYBER SECURITY (VIZSEC)

Visualizing-a-malware-distribution-network-2016-ieee-vissec.pdf ...

There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item.
Missing:

747KB Sizes 0 Downloads 150 Views

Recommend Documents

No documents