Hierarchical State Machine Architecture for Regular Expression Pattern Matching Cheng-Hung Lin

Hsien-Sheng Hsiao

National Taiwan Normal University, Taipei, Taiwan

National Taiwan Normal University, Taipei, Taiwan

[email protected]

[email protected] patterns. Take a regular expression “.*AB.*CD” for example where the metacharacters “.*” denotes zero or more instances of any character. Figure 1 shows the DFA to recognize the regular expression pattern “.*AB.*CD”, where State 0 is the initial state, and state 4 is the final state indicating the matching of the regular expression pattern.

ABSTRACT Regular expression has been widely used in network intrusion detection system to represent attack patterns due to its expressive power and flexibility. However, the traditional memory architecture suffers from the problem of memory explosion for certain types of complex regular expressions. In this paper, we propose a hierarchical state machine architecture which can significantly reduce the memory required to accommodate complex regular expression patterns. The experiments demonstrate a significant reduction in memory for the complex regular expression patterns commonly used in network intrusion detection systems.

[^A]

[^C]

A B

1

0

C C

2

D

3

[^CD]

4

C

[^AB]

[^C] Figure 1: Deterministic finite automaton of “.*AB.*CD”

Categories and Subject Descriptors C.2.0 [Computer Communication Networks]: General-Security and protection (e.g., firewalls)

It’s well known that when m patterns are compiled into a single DFA, the matching time complexity for matching m patterns against the input texts of length n is reduced from O(mn) to O(n). Therefore, compiling multiple regular expression patterns into a single composite DFA is an efficient way to match multiple regular expression patterns. However, there are specific regular expression patterns which cause the composite DFA to grow exponentially. For example, using Flex[18] for compilation, the two patterns in Linux L7-filter, “.*membername.*session.*player” and “.*rdpdr.*cliprdr.*rdpsnd” contain 47 and 43 states, respectively. When compiling these two patterns into a composite DFA, the number of states is up to 232. The increasing number of states come form the multiple sets of metacharacters “.*” in these patterns. Figure 2 illustrates the reason why compiling such regular expression patterns into a composite DFA causes the corresponding state machine to grow exponentially. It shows the composite DFA for matching the regular expression patterns, “.*AB.*CD” and “.*MN.*PQ” where partial edges are omitted for easy description. In Figure 2, states 0, 1, 2, 3, and 4 belong to the original state machine of “.*AB.*CD” and states 5, 6, 7, and 8 belong to the original state machine of “.*MN.*PQ”.

General Terms: Algorithms, Design, Security Keywords: Pattern Matching, Regular Expression, State Machine

1. INTRODUCTION In a signature-based network intrusion detection system (NIDS), the input packets are matched against thousands of attack patterns to identify malicious packets. To efficiently represent attack patterns, regular expressions has been widely adopted in NIDS systems such as Snort[1], Bro[2], and Linux L7-filter[3] because of its better expressive power and flexibility, when compared with explicit string representation. Because the regular expression matching is the most computative task in NIDS systems, many hardware approaches are proposed to accelerate regular expression matching. The hardware approaches may be classified into two main categories, the logic [4,5,6,7] and memory [8,9,10,11,12,13] architectures. Basically, the logic architecture is based on Nondeterministic Finite Automaton (NFA) while the memory architecture is based on Deterministic Finite Automaton (DFA). From the perspectives of re-configurability and scalability, memory architectures are attractive because memory architecture is flexible and scalable. The memory architecture for matching regular expression pattern works as follows. First, regular expression patterns are compiled into a Deterministic Finite Automaton (DFA). Then, the corresponding state transition table is stored in memory to recognize regular expression

[^MC]

[^AM] 0

A

1

B

C

2

D

3

M 9 M

N

11

P

12 N

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. GLSVLSI’09, May 10–12, 2009, Boston, Massachusetts, USA. Copyright 2009 ACM 978-1-60558-522-2/09/05…$5.00.

D C

10

B 5

4

C

13

A

Q 6

P

7

Q

8

[^AP] Figure 2: Deterministic finite automaton of “.*AB.*CD” and “.*MN.*PQ”

133

The additional states 9, 10, 11, 12, and 13 come from the correlation of two patterns. We observe that the additional states are determined according to the order and the lengths of four fragments, “AB”, “CD”, “MN”, and “PQ”. In other words, the additional states are duplicated from the sub-patterns segmented by the metacharacters “.*”. With the increasing number of such regular expression patterns to be compiled, the additional states will increase tremendously. Because the memory requirement is proportional to the state number, reducing the state number can efficiently reduce the memory requirement and, therefore the power consumption and cost. In this paper, we propose a hierarchical state machine architecture composed of two state machines, called master and slave, to match these complex regular expression patterns. The master state machine is used to match against the segmented sub-patterns while the slave state machine is used to identify the correlation of the segmented sub-patterns. Compared to the traditional DFA, the hierarchical state machine architecture can significantly reduce the number of states and therefore the memory size. Experimental results show that the hierarchical state machine architecture achieves an average of 95% of state reduction and 97% of memory reduction for the complicated regular expression patterns in Snort, Bro, and L7-filter systems.

R1 R1’

with one “.*” in each pattern needs O( ( ∑ ni ) ⋅ 2 i =0

β α

γ δ interε θ result

Rm Rm’

Master State Machine

ε

γ

δ θ

match vector

Slaver State Machine

Figure 3: Hierarchical State Machine Architecture We now illustrate the hierarchical state machine architecture using an example. Considering the following two patterns, “.*AB.*CD” and “.*MN.*PQ”, we first construct a master state machine to match input characters to the segmented sub-patterns, “AB”, “CD”, “MN”, and “PQ”, as shown in Figure 4(a) where partial edges are omitted for easy description. In Figure 4(a), the outputs, called inter-results, of final states 2, 4, 6, and 8 are represented asα, β, γ, and δ, respectively. Except for final states, the outputs are 0. Then, we construct a slave state machine to match the inter-results to the new regular expression “.* α.* β” and “.* γ.* δ” as shown in Figure 4(b).

In this paper, the complex regular expression pattern denotes the regular expression pattern with multiple sets of metacharacters “.*” such as the two patterns in L7-filter, “.*membername.*session.*player” and “.*rdpdr.*cliprdr.*rdpsnd”. Snort and Bro have similar complex regular expressions. As described in the introduction, when compiling multiple complex regular expression patterns into a composite DFA, the correlation of those segmented sub-patterns causes the state machine to grow exponentially. Basically, compiling m patterns of length ni (i=0…m) m

R2 R2’

input streams

2. COMPLEX REGULAR EXPRESSION

m

α β

[^ACMP] 0

0 1 0 3 0 5 0 7

A C M

) states.

P

Furthermore, compiling m patterns of length ni (i=0…m) with x

B

2

D

4

N

6

Q

8

α β γ δ

m

instances of “.*” in each pattern needs O( ( ∑ ni ) ⋅ (m + x) m ) states.

(a) Master State Machine

i =0

In other words, with the increasing number of metacharacters “.*” and the number of patterns, the composite state machine will grow exponentially due to the duplication of segmented sub-patterns. Therefore, if we can reduce the duplication of segmented sub-patterns, we can reduce the number of additional states and therefore the memory size.

0

0 0

α

γ

3. HIERARCHICAL STATE MACHINE In this section, we propose a hierarchical state machine architecture to handle such complex regular expression patterns. Our basic idea is that if we can use two concurrent state machines, where one state machine is for matching the segmented sub-patterns, while the other state machine is for identifying the correlation of the segmented sub-patterns, then we can efficiently reduce the additional states. For m regular expressions, “.*R1.*R1’”, “.*R2.*R2’”,…, and “.*Rm.*Rm’”, Figure 3 shows the hierarchical state machine architecture which are composed of two parallel state machines, master and slave state machines. The master state machine is used to match input streams to the segmented sub-patterns, such as R1, R1’, R2, R2’,…Rm and Rm’. Then, the matching results, α,β,γ,δ,…,ε, andθ of the master state machine, called inter-result is fed to the slave state machine which is used to identify the correlation of the segmented sub-patterns and to output final results, called match vector.

1 γ α 3

β

2

0 5 δ

β δ 4

0 (b) Slave State Machine Figure 4: Master and Slave State Machine Consider the example in Figure 5 which matches the input string “ABMNCD” to the regular expressions “.*AB.*CD” and “.*MN.*PQ”. First, as the master state machine reads “AB”, we can see that the master state machine moves from state 0 to state 2 and outputs “α”. At the same time, the slave state machine reads the inter-results “00α” and moves from state 0 to state 1. Then, the master state machine reads “MN”, and moves from state 2 to state 6 and outputs “γ”. Concurrently, the slave state machine reads “0γ”

134

and moves from state 1 to state 5. Finally, the master state machine reads “CD”, moves from state 6 to state 4, and outputs “β”. On the other hand, the slave state machine reads “0β”, and moves from state 5 to state 2 which represents the final state of the pattern “.*AB.*CD”.

In the following theorem, we show a necessary condition for the exception problem. Therefore, we can safely merge the complex regular expressions without the exception problem if the necessary condition is not satisfied. Consider a hierarchical state machine architecture which merges m regular expressions, “.*R1.*R1’”, “.*R2.*R2’”,..., and “.*Rm.*Rm’”.

Input character: A B M N C D State of master: 0 Æ 1 Æ 2 Æ 5 Æ 6 Æ 3 Æ 4 Inter-results: 0 0 α 0 γ 0 β State of slave: 0Æ 0Æ 0 Æ 1 Æ 1 Æ 5 Æ 5 Æ 2

Definition: A sub-pattern, R’ is called an overlapped-expression if R’ both matches the suffixes of R1, R2, …, or Rm and the prefixes of R1’, R2’, …, or Rm’.

Figure 5: State transitions of matching “ABMNCD”

Consider the same example in Figure 6, R1 and R1’ denote “ABC” and “BCD”, respectively while R2 and R2’ denote “MNO” and “NOP”, respectively. The sub-pattern “BC” both matches the suffixes of R1 and the prefixes of R1’. Therefore, the sub-pattern “BC” is an overlapped-pattern. Similarly, the sub-pattern “NO” is also an overlapped-pattern.

4. EXCEPTION PROBLEMS Although the hierarchical state machine architecture can effectively reduce the memory requirements of merging such complex regular expression patterns, it creates a new problem, called exception problem when merging certain types of complicated regular expressions. The exception problem may lead to false positive results. We now illustrate the exception problem using an example. Figure 6 shows the master state machine of two patterns, “.*ABC.*BCD”, “.*MNO.*NOP”. Consider an input string containing a substring “ABCDE”. When “ABC” is fed, the master state machine moves to state 3 and outputs “α” which causes the slave state machine to move to state 1. When the next “D” is fed, the master state machine moves to state 6 and outputs “β”. Then, the slave state machine moves to state 2 which represents the final state of the pattern “.*ABC.*BCD”. Actually, the matching result is a false positive because the input string “ABCD” shouldn’t match the pattern “.*ABC.*BCD”. We can see that the input “BC” both matches the suffixes of pattern “ABC” and the prefixes of pattern “BCD,” thus the exception problem arises.

[^AC] A 0 B M N

α 3 0 0 β C D 5 4 6 0 0 γ N O 8 7 9 0 0 O P δ 11 10 12 0 1

0 2

B

Theorem: The hierarchical state machine architecture does not have the exception problem if there is not any overlapped-pattern both matches the suffixes of R1, R2, …, or Rm and the prefixes of R1’, R2’, …, or Rm’. Proof: The exception problem arises when there exists an overlapped-pattern. Suppose an input string contains an overlapped-pattern of R1 and R1’, the master state machine matching R1 also matches the prefixes of R1’. If the R1’ is matched successively by the master state machine without moving back to ■ initial state. The exception problem arises.

5. EXPERIMENTAL RESULTS We implement the hierarchical state machine architecture on the regular expression patterns of Snort, Bro, and L7-filter. The results are compared with the traditional DFA approach, as shown in Table 1. Columns one and two show the rule set and the number of complex regular expression patterns. Columns three and four show the state number and the memory size of the traditional DFA. Columns five and six show the state number and memory requirements of the master state machine. Columns seven and eight show the state number and memory requirements of the slave state machine. For example, in the first row of the Table 1, the Snort rule sets contain 71 complex regular expressions. Constructing a composite DFA needs 211,438 states and 153,321,472 bytes. Applying our hierarchical state machine architecture, the master state machine needs 2,914 states and 1,694,720 bytes while the slave state machine needs 2,724 states and 1,498,624 bytes. The state and memory reductions are 97% and 98%, respectively. Furthermore, when applied to Bro and Linux L7-filter, we get 99% and 95% memory reduction, respectively. The results show that the hierarchical state machine architecture is very efficient in memory reduction for compiling multiple complex regular expression patterns.

C

D

P

(a) Master State Machine

0 0

α

γ

0 1 γ α 3

β

2

0 5 δ

β δ 4

0 (b) Slave State Machine Figure 6: Example of Exception Problem

135

Table 1: Experimental results on Snort, Bro and L7-filter rule sets Traditional DFA Rule set

# of patterns

SNORT

Hierarchical state machine architecture Master state machine Slave state machine state memory # of memory reduction # of states (bytes) states (bytes)

# of states

memory (bytes)

71

211,438

153,321,472

2,914

1,694,720

2,724

1,498,624

97%

98%

BRO

88

300,498

324,534,272

3,367

1,731,072

4,051

1,943,552

98%

99%

L7-FILTER

11

9,614

7,362,560

758

335,616

118

7,120

91%

95%

Total

170

521,550

485,218,304

7,039

3,761,408

6893

3,449,296

95%

97%

1

1

1.35%

0.78%

1.32%

0.71%

AVG.

6. CONCLUSIONS Because there are specific regular expression patterns in NIDS which cause the memory requirement to grow exponentially, reducing the memory requirement of such complicated patterns has become imperative. In this paper, we have presented a hierarchical state machine architecture to accommodate complex regular expression patterns. By using two parallel state machines, the hierarchical state machine architecture achieves significant improvements on state and memory reductions while maintaining the same matching time complexity. In addition, as the hierarchical state machine architecture is based on general memory architecture, it can also be applied to processing simple regular expression patterns, such as exact string patterns. The experiments demonstrate a significant reduction in memory for the complex regular expression patterns commonly used in NIDS systems.

[8]

M. Aldwairi*, T. Conte, and P. Franzon, “Configurable String Matching Hardware for Speeding up Intrusion Detection,” in ACM SIGARCH Computer Architecture News, 2005, pp. 99–107

[9]

S. Dharmapurikar and J. Lockwood, “Fast and Scalable Pattern Matching for Content Filtering,” in Proc. of Symp. Architectures Netw. Commun. Syst. (ANCS), 2005, pp. 183-192

[10] Y. H. Cho and W. H. Mangione-Smith, “A Pattern Matching

Co-processor for Network Security,” in Proc. 42nd Des. Autom. Conf. (DAC), 2005, pp. 234-239 [11] L. Tan and T. Sherwood, “A high throughput string matching

architecture for intrusion detection and prevention,” in 32nd Ann. Int. Symp. on Comp. Architecture, (ISCA), 2005, pp. 112-122 [12] H. J. Jung, Z. K. Baker, and V. K. Prasanna, “Performance of

FPGA Implementation of Bit-split Architecture for Intrusion Detection Systems,” in 20th Int. Parallel and Distributed Processing Symp. (IPDPS), 2006.

References [1]

Snort official website, “http://www.snort.org/”

[2]

Bro official website, “http://www.bro-ids.org/”

[3]

Linux L7-filter official website, “http://l7-filter.sourceforge.net/”

[4]

R. Sidhu and V. K. Prasanna, “Fast regular expression matching using FPGAs,” in Proc. 9th Ann. IEEE Symp. Field-Program. Custom Comput. Mach. (FCCM), 2001, pp. 227-238.

[5]

Memory reduction

[13]

F. Yu, Z. Chen, Y.Diao, T.V. Lakshman, and R.H. Katz, “Fast and Memory-Efficient Regular Expression Matching for Deep packet Inspection,” in Proc. ACM/IEEE Symp. Architectures Netw. Commun. Syst. (ANCS), 2006, pp. 93-102

[14]

A. V. Aho and M. J. Corasick. “Efficient String Matching: An Aid to Bibliographic Search,” in Communications of the ACM, 18(6):333–340, 1975.

B.L. Hutchings, R. Franklin, and D. Carver, “Assisting Network Intrusion Detection with Reconfigurable Hardware,” in Proc.10th Ann. IEEE Symp. Field-Program. Custom Comput. Mach. (FCCM), 2002, pp. 111-120.

[15]

B. Brodie, R. Cytron, D.Taylor, “A Scalable Architecture for High-throughput Regular Expression Matching,” in Proc. 33rd Int’l Symposium on Computer Architecture (ISCA), 2006, pp191-202.

[6]

C. R. Clark and D. E. Schimmel, “Scalable Pattern Matching for High Speed Networks,” in Proc. 12th Ann. IEEE Symp. Field-Program. Custom Comput. Mach. (FCCM), 2004, pp. 249-257

[16]

[7]

J. Moscola, J. Lockwood, R. P. Loui, and M. Pachos, “Implementation of a Content-Scanning Module for an Internet Firewall,” in Proc. 11th Ann. IEEE Symp. Field-Program. Custom Comput. Mach. (FCCM), 2003, pp. 31–38.

S. Kumar, S.Dharmapurikar, F.Yu, P. Crowley, and J. Turner, “Algorithms to Accelerate Multiple Regular Expressions Matching for Deep Packet Inspection,” in ACM SIGCOMM Computer Communication Review, ACM Press, vol.36, Issue. 4, Oct. 2006, pp. 339-350.

[17]

P. Piyachon and Y. Luo, “Compact State Machines for High Performance Pattern Matching,” in Proc. of Design Automation Conference, (DAC), 2007, pp.493-496

[18]

Flex, The Fast Lexical Analyzer, http://flex.sourceforge.net/

136

Hierarchical State Machine Architecture for Regular ...

hierarchical state machine architecture which can significantly reduce the memory ... 7, and 8 belong to the original state machine of “.*MN.*PQ”. Figure 2: ...

500KB Sizes 0 Downloads 254 Views

Recommend Documents

Hierarchical Deep Recurrent Architecture for Video Understanding
Jul 11, 2017 - and 0.84333 on the private 50% of test data. 1. Introduction ... In the Kaggle competition, Google Cloud & ... for private leaderboard evaluation.

A Scalable Hierarchical Power Control Architecture for ...
1. SHIP: A Scalable Hierarchical Power Control. Architecture for Large-Scale Data Centers. Xiaorui Wang ... Power consumed by computer servers has become a serious concern in the ... years as more data centers reach their power limits. Therefore, ...

A Hierarchical Fault Tolerant Architecture for ... - Semantic Scholar
Recently, interest in service robots has been increasing in ... As it may be deduced from its definition, a service robot is ..... Publisher, San Francisco, CA, 2007.

A Hierarchical Fault Tolerant Architecture for ... - Semantic Scholar
construct fault tolerance applications from non-fault-aware components, by declaring fault ... This is because mobile service robots operate with moving ... development, fault tolerance tools of component developers have been limited to the ...

MACHINE LEARNING FOR DIALOG STATE ... - Semantic Scholar
output of this Dialog State Tracking (DST) component is then used ..... accuracy, but less meaningful confidence scores as measured by the .... course, 2015.

Hybrid Memory Architecture for Regular Expression ...
Abstract. Regular expression matching has been widely used in. Network Intrusion Detection Systems due to its strong expressive power and flexibility. To match ...

State Government Regular Employee Data Sheet - Telangana ...
State Government Regular Employee Data Sheet. Photo. I. GENERAL DETAILS *. 1. Employee ID. 2. Dept Name. 3. STO Code. 4. DDO Code. 5. Sector. [ ] State Govt. [ ] Central Govt. [ ] AICTE [ ] UGC. [ ] Judicial. II. PERSONAL DETAILS. 6.1 Name **. 6.2 Su

A Proposed Extension to UML: A Hierarchical Architecture ... - Verimag
Abstract. This paper proposes a double extension to the UML 2.0 new notation, for Real-Time Applications, using the. Temporal-Assertion Components of the Arts'Codes method (Applicative Real-Time Systems based on. Component Design) [9] , by adding : â

Support Vector Echo-State Machine for Chaotic ... - Semantic Scholar
Dalian University of Technology, Dalian ... SVESMs are especially efficient in dealing with real life nonlinear time series, and ... advantages of the SVMs and echo state mechanisms. ...... [15] H. Jaeger, and H. Haas, Harnessing nonlinearity: Predic

Support Vector Echo-State Machine for Chaotic Time ...
Keywords: Support Vector Machines, Echo State Networks, Recurrent neural ... Jordan networks, RPNN (Recurrent Predictor Neural networks) [14], ESN ..... So the following job will be ...... performance of SVESM does not deteriorate, and sometime it ca

Support Vector Echo-State Machine for Chaotic ... - Semantic Scholar
1. Support Vector Echo-State Machine for Chaotic Time. Series Prediction ...... The 1-year-ahead prediction and ... of SVESM does not deteriorate, and sometime it can improve to some degree. ... Lecture Notes in Computer Science, vol.

Can A Machine Replace Humans In Building Regular ...
Regular expressions are routinely used in a variety of different application domains. ..... have not allowed collecting a meaningful set of results. ... We gathered results from a large population: 1,764 users participating from July 23-rd 2015 to.

Regular 2D NASIC-based Architecture and Design ...
on its pipelined routing architecture, which paves the way towards high-throughput ... NASIC (Nanoscale Application Specific Integrated Circuits). [3] is a nanoscale fabric ..... motivation of the development of pipeline-aware placers and routers.

finite state machine examples pdf
Page 1 of 1. finite state machine examples pdf. Click here if your download doesn't start automatically. Page 1 of 1. finite state machine examples pdf. finite state ...

Nonparametric Hierarchical Bayesian Model for ...
employed in fMRI data analysis, particularly in modeling ... To distinguish these functionally-defined clusters ... The next layer of this hierarchical model defines.

Efficient duration and hierarchical modeling for ... - ScienceDirect.com
a Department of Computing, Curtin University of Technology, Perth, Western Australia b AI Center, SRI International, 333 Ravenswood Ave, Menlo Park, CA, 94025, USA. a r t i c l e. i n f o ..... determined in advance. If M is set to the observation le

Hierarchical Planar Correlation Clustering for Cell ... - CiteSeerX
3 Department of Computer Science. University of California, Irvine .... technique tries to find the best segmented cells from multiple hierarchical lay- ers. However ...

Timing-Driven Placement for Hierarchical ...
101 Innovation Drive. San Jose, CA ... Permission to make digital or hard copies of all or part of this work for personal or ... simulated annealing as a tool for timing-driven placement. In the .... example only, the interested reader can refer to t

Hierarchical Decomposition Theorems for Choquet ...
Tokyo Institute of Technology,. 4259 Nagatsuta, Midori-ku, ..... function fL on F ≡ { ⋃ k∈Ij. {Ck}}j∈J is defined by. fL( ⋃ k∈Ij. {Ck}) ≡ (C) ∫. ⋃k∈Ij. {Ck}. fMdλj.

BAYESIAN HIERARCHICAL MODEL FOR ...
NETWORK FROM MICROARRAY DATA ... pecially for analyzing small sample size data. ... correlation parameters are exchangeable meaning that the.

Hierarchical Planar Correlation Clustering for Cell ... - CiteSeerX
3 Department of Computer Science. University of ..... (ECCV-12), 2012. Bjoern Andres, Julian Yarkony, B. S. Manjunath, Stephen Kirchhoff, Engin Turetken,.

Hamiltonian Monte Carlo for Hierarchical Models
Dec 3, 2013 - eigenvalues, which encode the direction and magnitudes of the local deviation from isotropy. data, latent mean µ set to zero, and a log-normal ...