A Heuristic Correlation Algorithm for Data Reduction ...

Viewer
Transcript

A Heuristic Correlation Algorithm for Data Reduction through Noise Detection in Stream-Based Communication Management Systems Faisal Zaman*, Sebastian Robitzsch*, Zhuo Wu*, John Keeneyt, Sven van der Meert and Gabriel-Miro Muntean* *Dublin City University, The Rince Institute, Dublin, Ireland Email: (faisal.zaman.sebastian.robitzsch.gabriel.muntean)@dcu.ie and [email protected] tEricsson, Network Management Lab, Athlone, Ireland Email: Gohn.keeney.sven.van.der.meer)@ericsson.com Abstract-Monitoring and management of modern telecom munication networks has become more and more challenging due to the explosion in scale of data generated by network elements. Not only has the size of the network, the number of nodes, and the number of customers increased, but the amount and dimen sionality of the data coming from each managed element has also increased. To support sophisticated monitoring and management strategies it is desirable to forward as much trace data as possible into operators' operations support systems (Operations Support Systems (OSSs». In this paper a heuristic algorithm is presented which reduces the data-stream by removing uncorrelated noise events by determining the degree of inter-relationship between the events in the data-stream. With a sophisticated open source control plane emulator used as the source generator, the results show that the presented algorithm is capable of differentiating noise from useful information thus significantly reducing scale and dimensionality of network monitoring data-streams.

I.

INTRODUCTION

When comparing communication systems from 10 to 20 years ago with what is currently being deployed, it becomes apparent that the performance limits of those systems have been pushed to new horizons. However, the complexity in managing and maintaining those networks has also been in creased significantly. Not only does the number of imple mented communication protocols increase or were amended with new features, also the number of Network Elements (NEs) in a communication system and their amended functionalities have increased steadily. This trend is desirable from a user's point of view to achieve better data plane performances, but has a downside on the network operator's side: the size and rate of control plane trace data from the communication system arriving at operator's OSSs has also increased dramatically. Consequently, the Network Operations Centres (NOCs) can is often restricted to dealing with only the most serious network failures in the system due to the operational costs to run those centres. There remains a need for more sophisticated (semi-)automated processes to assist NOCs' operations by autonomously monitoring, analysing and optimizing network behaviours. One of the main challenges operators face in this regard is the vast amount of data generated in each NE. Not only does the OSS need to receive the data stream from each NE in a timely manner without compromising customer traffic 978-1-4799-0913-1114/$31.00 © 2014 IEEE

capacity, the data stream must be then also processed in a distributed computing framework effectively. It is becoming more apparent that the multivariate data originating from each NE is too large in its dimensions and size to be forwarded directly to OSSs in its entirety. Additionally, whenever working with streamed data, there are several challenges which need to be addressed such as: data arrives on-line, the order in which the data elements arrive is unpredictable, the stream is unbounded in size and individual data elements are usually discarded after being processed [1]. The algorithm presented in this paper leverages a heuristic correlation approach to reduce the multivariate data-stream in size and time without removing from the stream any of the information required to identify network abnormalities, traffic trends or customer issues. In this work we assume that any important stream data elements holds information that is correlated to other elements in the stream. In this paper we investigate if it is possible to differentiate between uncorrelated noise and correlated data continuously in a data-stream. The remainder of the paper is structured as follows: Section II presents the correlation algorithm used to identify noise in a data-stream; an extremely low dimensioned example data-stream is used to explain each individual step of the algorithm. The results of the heuristic correlation algorithm are presented in Section III which starts off with an evaluation of various parameters and how they should be adjusted correctly in order to provide the best possible performance. The paper is concluded in Section IV. II.

T HE

HEURISTIC CORREL ATION ALGORITHM

Before the work-flow of the heuristic correlation algorithm is presented in Section II-B, the origin of the input data used to test the hypothesis later on is explained.

A. Preliminary Considerations In order to accomplish the goal to test and evaluate the algorithm presented in the next section, the source data stream shall have the following characteristics: the input data should represent control plane traffic from a communication system and the input data should be publicly available to allow follow up researchers to build on top of the results presented in this paper. As management and traffic data from

TABLE T.

SETTINGS OF THE QPENMS C EMULATOR TO EMULATE A CONTROL PLANE MANAGEMENT DATA -STREAM Value

E:N Ratio

User Equipments per Base-Station

5

4: 1

Base-Station

10

1: 1

Network Elements

3

Communication description length

7

Infonnation Elements per commlmication descriptor

1

Parameter

real networks are commercially and personally sensitive, the open source emulator OpenMSC is used to generate a stream of a pre-defined control plane communication [2]. For the algorithm presented in this paper, the OpenMSC emulator was set up using the configuration given in Table I. By using those settings, the emulator generates a data-stream with 355 different Event Identifiers (EventIDs). As the data coming from OpenMSC strictly adheres to the communication descriptions defined in the configuration files, the resulting data-stream will not have any EventlDs which can be classified as noise. Compared to the emulated EventlDs which show strong inter-correlations, noisy EventlDs will not create a predictable data-stream and so are much more uncorrelated than the emulated data-stream. To demonstrate this the emulated data-stream is augmented with randomly generated EventlDs that follow a uniform distribution, and so noise is artificially added to the stream. Those noise EventIDs are different in their numerical range for test and verifica tion purposes in the results section. The merging process is performed on a pre-defined ratio-basis where a fixed number of emulated EventlDs and a fixed number of noisy EventlDs are read in a round-robin fashion and merged into a single data-stream. Thus, the data-stream comprises 710 different EventlDs, merging the emulated and randomly generated EventIDs according to a configurable ratio. B.

TABLE II. MERGED EMULATED AND RANDOMLY GENERATED NOISE EVENT IDENTIFIER S WITH E:N RATIO S 4: 1 AND I: 1 Stream of EventIDs 1, 2,3,4,8, 1, 2,3,4,5, 1, 2,3,4,8, 1, 2,3,4,8, 1, 2,3,4,6, 1, 2,3,4,5, 1, 2,3,4,7 .

Window Size Fig.

1.

Relation of multiple transactions and the window size

TABLE Ill. EXAMPLE EVENT IDENTIFIER S (EVENT IDs) STREAM FOR AN E:N RATIO OF 4: I, A STEP SIZE'\ = 1 (SEE TABLE TT) Ti T, T2 T3

IT;I

Event Identifiers

16

1, 2,3,4,8, I, 2,3,4,5, I, 2,3,4,8, I

24

2,3,4,8, I, 2,3,4,6, I, 2,3,4,5, I, 2,3,4,5, I, 2,3,4,5

32

1, 2,3,4,8, I, 2,3,4,6, I, 2,3,4,7, I, 2,3,4,6, I, 2,3,4,8, I, 2,3,4,7, I, 2

in the next section. The entire input source is denoted as the window size which holds multiple transactions, as illustrated in Figure 1.

Ws

1) Transaction Length: The algorithm presented in this paper uses an increasing transaction length; for every new transaction its length increases by a pre-defined fraction relationship of the transaction step 'i and the step size A. The constantly increasing transaction length for a given transaction step 'i is given by:

Ti,

ITI

ITil = m· (1 + flooT (±))

The work-flow of the correlation algorithm including its internal steps is presented in this section by using an illustrative example of a small set of 8 EventlDs. The set of numbers is denoted as E and N where E represents the set of emulated EventIDs which describe the information from an emulated network. N represents the noise in the data which is completely random, does not have any correlation with any other EventID and is therefore very unlikely to be useful.

(1)

where m stands for the total number of unique EventIDs in the data-stream with mE Z, 0 < A with A E N and A ::.;i. For instance, when using an E:N ratio of 4: I, m 8 and a step size A 1, theith transaction fori {I,2,3} equals

=

=

Ti = {16,24,32}.

Algorithm Description

. .

1,5, 2,6,3,7,4,8, 1,6, 2,5,3,8,4,5, .. .

=

2) Window Size: The window size w defines the total count of EventIDs extracted from the data-stream in order to be analysed by the correlation algorithm. As illustrated in Figure 1, a window holds multiple transactions which split all EventlDs into logical chunks of data. The total window size w is calculated as follows:

i W=LITal a=l II

(2)

In order to perform a comprehensive analysis of the correlation algorithm, the set of EventlDs E and N are merged with different ratios throughout the paper where all ratios are given in the format of E:N. Assuming emu lated EventlDs {I,2,3,4} and synthetically generated noise EventIDs {5,6,7,8}, the ratios 4:1 and 1:1 result in the a merged stream of data, as shown in Table II where N is highlighted by bold numbers. Note for illustrative purposes EventIDs {I,2,3,4} occur in sequence for this example while EventIDs {5,6,7,8} are random. As N is randomly generated with equal probability for each noise EventlD the sequence of how {5,6,7,8} occur changes in the table.

3) Frequency Table: In this step, the occurrences of each EventID in every transaction is counted to approximate the magnitude of the relationship between two EventlDs. The algorithm essentially takes the counts of each distinct EventID occurrences within a transaction and populates the fre quency table. In this way the EventlDs in the data-stream are mapped from a sequential to a frequency representation.

The main objective of the algorithm is to calculate the correlation between any pair of EventIDs in the data source. To do so, the source is divided into transactions as described

The frequency table for the example data-stream given in Table III is shown in Table IV. It can be seen that frequent correlated EventlDs such as 1, 2, 3 and 4 increase their

Ni, T,

Consequently, the window size for all transactions in Table III equals 72 assuming A 1 and i i 3.

=

I=

Ti,

TABLE IV.

FREQUENCY TABLE FOR DATA FROM TABLE III T; T, T2 T3

1

2

3

4

5

6

7

8

4

3

3

3

1

0

0

2

4

5

5

5

3

1

0

1

7

7

6

6

0

2

2

2

Ncise Correlaticr:= t Nc·ise Con:elat� te· Emulated CorrF�ion

1

� 4 C

�

, ::::> u u

o

TABLE V. CORRELATION MATRIX FOR FREQUENCY TABLE TV WITH E:N RATIO OF 4: 1, STEP SIZE).. = 1 AND WINDOW SIZE w = 72 EventlD

1

2

3

4

5

6

7

8

1

-

0 . 86

0.7 5

0.7 5

-0.76

0 . 86

1 .0

0 . 50

2

0 . 86

0.9 8

0.9 8

-0.33

1 .0

0 . 86

00

3

0.7 5

-

1 .0

-0. 1 5

0.9 8

0.7 5

-0. 19

0.9 8

4

0.7 5

0.9 8

1 .0

5

-0.76

-0.33

-0. 1 5

-0. 1 5

0.9 8

0.7 5

-0. 19

-0.33

-0.76

6

0 . 86

1 .0

0.9 8

0.9 8

-0.9 5

-0.33

-

0 . 86

00

7

1 .0

0 . 86

0.7 5

8

0 . 50

00

-0. 19

0.7 5

-0.76

0 . 86

-0. 19

-0.9 5

00

-0. 1 5

Fig. 2. Correlation results for fixed transaction length of Iml ratio of 2: 1, window size w = 250 000

=

710, E:N

0 . 50 0 . 50

are considered: frequency counts with an increasing transaction while noise EventIDs such as 5, 6, 7 and 8 do not follow the same trend. Note, for computing the correlation matrix in the next step it is crucial that a non-occurrence of an EventlD is marked as 0 in the frequency table.

4) Correlation Matrix: The correlation matrix characterises the statistical dependence structure between the events in a single window [3]. In this first early version of this work the Pearson correlation coefficient is used to calculate the pairwise correlation rmomb between the frequencies of two events rna and rnb with a, bE (1,rn ) and a -I=- b:

where I i I denotes the total number of transactions T. This formula is used to maintain low latency in computing the correlation matrix of large window sizes, e.g., 1 Million. The correlation values of the matrix should be parsed with consider able caution, e.g., as described in [4] false negative correlation between pairs of events is not considered at this stage and so are removed. For example in the correlation matrix presented in Table V which was computed from the frequency Table IV, the correlation between EventID 5 with other EventIDs will be ignored in the next step of this illustrative example.

5) Correlation Filtering: To filter out uncorrelated EventlDs based upon the correlation matrix presented in Table V, the calculated correlation rmOi ,mb; between two EventIDs rna and rnb in a given transaction Ti must be further processed first. The main challenge in this step is to keep the EventlDs exhibiting unstable correlative behaviour across all calculated correlations. By applying a fixed correlation threshold r* to the correlation matrix could result in not only removing noise EventlDs. For instance, EventlD 2 of Table V has a strong correlation coefficient with EventID I of 0.86 but a correlation with EventlD 8 of O. Consequently, EventlDs such as 2 could be falsely removed from the analysis due to this equivocal characteristic. To make the algorithm more robust and to reduce false-negative correlations, the algorithm scans through all the pairwise correlation values rma,mb and checks each EventID pair for the maximum correlation coefficient r� for each EventlD when all correlations with all other EventlD

(4) With a correlation threshold r* 0.95, the filter will retain EventlDs 2,3,4 and 6 and discards other events. =

EventlD pairs which are not observed to co-occur fre quently but extract a strong correlation value based on the co-occurrences are defined as false strongly correlated pairs. In practice, the EventlD pairs showing false strongly positive correlation should be ignored [5] since these event pairs can result in misleading conclusions. In our approach we discard the event pairs which show perfect positive correlation value (which is 1) as a primary measure of caution. The criterion is that it is extremely unlikely in full-scale scenario with the dynamically growing transaction technique that any pair events can be observed in every transaction. Computing correlation values based on the relative frequency of the EventIDs, rather than the absolute frequencies can also scale the false strong correlation values. Ongoing work on more sophisticated cor relation coefficients is considering these aspects. For example, from Table V it can be seen that the correla tion filter would pick EventID 7 when its event-wise maximal correlation 1 is considered. But in Table IV it is obvious that the pairwise occurrence of EventlD 7 and EventlD I does not reflect strong positive correlative relation. EventID 7 occurred in T3 which recorded bulk of all event occurrences and in tum wrongly amplified the correlation value of EventlD 7 with all other events. III.

RESULTS

In this section the results are presented of the heuris tic noise detection algorithm using the input data-stream described in II-A. To provide a comprehensive but still sophisticated hypothesis testing the input data-stream was merged using two E:N ratios, i.e., 4:1 and 1: l. Both ratios will be used to test the hypothesis over window sizes w {100 000,250 000, 500 OOO} and step sizes A =

{0.5,1,2,4,8,16,32,64,128}.

=

A. Effect of Fixed Versus Changing Transaction Lengths As described in Section 11- B1, a steadily increasing trans action length is implemented in this paper. This is mainly due

OJ 0.4 u c 2 0·35

O.14 OJ u c 19 0.12

0

0

V1

Q)

V1

0.3

Q)

en

0.1

cO. 08 0

:;:;

:::J .!:l

.E 0. 06

�.15 0 0.1 Z ZO. 05

�O. 04 0 . �O. 02

UJ UJ

UJ UJ

.c::

V1

.8

.8

Fig. 4. EE to NN distribution bell distance comparison using Equation I's .\ {lOa 000,250 000,500 OOO}

OJ u c OJ

Noise Ccrrelaticn:= to Nc·ise Cc,rrela� to Emulat.ed Corr�ic,n

4 t: :::J U u

30 "-

'iii 0..

"-

� 102 o :w

� 101C ········ ················ ·····}

> UJ

1,20 :w c OJ O� 1 .iJ

IHIII

"0

OJ 100 �

�10QC ········ n llrltHHIHIII.IIIHH '0 Z -�___'.L�LlliWllllliillWLlJ.uJJ.J: 10-1 "--o .4 0.5

:::J

110

E

W

Correlation

Fig. 3. Event Identifier (EventID) pair correlation results for dynamic transaction length, E:N ratio of 2: I, window size w = 250 000 and .\ = 1

to the correlation coefficient results for a data-stream with constant bit-rate and EventID types Tn, If each transaction has the same length throughout the window, the corresponding correlation coefficient is the same for EE, EN and NN event pairs, For example, assuming an E:N ratio of 2:1, a window size w 250 000 and a step size A 1, the correlation results rmo,mb are shown in Figure 2. =

=

Note, this observation does not imply that fixed transaction lengths do not provide similar results. However, it is required that the algorithm works autonomously which would not be possible when using a fixed transaction length, as the optimal length of each transaction cannot be determined prior to the data processing. That is why the total number of distinct EventIDs in the data stream is counted and taken for Tn in Equation 1. Given the knowledge about the distinct ranges for em ulated and noise EventlDs, the three EventlD correlation coefficient categories: emulated-to-emulated (EE), emulated to-noise (EN) and noise-to-noise (NN) are illustrated using the colours ochre, grey and blue, respectively. As can be seen in Figure 2, all three correlation distributions have a range from a to approximately 0.2 and cannot be separated based on their correlation coefficient. The correlation coefficients for a steadily increasing trans action length over the same data-stream as in Figure 2 using

{0.5, 1,2,4,8,16,32,64, 128} over window sizes

w

Equation 1 is given in Figure 3. It can be observed, that all three distributions, i.e., EE, EN and NN, can be separated much better. Especially, the ochre distribution bell to the far right, representing EE correlation coefficients, indicates a clear distinction from the remaining distributions bells which allows to remove noisy events at a correlation threshold of approximately 0.95. It should be noted that, the current implementation of the algorithm is to work with the constant event rate stream rather than the bursty nature of the realistic streams. The step size of the transactions is based on the event distribution in each window rather than the distinct EventlD types in each window. Based on further experimental evidences (which are out of the scope of this current paper) it can be suggested that the dependency of the algorithm on step sizes vary with the event occurrence rate in each transaction. B.

Evaluation of Various Step Sizes A

The calculation of the transaction length ITil and by which amount it steadily increases mainly depends on the step size A used in Equation l. As it was described in Section II-B5, the correlation threshold r * defines the value that indicates which EventlDs will be kept after the correlation was calculated and r ;" determined for each EventID; preferably, only emulated EventIDs should be kept and all noise EventIDs removed from the data-stream. However, with a different E:N ratio and window size over which the transaction length statically grows, the EE, EN and NN distribution bell curves move in terms of their peaks as well as minimal and maximal tail values significantly. For this reason the effect of a changing A in Equation 1 is analysed first. For this analysis the window sizes w of 100 000, 250 000 and 500 000 are tested over step sizes A of 0.5, 1, 2, 4, 8, 16, 32, 64 and 128 for the E:N ratios 4:1 and 1:1. As the distance between the EE and NN bell distributions reveals whether or not any of the EventID pairs could possibly be affected by the noise reduction algorithm, Figure 4 shows the correlation distances between the right tail end of the NN distribution bell and the left tail end of the EE distribution bell. When using Figure 3 as a reference, the correlation EE to NN distribution bell distance is approximately 0.98 - 0.92 0.06. =

4

� 104 C

�

� 102

�

10"

'iij D O :w c

.D

UJ "0

OJ > UJ

�101

H IIIIIIII'······· +H lII+l H 10� �

�10°

� 1 °�

OJ > UJ

'0

10-1

0.2 0.3

0.4

0.5

0.6 0.7 ..w:----:U7-'''"110

Correlation (a) Ratio 4:1. w = 100 000 .\ •

0

� 102

O

�10"ItHIII:IIII'IfIlIlHIIIIIIIHIIIlII� �100

�ttlllttttilllttHiI,\\t

'0

::J

Z

104

tl o

�

.

D O

.

F===��ili:��==:�l l .14 � � � �103 .J..OJ § OJ

�::J

1 3::: o

BU 103

o

OJ U C

Z

E

10-1 "::,"",,:,:,Uill.O--:W 0.1 ':'-""!"':

ill

0.4

0.5

0.9

0.6 0.7 0.8

Correlation =

(b) Ratio 1:1,

4

W =

100 000, .\

=

4

� 104

� § 103e �

o

�101 OJ

UJ "0

� �100e ······· ·············

� 10° �

'0

::J

O 7'---OL.-:- 8==--=0L1L.Lll. 9 l..LWl 1 .1 O.2 O.3 O.4 O.5 "'Ow.1L6Wlll .L

Correlation (c) Ratio 4:1. w = 250 000 .\ •

=

E

l �� illillillliil lll � 10-1 � �0.1 0.4 0.5 0.6 0.7 0.8

ill

Correlation (d) Ratio 1:1, W = 250 000, .\

16

�

�

g: 10

�10"HllllttHllllttIIlIlIIIIlIHIIII�

� 10" � �10°

OJ

Z

10-1

HllltttHlttttttttttttllHttttll1t

0.4

0.5

0.6 0.7 0.8

Correlation (e) Ratio 4:1. w = 500 000 .\ •

2

Z

lM illillL --��lliilil l � 10-1 � �0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

0.9

0.9

Correlation =

(f) Ratio 1:1,

32

Fig. 5. Event Identifier (EventID) pair correlation coefficient distributions for E:N ratios 4:1 and I :1, window sizes step size .\ as indicated in the caption of each sub-figure.

Figure 4a indicates that for U! 100 000 and ratio 4: I the highest possible EE to NN distribution bell distance can be achieved with lambda 4 which results in a distribution bell distance of approximately 0.025. When moving to U! 250 000 a step size of A 16 gives the highest distance. For U! 500 000 several step sizes indicates the same distance, i.e., 32 and 64. As a higher step size results in more transactions T which need to be analysed, the lowest possible step size A is preferred so that the least amount of transactions are generated. Thus, for U! 500 000 a step size of A 32 is chosen. =

=

=

=

=

=

4

'0

lLill!J.J.LlllillJlill!JLLUlU!!L

0.1 0.2 0.3

=

0.9

o :w

o

'0

",1111111111111111111

Z

� 102 hlltttHlttttttttttttllltttttll1t � �100

,dIlHIIII I IHI

� 102

HHtIIH IHIII+I·····················+H· II IHd10� �

o

....... ............................ ..............

o

�

'iij "DO :w c

W =

W =

500 000, .\

=

8

{100 000,250 000,500 OOO} and

data-stream which decreases the correlation coefficients for emulated EventIDs and increases the correlation coefficients for noise EventIDs. Nevertheless, for window sizes of 250 000 and 500 000 sufficiently high step sizes A can be derived from Figure 4b of 4, 8 and 32, respectively. For U! 100 000 all step sizes give an EE to NN distribution bell distance of very close to zero. As can be seen later, when manually analysing the resulting distributions, a step size of A 4 still provides reasonable results. =

=

=

Moving to an E:N ratio of 1:1 Figure 4b reveals much lower distances across all step sizes than for a ratio of 4: l. This is due to a higher amount of noise EventlDs in the

C.

Correlation Results - Event Identifier Pairs

This sub-section presents the correlation results for E:N ratios of 4:1 and 1: I in combination with window sizes U! =

Q)

�

Q) U

� 102� .... ....................... ....................... ....................................................··················+1102 � Q)

�

U

0

tl o

�

� 101

1O�

.0..

o :w c

� 10

UJ Q)

0

100

Vl

'0

'm 0..

o :w c Q) >

�

2j �

�

Z

E

--c:'--';:=;:-'-;:"-7.w:-llill.L :' 10-1 "::-----: 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0-".':-S---'0.c- 9 --;: ---,=110ill

� 102

I,)

�

Q) U

� 102

w =

=

•

..

'm 0.. o :w C

�1 00e·············································· n n lHIHII

UJ Q)

Vl

'0 Z

10-1 "::-��7---, -,---,�---,:":"----,� � 0.7 O.S 0.9 0.1 0.2--0.3 0.4� 0.5 0.6 (b) Ratio 1:1.

Q) U C Q)

10L t � U U

0

o

�

'm

·� 101e········································· 111 11 1

1O� 0..

�100� ......... ... ........... .. 1IT IlI lrllllmlll

100

0.. o :w c Q)

Vl

'0

0 :w C Q) >

�

2j �

�

Z

E

--:-�::--"LLl.JJJ.l.IJ-"WJ.I.I! 10-1 ":--�0.1 0.2 0.3 0.4 0.5 0.6 0.7 O.S 0.9 11 ill

Q) U C

w =

100 000 .\ •

4

_

� U U

0

�

1 .� 10

�O�

� 100

�O

0.. 0 :w C Q)

O

10-1

0.. o :w C Q) >

�

2j �

Vl

'0 Z

'm

�

Jj

0.1 0.2 0.3 0.4 0.5 0.6 0.7 O.S 0.9 11 Correlation

250 000 .\ •

=

(d) Ratio 1:1.

16

Q) U C

102

� 102 � U U

Q) U C Q)

�

U U

o

o

�

'm

w =

250 000 .\ •

� 102

�

� U U

o

ti:i 100<:>

ti:i 100

Q)

2j �

Vl

�

Z

---, _�,----"-l 10-1 ":--�0.1 0.2 0.3 0.4 0.5 0.6 0.7 O.S 0.9 11

Jj

0.. o :w C Q)

10-1

� U U

'm

�O

0.1 0.2 0.3 0.4 0.5 0.6 0.7 O.S 0.9 11.

Correlation

500 000, .\

�

11111111+1111111········ , 1,1� a.

Q)

� Z

�

�

� 100

'0

0"

o

1 � .. ..................... ... ........... ... ........... ... ........... ... ......... .� 10

o :w C Q)

4 Q) U C

1 � 0..

0.. o :w c

=

Q) U C

1 .� 10

w =

=

� 10':

Correlation

(e) Ratio 4:1,

o

�10"�····································································· 11

4

� U U

w =

�

Correlation

100 000 .\

Noise Ce·rrelat.ie·rr to Ncise Ccrrela� to Emulated Corrc=-- __ icn

(c) Ratio 4:1.

Q) U C � U U

� U U

Correlation

(a) Ratio 4:1.

L

O

0 :w C Q) >

�

2j

� Jj

Correlation =

32

(f) Ratio 1:1.

Fig. 6. Event Identifier (EventTD) correlation coefficient distributions for E:N ratios 4:1 and 1:1, window sizes size .\ as indicated in the caption of each sub-figure.

{lOa 000,250 000,500 OOO}. For each ratio and window size plot in Figure S, the step size A was chosen according to the previous section and its value is given in each sub-figure's caption. As can be observed from Figure Sa, Sc and Se, with an increasing window size and an as high as possible A from Figure 4a for each window size, the obtained EE, EN and NN distributions bells show lower tails and a higher distance. The same trend can be observed for an E:N ratio of I:1, depicted in the right column of Figure S. However, the distance between the EE and the ENINN distributions is significantly lower with the EN and NN distribution bell settling around rm m" of 0.94 and 0.82 in Figure Sf, respectively. ", Compared to an E:N ratio of 1: I, the left tail of the EN

w =

w =

500 000, .\

=

8

{100 000,250 000,500 000} and step

and NN distributions bells for 4: I always go down to a corre lation of 0 where only their peak and right tail moves further towards lower correlations. In summary, with an increasing window size and a well chosen A to determine by how much each transaction increases, the noise and emulated EventIDs pairs can be clearly separated using their correlations.

D. Correlation Results - Event Identifiers In order to use a correlation threshold in order to remove certain EventIDs which have been identified as noise, the event-wise correlation r:n is derived for the results in Figure S using Equation 4. When looking at the results presented in Figure 6 it be-

ApPLIED DATA REDUCTION FOR E:N RATIO OF 4: I U SING VARIOU S CORRELATION COEFFICIENT THRE SHOLD S AND WINDOW SIZE S

TABLE VI.

W

('000)

ApPLIED DATA REDUCTION FOR E:N RATIO OF 1: I TABLE VII. U SING VARIOU S CORRELATION COEFFICIENT THRE SHOLD S AND WINDOW SIZE S W ('000)

tv

0.5

0.7 5

0.8

0. 8 5

0.9

0.9 5

tv

0.5

0.7 5

0.8

0. 8 5

0.9

0.9 5

100

0%

16%

19%

20%

20%

20%

100

0%

1 5%

34%

46%

49%

49%

2 50

7%

19%

19%

19%

19%

19%

2 50

0%

0%

10%

36%

49%

49%

500

10%

20%

20%

20%

20%

20%

500

0%

0%

14%

46%

50%

50%

comes apparent to the reader that all distributions appear sparse and that the right tail of especially the EN bell disappears. Additionally, the EE to NN distribution bell curve seperation distance further increases across all E:N ratio and window size U! combinations. While for our evaluation we explicitly know which EventIDs are noise and which are emulated data, we strive to automatically determine noise from data. Assuming that the EventlDs cannot be classified into emulated and noise using their numerical ranges, the results presented in Figure 6 still allow to clearly identify noise through their correlation coefficient which concludes a successful hypothesis test. Con sequently, an optimal correlation threshold can be dynamically derived to reduce the data-stream by removing noise without affecting any emulated EventlD, as described in the next section.

E. Data Reduction The core functionality of the algorithm is to reduce noise EventlDs based on the given correlation threshold T*. The precision of this algorithm depends on the amount of noise EventIDs left after the reduction without removing any emu lated EventlD. At the optimal performance level, the algorithm should reduce the noisy portion completely. In the experiment all these hypotheses are examined with different values of T*. The results on the amount of data reduction by the algorithm for E:N ratio of 4:1 is presented in Table VI and for E:N ratio of 1:1 is presented in Table VII. In both tables the window sizes U! are given in the scale of thousands. In Figure 7 the percentage of the noise kept in the reduced events are reported. For a E:N ratio of 4: I, the optimal ratio of noise to data is

20 % of all the data. As mentioned previously that for higher E:N ratio, the separation between the correlation distribution of the event types are further maximised with higher window size. As a consequence, even for lower T* the algorithm can perform optimally with higher E:N ratio and larger window length. In Table VI a similar scenario can be observed, that with the largest window length of 500 thousand events and T* � 0.75 the algorithm is able to reduce 20 % of the data which corresponds to half of the added uncorrelated noise data. As the window lengths get shorter the algorithm is able to reduce the noisy part completely for higher T*. In case of the largest window size, for the highest T* the algorithm removes almost 90 % of the events which include most of the emulated events. To prevent such false reduction it is important to carefully set the parameters that is, for larger window size the T* should be set to lower values.

IV.

CONCLUSION

The challenge of reducing the massive amount of streamed data generated by NEs is addressed in this paper. A step-wise

heuristic algorithm is designed to detect noisy events from the data-streams and eventually remove those events in order to preserve the events leading to various network incidents. The algorithm first maps the unbounded streaming data into finite dimension by dynamically slicing the input stream and counting the occurrence of each EventIDs type in each slice of the stream. The size of the slices dictates the computational complexity and precision of the algorithm. At the core of the algorithm the pairwise co-occurrences of the events are taken into account to measure the magnitude of the relationship between the events and infer on the inclusion and removal of each event, respectively. From the hypothesis tests it can be observed that when choosing proper step sizes A, window sizes U! and correla tion threshold T*, the correlation algorithm can remove all artificially injected noise events without removing any single events originated from the emulator. The results show a strong relationship between the obtained correlation distributions and the manually selected correlation thresholds T*. Thus, by automatically determining T* using half of the distribution distance, the proposed correlation algorithm can be realised in a completely autonomous fashion. The current prototype correlation formula and algorithms used require further development before being reliably ap plied to real telecoms network data. The algorithm should dynamically tune the transaction length rather than tuning the step sizes. Summary statistics of the distinctive EventlD types in each window can be used as the evaluation metric for this purpose. The validation of the transaction length can be carried out in a sequential manner to reduce the computational complexity. In summary, this work shows good potential for automatically reducing noise and therefore dimensionality of streamed data, thus facilitating more automated data processing to support future OSS systems. ACKNOW LEDGMENT

This work was funded by Enterprise Ireland Innovation Partnership Programme with Ericsson Ireland under grant agreement IP/201110135 [6]. REFERENCES [1]

B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom, "Models and Issues in Data Stream Systems," in Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems - PODS

2002, pp. 1-16.

'02.

New York, New York, USA: ACM Press, Jun.

[2]

S. Robitzsch, "OpenMSC - An Open Source MSCgen-Based Control Plane Trace Emulator for Communication Networks," 2014. [Online]. Available: http://openmsc.blogspot.com

[3]

K. L. Mills and 1. 1. Filliben, "Comparison of Two Dimension-Reduction Methods for Network Simulation Models," Journal of Research of the National Institute of Standards and Technology, vol. 116, no. 5, pp. 771783, 2011.

50

50

�4 0

�40

�

�

Cl.

�

Cl.

V1

30

�

20

o :w c UJ OJ

� V1

30

�

20

o :w c UJ OJ

V1

V1

� 10

� 0.5

0.75

0.8

0.85

0.9

10

0.5

o .95

0.75

Correlation Threshold (a) Ratio 4:1,

W =

100 000, .\

=

(b) Ratio 1:1,

4

50

50

�4 0

�40

�

30

�

20

UJ OJ

� V1

30

�

20

o :w c UJ OJ

V1

� 0.5

0.75

0.8

0.85

0.9

0.5

o .95

100 000, .\

0.75

0.8

=

0.85

Correlation Threshold (d) Ratio 1:1, W = 250 000, .\

4

�4 0

0.9 =

0.95

4

5 0 ��----��---,--.---.---�

5 0 CT-----,---r--.---.---�

�40

�

�

Cl.

Cl.

V1

30

�

20

UJ OJ

W =

10

Correlation Threshold (c) Ratio 4:1, W = 250 000, .\ = 16

o :w c

0.95

V1

� 10

�

0.9

Cl.

V1

o :w c

0.85

�

Cl.

�

0.8

Correlation Threshold

� V1

30

�

20

o :w c UJ OJ

V1

V1

� 10

� 0.75

0.8

0.85

0.9

o .95

Correlation Threshold (e) Ratio 4:1,

W =

500 000, .\

10

0.5

0.75

0.8

0.85

0.9

0.95

Correlation Threshold =

32

(f) Ratio 1:1,

W =

500 000, .\

=

8

Fig. 7. The amount of kept noise Event Identifiers (EventIDs) after applying a certain correlation coefficient threshold for E:N ratios 4:1 and 1: I, window sizes W = {100 000,250 000,500 OOO} and step size .\ as indicated in the caption of each sub-figure.

[4]

G. Reeves, J. Liu, S. Nath, and F. Zhao, "Managing massive time series streams with multi-scale compressed trickles," Proc. VLDB Endow., vol. 2, no. 1, pp. 97-108, Aug. 2009.

[5]

A. Mueen, S. Nath, and J. Liu, "Fast approximate correlation for massive time-series data," in Proceedings of the 2010 ACM S1GMOD International Conference on Management of Data, ser. SIGMOD '10. New York, NY, USA: ACM, 2010, pp. 171-182.

[6]

Dublin City University and Ericsson, "E-Stream Project." [Online]. Available: http://estream-project.com

A heuristic ant algorithm for solving QoS multicast ...