Lecture 14: Monte Carlo on networks

Viewer
Transcript

Lecture 14: Monte Carlo on networks Eduardo L´opez March 28, 2017 As we learn more and more about statistical testing in networks and about random networks, we discover that the two are intimately connected: when we have data and we are looking for those significant events, we compare with something that we consider typical. So, in the case of the Krackhardt study, we wanted to establish a baseline for comparing the data to something that might represent it to some extent, and so we used ER networks as representative networks and then proceeded to make comparisons. We found that there is subtlety in making the comparison: yes, the degree histogram had some similar features to the data such as two tails (a growing and a decaying part), and yes we engineered it to have the same number of nodes and the same density of links, which meant that the average degrees of our networks and those of the Krackhardt dataset were about the same. However, we also discovered that to make the comparison properly we needed precise ER modelling, which we did not do computationally (we only made 2 example ER networks and used them only for a qualitative comparison), but rather through equations. In this chapter, we are going to familiarize ourselves with the first powerful steps into creating computational statistical models through the use of Monte Carlo, a technique for creating many random networks in the computer and using them as our baseline model to make accurate calculations for statistical testing and other purposes.

1

Revising the Krackhardt example

In order to revisit the way in which we studied the Krackhardt dataset, we discuss a few important features of that problem.

1.1

The need for large computer simulations of random networks: The Krackhardt case study situation

Let us revise what we learned in that lecture. We found two interesting things. First, we saw that in order to compare the data to ER networks, we could only establish a simple qualitative comparison by creating a few (in our case we

1

created 2) examples of ER networks and looking for similarities and differences with the data by eye. By eye is not very rigorous, though... When we actually had to precisely determine the probability (obtain a pvalue) to observe the nodes of high degree in the data, one with degree 14 and the other 18, we calculated using the formulas for degree probability in ER networks. Why was that? Because the formulas represent the equivalent of making not just two, but infinite ER networks and averaging them, so that when we make the comparison between the data with the ER model, we do it against a highly detailed model, where fluctuations have been dealt with. Recall, however, that towards the end of the session (lecture notes), I also presented the result of averaging one thousand ER networks and presenting the degree histogram. We could have also calculated a p-value simply by looking at that plot and counting from it the probabilities. In this chapter, we will learn to do this kind of comparison: to use many many simulations in order to obtained well-averaged statistical models to make rigorous statistical comparisons. This avoids the need to use formulas, not due to laziness, but rather due to the fact that in many situations formulas cannot be calculated exactly, and therefore we need computer simulations. An additional benefit is that in situations when we can construct new mathematical calculations, we can check them by comparing them to simulations. I have used the word simulation above a few times without defining it. But what is a simulation Definition 1 A simulation is the creation of a system (typically in a computer) that is based on well defined rules. Here, we are mostly concerned with the creation of random networks, so we usually refer to the creation of one or many random networks on the computer as a simulation. The second thing we found by comparing the Krackhardt dataset against ER networks was that indeed the nodes of high degree in the data were very, very improbable if the data came from an ER network. We will wait to address the problem until we develop so-called exponential graph models a few lectures from now.

1.2

p-values in Krackhard case from simulations

In order to make 1000 individual ER networks, we previously presented the following code (with some useful additions): d e f ER( n , p ) : U=nx . Graph ( ) f o r i in range (n) : f o r j i n r a n g e ( i +1 ,n ) : r=rn . random ( ) i f r<=p : U. a d d e d g e ( i , j ) r e t u r n (U) h t o t a l ={}

2

pr={} NormPr=0. f o r counter in range (1000) : H=ER( n , p ) f o r node i n G. nodes ( ) : k=H. d e g r e e ( node ) h t o t a l [ k]= h t o t a l . g e t ( k , 0 ) +1 pr [ k]= pr . key ( k , 0 ) +1 NormPr=NormPr+1 f o r k in h t o t a l . keys ( ) : h t o t a l [ k]= h t o t a l [ k ] / 1 0 0 0 . pr [ k]= pr [ k ] / NormPr

The first part of it is just our usual little bit of code for an ER network. The second part is mainly a loop that goes around 1000 times, each time creating an ER network, extracting the degrees from it, and adding them up. There is a small addition in comparison to the code in the previous lecture, the introduction of ‘pr’ which is the probability of a degree, together with a normalization variable ‘NormPr’. The logic of the code is that, since each individual example of a random ER network may have an atypically low or high number of nodes with a given degree k, if you make many ER networks, these atypical cases should average out. Then, the last loop divides the overall degree histogram ‘htotal’ by 1000 so that it is equivalent to having an almost perfectly averaged single ER network, one that behaves like the formulas say, and also normalizes ‘pr’ (as a reminder, the for a node to have k connections is binomial kprobability n−1−k P (k, n − 1) = n−1 p (1 − p) and we expect ‘pr’ to approach it, and for a k network of n nodes, a histogram from this “perfect” case is h(k) = nP (k, n − 1) and ‘htotal’ should approach that). Let us see what I mean when I say we can calculate p-values from simulations. Figure 1 is the plot of ‘htotal’ as presented previously. Next to it is the probability ‘pr’ associated with it. In the lecture on analyzing the Krackhardt dataset, we calculated the p-values of k = 14, i.e. the probability of obtaining from an ER baseline model degrees equal or larger than k = 14. How can we get an approximation to this probability from our results above? The following code allows us to do this: p v a l =0. f o r k in range (14 ,21) : p v a l=p v a l+pr . g e t ( k , 0 ) print pval pval

The last line, in my simulation produces the number 0.005333587313681603. Furthermore, if you run the code, with the print statement included in there, in my simulation, nothing changes after ’k=15’, which means that my simulation did not produce any nodes with degrees 16, 17, . . . , 20. Let us compare this to a theoretically calculated value. For this, we use once again the code of the previous lecture im po rt s c i p y . s p e c i a l a s sp

3

run p-value

1 0.000

2 0.000

3 0.0476

4 0.000

5 0.000

Table 1: A table of p-values for k = 14 on an ER baseline model compatible with the Krackhardt dataset with n = 21 and p ≈ 0.376. p=0. f o r k in range (14 ,21) : p=p+sp . binom ( n−1 ,k ) ∗ ( ( 0 . 3 7 6 ) ∗∗ k ) ∗ ( ( 1 − 0 . 3 7 6 ) ∗ ∗ ( n−1−k ) ) p

When we print out ‘p’, we obtain 0.003341173838685724. This is not the same as the result of the code above! In fact, with 1000 different ER networks, our simulation is off from the calculation by as much as 60%. Why can this be? Well, even though 1000 ER networks seem like a lot, in fact they may not be enough. Running the same code as above 10000 times produces a better value: 0.0033907668276328446, which is only off by 1.5%. Now, do you recall that we first tried only two ER networks? What would happen if we tried to use that to approximate the p-value? Well, we can simply change the code above so that we only do 2 ER networks. But to make the whole process more efficient, let us write this bit of code to wrap the code above and make life easier d e f DegDistER ( n , p , r e p s ) : pr={} NormPr=0. f o r count i n r a n g e ( r e p s ) : G=ER( n , p ) f o r node i n G. nodes ( ) : k=G. d e g r e e ( node ) pr [ k]= pr . g e t ( k , 0 ) +1 NormPr=NormPr+1 f o r k i n pr . k e y s ( ) : pr [ k]= pr [ k ] / NormPr r e t u r n ( pr ) f o r count i n r a n g e ( 5 ) : pr=DegDistER ( 2 1 , 0 . 3 7 6 , 2 ) p v a l =0. f o r k in range (14 ,21) : p v a l=p v a l+pr . g e t ( k , 0 ) print pval

Look at Tab. 1 of 5 different p-values I obtain by running the code 5 different times. The reason why most of the runs produce a p-value of zero is simply because with just two ER networks to use to create a baseline, k = 14 or larger degrees don’t even show up in 4 out of the 5 runs. Before we leave this section, it is worth pointing out that results obtained through simulation are typically what we call estimates or estimations Definition 2 An estimate of a calculation is the result that we obtain when the calculation is not performed exactly, for whatever reason, but instead is 4

Figure 1: Averaged histogram (left) and probability distribution (right) for degrees of 1000 ER networks, each one with the same n and p as Krackhardt’s data. done by the use of a method that provides an approximation. In our case, many approximations come from simulations.

2

Monte Carlo method

The field of Monte Carlo methods is one of the most important in physics, statistics, and related fields in the second half of the 20th century. The ideas behind the method have even earlier origins. For our purposes, this is what we need to know Definition 3 The Monte Carlo method (simulation) in networks is the process of creating many random networks of a specific type (for instance ER of given n and p) and using those outcome networks for the construction of probability distributions necessary for statistical estimations. Incidentally, the name Monte Carlo comes from a quarter inside the principality of Monaco in Europe (surrounded by France, near Nice) which is associated with the installation of a casino y the mid 1800s that made it famous. The relationship between the method and the location is precisely because of the element of chance in casinos. If we look at the example of the previous section, that constituted a Monte Carlo simulation. That is because we created many networks (in one instance 1000 and in another 10000) and extracted from them a degree distribution that we then deployed on the calculation of p-values. We could have also proceed slightly differently there too: instead of making one averaged degree distribution, we could have made a degree distribution for each of the 1000 or 10000 networks, and then combined them all after the fact to make the average degree distribution. However, it turns out that if we do the math, the results would have been exactly the same (I’ll spare you the details). 5

The power of Monte Carlo methods lies in the fact that they are the computerbased calculational substitute for most random processes that cannot be calculated mathematically. To deploy a Monte Carlo method, we basically do the following: • we define the specific random process we want to simulate, such as the throw of a die, flip of a coin, or the creation of a random network, • proceed with the creation of many repetitions of the process and save from each the relevant information that allows us to perform estimation, and • calculate the estimates we are interested in. Repetitions in Monte Carlo have a name: realizations or instantiations. Definition 4 A realization or instantiation in a Monte Carlo method is one of the results that occurs when we run one repetition of the method. Thus, in the throw of a die, a realization is one integer between 1 and 6 coming out of the throw, or in networks a realization is one random network produced with the specified input parameters. Realizations can produce repeated outcomes, so in throwing a die one can obtain the same result multiple times. Furthermore, the fact that a Monte Carlo produces repeated outcomes is the expression of probability at work. Say that a coin is loaded so that it produces twice as many tails and heads, then a Monte Carlo in which on average twice as many tails come out as heads is precisely what we want. Using notation of random variables, we can state that for a random variable U with sample space {u1 , . . . , uq } Remark 1 The outcomes of a Monte Carlo are drawn from the sample space of the problem (or equivalently of the random variable), and the frequency H(uα ) with which outcome uα occurs approaches R × Pr(U = uα ) where R is the number of realizations performed with the Monte Carlo. In other words, H(uα ) → Pr(U = uα ) (1) R which is simply the experimental probabilities we discussed in a previous lecture. Now, just to be clear Remark 2 In our context, the random variable U refers to the type of thing that comes out of the Monte Carlo such as a side of a coin, or a face of a die, or a network. When U is said to take one of its values uα from the sample space, we are saying that the coin came out (say) heads, or the die was ‘2’, or the network was the one with specific nodes and links. To close this section, let me just phrase all of the above like this: a good way to think of a Monte Carlo simulation is as the creation of the chance experiments one is interested in (virtually always on the computer although one can make Monte Carlo by other means we won’t discuss), which provide the experimental probabilities for the random variable of interest. 6

3

Worked Examples

Let us show here a couple of simple examples of Monte Carlo. Some of the key things to pay attention to are what U is (whether it’s a coin, a die, or a network), the question we want answered (an average, a p-value, etc.), and the information we need to keep in order to get the answers. Example 1 Let us imagine we have a board game in which the number of steps forward on the board correspond to the outcome of a simultaneous throw of 2 dice. This problem can be thought of with the use of 3 random variables U1 , U2 , and S, where S = U1 + U2 What are the average and standard deviation of the number of steps taken with one simultaneous roll of the 2 dice? The sample space is given by {2, 3, . . . , 12}. However, these outcomes are not all equally likely in contrast to when we only roll a single die. Let us call the outcome of a single trial labeled α as sα . To determine the average and standard deviation, we can proceed in several ways. The simplest is to keep a list of all the outcomes of our trials, and then perform the calculations afterwards. We could also construct the distribution Pr(S) and calculate from them. By the way, this problem can be solved mathematically, but it is not as simple at this level, and thus for most of us, it may be easier to use Monte Carlo. The Monte Carlo is executed with the following code d e f DoubRoll ( ) : u1=rn . r a n d i n t ( 1 , 6 ) u2=rn . r a n d i n t ( 1 , 6 ) s=u1+u2 return ( s ) # Let ’ s s a y we do 1000 r e p e t i t i o n avg =0. r e p s =1000 Outcomes = [ ] f o r count i n r a n g e ( r e p s ) : s=DoubRoll ( ) Outcomes . append ( s ) avg=avg+s avg=avg / r e p s sd =0. f o r s i n Outcomes : sd=sd +(s−avg ) ∗∗2 sd =( sd / r e p s ) ∗ ∗ 0 . 5

We could wrap all of this in a function, but for our purposes this is enough. When I run this code, my results are ‘avg=6.848’ and ‘sd=2.4105800131918427’. The theoretical p values if ‘reps’ was ∞ would be average of exactly 7, and standard deviation 35/6 ≈ 2.4152. 7

Example 2 We want to determine the distribution of the largest degree for the family of G(n, p) where n = 21 and p = 0.376; this is another cross check of our analysis of Krackhardt’s data. Note that in this case, we are not looking for the degree distribution for all the nodes, but rather only the distribution of the largest degree we find on each ER network we produce. To solve this problem, we use our own code for ER networks. For each network, we need to loop over all the nodes, find the degree of each and keep track of the largest degree. This is then saved and used to create the distribution we are interested in. Although we can introduce the notation of random variables, this does not help much in our case. The following code provides a solution # We c h o o s e t o do 1000 r e a l i z a t i o n s r e p s =1000 n=21 p =0.376 pr={} f o r count i n r a n g e ( r e p s ) : U=ER( n , p ) kmax=0 f o r node i n U. nodes ( ) : k=U. d e g r e e ( node ) i f k>kmax : kmax=k pr [ kmax]= pr . g e t ( kmax , 0 ) +1. f o r kmax i n pr . k e y s ( ) : pr [ kmax]= pr [ kmax ] / r e p s

Figure 2 shows the results. The distribution grows and decays, but is not symmetric around the maximum (grows faster and decays slower). As you can also see, it is much more typical for these ER graphs to have a maximum degree of about 11 or 12, and much less likely to have a maximum degree of 14 and beyond; our Monte Carlo doesn’t show a single realization with 17 or greated degree

4

Some uses of Monte Carlo in our Network Science course

In Network Science, which is basically what we are doing in this course, Monte Carlo is used for all kinds of applications. Above, Ex. 2 is a good illustration. We formulated a question about a statistical property of random networks of a certain kind, and by executing a Monte Carlo we managed to generate an answer. This gives us a template for many of the uses we encounter in both the networks literature, and in some of the questions coming up later in the course. I now mention quite typical uses of Monte Carlo that we will encounter:

8

Figure 2: Distribution of the maximum degree of an ensemble of ER networks with parameters of n and p consistent with Krackhardt’s data. 1. Generating full statistics of a network model: On the basis of the rules of creation of a random network, Monte Carlo allows us to generate the probability distributions of those random networks. The distribution carries all the information necessary to study statistical properties of the network model. 2. Understanding the properties of a type of random network: In this case, we mean, for instance, that we want to know something difficult to calculate, like the clustering of scale-free networks. We use Monte Carlo to generate many scale-free networks and then measure the clustering. Basically, this is equivalent to the previous point and adding to it the measurement of something. 3. Sampling special subtypes of networks: Imagine we want ER networks with a given n and p, but we want to make sure that all of them have maximum degree above some value. In Ex. 2, we might want to only work with networks that have a maximum k equal or larger than 14. Then Monte Carlo allows us to create ER networks, some of which satisfy the property of maximum degree ≥ 14 and others that do not, and select the ones we want. After this selection, one can do the kinds of things mentioned in the previous two points. You should not take the previous list of items as extensive in any way. The use of Monte Carlo is basically equivalent to the use of random simulation, so the possibilities are endless. However, the examples above do give a good sense for the things we will do in future lectures.

9

5

Some technical challenges implementing Monte Carlo

When using Monte Carlo methods, the error checking of the results can be tricky. This is because Monte Carlo comes into play when other methods are not available, and thus the choices of independent checks are very limited. This forces us to be very judicious about its use and we need to be resourceful when making sure our results are meaningful. Below, I mention a few of the things that one can keep in mind in order to check if Monte Carlo results are correct.

5.1

How many realizations are needed?

One constantly tricky problem with Monte Carlo is that we do not know right from the start (a priori) how many realizations are needed to obtain good results. Therefore, it is always important to test this. How do we go about it? The key thing is that we must test the trends that our results have as a function of the number of realizations. To illustrate how this may look, let us go back to Ex. 1, and ask how whether the averages that are being calculated trend towards some limit. Let us wrap the code from that example in a way that makes it easy for us to check the trends (assume the function ‘DoubRoll()’ is already loaded in memory). d e f avgDoubRoll ( r e p s ) : avg =0. f o r count i n r a n g e ( r e p s ) : avg=avg+DoubRoll ( ) r e t u r n ( avg / r e p s ) AvgTrend={} f o r reps in range (10 ,2001 ,10) : AvgTrend [ r e p s ]= avgDoubRoll ( r e p s )

After running the code, we produce Fig. 3, which shows the gradual approach towards a value at the center of the dots as ‘reps’ becomes larger. Note that I have decided to consider the problem of the number of repetitions not from the standpoint of having a very large ‘reps’ value and leaving it at that, but rather have focused on the trend as ‘reps’ becomes larger. This is because what we consider as a large value of ‘reps’ depends on the problem we are studying; in some cases ‘reps’ can be very small (say 10) and already be sufficient, but in some others ‘reps’ needs to be a massive number. This is where knowing the trend works very well because we may be able to extrapolate what the target result should be, and how many ‘reps’ may be needed to get close.

5.2

How representative is the sampling?

Monte Carlo can be used is other ways than the ones mentioned in the previous sections. Sometimes, in contrast to the cases above where we always knew how to simulate fairly, in other cases, fair simulation can be a challenge. In those 10

Figure 3: Average of the result of the roll of a pair of die as a function of the number of realizations of a Monte Carlo for the problem. As the number of realizations increases, the result starts to get closer to the theoretical value 7, and we can probably guess this result withing a margin of error from the dots on the plot even if we did not know that the theoretical result is 7. cases, it is not clear whether the frequencies we observe (the ones that we make reference to in Eq. (1)) are correct or are biased. When this is the case, one needs to perform many more checks to make sure we have a fair distribution. I will leave this problem for now.

References [1] D. Krackhardt, “Cognitive social structures”, Social Networks, 9, 104-134 (1987).

11

Monte Carlo Applications in Polymer Science (Lecture ...

lecture 9: monte carlo sampling and integration - GitHub

Monte Carlo Simulation

a monte carlo study

Sequential Monte Carlo multiple testing

Introduction to Monte Carlo Simulation

Sequential Monte Carlo multiple testing

Hamiltonian Monte Carlo for Hierarchical Models

Introduction to Monte Carlo Simulation - PDFKUL.COM

Statistical Modeling for Monte Carlo Simulation using Hspice - CiteSeerX

Sonification of Markov chain Monte Carlo simulations

Bayes and Big Data: The Consensus Monte Carlo ... - Semantic Scholar

Using the Direct Simulation Monte Carlo Approach for ...

Bayes and Big Data: The Consensus Monte Carlo ... - Rob McCulloch

A Non-Resampling Sequential Monte Carlo Detector for ... - IEEE Xplore

Monte Carlo simulations of interacting anyon chains - Semantic Scholar

Wigner-Boltzmann Monte Carlo approach to ... - Springer Link

accelerated monte carlo for kullback-leibler divergence ...