measuring the sensory order

Viewer
Transcript

MEASURING THE SENSORY ORDER: MEASURE THEORY, NEURAL NETWORKS & EQUIVALENT MODELS

A Thesis Presented to the Faculty of California State Polytechnic University, Pomona

In Partial Fulfillment of the Requirement for the Degree Master of Science In Pure Mathematics

By Brendan Patrick Purdy 2005

SIGNATURE PAGE

THESIS:

MEASURING THE SENSORY ORDER: Measure Theory, Neural Networks & Equivalent Models

AUTHOR:

Brendan Patrick Purdy

DATE SUBMITTED:

_____________________________________________ Department of Mathematics

Prof. Jim McKinney Thesis Committee Chair Mathematics

_____________________________________________

Prof. Michael Green Mathematics

_____________________________________________

Prof. Randy Swift Mathematics

_____________________________________________

ii

Dedication & Acknowledgements Dedication I would like to dedicate this Thesis in the memory of my paternal grandparents who both passed away during my tenure at Cal Poly: Robert “Poppy” Purdy—the finest of the American Yankee tradition. Pat “Nanny” Purdy—the reason for the phrase “Irish Eyes Are Smiling” Agnus Dei, qui tollis peccáta mundi, dona eis réquiem. Agnus Dei, qui tollis peccáta mundi, dona eis réquiem. Agnus Dei, qui tollis peccáta mundi, dona eis réquiem sempitérnam. Acknowledgements Professor Krinik—for getting me in, and later helping me through my period of doubt. Professors Green & Swift—for each having me for two quarters and yet still agreeing to be on my committee. Professor McKinney—for being my advisor on life, not just mathematics and naturally, this would have never been completed without you; deepest gratitude. Fayez Khoury—for being with my on this Cal Poly journey and editing both my thoughts and this thesis. My parents—for helping me have a second chance and encouraging me all the way. My brother—for being the first person to teach me that symbols can be a powerful language. My wife—anh yêu em. And I must give thanks to my patrons St Thomas Aquinas and St Albert the Great. Ad Majorem Dei Gloriam!

iii

ABSTRACT This thesis discusses measuring neural networks with measure theory. In particular, we show that finite artificial (recurrent) neural networks can be quantified by the theory of Lebesgue measure. Using measure theory to quantify neural networks is mathematically valid because of the following. First, using computation theory, neural networks are equivalent to finite automata, which are in turn equivalent to hidden Markov models. Next Markov models are valid on the discrete probability space by using measure theory and the axioms of probability. Finally since Markov models are equivalent to neural networks, then it is valid to measure these networks with measure theory. We call this measure of neural nets the McCulloch-Pits Measure and denoted µ Ν . Once we have the McCulloch-Pitts measure, we apply it to a few examples. We then finish our discussion with a sketch of the proof showing that infinite neural networks can be measured by continuous probability space, possibilities for further research, and concluding thoughts.

iv

TABLE OF CONTENTS

Signature Page

ii

Dedication & Acknowledgments

iii

Abstract

iv

Table of Contents

v

List of Figures and Tables

vi

List of Major Results

vii

Notational Glossary

ix

Chapter 0

Introductory

1

Chapter 1

Set Theoretic-cum-Computational Beginnings

11

Chapter 2

Equivalence of Finite Models

21

Chapter 3

Markov Models & Measure on Finite Measure Space

37

Chapter 4

Finite Applications

45

Chapter 5

Conclusion and Future Research

55

References

61

Appendix I

A Short Philosophical Note on Models and State Diagrams

74

Appendix II

Source Code, Sample Output, and Data with Technical Remarks

76

v

List of Figures and Tables

Table G.1 A Glossary of common symbols

viii

Figure 0.1.1 A directed graph.

3

Figure 0.1.2 Three examples of cyclic graphs.

4

Figure 0.1.3 A state diagram of a finite automaton.

7

Figure 1.1.1 The path of the thesis.

11

Figure 1.2.1 An example of a finite (recurrent) neural network.

14

Figure 1.2.2 An example of a finite automaton.

15

Table 1.2.3 The transition function for M.

17

Table 1.2.4 Isomorphic Relation of the three models.

20

Table 2.1.1 Truth tables for AND, OR, and NOR.

24

Figure 2.1.2 A canonical finite-state neural net.

26

Table 2.1.3 I/O and state representation of M and some NM.

29

Figure 3.1.1 A state diagram of a Markov chain.

38

Figure 4.1.1 The process of teaching a neural network.

46

Figure 4.1.2 An illustration of the 1-1-1 two-layer network.

47

Figure 4.3.1 An illustration of the 1-2-1 two-layer network.

51

Table 4.3.2 Possible runner situations in a baseball game, ignoring outs.

53

Figure 5.2.1 The path for further work.

57

Table A.1 Glossary of the terminology (Logical, Sensory).

75

Table A.2 Results for §4.2.

95

Table A.3 Randominization of the Los Angeles Dodgers Wins 1969-2004.

97

Table A.4 Results for §4.3.

98

vi

List of Major Results Theoretical

Theorem 2.1.18 Every finite automaton, M, is equivalent to some neural net NM , i.e. ∀M , ∃N M ( M ∼ N M ) , p. 25.

Theorem 2.2.4 Finite automata and hidden Markov models are equivalent, i.e.,

M ∼ H , p. 33.

Corollary 2.2.2 (Finite Equivalence Theorem) Finite neural nets are equivalent to finite automata and finite automata are equivalent to hidden Markov models; i.e.,

N ∼ M ∼ H . Further, neural nets and Markov models are equivalent: N ∼ H , p. 35.

Theorem 3.1.4 Markov chains, and hence hidden Markov models, obey the Axioms of Probability. In other words, they exist (i.e., are valid) on probability space, p. 39.

Corollary 3.1.5 Finite neural nets and finite automata are valid on probability space, p. 39.

Definition 3.2.13 The McCulloch-Pitts measure is the measure of a neural network. We denote the measure µ Ν and define the measure as µ Ν = E [ g ( X )] = ∫ g ( x) PX , where g is the indicator function of the probability space X, p. 43.

vii

List of Major Results Applied

Solution 4.2.2 (Cryptography) After running the program 50 times, the expected value of this neural net is µ N1 = 9.78 iterations. Thus, the measure of this neural net is 9.78. p. 50.

Solution

4.3.3

(Baseball)

The

expected

value

of

this

neural

net

is

µ N 2 = 25.6 iterations. Thus, the measure of this neural net is 25.6. The neural net predicts that the Dodgers will win 92.8 or 93 games in 2005. p. 52.

viii

Notational Glossary

Table G.1 A Glossary of common symbols Symbol Meaning µΝ McCulloch-Pitts Measure ~ Equivalence N Finite Neural Net Finite Automaton M Hidden Markov Model A Any Formal Model, e.g. M , N , A M Isomorphism ≈ Language, Formal L

Page(s) 1, 12, 42 7, 22 13 15 17 20

I/O ∧ ∨ ↓

22 23 23 23 25 25

∀

∃

Input/Output AND (Logical Operator) OR (Logical Operator) NOR (Logical Operator) “For every” (Universal Quantifier) “For some” (Existential Quantifier)

20 21

Note: The column “Page(s)” refers to where the symbol was first used and any other particularly important uses of the symbol. Please see Table 1.2.4.

ix

CHAPTER 0 INTRODUCTORY “Because of the “all-or-none” character of nervous activity, neural events and the relations among them can be treated by means of propositional logic.” —W.S. McCulloch & W. Pitts, 1943.

§0.0 The Thesis Statement of the Thesis Artificial (recurrent) neural networks can be quantified by measure theory. Using measure theory to quantify neural networks is mathematically valid because of the following. First, using computation theory, neural networks are equivalent to automata, which are in turn equivalent to Markov models. Next Markov models are valid on probability space by using measure theory and the axioms of probability. Finally since Markov models are equivalent to neural networks, then it is valid to measure these networks with measure theory. There are two types of artificial (recurrent) neural networks, viz. finite and infinite. In the finite case, finite neural networks are equivalent to finite automata, which are equivalent to hidden Markov models. We show that discrete probability can be used to measure finite neural networks. Regarding the infinite case, we will not prove that infinite neural networks can be measured by indiscrete probability. Rather we merely sketch the proof that infinite neural networks are equivalent to Turing machines and Turing machines are equivalent to labeled Markov algorithms. Naturally, Markov algorithms can be measured with continuous (indiscrete) probability space. We christen the measure of neural networks the McCulloch-Pits Measure and denoted µ Ν .

Theoretical Frameworks In order to be able to measure neural networks (or simply neural nets), we must deal with two theoretic frameworks. First, we have to be able to move from neural nets to automata (either finite automata or Turing machines) and then from automata to Markov models (either hidden or labeled). In order to do this, we must use computation theory. Second, we have to show that Markov models are valid on probability space, which is possible only by measure theory. Thus, to show equivalence of models we use the theory of computation while to show validity of models on probability space, we use the theory of Lebesgue measure.

The Big Ideas This thesis is a sequence of connected ideas. Of course, we prove that we can move from idea to idea, but here we will list the ideas that concern Part I. None of the ideas have their origin with us, but as far as we know, we are the first to connect the dots. •

Neural events can be treated by means of an (indefinite) propositional calculus, and hence set theory (McCulloch & Pitts, 1943; Cf. Carnap, 1934).

•

Every finite automata machine is equivalent to, and can be “simulated” by, some finite neural net (Minsky, 1967).

•

Finite automata are equivalent to hidden Markov models (Cf. Markov, c. 1906).

•

Markov models are valid in (classical) probability (Markov, 1913).

•

Probability space is a subset of measure space (Kolmogorov, 1933).

2

§0.1 Preliminary Definitions We begin with a number of definitions that are basic to our study. Definition 0.1.1 A network is a directed graph together with a function which assigns a positive real number to each edge [48, p. 52]. Artificial neural networks are modeled after the human brain, which can be viewed as a biological neural network. This thesis concerns the former type of network. Recall 0.1.2 A graph G is an ordered pair of disjoint sets (V , E ) such that E, the set of edges, is a subset of the set V (2) of unordered pairs of V, the vertices [17, p. 1]. Further a directed graph is a graph that has edges that are ordered pairs of vertices (Ibid., p. 8). In this case, the edges are called arcs [52, p. 2].

Figure 0.1.1 A directed graph [123].

Definition 0.1.3 Recurrent refers to the network’s cyclic architecture. The dichotomous characterization of neural networks is either feedforward or feedback, i.e. recurrent. To use graph theoretic terms, feedforward graphs are acyclic while recurrent networks are cyclic.

3

Figure 0.1.2 Three examples of cyclic graphs, which are the graph theoretic analogues to recurrent neural nets [122].

This means that the computation of feedforward networks ends in a fixed number of finite sets while the recurrent networks are allowed to have memory of past events; thus, making them more computationally powerful [109, p. 2]. Further, we can even define the “motion” of the network with an equation that is a system of recurrence relations. [1, p. 92]. Also note that the human brain is a recurrent network [24, p. 208].

Remark 0.1.4 An artificial (recurrent) neural network (or neural net) is a recurrent network that is modeled after a connectionist view of the human brain. The first two critera are connectionist. In particular, the three aspects that make this (artificial) network “neural” are the subsequent: 1. Connectionism: The building blocks of both artificial and biological networks are simple computational devices that are highly interconnected [46, p. 1-9]. 2. Connectionism: The connections between the neurons of both types of networks determine the function of the network (Ibid).

4

3. Learning: Both types of networks can learn. In particular, neural nets learn by learning rules. Definition 0.1.5 A learning rule (or training algorithm) for a neural net is a “procedure for modifying the weights and biases of a network” [46, p. 4-2]. So, the function of a learning rule “is to train the network to perform some task” (Ibid., p. 4-3). For the applications in Chapter 4, we use the Backpropagation algorithm as our learning rule. Remark 0.1.6 An algorithm is a step-by-step procedure that solves a problem. The classic example of an algorithm is Euclid’s Algorithm. Euclid’s Algorithm solves the problem of computing the greatest common divisor of two integers [81, p. 428]; [129, p. 40ff]. Thus, learning rules are a particular kind of algorithm. Neural nets are a type of automata. The word ‘automaton’ comes from the Greek αυτόµατο, i.e. “acting of itself.” In other words, automata are theoretical computers that are capable of solving algorithms. In mathematics and computer science, we have the following definition of automata.

Definition 0.1.7 Automata are theoretical machines, which are also called computational models. The two traditional automata that we discuss are finite automata and Turing machines. It is also common to refer to automata as machines. Automata theory is concerned “with the definitions and properties of mathematical models of computation” [110, p. 3]. We should pause a moment and discuss what is meant by a mathematical model in this thesis.

5

The most common usage of “mathematical model” is in the field of applied mathematics. In applied mathematics, it refers to a system of equations that describe certain physical characteristics of the world given a set of a priori conditions. For example, Newton’s Law of Cooling models the physical fact the surface temperature of an object changes at a rate proportional to its relative temperature. So, mathematical models in this sense are models that approximate empirical observations of the physical world [70], [112], [113]. Some of the most glorious examples of mathematical models can be found in Newton’s Principia [82], [83] and Einstein’s Relativity [31, 32]. In mathematical logic, which includes theoretical computer science, the phrase “mathematical model” is used in a different sense. The canonical exemplars of a mathematical model in this sense are Gödel’s Theorems [43], [44]. Unfortunately, it would take us too far afield to define this phrase precisely and completely [33, p. 80-99]; [59, p. 177-180], [79, p. 243], [108, p. 22], [114, p. 416-417]. Thus, we will describe mathematical model qua mathematical logic by a descriptive analog. This analog is a state diagram.

Definition 0.1.1 stated that a network is a directed graph. This means that networks, like graphs can be denoted with diagrams of vertices (or nodes) and edges as can be seen in Figure 0.1.1. Finite automata can also be described in this fashion, which is called a state diagram.

Definition 0.1.8 A state diagram (or transition diagram) is a labeled directed graph associated with a finite automaton as follows. The vertices of the graph correspond to the

6

states of the diagram (q0 usually denotes the start state). If there is a transition from state p to q on input a, then there is an arc (or fiber) labeled a from p to q in the diagram. [52, p. 16]. The term ‘fiber’ is used to describe arcs of state diagrams for neural nets. Further, state diagrams can also be described using a table, which we call a state table.

Figure 0.1.3 A state diagram of a finite automaton [90].

It is no accident that both networks and finite automata can be described by state diagrams. In fact, this lends credence to our aim of showing that neural nets and finite automata are equivalent. A state diagram can also be considered to be a visual representation of a mathematical model. (It should be noted that in predicate logic, the diagrams are not directed.) And of course, we can denote our mathematical model in set theoretic notation. N.B. Appendix I contains a short philosophical note on models and state diagrams. In summary, the formal definition of a mathematical model would take us too far afield, so instead we defined a state diagram. A state diagram is a visual representation of a mathematical model. Further, based on the state diagram, we can define a model using set theoretic notation. That is,

Mathematical Model ~ State Diagram ~ Set Theoretic Model.

7

As a result, when we formally define finite neural nets, finite automata, and hidden Markov models, we do so using set theoretic models, which are equivalent to mathematical models.

§0.2 Comparisons of Methods of Measuring Theoretical Computers

Now that we have given a brief introduction to neural nets, automata, and mathematical models of computation, we can state this thesis’ contribution to mathematics.

Definition 0.2.1 It is said that one type of computational model is stronger than another

if the following is true: “the first can compute (or generate) a set of functions or decide/recognize (or accept) a set of languages, while the latter can compute only a strict

subset of these functions or languages” [109, p. 2]. In other words, the second computational model is weaker than the first. We will now give a couple of examples from physical computers. Example 0.2.2 1. A HP-48G graphing calculator is stronger than a HP-33S scientific

calculator (since the former can perform many more functions, e.g. derivatives). 2. A Power Mac G5 is stronger than an Apple 2e because of, in part, the G5’s speed. Observation 0.2.3 Turing machines are the most common type of automata and are

neither the strongest nor weakest.1 They are stronger, however, than finite automata. In the theory of computation, mathematical models are usually compared to Turing machines. For example, one would say that a particular model is weaker (or

1

Turing machines are mathematically equivalent to a digital computer with unbounded resources [109]. Further, there are many different types of Turing machines [110].

8

stronger) than a given Turing machine. This gives us the relative strength of the models, but it does not give us the absolute strength of the models. By that we mean, we cannot say that Mac G5 has a measure of 72.4 whereas the measure of an Apple 2e has a measure of .13, which thereby implies that the Mac G5 is stronger. 2 Yet, with respect to our computational models, we would like to be able to say that the measure of a particular neural net (which is equivalent to a finite automaton) has a measure of 1 whereas another neural net (which is equivalent to a Turing machine) has a measure of

.3

Ergo, we show that the strength of neural nets (and all equivalent models) can be measured qualitatively by the McCulloch-Pitts measure. In Chapter 4 we will measure two applications of finite neural nets.

Fields of Mathematics

It may be helpful to state what fields of mathematics are being covered in each chapter (excluding Chapter 6) in order to get a better understanding of the flow of the thesis. •

Chapters 1 and 2 concern the theory of computation, with specific reference to finite automata, neural nets, and hidden Markov models.

•

Chapter 3 involves probability and measure theory on a finite measure and in particular, the discrete probability space.

•

Chapter 4 applies the theory discussed previously to neural nets, and as such this chapter is about applied mathematics and computer programming.

2 3

These values were picked at random; they do not represent the true measures of the computers. Once again, the actual values themselves do not matter. What matters is that we have (cardinal) values.

9

A Comment on Neural Networks and the Theory of Computation

We have intentionally left out many details concerning neural nets, to include their biological, historical, and philosophical background. Further, we could discuss, in broad strokes, the theory of computation. These topics, while important, are not necessary for an understanding of the mathematics that follows in the main body of the text. That being said, we offer occasional and brief historical notes in the chapters that follow. Also, in the references we list a number of works, while not mentioned explicitly in the thesis, that can be investigated for a broader understanding of neural nets and the theory of computation. (In fact one might want to say that the works in this category are implicitly referenced in this thesis.)

10

CHAPTER 1 SET THEORETIC-CUM-COMPUTATIONAL BEGINNINGS “The only way in which we can break the circle in which we move so long as we discuss sensory qualities in terms of each other, and can hope to arrive at an explanation of the processes of which the occurrence of sensory qualities forms a part, therefore, is to construct a system of physical elements which is ‘topologically equivalent’ or ‘isomorphous’ with the system of sensory qualities…” — F. A. Hayek, 1952

§1.1 Overview of Part I

We want to be to be able to measure finite neural networks. In particular, we want to state that the measure of a particular neural network has a measure of, say, 23.4. We wish to use finite measure space, i.e. probability space, to measure our finite neural networks. In order to justify the use of measure theory, we must show the following equivalences. First, we must show that finite neural nets are equivalent to finite automata. Next, we show that finite automata are equivalent to hidden Markov models. Then we demonstrate that hidden Markov models are valid on probability space. So it follows that both finite automata and finite neural nets are also valid on probability space. The following figure demonstrates our method of exposition. Finite Automata

⇔

Markov Chains

Key:

↓

x ⇔ y : denotes x is equivalent to y x → y : denotes x is valid on y

Finite Neural Nets Finite Measure Space

Figure 1.1.1 The path showing that Finite Neural Nets can be measured by Probability Space.

Once we have reached the point where we know that neural nets can be measured with measure theory, we can apply this theory to some actual neural networks and

11

measure them by using Expected Value. Before we continue, we discuss the two neural net applications that we are going to solve in Chapter 4. The applications concern number theory (cryptography) and statistics (baseball). In each case, we design a neural network that solves our particular problem. We implement our design using a programming language. We then run the program, which allows us to determine the expected value of the neural net. The expected value is the McCulloch-Pitts measure, µ Ν , i.e. the measure of the neural net.

A remark on style In this part, whenever we discuss finite recurrent neural networks, we simply refer to them as neural nets and, likewise, we call hidden Markov models, merely Markov models.

§1.2 Definitions and Examples of the Three Models

We formally state the definitions of finite neural nets, finite automata, and Markov models in set theoretic terms, which is the language of both mathematics and computation. Set theory will be the organon of choice throughout this thesis. Recall that the set theoretic definition is equivalent to a mathematical model. After we define each computational model, we give an illustrative example to demonstrate it. Of the three models to be defined set theoretically, only finite automata seem to be naturally defined in that matter. Defining neural nets and Markov models set theoretically seems forced, yet we do this in order to help illustrate that neural nets, finite automata, and Markov models are all equivalent to one another. That being said,

12

historically neural nets were initially defined in terms of set theory, but presently this is rarely the case. For the remainder of Part I, unless otherwise stated, neural nets shall refer to the finite case. It should be noted that the reason these models are finite is due to their lack of memory, i.e. they have a limited amount of memory. In particular, the weights of the neural nets are contained in the integers.

Definition 1.2.1 A finite (recurrent) neural network is a 5-tuple N = (Q, Σ, f , n, F ) ,

where i. Q is a finite set called the network (or net) space (and ∀q ∈ Q, q is called a cell), ii. Σ is a finite set called the neuron space, iii. f: Q × Σ → Q is the transfer function, iv. n ∈ Q is the input fiber, and v. F = {a} ⊆ Q is the output fiber space. (Cf. [46, p. 2-2ff.].) Recall 1.2.2 In definition 0.1.3, we stated that recurrent refers to the fact that the network

architecture is like a cyclic graph, e.g. a triangle (a 3-cycle), a quadrilateral (a 4-cycle), or a pentagon (a 5-cycle). Cf. Figure 0.1.2. Example 1.2.3 A finite (recurrent) neural net, N:

i. Q =

,

ii. Σ = {0,1} ,

13

iii. f: Q × {0} → Q , where f is given by the symmetric hard limit (hardlims), viz. f (n) =

{

a = −1 if n < 0 a = +1 if n ≥ 0

iv. n ∈{−1,0,1} ⊂

,

, and

v. F = {±1} . Q, which is the net space, is composed of cells. The number of cells, which are the elements of the net space, are countably infinite since they are contained in

. Also,

∑, which is the neuron space, is the number of neurons in the network, in this case we have 0 (= the null case) or 1. The transfer function, f, helps to define the learning rule of the neural net. In this case, the learning rule is hardlims which is given by f (n) above. The input states that our input fibers (arcs) for a finite neural net must be in n ∈{−1, 0,1} . Lastly, as can be seen by the two possible values of a for f (n) , the output fibers (arcs) must be either 1 or -1. This neural net is given below in its traditional engineering representation. This figure, and all like it, were created using the MATLAB Neural Network Toolbox.

Figure 1.2.1 An example of a finite (recurrent) neural network (Example 1.2.3).

14

In Chapters 2 and 4 we will have the opportunity to discuss these and other aspects of neural nets in greater detail, to include transfer functions and engineering representations. Definition 1.2.4 A finite automaton is a 5-tuple M = (Q, Σ, δ , q0 , F ) , where

i. Q is a finite set called the states, ii. Σ is a finite set called the alphabet or input symbols, iii. δ : Q × Σ → Q is the transition function and where for ∀a ∈ Σ, δ (q, a ) is interpreted as the next-state, iv. q0 ∈ Q is the start or initial state, and v. F ⊆ Q is the set of accepts states; further we assume that δ (q, a ) = q for any accepting state F.4 ([109, p. 6], [110, p. 35].) Example 1.2.5 A finite automaton, M:

Let A be given by the state diagram given in Figure 0.1.2, which we will reproduce here.

Figure 1.2.2 An example of a finite automaton (Example 1.2.5). (Cf. Figure 0.1.2).

The finite automata has three states: q0 , q1, & q2 , with q0 denoted as the start state with an arrow pointing at it from nowhere and q1 denoted as the accept state with a double

4

This is a deterministic finite automata, which means that it cannot accept empty strings or words. Finite automata that can are called nondeterministic. These two types of automata are equivalent and since we do not need nondeterministic finite automata for this thesis, we will not bother to prove this theorem. A proof for it can be found in [110, p. 55-56].

15

circle. The arrows going from one state to another are called the transition arcs, or simply arcs. We will return to this example on occasion because it will help us understand such terms as string and accept, which will be formally defined in the next chapter. Let us suppose that we are given a string 1001. If we feed this string into the state diagram in Figure 1.2.3, then the finite automaton M computes the algorithm as follows: 1. start in state q0; 2. read 1, follow the transition from q0 to q0; 3. read 0, follow the transition from q0 to q1; 4. read 0, follow the transition from q1 to q2; 5. read 1, follow the transition from q2 to q1; 6. accept because the finite automata is in an accept state at the end of the input. Here is the state diagram in set theoretic notation. i. Q = {q0, q1, q2}, ii. Σ = {0,1}, iii. δ : {q 0 , q1 , q 2 } × {0,1} → {q 0 , q1 , q 2 } , where δ is given by the following table, which is an example of a state table for the finite automaton: Table 1.2.3 The transition function for M. 0

1

q0

q1

q0

q1

q2

q1

q2

q1

q1

16

iv. q0 ∈ Q , and v. F = {q1}. (Cf. [110, p. 34-36].) As was mentioned earlier it is unnatural to define Markov models in terms of set theory, but we have done this in order to hint at the equivalences of our three models. That being said, now that we have given definitions and illustrations of neural nets and finite automata, hopefully the following definition of Markov models will seem familiar if non-canonical. Definition 1.2.6 A hidden Markov model is a 5-tuple H = (Q, Y , δ , qo , ζ ) where

i. Q is a finite set called the states, ii. Y is a finite set called the output symbols, iii. δ : Q → Q is the transition function, iv. q0 ∈ Q is the start or initial state, and v. ζ is the output function ζ : Q → Y . [Cf. 26.] Observation 1.2.7 These models have the Markov property, viz. “given the current

state Xn, any information about the past is irrelevant for predicting the next state Xn+1” [30]. These types of Markov models are called hidden because only the outcome or observation matters to the observer, i.e. the observer is indifferent to the states. (In fact, we can say that these models are black boxes.)

17

Example 1.2.8 An Input/Output Hidden Markov Model:

⎡p ⎢ i. Q = ⎢ p ⎢p ⎣

11 21 31

p 13 ⎤ ⎥ p 23 ⎥ , ∀p p 33 ⎥⎦

p 12 p 22 p 32

∈ Σ ∋ ∃i, j ∈{1, 2,3} ,

i, j

ii. Let Σ = (a) [ p 12 ] , (b) [ p 33] , and (c) [ p 23 ] ,

⎡ p 11 iii. δ : ⎡⎣ p i , j ⎤⎦ × ⎢ p 21 ⎢ ⎢⎣ p 31

p 12 p p

22 32

⎤ p 23 ⎥⎥ → ⎡⎣ p p 33 ⎥⎦ p

13

i, j

⎤⎦ where we will let δ be given by the

subsequent matrix:

⎡.8 .3 .2 ⎤ δ [ p i , j ] = ⎢⎢.1 .2 .6 ⎥⎥ , ⎢⎣.1 .5 .2 ⎥⎦ ⎡ p 1j ⎤ ⎢ ⎥ iv. V = ⎢ p 2 j ⎥ , where p ⎢p 3j⎥ ⎣ ⎦ v. Then, (a) P[ p

1,2

1j

+p

2j

+p

3j

= 1,

] = .3 , (b) P[ p 3,3 ] = .2 , (c) P[ p

2,3

] = .6 .

[5, p. 615]. The hidden Markov model example that we gave is in matrix form, thus it may be helpful if we give an alternate definition of this model in the terms of linear algebra. This alternate definition is, of course, equivalent to the set theoretic one. Definition 1.2.9 A hidden Markov model in linear algebra is a 5-tuple

H = (Q, Σ , δ, V, P), where i. Q = ⎡⎣ p

i, j

⎤⎦ for any row i, column j of the matrix and is called the observation,

ii. Σ is a finite matrix called the state matrix, where its entries (i, j ) meet the following

18

criterion: ∀i, j , 1 ≤ i ≤ j ≤ k ∋ k ∈

. In other words, Σ has k states or, Σ is a k by k

matrix, iii. δ : Σ × Q → Σ is the stochastic matrix, which the sum of all of the entries of each column (fixed j) is one, i.e. ∀p

i, j

∈ V ∋ ∃i, j ∈

~ {0} p

1j

+p

2j

+ ... + p

i, j

= 1 ,5

⎡ p 1j ⎤ ⎢p ⎥ 2j⎥ iv. V is a finite vector called the probability vector, with V = ⎢ , where the sum of ⎢ ... ⎥ ⎢ ⎥ ⎣⎢ p i , j ⎦⎥ the column is 1. So, each column of the stochastic matrix is a probability vector. This implies that p v. P[ p

i, j

i, j

∈ [0,1] . Further, there are k many probability vectors.

] ⊆ Q is the transition probability.

(Cf. [5, p. 615ff.].) We spent this chapter defining and illustrating our three basic mathematical models of computation. In the next chapter we show that neural nets and finite automata are equivalent as well as the equivalence of finite automata and Markov models. The following table shows the isomorphic relationship of the 5-tuples that compose each model. Recall 1.2.10 Recall that models are isomorphic if there is one-to-one correspondence

between the members of the models that preserves the operations and relations. By inspection one can see that these models are isomorphic. Yet if one is not convinced by

5

In linear algebra (in general) this is called a transition matrix, but for Markov theory it is called a stochastic matrix because each column is a probability vector. A stochastic matrix is also called either a probability or a Markov matrix.

19

inspection, then the proofs of the equivalences of these models will show that these models are isomorphic. We denote that a model M1 is isomorphic to M2 by M1 ≈ M2. Table 1.2.4 Isomorphic Relation of the three models. Finite Automaton Name 5-tuple

Model States

Symbol M = (Q, Σ, δ , q0 , F ) M Q

Alphabet/ Σ Input Transition δ : Q × Σ → Q Function Start q0 State Accept F⊆ Q States

Neural Nets Name 5-tuple

Symbol N = (Q, Σ, f , n, F )

Model N Net Q space Neuron Σ space Transfer f: Q × Σ → Q function n∈Q Input fiber Output F = {a} ⊆ Q fiber space

20

Markov Models Name 5-tuple

Model States

Symbol H = (Q, Y , δ , qo , ζ ) H

Q

Output Y Symbols Transition δ : Q → Q Function Start q0 ∈ Q State ζ :Q →Y Output function

CHAPTER 2 EQUIVALENCE OF FINITE MODELS “Two sets of sentences are called “logically” or “inferentially equivalent”, or simply “equivalent”, if they have all their consequences in common (i.e. their sets of consequences coincide).” —Alfred Tarski, Fundamental Concepts of The Methodology of the Deductive Sciences, V, §3, 1930

§2.1

Proof Theoretic Equivalence of Finite Neural Networks and Finite Automata

In this chapter we prove two equivalences. First, we show that finite neural nets are equivalent to finite automata. Second, we demonstrate that finite automata are equivalent to hidden Markov models. By transitivity of equivalence we have as a corollary that finite Markov models are equivalent to finite neural nets.

Definition 2.1.1 A string over an alphabet is a finite sequence from that alphabet. An

alphabet is a group of symbols and symbols are taken to be intuitive like points in geometry. Definition 2.1.2 A language is a set of strings.

Before we define a number of technical terms relating to the theory of computation, recollect Example 1.2.5. We were given a string 1001, with the alphabet

Σ = {0,1} . This string is part of a language. In our example, the model returned the value True and so it is said to accept the string. Further, if this is true for all strings in a language, then the model is said to recognize the language. Definition 2.1.3 A class of languages, L, is simply a set of strings from the domain set.

21

Definition 2.1.4 In the theory of computation, two algorithms are said to be equivalent if

they recognize the same language [109, p. 1, 6], [110, p. 14]. That is, for two languages L1 and L2, we write L1 ~ L2, where we use the tilde to denote equivalence. Definition 2.1.5 Any formal model, M, recognizes language L if and only if L = {w ∈Σ | M accepts w}.

The first theorem that we prove as part of this thesis demonstrates the equivalence of finite automata and neural nets. We took as our guide for this proof [77, p. 55-57]. Before we turn to this proof, we must state a few definitions. Definition 2.1.6 A black-box automaton is a automaton where the user is only

concerned about the inputs and outputs (I/O), but not what happens internally. That is, the user does not care how the automaton works, only that it does. So, we have this:

I → Black Box → O . Example 2.1.7 1. A hidden Markov model can be viewed as black-box automaton because we do

not care about previous states. 2. An every day example is someone testing a computer program. The tester knows

the inputs and what the expected outcomes should be, but she has no idea (nor does she want to know) how the program works. Thus, she does not view the code—only the I/O. Definition 2.1.8 At each moment a cell, which is an element of the neuron space (Cf.

Definition 1.2.1), is either firing or quiet—these are the two possible states of the cell. We can think of the firing state as a (electronic) pulse while the quiet state yields no

22

(electronic) pulse. Or, in terms of the Boolean operation, the firing state can be given the value 1 (or True) and the quiet state the value of 0 (or False). Remark 2.1.9 From the definitions of firing and quiet, it can be seen that each cell is a

finite-state machine and accordingly operates discretely. This fact hints at our eventual goal of showing that neural nets can be measured by discrete probability space. Further, it is interesting to note that neural nets themselves are composed of “mini” finite-state machines. Definition 2.1.10 The output fiber (arc), at some point, will terminate as an input

connection to another (or perhaps the same) cell. In Figure 1.2.2 on p. 15, q1 has an input that terminates in another cell (0) and an input connection that terminates to itself (1). There are two types of termination: excitatory or inhibitory. Definition 2.1.11 Excitatory inputs allow a cell to fire. Example 2.1.12 Both the AND and OR operators (or gates) are excitatory inputs, see

Table 2.1.1. Definition 2.1.13 Inhibitory inputs stop a cell from firing. Example 2.1.14 The NOR operator (or gate) is an inhibitory input, see Table 2.1.1. NOR

is typically called the joint denial. The following truth tables give the Boolean values for AND ( ' ∧ ' ), OR ( ' ∨ ' ), and NOR ( ' ↓ ' ). Truth tables usually contain T and F as their values, but in this case we will follow the computer science practice and use 1 and 0, where T = 1 and F = 0. The first two columns of each table can be considered to be the signals on input fibers and column three can be considered to be the signals on output fibers. Thus, when a row of a truth

23

table has the value of T = 1 , then the cells fire. For example, row 1 of AND, rows 1 - 3 of OR, and row 4 for NOR; we have bolder these rows.

Table 2.1.1 Truth tables for AND, OR, and NOR.

P

Q

P∧Q

P

Q

P∨Q

P

Q

P↓Q

1

1

1

1

1

1

1

1

0

0

1

0

0

1

1

0

1

0

1

0

0

1

0

1

1

0

0

0

0

0

0

0

0

0

0

1

Historical Note: The Austrian L. Wittgenstein in 1922 and the American E. Post, 1921, independently and simultaneously discovered truth tables. Both of these works were used for their doctoral thesis, at Cambridge and Princeton respectively. Also, the joint denial,

' ↓ ' , was discovered independently by C.S. Pierce (c. 1880) and H. M. Scheffer (1912).

Definition 2.1.15 The threshold of a cell decides the state-transition properties of a cell

C. That is, the threshold can be viewed as a type of transition function. The threshold is determined precisely by the number of values that must be excited, i.e. true, in order for the cell to fire an output. This can be stated as the following principle. Principle 2.1.16 (Minsky, 1967) Principle of Excitatory Inputs: If no inhibitor is firing,

then we count up the number of excitory inputs that are firing; if this number is equal to

or greater than the threshold of the cell, then the cell will fire at time t + 1. In other

24

words: a cell will fire at a time t + 1 if and only if, at time t, the number of active

excitatory inputs equals or exceeds the threshold, and no inhibitor is active. The subsequent examples of the Principle will make most sense in reference to Table 2.1.1. Example 2.1.17 1. A cell with threshold 0 will fire at any time unless an inhibitor prevents this, e.g.

NOR operator. 2. A cell with threshold 1 will fire if any excitatory fiber is fired and no inhibitor is

active, e.g. OR operator. 3. A cell with threshold 2 requires at least two excitations (and no inhibition) if it is

to fire at the next moment, e.g. AND operator.

Here is the statement of the first model equivalence of this thesis. Theorem 2.1.18 (Minsky, 1967) Every finite automaton, M, is equivalent to some neural

net NM , i.e. ∀M , ∃N M ( M ~ N M ) , where NM means the set of functions f : M → N . That is, given any finite automaton M, we can build a certain neural net NM which, regarded as a black-box automaton, will be precisely like M. So, we wish to prove that

M ~ NM. Before we prove Theorem 2.1.18, we discuss an example that is an instantiation of the proof. Example 2.1.19 Given: Suppose that a finite automaton, M , is equivalent to a neural net

N M . Let M have inputs S 1 and S 2 , outputs R 1 and R 2 , and states Q 1 and Q 2 . Also, let N M be with input fibers s 1 and s 2 , output fibers r 1 and r 2 , and four cells, 25

C 11 ,C 12 ,C 21 , and C 22 . We will arrange these cells in a two-dimensional ( 2 × 2 ) matrix and the cells has a threshold of 2 and the output has a threshold of 1.

Problem: Using C 21 , demonstrate that M ~ N M is true for this particular case. See Figure 2.1.1, which is a partial representation of this equivalence. In particular, it shows part of the activity of C 21 .

G-Function Connection Box G (Q1, S2) = Q1

Q1 S1

Q2

2

2 C 12

C 11

S2

2

2

C 22

C 21

1

F (Q1, S2) = R2 R22

F-Function Connection Box

Figure 2.1.2 A canonical finite-state neural net for (Q 1 , S 2 ) from Example 2.1.19,

partial diagram. Recall that S 1 and S 2 are the inputs and R 2 is the output.

26

Solution: This solution can be broken into four parts. 1. We need to use C 21 to confirm that M ~ N M so we look at Figure 2.1.2, p. 26, as a matrix. Since, C 21 is the intersection of row S 2 and column Q 1 , this means that C 21 will fire at time t + 1 with the input of S 2 occurring at time t and when the machine M is at state Q 1 . Thus the firing of C 21 is equivalent to the pair of events (Q 1 , S 2 ) . To make N M equivalent to the machine M we only have to arrange things so that it produces the proper output F (Q 2 , S1 ) and goes into the appropriate internal state

G (Q 2 , S1 ) . (N.B. The indices are switched because we are discussing on how to convert the I/O from M to N M .) We will do this by concentrating on the vertical column Q 1 . 2. At any moment t there will be, among all the descending (column) fibers, precisely one active fiber. Suppose that this fiber is in the 1st column. Then the neural net N M is simulating the state Q 1 of M . Suppose also that at this same moment, t , one horizontal (input) fiber is active in the 2nd row. This corresponds to the input signal S 2 . Then precisely one array call will fire at the time t + 1 ; this will be cell C 21 . The fiber from C 21 divides into two branches. 3. The descending branch (the downward arrows in Figure 2.1.2) goes to the “Ffunction connection box” and there leads to the cell which represents the appropriate output F (Q 1 , S 2 ) . Thus, the function is “wired” into this part of the

27

neural net, i.e. the output will occur automatically. As a result we do not need to concern ourselves with the descending branch any further. 4. The ascending branch (the upward arrow in Figure 2.1.2) of the output from C 21 is responsible for the machine’s change of state. This fiber goes up through the “G-function connection box” which is so wired that this fiber enters the descending column for the appropriate new state, viz. G (Q 1 , S 2 ) . If the cell C 21 is the only cell that fires at time t + 1 , then we are thus assured that at time t + 1 there will again be precisely one active fiber among the descending columns (and that it will be in the column corresponding to the next state of the simulated machine M ). Thus, if the signals entering N M are at each moment the same as those entering M , both machines will go through the same sequences of states and outputs. (The output of

N M will be delayed by one time unit because of the OR cells in the output connection box.) This shows that C 21 demonstrates M ~ N M .

The proof of Theorem 2.1.18 is a generalization of Example 2.1.19 that we just solved, ergo the proof should be easy to understand, if not a little redundant. Yet, redundancy in the name of perspicuity is always allowable. The form of the proof is subsequent. Show that: 1. the firing of Ci , j is equivalent to the pair of events (Q j , Si ). 2. that Ci , j is the only cell firing at time t + 1 . 3. that the neural net produces the proper output.

28

4. the neural net produces the correct change of state.

Proof (Theorem 2.1.18): (Modeled closely after [76, p. 55ff.].)

(⇐) : We first show that each neural net is a finite automata. At any moment, all of the states of the neural net are given by the firing pattern of its cells. The truth tables in Table 2.1.1 are example of firing patterns. The transition function Q (t + 1) = G (Q(t ), S (t )) is determined by the connection structure of the neural net and the output function

R (t + 1) = F (Q(t ), S (t )) is determined by which fibers are designated as carrying output signals (cf. Figure 2.1.2).

(⇒) : We need to construct a neural net NM which is equivalent to M. In order to do this, we must stipulate how the input, output, and states are to be represented. Note that we have used the terminology from the definition of a finite automaton (viz., input, output, and states) as opposed to the terminology of neural nets (Cf. Definition 1.2.1). The reason for this is that the terminology of finite automata is the canonical terminology in mathematics and computer science (Cf. Table 1.2.4). Table 2.1.3 has the I/O and the states.

Table 2.1.3 I/O and state representation of any finite automaton (M) and some neural net

(NM). Machine Finite Automaton (M)

Input (I) S1 ,..., S m ∀m ∈

Output (O) R1 ,..., Rn ∀n ∈

States Q1 ,..., Q p ∀p ∈

Neural Net (NM)

s1 ,..., s p ∀m ∈

r1 ,..., rn ∀n ∈

m cells, C 1, j ,..., C m , j ,

∀Q j ∈ M

29

The I/O for both M and NM are self-explanatory as are the states for M; however, we need to explain the exact meaning of the states for NM. Recall that cells are the elements of the net space. In this proof, the cells Ci , j are to be arranged in a twodimensional ( m × p ) matrix. The reason that the cells are arranged in this way can be seen by the diagram. Of course, in this proof we want to mentally extend the diagram from a 2 × 2 array to a m × p array. Mimicking our example, we will break the proof into four parts. 1. Each of the cells Ci , j has threshold 2 since it is a two dimensional array. That is, the cell’s threshold is dependent on both S i ∧ Q j , and thus it is dependent on the AND gate. We will exploit this fact to detect coincidences. This means that the cell Ci , j will fire (at time t + 1 ) precisely when it receives an input from Si (at time t ) and when the simulated machine M is in a state Q j by the Principle of Excitatory Inputs. Thus the firing of Ci , j is equivalent to the pair of events

(Q j , Si ). To make the neural net, N M , equivalent to the machine M we now have only to arrange things so that it produces the proper output F (Q i , S j ) and goes to the appropriate internal state G (Q i , S j ) . (N.B. The indices are switched because we are discussing on how to convert the I/O from M to N M .) We will do this by concentrating on the vertical columns, Q j . 2. At any moment t there will be, among all the descending fibers (i.e., the (columns), precisely one active fiber. Suppose that this fiber is in the jth column.

30

Then the neural net NM is simulating the state Qj of M. (It does not matter which fiber of the column is active, since all of the columns have the same connections.) Suppose also that at this same moment, t , one horizontal (input) fiber is active the ith row. This corresponds to some input signal S i . Then precisely one array cell will fire at the time t + 1 (by the Principle of Excitatory Inputs); this will be cell C i j . The fiber from C i j divides into two branches. 3. The descending branch (the downward arrows in Figure 2.1.2) goes to the “Ffunction connection box” and there leads to the cell which represents the appropriate output F (Q j , Si ) . Thus, the function is “wired” into this part of the neural net, i.e. the output will occur automatically. As a result we do not need to concern ourselves with the descending branch any further. 4. The ascending branch (the upward arrow in Figure 2.1.2) of the output from C i j is responsible for the machine’s change of state. This fiber goes up through the “G-function connection box” which is so wired that this fiber enters the descending column for the appropriate new state G (Q j , S i ) . If the cell C i j is the only cell that fires at time t + 1 , then we are thus assured that at time t + 1 there will again be precisely one active fiber among the descending columns (and that it will be in the column corresponding to the next sate of the simulated machine

M ). Thus, if the signals entering N M are at each moment the same as those entering M , both machines will go through the same sequences of states and outputs. (The output of

31

N M will be delayed by one time unit because of the OR cells in the output connection box.) Ergo, M ~ N M .

§2.2

Q.E.D.

Proof Theoretic Equivalence of Finite Automata and Markov Chains

We have shown that finite neural nets are equivalent to finite automata. Now we will demonstrate that finite automata are equivalent to Markov chains, or in particular a hidden Markov models. Then, trivially from the law of transitivity for equivalences we will have the subsequent chain of equivalences: N ∼ M ∼ H . Recall that each model is defined as a 5-tuple, in particular for finite automata, we have M = (Q, Σ, δ , q0 , F ) and for Markov models, H = (Q, Y , δ , qo , ζ ) . Theorem 2.2.4 is based on a partial proof by [26]. This partial proof is contained in a online tutorial of Dynamical Systems. We, of course, do the entire proof—and leave nothing as an exercise. First, we are going to return to some of the earlier definitions and refine them in light of the proof for the following theorem.

Denotation 2.2.1 1. If I is a set of finite inputs, then let I * denote the set of all finite sequences of

inputs. 2. Let ε denote the empty or null string. Definition 2.2.2 We say that a hidden Markov models accepts a string

t = y0 ,..., yk ∈ Y * if and only if there exists a sequence of states and outputs

32

q0 , y0 , q1 ,..., yk , qk , yk +1 such that qi+1 ∈ δ (q1 ) and ζ (qi ) = yi for i ∈ [0, k ] and

ζ (qi ) = ε for i ∈ [k + 1, ∞) . Definition 2.2.3 We say that a finite automaton accepts a string t = σ0 ,..., σk ∈ Σ* if

and only if there exists a sequence of states and inputs q0 , σ0 , q1 ,..., σ k , qk +1 such that

δ (q i , σ i ) for i ∈ [0, k ] and q k +1 ∈ F . Theorem 2.2.4 Finite automata and hidden Markov models are equivalent, i..e.,

M ∼ H. In order to prove this theorem, we ask ourselves how one can model M using H and the converse query. This proof is more set theoretic then the previous one; however the form of the proof for each direction will be similar. That is, we must do the following: 1. State how to model the accepting state of one model in terms of the other. 2. Similarly for the outputs. 3. Stipulate how the inputs of one model will be converted to the outputs of the other. 4. Show that the two models accept the same language. Proof:

We prove the converse first, viz. any hidden Markov model is equivalent to some finite automaton. ( ⇐ ): We want to model H = (Q, Y , δ , qo , ζ ) using M = (Q, Σ, δ , q0 , F ) . 1. To model accepting states in H, we create a set of states Q ′ = Q ∪ {a} , where a is a new state. 2. To model the outputs using M, let Y = Σ ∪ {ε} .

33

3. To convert from the inputs of M to the outputs of H, let Q ′′ = Q ×Σ ,

δ ′ ((q, σ )) = {(q ′, σ ′) : δ (q, σ ′) = q ′ , ∃σ ′ ∈ Σ} , δ ′ ((a, σ )) = {(a, ε)} , and ζ ((q, σ )) = σ. The equivalent hidden Markov model is H ′ = {Q ′′, Y , δ ′,(q0 , ε), ζ ). Now we must show that M accepts a string t in Σ* if and only if H ′ accepts t. 4. Since Step 3 converts the inputs of M to the outputs of H, we know that M accepts a string t of states and inputs if and only if H ′ accepts a string t of states and outputs for the conditions given in definitions 2.2.2 and 2.2.3. ( ⇒ ): We want to model M = (Q, Σ, δ , q0 , F ) using H = (Q, Y , δ , qo , ζ ) . 1. Let the accepting states remain the same. 2. Also, let the output stay the same. 3. To convert from the inputs of H to the outputs of M, let

δ ′(q, ζ (q ′)) = q ′ ∀q ′ ∈ ζ (q ) . Let F be the unique largest sets of all states in Q such that ζ (q ) = ε and δ (q ) ∈ F . The equivalent finite automata is M ′ = (Q, Y , δ ′,qo , F ). Now we must show that H accepts a string t in Y * if and only if M ′ accepts t. 4. Since Step 3 converts the inputs of M to the outputs of H, we know that H accepts a string t of states and inputs if and only if H ′ accepts a string t of states and outputs for the conditions given in definitions 2.2.2 and 2.2.3.

Q.E.D.

Corollary 2.2.2 (Finite Equivalence Theorem) Finite neural nets are equivalent to

finite automata and finite automata are equivalent to hidden Markov models, i.e.,

N ∼ M ∼ H . Further, neural nets and Markov models are equivalent, i.e. N ∼ H . .

34

Proof: The proof is given in the terms and methods of deductive logic. (Cf. [26].) 1. N ∼ M . (Premise Theorem 2.1.18) 2. M ∼ H . (Premise Theorem 2.2.1) 3. ( N ∼ M ) ∧ ( M ∼ H ). (1, 2 Adjunction) 4. ∴ N ∼ H . (3 Transitivity of Equivalences) 5. ( N ∼ M ) ∧ ( M ∼ H ) ∧ ( N ∼ H ). (3, 4 Adjunction) 6. ∴ N ∼ M ∼ H . (5 Transitivity of Equivalences) Q.E.D.

§2.3

Remarks on the Finite Equivalences

Using the machinery of mathematical logic, we are able to show that the three models are equivalent. This means that anything that can be demonstrated about one computational model, can be said about the other two. In particular for this thesis, when we show that Markov models are valid on finite measure space, we will be able to state that this is also true for neural nets (as well as finite automata). Both equivalence proofs (Theorems 2.1.18 and 2.2.4) use a similar methodology to show that N ~ M and M ~ H respectively. In both proofs, in Step 1 we began by showing that the accept states (i.e.,. firing) of the two models are the same. For Steps 2 and 3 we discussed the relationship between the inputs and outputs. Lastly, Step 4 we show that the proposed equivalent model has the correct change of states (i.e., accepts the same language).

35

So mathematical logic allows us to bridge gaps between different mathematical models. In this way mathematical logic is an unlikely, yet natural, heir to the critical philosophy of Immanuel Kant [60] and Ludwig Wittgenstein [125], [126]. Both of those philosophers set out to bridge the gap between empirical data (whether it be sensory or linguistic) with the brute facts that compose the world. Similarly, logical models can set out to bridge the gap between physical models (e.g. neural nets) and abstract models (e.g. finite automata).

36

CHAPTER 3 MARKOV MODELS & MEASURE ON FINITE MEASURE SPACE “It is a common view that belief and other psychological variables are not measurable, and if this is true our inquiry will be in vain…” —F. P. Ramsey, 1926

§3.1 Probability Axioms, Markov Property, &c.

With the exception with talking about coins or games of chance, the best way to begin the discussion on probability theory is by stating the three probabilistic axioms: Axiom 3.1.1 (Axioms of Probability Theory)

A probability, P, is a way of assigning numbers to events that satisfy these conditions: 1.

∀ Events A, 0 ≤ P ( A) ≤ 1 . N

2. Ω is the sample space ⇒ P (Ω) = 1 , where Ω = ∪ Ai . i =1

3. For a finite or countably infinite sequence of disjoint events, we have the following: P (∪ i Ai ) = Σi P( Ai ) . (This condition is called countable additivity.) Definition

3.1.2

We

will

define

conditional

probability

as

such:

P ( B | A) = P ( A ∩ B ) / P ( A) , with P ( A) ≠ 0 . Conditional probability is valid by the axioms

since

it

follows

directly

from

the

Multiplication

Rule,

viz.

P ( A ∩ B ) = P ( A) ⋅ P( B | A) . Probability has its origins in statistics, thus the terminology is different than that of analysis.6 We will discuss the implications of this in the next section. We will now give the formal definition of the Markov property, the essential feature of Markov models. Both Axiom 3.1.1 and Property 3.1.3 were modified from [30, p. 2, 28-29]. 6

Our favorite general book on the: (a) history of probability [45] and (b) philosophy of probability is [41].

37

Property 3.1.3 (Markov Property)

We say that Xn is a discrete time Markov chain with transition matrix p (i, j ) if

∀ j , i n−1 , ... i 0 , i P ( X n+1 = j | X n = i, X n−1 = i n −1 ,..., X 0 = i 0 ) = p (i, j ) . To simplify this

property

we

could

focus

on

the

temporally

homogeneous

case,

viz.

p (i, j ) = P ( X n+1 = j | X n = i ) . A further note on terminology. We have just called Xn a Markov chain as opposed to model, this is in order to follow stochastic convention. For our purposes, these terms are essentially interchangeable, but generally chain implies working within probability and measure theory whereas model generally implies working within computability and set theory. In §0.1, we discussed how a state diagram can describe both neural nets and finite automata. By the Finite Equivalence Theorem it should come as no surprise that Markov models can also be described by state diagrams. In fact, pictorial representations of Markov models hint as to why these models are referred to as chains.

Figure 3.1.1 A state diagram of a Markov Chain; taken from [118].

Naturally there are many applications of Markov chains such as queuing theory, but we will not concern ourselves with those. All we wish to show is that the Markov property is valid within probability, i.e. it does not contradict the Probability Axioms. The Markov property is merely the definition of conditional probability with a particular

38

criterion concerning the previous states, viz. “any information about the past is ignored.” Thus, the property is clearly valid under the Axioms since the definition of conditional probability is itself valid. Further, in Example 1.2.8, there was a matrix that we said met the Markov property. This is true because there is an existence proof that states that given such a matrix, then there exists some Markov chain on the probability space [7, p. 232]; [15, p. 112]. To iterate, given a stochastic matrix, then there exits a Markov chain on probability space. This discussion in the last two paragraphs proves the following theorem. Theorem 3.1.4 Markov chains, and hence hidden Markov models, obey the Axioms of

Probability. In other words, they exist (i.e., are valid) on probability space. Corollary 3.1.5 Finite neural nets and finite automata are valid on probability space.

This follows directly from the Finite Equivalence Theorem and the previous theorem.

§3.2

Measure Theory and Probability Measure—Finite Measure Theory

The task before us now is to show that probability measure is a finite measure space. We also define the space discretely. This is since we are dealing with finite models. Once we show this, then we can use probability to measure the computability of neural nets and equivalent computational models. We will use the notion of expected value as the mechanism for measure. In other words, the expected value of a neural net will be its computability. This measure is called the McCulloch-Pitts measure, µ Ν . We should mention that by measure theory, we are referring to Lebesgue measure and the Lebesgue Integral.7 It would be, of course, beyond the scope of this thesis if we

7

We have found the following books on Analysis and Measure Theory to be helpful:[65], [95], [96], [97].

39

were to show that all of probability theory is valid on all measure theory. Since the metric of computability that we desire is the expected value, we will only need to offer justification for that. We are going to assume that the reader is familiar with measure theory and probability theory, thus we only discuss the essentials of the two theories that point out their concomitantness.

Definition 3.2.1 In (Lebesgue) measure theory, we have the three-tuple (X, A, µ), where

X is a set, A is a σ-algebra, i.e. the Borel field, and µ is the measure. The only restrictions on X are the Axioms of Zermelo-Frankel Set Theory with the Axiom of Choice (ZFC) (Cf. [79]). Definition 3.2.2 A ∈ X is an algebra if it meets the two Boolean properties: 1.

( ( A∈A ) ∧ (B ∈ A )) → A ∪ B ∈ A .

2. A∈ A → A ∈ A .

Further, an algebra A is called a σ-algebra if every union of a countable collection of sets in A is again in A. Definition 3.2.3 The Carathéodory definition of measure is: a set E ∈ X is measurable if

∀A we have µ * A = µ * ( A ∩ E ) + µ * ( A ∩ E ) . We will state, without argument, that we can drop the superscript ‘*’ because the outer measure is equal to the inner measure.

Definition 3.2.4 Analogously in probability theory, we have the three-tuple (Ω, F, P),

where as above Ω is set that is the sample space, F is called a σ-field (which is the same as a σ-algebra), and P is the measure and as the second Axiom of Probability states, the

40

measure of the probability space is 1. The online text [11] proved most helpful for describing probabilistic terminology analytically and so we followed it closely.

Discussion 3.2.5 Let F be the σ-field of all subsets of a countable space Ω and let p(ω )

be a nonnegative function on Ω. Suppose that

∑

p(ω) = 1 (by Axioms 2 and 3), and

ω ∈A

define P (A) =

∑

p(ω ) . Since p(ω ) ≥ 0 , the order of summation is irrelevant (by

ω ∈A ∞

Dirichlet’s Theorem, [6], p. 570-71). Suppose that A = ∪ A i , where the A i are disjoint, i =1

and let ω i 1, ω i 2 ,... be the points in A i . By the theorem on nonnegative double series (Ibid., p. 571), P (A) =

∑ p(ω i j ) = ∑ ∑ p(ω i j ) = ∑ P (A i ) , and so P is countably i j

i

j

i

additive. (Ibid., p. 21). Definition 3.2.6 If a probability space meets the criteria in Discussion 3.2.5, then this

space, (Ω, F, P), is a discrete probability space. This space is the formal basis for discrete probability theory, and the space we use since our machines are discrete. Definition 3.2.7 Elements of F are called events. Remark 3.2.8 One can see that probability space is a subset of Lebesgue space. In

particular, the probability space has a Lebesgue measure of 1. Definition 3.2.9 The expected value of X is E ( X ) = ∫ X dP , which is a clearly a A

Lebesgue integral.

41

Definition 3.2.10 Further, we can say that random variables are real-valued functions

from Ω to

. So given a random variable X, we can define a probability on

PX ( A) = P ( X ∈ A) , ∀A ⊆

as

.

Remark 3.2.11 The probability PX is the distribution of X. The random variable 1A can

be defined as 1A =

{

1 ω ∈A ⊆ Ω 0 otherwise ,

which is called the indicator.8

Let us return to expected value for a moment. There is an important aspect of expected value that must be made clear. We demonstrate this aspect by using a single die. The expected value for rolling a die (once) is E ( X ) =

since P ( X = x) =

1 + 2 + 3 + 4 + 5 + 6 21 = = 3.5 6 6

1 for x = 1,..., 6 . Obviously, it is impossible to roll a 3.5, but that is 6

the expected value, i.e. the mean of X. This concept should be recalled when we find the expected value of neural net applications. The following proposition tells us that the expected value of an indicator function has a measure. This is important to prove because our neural nets will be thought of as indicator functions. This proof is entirely measure theoretical as one would expect.

8

In analysis we call this the characteristic function, but in probability that term is reserved for the Fourier transform.

42

Proposition

3.2.12

If

g,

an

indicator,

is

bounded

or

nonnegative,

then

E [ g ( X )] = ∫ g ( x) PX . Proof:

If g is the indicator of an event A, this is merely the definition of PX. By linearity, the result holds for simple functions. By the Monotone Convergence Theorem, the result holds for nonnegative functions, and by linearity again, it holds for bounded g. Q.E.D.

We are at the point where we should summarize where we are and foreshadow where we are going. Finite neural nets are equivalent to finite automata which are equivalent to hidden Markov models. Markov chains are valid in probability theory and probability theory is merely Lebesgue measure on finite (discrete) measure space. Thus, Markov chains can be measured in probability space and by implication so can neural nets. In particular, the probability space has measure 1. In probability we consider the Lebesgue integral to be the Expected Value, Ε. This integral tells us the expected value of a function. Therefore, we can use Ε to measure the computability of computational models like neural nets, finite automata, and Markov chains. Our next step will be to give some examples of neural nets and to measure them using the McCulloch-Pitts Measure. Definition 3.2.13 The McCulloch-Pitts measure is the measure of a neural network. We

denote the measure µ Ν and define the measure as µ Ν = E [ g ( X )] = ∫ g ( x) PX , where g is the indicator function of the probability space X. Remark 3.2.14 It follows from this definition and the Equivalence of Models theorem

that we can use this measure to model both finite automata and hidden Markov models. 43

Historical Note: The reason that µ Ν is christened the McCulloch-Pitts measure is because in 1943, W. McCulloch and W. Pitts published a paper in the Bulletin of

Mathematical Biophysics called “A logical calculus of the ideas immanent in nervous activity” [75]. Logical Calculus is a seminal paper for many reasons, some of which follow. It was the first mathematical model of neural nets and finite automata and hence, all neural and automata research since is indebted to it. Second, the authors took care to use current neuro-physiological data as the basis for their model. Third, McCulloch and Pitts used modern logic as the language of their paper. N.B. More works on the history, philosophy, and science of neural nets can be found in the references.

44

CHAPTER 4 APPLICATIONS OF FINITE MEASURE “What a finite automaton can and cannot do is thought to be of some mathematical interest intrinsically, and may contribute to better understanding of problems which arise on the practical level.” —S.C. Kleene, 1956

§4.1

Design Procedure for Neural Networks

We are going to look at two applications of finite neural networks. Both of these examples are simple. We have chosen to do this because the aim of this thesis is to show how one uses measure theory to measure the computability of neural networks, not to measure a really difficult neural net. Of course, we would like to be able to do that for future study. The two applications concern the following topics: number theory (cryptography) and statistics (baseball). The reason that we chose to do only two examples is for the following reasons. First, this is a pure mathematical thesis and thus the emphasis should be on the theory. Second, two applications is sufficient to be able to see how neural nets are applied and how we measure them using the McCulloch-Pitts measure µN . Third, two applications is all we need to quantitatively compare their strength. For both applications, we state the problem, discuss the application, describe the solution to the problem, and measure the computability of the neural net. The programming code that is used to solve the problem for each application is in Appendix II, with sample output as well as other technical remarks. We used C++ for the cryptography application and MATLAB, to include the Neural Network Toolbox, for the baseball application. We have been remiss up to this point to talk about some aspects of

45

neural nets that did not arise when we discussed them with respect to the theory of computation, but will now come to pass since we are applying the theory. Neural nets are used to solve problems and since they can be programmed into computer languages, then they are able to deal with problems that have large data sets. As was mentioned at the beginning of the thesis, neural nets are neural because they are based on a connectionist view of the mind. Of course, many problems can be solved by other networks, and in many cases those networks are superior. Yet, neural nets are quite good at solving certain problems because they are trained or taught by a learning rule. So the general process of teaching a neural net is this. First, the net receives the input. Next the neural net alters the data set and produces an output. It compares this to the target value and then adjust the weights accordingly. It repeats this process until the error is at an acceptable level, i.e. the net has learned the target value.

Figure 4.1.1 The process of teaching a neural network (MATLAB Neural Network

Toolbox).

46

As can be seen, Figure 4.1.1 illustrates the overall process of a neural net. Yet, it is the internal process, which occurs in the “Neural Network” box, where the action happens. For it is there that the learning rule is incorporated, the biases are set, and the output is calculated among other mathematical operations. Ergo, we will now turn our attention to that part of the network. Definition 4.1.1 A neural net is composed of layers of S neurons, each neuron includes

the weight matrix, the summers, the bias vector b , the transfer function boxes, and the output vectors a . Each element of the input vector p is connected to each neuron through the weight matrix W . This implies that the input is not part of the neuron. Recall that are weights are all contained in

. The following figure is an illustration of a 1-1-1 two layer

neural network. This is the illustration that describes our first applications. It is two layers because it has two neurons, viz. Log-Sigmoid and the Linear Layer. It is called a “1-1-1” because it has 1 input-1 neuron in the first layer-1 neuron in the second layer.

Inputs

Log-Sigmoid Layer

a1 = logsig(w1p+b)

Linear Layer

a2 = purelin(w2a1+b2)

Figure 4.1.2 An illustration of the 1-1-1 two-layer network (MATLAB Neural Network

Toolbox).

47

Definition 4.1.2 The backpropagation algorithm is an algorithm that propagates the

input forward through the network and then propagates the sensitivities backward through the network. Finally the weights and biases are updated using the approximate steepest

descent rule. Our programs use this algorithm. Remark 4.1.3 Sensitivities are a mathematical way of describing the recurrence

relationship in which the sensitivity of at layer m is computed from the sensitivity at layer m + 1 . The approximate steepest descent rule is an algorithm that allows us to search the parameter space and locate the minimum points of the surface, i.e. to find the optimum weights and biases for a given neural network.

Algorithm 4.1.4 Here is the listing for the backpropagation algorithm. [46], p.11-7ff.) 1. Decide on the architecture of the neural net, e.g. 1-1-1 or 1-2-1, and what transfer

functions are to be used. 2. Set the weights and biases. 3. Send the initial input through the first layer. 4. The output of the first layer becomes the input of the second layer. 5. Repeat this process until there are no more layers. 6. Determine the error of the network and take the derivatives of the transfer

functions. 7. Backpropagate the sensitivities through each layer (from the last back to the first).

(Cf. Definition 4.1.2.) 8. Use the sensitivities to update the weights. Return to step 3, but use a new input

from the target set.

48

9. Continue steps 3-8 until the error is reasonable, usually ±0.5. 10. The average of the number of iterations is the measure of the neural net. Remark 4.1.5 There are dozens of different learning algorithms for neural nets, so why

would one want to chose backpropagation over other algorithms? The reason is that with backpropagation, we are guaranteed convergence. (Ibid., p. 11-19—11-21.) That means at some point, albeit it may take a long time to get there, the neural net will learn the target value. Of course, both of our applications converge rather quickly.

§4.2

Finite Application Prīmus: Number Theory (Cryptography)

Problem 4.2.1 (Cryptography) To teach a neural net to decipher a single digit code.

Since this is a thesis in pure mathematics, we will not go through all of the design specifications for this cryptography neural net. Rather we will state the steps of the algorithm so that we have a general feel for the program. The code, with detailed comments, and the output can be found in Appendix II. There all of the design specifications are spelled out. This is a 1-1-1 network like the one in Figure 4.1.2. The neural net is given a target value, which is our code. It then uses the backpropagation algorithm to learn this code. The number of iterations is the measure of the neural net. Of course, we run this program many times (50) in order to get an acceptable expected value. 1. From a random number table, we pick the target value. 2. The program randomly picks the initial input for the neural net. 3. Two weights, one for each layer of the neural net, are also randomly generated. 4. The outputs and sensitivities are produced.

49

5. If the error is within ±0.5 , then the program stops. Otherwise it continues to the next step. 6. The weights and sensitivities are updates and the error is recalculated. It then returns to step 5.

Solution 4.2.2 (Cryptography) After running the program 50 times, the expected value

of this neural net is µ N1 = 9.78 iterations. Thus, the measure of this neural net is 9.78. Of course, if the program is run more times, then the measure of the neural net will become more accurate. We have decided that 50 times is sufficient for this thesis, but optimally one might want to run a program a 100, 500, or even a 1,000 times in order to be closer to the true measure of the neural net. Naturally this depends on how accurate one needs to be.

Further Work 4.2.3 This cryptography example is a simple one because the purpose of

the application is to show how we can apply the theory, i.e. the application is not an end in itself. Yet, this application lends itself to many paths for future work. Initially, one could choose to make the code have more digits, say 4 like a PIN, or be alpha-numeric. Yet, the neural net would quickly crack these codes as well. Thus, one may wish to train a neural net to learn more advanced cryptographies, e.g. a secret-key cryptography. Another avenue of research, would be to use neural nets as a public-key cryptography. My initial research discovery concerning neural nets and cryptography led me to a group of neural cryptographers. Their work can be found here: [100].

50

§4.3

Finite Application Secundus: Statistics (Baseball)

Problem 4.3.1 To teach a neural network to predict the number of wins for the LA

Dodgers for the 2005 season. Baseball is the most statistical of all sports and, arguably, the most beautiful.9 The sport is so open for statistical investigation that there is even a term that refers to the statistics of baseball: sabermetrics. Since baseball is stochastic and neural nets are algorithms that are able to learn from data, then baseball is a perfect place to train a neural net. That being said, there is a paucity of literature on neural nets and baseball. Regardless, we will make an initial, albeit small, contribution to using baseball with neural nets. This neural net is similar to the previous one and as such we give the minimum necessary explanation. The program first gets data from an Excel file that has the Win-Loss records for the LA Dodgers from 1969-2004. We picked 1969 since that was the first year of the National League West, which the Dodgers are a member. The neural net that we create is a 1-2-1, i.e. it has 1 input-2 neurons in the first logsig layer-1 neuron in the purelin layer as can be seen in Figure 4.3.1.

Figure 4.3.1 An illustration of the 1-2-1 two-layer network. 9

[2] , [94], and [103] are all recent books that explore the marriage of mathematics, statistics, probability, and the National Pastime.

51

The input is a randomly generated integer between 58 and 102 from half of the games (18) played in the given time period. This range, the training set, covers the least and the most wins that the Dodgers received during the 1969-2004 seasons. Fifty eight comes from the shortened strike year season, but it is still a good minimum value for the range. Also, a season is usually 162 games. The weights are randomly generated integers (recall that having weights from

is an essential property of finite neural nets) between

0 and 9, with all biases set at .5. The program then will try to reach the target value. The target values were randomly taken from a quarter of the games between 1969-2004, which is 9 games. Lastly, the remaining quarter of games is for validation. It should be noted that the Wins were scrambled before they were broken into the three different sets.

Solution

4.3.3

(Baseball)

The

expected

value

of

this

neural

net

is

µ N2 = 25.6 iterations. Thus, the measure of this neural net is 25.6. The neural net predicts that the Dodgers will win 92.8 or 93 games in 2005. (Prediction is determined by averaging all of the target values whose iterations equals 25.)

Remark 4.3.4 In classical computation theory, we would not be able to say that the

baseball neural net is computationally stronger than the cryptography neural net other than in the generic way, viz. the baseball application has two neurons on each layer, whereas the cryptography application only has one on each. Yet we can now definitively say quantitatively, since each of the applications have a cardinal value, that the measure

52

of baseball example is greater than the measure of the cryptography example. In particular, 25.6 = µ N 2 > µ N1 = 9.78. One may think this measurability process is not necessary, since these two neural nets can be compared qualitatively by the number of neurons in each layer. But what happens when you have two identical neural nets that only differ with one bias? The biases will not help one determine which neural net is stronger, but our theory of measure will. Further Work 4.3.5 There are many ways that one could approach further research into

the fields of sabermetrics and neural networks. For example, there are various other factors, many more meaningful, that affect a team’s Win-Loss record other then previous year records. To name a few examples there are the ERAs of the starting pitchers, the number of day games, and the OBP of the lead-off runner.10 But neural nets can not only be used to model a particular team or even an individual player, they can also be used to model the mechanics of the game itself. One can look at the possible run situations during a half-inning.11 So, in particular, one can look at a half-inning of baseball as a Markov chain. There are a total of eight possible base situations, when one ignore outs, as can be seen in Table 4.2.4. Table 4.3.2 Possible runner situations in a baseball game, ignoring outs.

Bases Empty Runner Runner Runner Runners on 2nd on 3rd on Situation on 1st 1st & 2nd Notation

0

1-0-0

0-1-0

0-0-1

10

1-1-0

Runners on 1st & 3rd 1-0-1

Runners Bases on loaded & 2nd 3rd 0-1-1 1-1-1

ERA: Earned Run Average; OBP: On Base Percentage. Case “Study 9-1: A Half-Inning of Baseball as a Markov Chain” from [2] is the inspiration for this possible avenue of research.

11

53

When outs are taken into consideration, there are 25 possible run states. From here we could use a neural net to help us determine the best running strategies or the likelihood that a particular hitter is going to hit into a double-play. Once again considering the wealth of data that baseball has, we find it surprising that neural nets have not been used to exploit this data in any sort of meaningful way.

We have seen how one can use the McCulloch-Pitts measure, µ N , to compare two neural nets qualitatively. Now since N ∼ M ∼ H in general, this means in particular that

N 1 ~ M 1 ~ H 1 and N 2 ~ M 2 ~ H 2 . By the results in the preceding two sections, we can state that the measure of the hidden Markov model (or finite automaton) of the first application is less than the measure of the hidden Markov model (or finite automaton) of the second application by precisely15.82 iterations , or the baseball neural net is approximately 2.6 times stronger than the cryptography neural net.

54

CHAPTER 5 CONCLUSION AND FURTHER RESEARCH “The logic of the world which the propositions of logic show in tautologies, mathematics shows in equations.” —L. Wittgenstein, 1922.

§5.1

Where we are and where we can go

In some ways it is strange to speak of conclusions in mathematics since in science a final conclusion is never reached. Yes, discoveries are made and models are verified, but it is always possible to make the system more exact. So, in one sense we can conclude that we have resolved the issue of how to measure neural nets with cardinal numbers. Yet, by the very fact of doing this a number of questions spring to mind. For example, what is the appropriate number of times that a neural net program should be run in order to get sufficient data to produce an accurate expected value, µ Ν . Another question: to what extent does the data, independent of the model, affect µ Ν ? There are many more questions of these types, but perhaps the most pressing one is how to measure an infinite neural net. We will now turn to this.

§5.2

Infinite Neural Networks and Future Work

In the first four chapters of this thesis we justified the statement that finite neural nets can be measured by expected value on discrete probability space. If we had time for another four chapters, we would prove that infinite neural nets can be measured by expected value on continuous (indiscrete/continuous) probability space. Since we only have one chapter, we will merely sketch the steps in the justification of this statement. We will see,

55

however, that the justification is similar to the one for the finite case. Yet, as is often the state of affairs with the infinite case, the justification becomes more complex.

1. Set Theoretic-cum-Computational Beginnings: We alter the definition of a neural net so that they now have infinite memory. Next we define Turing machines which are computationally more powerful than finite automata. In fact, one can view Turing machines as “infinite automata.” Lastly we modify the definition of hidden Markov models in order to get labeled Markov models. 2. Equivalence of Infinite Models: Infinite neural nets can be shown to be equivalent to Turing Machines. Turing machines, which are a type of automaton, are equivalent to labeled Markov models. 3. Markov Models & Measure on Infinite Measure Space: Labeled Markov models are valid on probability space and also, labeled Markov models are valid on continuous probability space. 4. Thus, we can extend the definition of µ Ν to the infinite case and use it to measure applications of infinite neural networks. Thus the path to show that µ Ν can measure infinite neural nets is similar to the path to show that µ Ν can measure finite neural nets. In the next three sections we will sketch the proof in a little more detail. But it is still a sketch and thus it does not show definitively that our conjecture is true.

A Sketch for Infinite Models In this sketch we make a comment about what makes a neural network infinite. Next, we define Turing machines and compare them to finite automata. Then we state the

56

theorems that show that infinite neural nets, Turing machines, and labeled Markov models are equivalent. We will conclude by making a comment about continuous probability and the McCulloch-Pitts measure for infinite neural nets. The following figure is akin to Figure 1.1.1. Turing Machines

⇔

Labeled Markov

Key:

Models

x ⇔ y : denotes x is equivalent to y x → y : denotes x is valid on y

↓ Infinite Neural Nets

Continuous Probability Space

Figure 5.2.1 The path showing that Infinite Neural Nets can be measured by Probability

Space.

Remark 5.2.1 If we want to have a infinite neural network, then we need the weights to

be rational. Finite neural nets are finite because their weights are in neural nets are infinite because their weights are in the rationals, nets with weights in

, whereas infinite . (Naturally, neural

are infinite as well, but since systems with infinitely precise

constants cannot be built, we will neglect neural nets with real weights.) 12 In particular, the neurons in neural nets with only integer weights can only assume two values, whereas the neurons in neural nets with rational weights can assume countably infinite values. Remark 5.2.2 The following are the central differences between finite automata and

Turing machines [110]. 1. A Turing machine can both write and read from it, i.e. a finite automata is ROM.

12

That is not to say that these real-weighted neural nets do not have applications; they do. But these applications would take us to far a field.

57

2. The read-write head in the Turing machine can move both to the left and the right, whereas the finite automata could only move in one direction. 3. The Turing machine tape is infinite. 4. The special Turing states for rejecting and accepting take immediate effect. Now we will define Turing machines. Definition 5.2.3 A Turing machine is a 7-tuple, TM = (Q, Σ, Γ, δ , q0 , qaccept , qreject ), where

Q, Σ, Γ are all finite sets: i.

Q is the set of states,

ii.

Σ is the input alphabet not containing the ‫ײַ‬,

iii.

Γ is the tape alphabet, where ‫ ∈ײַ‬Γ and Σ ⊆ Γ ,

iv.

δ : Q × Γ → Q × Γ × {L, R} is the transition function,

v.

q0 ∈ Q is the start state,

vi.

qaccept ∈ Q is the accept state,

vii.

qreject ∈ Q is the reject state, where qaccept ≠ qreject .

We will now list a few theorems and where they can be found.

Theorem 5.2.4 Infinite neural nets are equivalent to Turing machines [109, Ch. 3]. Theorem 5.2.5 Turing machines are equivalent to labeled Markov models [19, p. 115-

118]; [68, p.68ff.]. Since both labeled and hidden Markov models have the Markov property, then it follows that they are both valid on probability space. But since labeled Markov models are infinite, we would want to show that they exist on the continuous probability space. 57

After that, we would re-define Expected Value and the McCulloch-Pitts neuron in order to accommodate the infinite case. Once we were there, we could happily begin applying the extended definition of µ N . Remark 5.2.6 In a recent (October 2004) Notices of the American Mathematical Society

there is an article “Computing over the Reals: Where Turing Meets Newton.” [16] This article demonstrates that Turing machines can be viewed as a commutative ring or field possibly ordered) with unit. This means that once we show that infinite neural nets are equivalent to Turing machines, instead of proving Theorem 5.2.5 we can instead prove the following theorem. Theorem 5.2.6 (Blum, Shub, Smale, 1989) Turing machines are equivalent to an

arbitrary ring or field R (possibly ordered) with commutativity and identity. Further, (a) if R =

R=

2

= ({0,1} , + , − , i) , the Turing theory of computation is valid and (b) if

, Newton’s Algorithm is valid.

Recall 5.2.7 [56, p. 42ff.] A ring in modern algebra is: a nonempty set R equipped with

two operations, (+, i) , that satisfy the following axioms. ∀ a, b , c ∈ R : 1. If a ∈ R ∧ b ∈ R → a + b ∈ R (closure for + ). 2. a + (b + c) = (a + b) + c (associative + ). 3. a + b = b + a (commutative + ). 4. ∃ 0 R ∈ R . ∋ . a + 0 R = a = 0 R + a, ∀ a ∈ R (additive identity or zero element). 5. ∀ a ∈ R, the equation a + x = 0 R has a solution in R. 6. a ∈ R ∧ b ∈ R → a ib ∈ R (closure for i ), we will write ab for a ib . 7. a (bc) = (ab)c (associative i ).

58

8. a (b + c) = ab + ac and (a + b)c = ac + bc (distribution laws). Further, 9. A commutative ring is a ring R such that ab = ba, ∀a, b ∈ R (commutative i ). 10. A ring with identity is a ring R such that 1R ∈ R, a1R = a = 1R a, ∀a ∈ R ( i identity). 11. An integral domain is a ring that satisfies axioms 1 – 10 where 1R ≠ 0 R and

(∀ a, b ∈ R ∧ ab = 0 R ) → a = 0 R ∨ b = 0 R . Example:

.

12. A field is ring that satisfies axioms 1 - 11 where 1R ≠ 0 R and

∀a ≠ 0 R , the equation ax = 1R has a solution in R . Examples:

p,

,

, and

, all with the usual addition and multiplication and

where p is a prime. Once we prove Theorem 5.2.6, then it follows from (a) and (b) that R is valid on Lebesgue measure space. From this point, we would continue as we did after Theorem 5.3.5, viz. we would show that probability space is part of measure theory and µ N can measure infinite neural nets. This discussion leads to other thoughts about future work. The first thought is to complete that sketch that was just given. In addition to filling in the argument, we could explore a few applications as well. Then once we measured these applications, we could compare them to the results that we received in Chapter 4. For the finite case we only had weights in discussed the weights are in

and for the infinite case just

. As was mentioned, one can also have weights in

59

, thus

a second idea for future work is that we could show that infinite neural nets with a neuron that can take on uncountably different values. A third, and perhaps the most exciting, thought for future work is to connect the measurability of neural networks (and their equivalent models) to the computational notion of complexity, both with respect to time and space.

§5.3 Concluding Thoughts

The stated thesis of this work is to show that the sensory order can be measured, and in particular, neural nets and equivalent models can be measured using measure theory. Yet, there is an implicit thesis as well. This implicit thesis is: the power and flexibility of mathematics can be seen by how the mathematician can move from one subdiscipline of mathematics to another in order to prove theorems. And further, mathematical logic is not just connected to analysis through topology or algebra, but it is also connected through probability.

60

REFERENCES

Note: Many of these works are not directly referenced in the text, but we have taken this opportunity to give an overview of a delectable sampling of the essential works on the history, science, and philosophy of neural networks (and related topics, e.g., the mind), mathematical logic, and the theory of computation. That being said, all of these works have influenced us as we researched the topic of this thesis. Notation: The works that are directly referenced in the text are denoted with a ‘*’, e.g. [1]*.

[1]*

Aiserman, Mark A., Gusev, Leonid A., Rozonoer, Lev I., Smirnova, Irina M., Tal’, Alesksey A. (1971), Logic, Automata, and Algorithms. New York: Academic Press.

[2]*

Albert, Jim. (2003), Teaching Statistics Using Baseball. Washington, D.C.: MAA.

[3]

Almog, Joseph. (2002), What am I? Descartes and the Mind-Body Problem. Oxford: OUP.

[4]

Anderson, James A. & Rosenfield, Edward (ed). (1988), Neurocomputing: Foundations of Research. Cambridge, MA: MIT Press.

[5]*

Anton, Howard and Rorres, Chris. (1994), Elementary Linear Algebra: Applications Version (7th ed.). New York: John Wiley & Sons.

[6]

Arbib, Michael A, ed.. (2002), The Handbook of Brain Theory and Neural Networks (2nd ed.). Cambridge, MA: MIT Press.

[7]*

Ash, Robert B. (1972), Real Analysis and Probability. New York: Academic Press.

61

[8]

Aquinas, St Thomas. (c.1274/1981), Summa Theologica, trans. Fathers of the English Dominican Province. Notre Dame: Ave Maria.

[9]

Azriel, Levy. (2002), Basic Set Theory. Mineola, NY: Dover.

[10]

Baars, Bernard J. (1997), In the Theater of Consciousness: The Workspace of the Mind. Oxford: OUP.

[11]*

Bass, Richard F. (1998), Probability Theory. ww.math.uconn.edu/~bass/prob.pdf 1/27/2005.

[12]

Benacerraf, Paul & Putnam, Hilary, ed. (1983), Philosophy of Mathematics (2nd ed.). Cambridge: CUP.

[13]

Berkeley, George. (1975), Philosophical Works, ed. M.R. Ayers. London: Everyman.

[14]

Bernays, Paul. (1968/1991), Axiomatic Set Theory. Mineola, NY: Dover.

[15]*

Billingsley, Patrick. (1986), Probability and Measure (2nd ed.). New York: John Wiley & Sons.

[16]*

Blum, Lenore. (2004), “Computing over the Reals: Where Turing meets Newton,” Notices of the American Mathematical Society, October 2004, Vol. 51, No. 9, p. 1024 – 10234.

[17]*

Bollobás, Béla. (1998), Modern Graph Theory. New York: Springer-Verlag.

[18]

Boole, George. (1854/1958), The Laws of Thought. Mineola, NY: Dover.

62

[19]*

Brainerd, Walter S. & Landweber, Lawrence H. (1974), Theory of Computation. New York: John Wiley & Sons.

[20]

Brentano, Franz. (1874/1995), Psychology from an Empirical Standpoint (2nd ed.), trans. Linda L. McAlister. New York: Routledge.

[21]

Carnap, Rudolph. (1928/1967, 2003), The Logical Structure of the World and Pseudoproblems in Philosophy (2nd ed.), trans. Rolf. A. George. Chicago: Open Court.

[22]*

— (1934/1937), The Logical Syntax of Language. New York: Harcourt Brace.

[23]

Chalmers, David J. (1996), The Conscious Mind: In Search of a Fundamental Theory. Oxford: OUP.

[24]*

Churchland, Paul M. (1989), A Neurocomputational Perspective: The Nature of Mind and the Structure of Science. Cambridge, MA: MITP.

[25]

— (1995), The Engine of Reason, the Seat of the Soul: A Philosophical Journey into the Brain. Cambridge, MA: MIT Press.

[26]*

Dean, Thomas, Leach, Sonia, & Shatkay, Hagit. Finite State Dynamical Systems from “Learning Dynamical Systems: A Tutorial” http://www.cs.brown.edu/research/ai/dynamics/tutorial/Documents/FiniteAutomat a.html 1/11/2005.

[27]

Dedekind, Richard. (1901/1963), Essays on the Theory of Numbers. Mineola, NY: Dover.

63

[28]

Deitel, H.M. & Deitel, P.J. C++: How to Program (3rd ed.). Upper Saddle River, NJ: Prentice Hall.

[29]

Descartes, René. (1641/1990), Mediationes de prima philosophia/ Meditations on the First Philosophy : A bilingual edition, ed. George Heffernan. Notre Dame: UNDP.

[30]*

Durrett, Rick. (1999), Essentials of Stochastic Processes. New York: SpringerVerlag.

[31]*

Einstein, Albert. (1916/1961), Relativity: The Special and the General Theory. New York: Three Rivers Press.

[32]*

— (1949/1970), Albert Einstein: Philosopher-Scientist, ed. Paul Arthur Schipp. Chicago: Open Court.

[33]*

Enderton, Herbert B. (2001), A Mathematical Introduction to Logic (2nd ed.). San Diego: Harcourt Academic Press.

[34]

Feynman, Richard. P. (1996), Lectures on Computation, ed. Tony Hey & Robin W. Allen. New York: Westview.

[35]

Flanagan, Owen. (1991), The Science of the Mind (2nd ed.). Cambridge, MA: MIT Press.

[36]*

Frege, Gottlob. (1879/167), “Begriffsshrift, a formula language, modeled upon that of arithmetic, for pure thought,” [116], p. 8-82.

64

[37]*

— (1884/1980), The Foundations of Arithmetic. Evanston, IL: Northwestern University Press.

[38]*

— (1997), The Frege Reader, ed. by Michael Beaney. Malden, MA: Blackwell.

[39]

Gallian, Joseph A. (2002), Contemporary Abstract Algebra (5th ed.). Boston: Houghton Mifflin.

[40]

Garfield, Jay L., ed. (1990), Foundations of Cognitive Science: The Essential Readings. New York: Paragon House.

[41]*

Gillies, Donald. (2000), Philosophical Theories of Probability. New York: Routledge.

[42]

Goble, Lou, ed. (2001), The Blackwell Guide to Philosophical Logic. Oxford: Blackwell.

[43]*

Gödel, Kurt. (1930/1967), The Completeness of the Axioms of the Functional Calculus of logic, [], pg. 583-591.

[44]*

— (1931/1967), On Formally Undecidable Propositions of Principia Mathematica and related systems I, [], pg. 596-616.

[45]*

Hacking, Ian. (1975), The Emergence of Probability. Cambridge: CUP.

[46]*

Hagan, Martin T., Demuth, Howard B. & Beale, Mark. (1996), Neural Network Design. Boston: Thomson Learning.

[47]*

Hanselman, Duane & Littlefield, Bruce. (2001), Mastering MATLAB 6: A Comprehensive Tutorial and Reference. Upper Saddle River, NJ: Prentice Hall.

65

[48]*

Harary, F. (1994), Graph Theory. Reading, MA: Addison-Wesley.

[49]

Hart, W.D., ed. (1997), The Philosophy of Mathematics. Oxford: OUP.

[50]*

Hayek, F.A. (1952/1976), The Sensory Order: An Inquiry into the Foundations of Theoretical Psychology. Chicago: CUP.

[51]

Hofstadter, Douglas R. (1979), Gödel, Escher, Bach: An Eternal Golden Braid. New York: Basic Books.

[52]*

Hopcroft, John E. & Ullman, Jeffrey, D. (1979), Introduction to Automata Theory, Languages and Computation (1st ed.). Menlo Park, CA: Addison-Wesley.

[53]

Hopfield, J.J. (1982/1988), Neural Networks and Physical Systems with Emergent Collective Computational Abilities, [4], pg. .460-464.

[54]

Hume, David. (1739-40/1978), A Treatise of Human Nature, (2nd ed.), ed. L. A. Selby-Bigge & P.H. Nidditch. Oxford: Clarendon Press.

[55]

— (1748, 1751, 1777/1975), Enquiries Concerning Human Understanding and Concerning the Principles of Morals, (3rd. ed.), ed. L. A. Selby-Bigge & P.H. Nidditch. Oxford: Clarendon Press.

[56]*

Hungerford, Thomas W. (1990), Abstract Algebra: An Introduction (2nd ed.). San Diego, CA: Saunders College Publishing.

[57]

Jacquette, Dale, ed. (2002), Philosophy of Logic: An Anthology. Oxford: Blackwell.

[58]

— (2002), Philosophy of Mathematics: An Anthology. Oxford: Blackwell.

66

[59]*

Kalish, Donald & Montague, Richard. (1980), Logic: Techniques of Formal Reasoning (2nd ed.). San Diego: Harcourt Brace Jovanovich.

[60]*

Kant, Immanuel. (1781, 1787/1998), Critique of Pure Reason, ed. Paul Guyer & Allan W. Wood.

[61]

Kenny, Anthony. (1993), Aquinas on the Mind. New York: Routledge.

[62]*

Kleene, S.C. (1956), Representation of Events in Nerve Nets and Finite Automata, [107], pg. 3-41.

[63]*

— (1967/2002) Mathematical Logic. Mineola, NY: Dover.

[64]

Kneale, William & Kneale, Martha. (1962), The Development of Logic. Oxford: Clarendon Press.

[65]*

Kolmogorov, A.N. & Fomin, S.V. (1975), Introductory Real Analysis, trans. Richard A. Silverman. Mineola, NY: Dover.

[66]

Kripke, Saul. (1980), Naming and Necessity. Cambridge, MA: HUP.

[67]

— (1982), On Rules and Private Language. Cambridge, MA: HUP.

[68]*

Kurki-Suonio, Reino. (1971), A Programmer’s Introduction to Computability and Formal Languages. Princeton, NJ: Auerbach Publishers.

[69]

Lakatos, Imre. (1976), Proofs and Refutations: The Logic of Mathematical Discovery, ed. John Worrall & Elie Zahar. Cambridge, MA: CUP.

[70]*

Lin C.C & Segel, L.A. (1988), Mathematics Applied to Deterministic Problems in the Natural Sciences. Philadelphia: SIAM.

67

[71]

Lycan, William G, ed.. (1990) Mind and Cognition: A Reader. Cambridge, MA: Blackwell.

[72]

Locke, John. (1689/1975), An Essay Concerning Human Understanding, ed. Peter H. Nidditch. Oxford: OUP.

[73]

Masters, Timothy. (1993), Practical Neural Networks in C++. San Diego, CA: Morgan Kaufmann.

[74]

Maritain, Jacques. (1932/1995), Distinguish to Unite or The Degrees of Knowledge (4th ed.), trans. Gerald B. Phelan. Notre Dame: UNDP.

[75]*

McCulloch & Pitts. (1943/1988), A Logical Calculus of the Ideas Immanent in Nervous Activity, [4], pg. 18-27.

[76]*

Minsky, M. L. (1956), Some Universal Elements for Finite Automata, [107], pg. 117-128.

[77]*

— (1967), Computation: Finite and Infinite Machines. Englewood Cliffs, NJ: Prentice-Hall.

[78]

Moore, E.F. (1956), Gerdanekn-Experiments on Sequential Machines, [107], pg. 129-153.

[79]*

Moschovakis, Yiannis N. (1994), Notes on Set Theory. New York: SpringerVerlag.

[80]

Nagel, Ernest & Newman, James R. Gödel’s Proof, rev. ed., ed. Douglas Hofstadter. New York: NYU Press.

68

[81]*

Neapolitan, Richard & Naimipour, Kumarss. (2004), Foundations of Algorithms: Using C++ Pseudocode (3rd ed.). Boston: Jones and Bartlett.

[82]*

Newton, Isaac. (1687/1999), Principia: Mathematical Principles of Natural Philosophy, I. Bernard Cohen & Anne Whitman (trans.). Berkeley: UC Press.

[83]*

— (1995), Newton, ed. I. Bernard Cohen & Richard S. Westfall. New York: Norton Critical Editions.

[84]

Nievergelt, Yves. (2002), Foundations of Logic and Mathematics: Applications to Computer Science and Cryptography. Boston: Birkhäuser.

[85]

Popper, Karl. (1959/2002), The Logic of Scientific Discovery. New York: Routledge.

[86]

— (1963/2002), Conjectures and Refutations. New York: Routledge.

[87]

Penrose, Roger. (1989), The Emperor’s New Mind. Oxford: OUP.

[88]

Quine, Willard Van Orman (1969), Set Theory and its Logic, rev. ed. Cambridge, MA: Harvard.

[89]

— (1982), Methods of Logic (4th ed.). Cambridge, MA: Harvard.

[90]*

Ramaswamy, Srinivasan. “Applications of Functions: Finite State Automata.” http://www.csc.tntech.edu/~srini/DM/chapters/review4.2.html 1/9/2005.

[91]*

Ramsey, F. P. (1926/1990), Philosophical Papers, ed. By D.H. Mellor. Cambridge: CUP.

69

[92]

Rogers, Joey (1997), Object-Oriented Neural Networks in C++. San Diego, CA: Academic Press.

[93]

Rosenblatt, F. (1958/1988), The Perceptron: a probabilistic model for information storage and organization in the brain, [4], pg. 92-114.

[94]*

Ross, Ken. (2004), A Mathematician at the Ballpark: Odds and Probabilities for Baseball Fans. New York: Pearson Education.

[95]*

Royden, H. L. (1988). Real Analysis (3rd ed.). Englewood Cliffs, NJ: PrenticeHall.

[96]*

Rudin, Walter. (1976), Principles of Mathematical Analysis (3rd ed.). Singapore: McGraw Hill.

[97]*

— (1987), Real and Complex Analysis (3rd ed.). Singapore: McGraw Hill.

[98]

Russell, Bertrand. (1903/1996), The Principles of Mathematics (2nd ed.). New York: Routledge.

[99]

— (1920/1993), Introduction to Mathematical Philosophy, (2nd ed.). Mineola, NY: Dover.

[100]* Rutter, Andreas. “Neural Cryptography.” http://theorie.physik.uni-wuerzburg.de/~ruttor/neurocrypt.html 1/29/2005.

[101] Ryle, Gilber. (1949), The Concept of the Mind. Chicago: UCP. [102] Savitvh, Walter. (2002), Absolute C++ (1st ed.). San Francisco: Addison Wesley.

70

[103]* Schwartz, Alan (2004): The Number’s Game: Baseball’s Lifelong Fascination with Statistics. New York: Thomas Dunne Books. [104] Searle, John R. (1992), The Rediscovery of the Mind. Cambridge, MA: MIT Press. [105] — (1997), The Mystery of Consciousness. New York: NYRB. [106] — (2004), Mind: A Brief Introduction. Oxford: OUP. [107] Shannon, C.E. & McCarthy, J. (1956), Automata Studies, Annals of Mathematical Studies, No. 34. Princeton: Princeton University Press. [108]* Shoenfield, Joseph R. (1967), Mathematical Logic. Natick, MA: ASL. [109]* Siegelmann, Hava T. (1999), Neural Networks and Analog Computation: Beyond the Turing Limit. Boston: Birkhäuser. [110]* Sipser, Michael. (1997), Introduction to the Theory of Computation. San Francisco: PWS Publishing Co. [111] Sports Reference. “Baseball Reference: Los Angeles Dodgers.” http://www.baseball-reference.com/teams/LAD/ 1/31/2005. [112]* Strogatz, Steven. (1994), Nonlinear Dynamics and Chaos: With Applications to Physics, Biology, Chemistry, and Engineering. Cambridge, MA: Westview Press. [113]* — (2003), Sync: The Emerging Science of Spontaneous Order. New York: Athena. [114]* Tarski, Alfred. (1936/1956), Logic, Semantics, and Metamathematics: Papers from 1923 to 1938, trans. by J. H. Woodger. Oxford: Clarendon Press.

71

[115] Tymoczko, Thomas, ed. (1998), New Directions in the Philosophy of Mathematics, rev. ed. Princeton: PUP. [116] Van Heijenoort, Jean. (1967) From Frege to Gödel: A Source Book in Mathematical Logic. Cambridge, MA: HUP. [117]* Von Neumann, J. (1958), The Computer and the Brain. New Haven: Yale. [118]* Walrand, Jean. “EECS 126- Probability and Random Processes.” http://robotics.eecs.berkeley.edu/~wlr/126/w12.htm 1/30/2005. [119] Weyl, Hermann. (1918/1994), The Continuum: A Critical Examination of the Foundation of Analysis, trans. Stephen Pollard & Thomas Bole. Mineola, NY: Dover. [120] Whitehead, A.N. & Russell, B. (1910/1997), Principia Mathematica to *56. Cambridge: CUP. [119] Whitehead, A.N. (1911/1948), An Introduction to Mathematics. Oxford: OUP. [121]* Weisstein, Eric W.. "Graph." From MathWorld--A Wolfram Web Resource. http://mathworld.wolfram.com/Graph.html 1/9/2005 [122]* — , et al. "State Diagram." From MathWorld--A Wolfram Web Resource. http://mathworld.wolfram.com/StateDiagram.html 1/9/2005 [123]* — "Cyclic Graph." From MathWorld--A Wolfram Web Resource. http://mathworld.wolfram.com/CyclicGraph.html 1/10/05

72

[124]* Wittgenstein, Ludwig. (1922/1981), Tractatus Logico-Philosophicus, C.K. Ogden (trans.). New York: Routledge. [125]* — (1958), Philosophical Investigations, (2nd ed.) trans. G.E.M. Anscombe. Boston: Blackwell. [126] — (1975), Wittgenstein’s Lectures on the Foundations of Mathematics, Cambridge 1939, ed. Cora Diamond. Chicago: CUP. [127] — (1978), Remarks on the Foundation of Mathematics, rev. ed., ed. G.H. von Wright, R. Rhees, & G.E.M. Anscombe. Cambridge, MA: MIT Press. [128] Yan, Song Y. (2002), Number Theory for Computing (2nd ed.). New York: Springer-Verlag.

73

APPENDIX I A SHORT PHILOSOPHICAL NOTE ON MODELS AND STATE DIAGRAMS

F. A. Hayek’s The Sensory Order: An Inquiry into the Foundations of Theoretical Psychology is one of the founding texts of connectionism [50]. The title of this thesis was inspired by the title of that work. In The Sensory Order, Hayek describes how his neural nets (to use an anachronistic term for him) can be viewed as models or maps, where his notion of “models” is equivalent to our notion of “mathematical model” and his notion of “map” is equivalent to our notion of a “state diagram” (Ibid, §5.33-5.49). Thus, Hayek offers a philosophical justification for using mathematical models and state diagrams as a way to describe neural nets. Also, this thesis connects state diagrams to mathematical logic, following in the logical tradition of McCulloch and Pitts [75], S.C. Kleene [62], [63], M. L. Minksy [76], [77], J. von Neumann [117], among others. This was discussed in §0.1. Hayek, on the other hand, connects the models to the actual physical or mental objects [50, §5.77-5.91]. Thus, we can see that Hayek’s concern was the philosophical-cum-psychological while our concern is mathematical-cum-computational. Albeit, Hayek is not in the logical tradition, Sensory Order is modeled after his cousin L. Wittgenstein’s Tractatus Logico-Philosophicus [125], which is one of the founding documents of the logical tradition that began in the late-19th century with the works of G. Frege [36, 37, 38]. It should come as no surprise then that Wittgenstein’s Tractatus has a similar structure to both the logical and sensory traditions since both traditions looked to him for inspiration. The following table is a glossary of the logical

74

(McCulloch, Pitts, Kleene, Minksy, von Neumann) and sensory (Hayek) traditions of neural nets with reference to Wittgenstein’s Tractatus.

Table A.1 Glossary of the terminology used by the logical tradition of neural net research

as compared to the Hayekian terminology, with an inclusion of Wittgenstein’s philosophical structure in his Tractatus. School of Thought

Logical Sensory Wittgenstein

Theoretical Framework Mathematical Model Biological Model

Descriptive Analog State Diagram Map

Thing being Investigated Mathematical Logic (Theoretical Computers) Physical/Mental Objects

Propositions

Pictures of Facts

Reality (The totality of facts)

75

APPENDIX II SOURCE CODE, SAMPLE OUTPUT, AND DATA WITH TECHNICAL REMARKS We have two applications in Chapter 4. The applications are: Finite Application Prīmus: Number Theory (Cryptography) and Finite Application Secundus: Statistics (Baseball). We give the source code, sample output, relevant data with technical remarks as needed.

A.2.1 Finite Application Prīmus: Number Theory (Cryptography), §4.2

The source code is written in C++ using standard libraries and with extensive comments. Source Code /* Neural Networks and Cryptography. This program will allow us to teach a neural network to decipher a key. Brock Gibson of The Boeing Company assisted me with function update1() and function update2 () as well as reviewing all of the code.*/

#include #include #include #include using namespace std;

// Function declarations (for the functions that use randomly generated weights.) int targetCheck (void);

//This is an error function: the user must input a integer btw

1-9. int input (void); //This function returns the input,'p'. This is the network's datum. int weight (void); //This function returns the weight, 'w.' float output01 (void); //This function returns a1 (transfer function is logsig). float output02 (void); //This function returns a2 (transfer function is purelin). float error (void);

//This function returns the error, e.

float sensitivity02 (void);

//This function returns the sensitivity for layer 2, s2.

float sensitivity01 (void);

//This function returns the sensitivity for layer 1, s1.

76

float update02 (void);

//This function returns the update weight for layer 2.

float update01 (void);

//This function returns the updated weight for layer 1.

// Function declarations (for the functions that use updated weights.) int input1 (void);

//This function returns the input,'p'. This is the network's datum.

float output1 (void); //This function returns a1 (transfer function is logsig). float output2 (void);//This function returns a2 (transfer function is purelin). float err (void);

//This function returns the error, e.

float sensitivity2 (void);

//This function returns the sensitivity for layer 2, s2.

float sensitivity1 (void);

//This function returns the sensitivity for layer 1, s1.

float update2 (void);

//This function returns the update weight for layer 2.

float update1 (void); //This function returns the updated weight for layer 1.

//Global variables //Variables for randomly generated weights int t01 = 0, w01 = 0, w02 = 0, p01 = 0; double alpha = .1; float a01 = 0, a02 = 0, e01 = 0, s02 =0, s01 =0, w02Update = 0, w01Update = 0, w2UpdatePrevious = 0, w1UpdatePrevious = 0;

//Variables for updated weights int t = 0, p1 = 0; double alpha1 = .1; float w1 = 0, w2 = 0, w2Update = 0, w1Update = 0, a1 = 0, a2 = 0, s2 = 0, s1 = 0;

//Counter for iterations before successful completion int counter = 1;

//The main program. It contains only I/O, function calls, and flow of control mechanisms. int main() { //Initial Output and Input to get the users code. cout << "Welcome to the Neural Cryptography Program for MAT 580.\n"; cout << "The purpose of this program is to teach the network to decipher a code.\n"; cout << "Please input the first digit of your code, i.e.\n";

77

cout << "the target value for the network.\n"; cout << "It will be an integer between 0 and 9.\n"; cout << "t = \n"; cin >> t01; cout << endl;

targetCheck (); //Function call for error function.

//The output statements and function call for the network's initial input. cout << "Now the program will state an integer between 0 and 9.\n"; cout << "The integer is the initial input for the network.\n"; srand (time (0)); input (); //Function call for the input, p. cout << endl;

//The randomly generated weight. cout << "Now the program will state the randomly generated weights.\n"; srand (time (0)); weight (); //Function call for the weight, w. cout << endl;

/*Now the program will perform the various calculations. It will do this by calling the relevant functions. */ output01 (); //Function call for a1. output02 (); //Function call for a2. cout << endl;

error (); //Function call for e01. cout << endl;

sensitivity02 (); //Function call for s2 sensitivity01 (); //Function call for s1. cout << endl;

cout << "The network will now update the weights.\n"; cout << "We will use the learning rate of alpha = 0.10.\n";

78

cout << endl;

update02 (); //Function call for w02Update. update01 (); //Function call for w1Update.

cout << endl;

//If-then branch to see if the error is acceptable, viz. e01 <= +-.5. if ((e01 >= 0 && e01 <= .5) || (e01 >= -.5 && e01 < 0) ) { cout << "The program has learned the code\n"; return 0; } else { //The do-while is so that the iterations can continue while the e01 is not acceptable. do { /*cout << "Enter your same code number.\n"; cin >> t; cout << endl;*/ t = t01;

srand (time (0)); input1 ();

/*cout << "Enter your update w2 from last time\n"; cin >> w2; cout << "Enter your update w1 from last time\n"; cin >> w1;*/ w2 = w2UpdatePrevious; w1 = w1UpdatePrevious;

cout << endl;

output1 (); //Function call for a1. output2 (); //Function call for a2.

79

cout << endl;

err (); //Function call for e. cout << endl;

sensitivity2 (); //Function call for s2 sensitivity1 (); //Function call for s1. cout << endl;

cout << "The network will now update the weights.\n"; cout << "We will use the learning rate of alpha = 0.10.\n"; cout << endl;

update2 (); //Function call for w2Update. update1 ();

//Function call for w1Update.

//Increment the iteration counter counter++; } while ((e01 >= .5) || (e01 <= -.5));

}

cout << "The program has learned the code after " << counter << " iterations\n"; cout << "Good bye Cal Poly\n";

return 0; }

//These functions are for the randomly generated weights. /*An if-then branching function to check for errors, e.g. the user inputs a letter or a two digit number. (Doesn't catch all errors.)*/ int targetCheck (void) { if (t01 >= 0 && t01 <= 9) { cout << "Thank you.\n";

80

cout << endl; }

else { cout << "Please input an integer between 0 and 9.\n"; cin >> t01; targetCheck ();

}

return 0; }

//The function for getting 'p'. int input (void) { p01 = rand() % 9; cout << "The initial input,\n"; cout << "p = " << p01 << endl;

return 0; }

//The function for returning 'w'. int weight (void) { w01 = 1 + rand() % 9; //w01 is the weight for the first layer. w02 = 1 + rand() % 9; //w02 is the weight for the second layer. cout << "The weight for layer 1,\n"; cout << "w1 =

" << w01 << endl;

cout << "The weight for layer 2,\n"; cout << "w2 = " << w02 << endl;

return 0; }

//NB: Both p and w are randomly generated.

81

//Below are all the functions for the backpropagation calculations. //The function for calculating the output for the first layer (logsig). float output01 (void) { float n01 = 0, logsig = 0; n01 = (w01 * p01); cout << "The n value for layer 1,\n"; cout << "n1 = " << n01 << endl; logsig = 1/(1 + exp(-n01)); //logsig a01 = logsig; cout << "The output for the first layer,\n"; cout << "a1 = " << a01 << endl;

return 0; }

//The function for calculating the output for the second layer (purelin). float output02 (void) { a02 = (w02 * a01); //purelin cout << "The output for the second layer,\n"; cout << "a2 = " << a02 << endl;

return 0; }

//The function for the error. float error (void) { e01 = t01 - a02; cout << "The error,\n"; cout << "e = " << e01 << endl;

return 0; }

82

//The function for the sensitivity of the second layer, s2. float sensitivity02 (void) { int F02 = 1; //The derivative of purelin (layer 2) is always 1. s02 = -2 * F02 * e01; cout << "The sensitivity for the second layer,\n"; cout << "s2 = " << s02 << endl;

return 0; }

//The function for the sensitivity of the first layer, s1. float sensitivity01 (void) { float F01 = (1-a01)*a01;

//The derivative of logsig (layer 1) is (1-a1)(a1).

s01 = F01 * w02 * s02; cout << "The sensitivity for the first layer,\n"; cout << "s1 = " << s01 << endl;

return 0; }

//The function for updating the second weight. float update02 (void) { w02Update = w02 - (alpha * s02 *a01); cout << "The updated weight for the second layer,\n"; cout << "w2_update = " << w02Update << endl; w2UpdatePrevious = w02Update;

return 0; }

//The function for updating the first weight. float update01 (void) {

83

w01Update = w01 - (alpha * s01 * p01); cout << "The updated weight for the first layer,\n"; cout << "w1_update = " << w01Update << endl; w1UpdatePrevious = w01Update;

return 0; }

/* ********************* */ //These functions are for the updated weights. //The function for the randomly generated input. int input1 (void) { p1 = rand() % 9; cout << "The initial input,\n"; cout << "p = " << p1 << endl;

return 0; }

//The function for calculating the output for the first layer (logsig). float output1 (void) { float n1 = 0, logsig1 = 0; n1 = (w1 * p1); cout << "The n value for layer 1,\n"; cout << "n1 = " << n1 << endl; logsig1 = 1/(1 + exp(-n1)); //logsig a1 = logsig1; cout << "The output for the first layer,\n"; cout << "a1 = " << a1 << endl;

return 0; }

//The function for calculating the output for the second layer (purelin).

84

float output2 (void) { a2 = (w2 * a1); //purelin cout << "The output for the second layer,\n"; cout << "a2 = " << a2 << endl;

return 0; }

//The function for the error. float err (void) { e01 = t - a2; cout << "The error,\n"; cout << "e = " << e01 << endl;

return 0; }

//The function for the sensitivity of the second layer, s2. float sensitivity2 (void) { int F2 = 1; //The derivative of purelin (layer 2) is always 1. s2 = -2 * F2 * e01; cout << "The sensitivity for the second layer,\n"; cout << "s2 = " << s2 << endl;

return 0; }

//The function for the sensitivity of the first layer, s1. float sensitivity1 (void) { float F1 = (1-a1)*a1;

//The derivative of logsig (layer 1) is (1-a1)(a1).

s1 = F1 * w2 * s2; cout << "The sensitivity for the first layer,\n";

85

cout << "s1 = " << s1 << endl;

return 0; }

//The function for updating the second weight. float update2 (void) { w2Update = w2 - (alpha1 * s2 *a1); cout << "The updated weight for the second layer,\n"; cout << "w2_update = " << w2Update << endl; w2UpdatePrevious = w2Update;

return 0; }

//The function for updating the first weight. float update1 (void) { w1Update = w1 - (alpha1 * s1 * p1); cout << "The updated weight for the first layer,\n"; cout << "w1_update = " << w1Update << endl; w1UpdatePrevious = w1Update;

return 0; } //End of the program’s source code.

Sample output, with t = 6 Welcome to the Neural Cryptography Program for Brendan's Thesis. The purpose of this program is to teach the network to decipher a code. Please input the first digit of your code, i.e. the target value for the network. It will be an integer between 0 and 9. t = 6

86

Thank you.

Now the program will state an integer between 0 and 9. The integer is the initial input for the network. The initial input, p = 2

Now the program will state the randomly generated weights. The weight for layer 1, w1 =

3

The weight for layer 2, w2 = 2

The n value for layer 1, n1 = 6 The output for the first layer, a1 = 0.997527 The output for the second layer, a2 = 1.99505

The error, e = 4.00495

The sensitivity for the second layer, s2 = -8.00989 The sensitivity for the first layer, s1 = -0.0395132

The network will now update the weights. We will use the learning rate of alpha = 0.10.

The updated weight for the second layer, w2_update = 2.79901 The updated weight for the first layer, w1_update = 3.0079

87

The initial input, p = 2

The n value for layer 1, n1 = 6.01581 The output for the first layer, a1 = 0.997566 The output for the second layer, a2 = 2.7922

The error, e = 3.2078

The sensitivity for the second layer, s2 = -6.41561 The sensitivity for the first layer, s1 = -0.043601

The network will now update the weights. We will use the learning rate of alpha = 0.10.

The updated weight for the second layer, w2_update = 3.43901 The updated weight for the first layer, w1_update = 3.01662 The initial input, p = 2

The n value for layer 1, n1 = 6.03325 The output for the first layer, a1 = 0.997608 The output for the second layer, a2 = 3.43078

88

The error, e = 2.56922

The sensitivity for the second layer, s2 = -5.13844 The sensitivity for the first layer, s1 = -0.0421681

The network will now update the weights. We will use the learning rate of alpha = 0.10.

The updated weight for the second layer, w2_update = 3.95162 The updated weight for the first layer, w1_update = 3.02506 The initial input, p = 2

The n value for layer 1, n1 = 6.05011 The output for the first layer, a1 = 0.997648 The output for the second layer, a2 = 3.94233

The error, e = 2.05767

The sensitivity for the second layer, s2 = -4.11534 The sensitivity for the first layer, s1 = -0.0381599

The network will now update the weights. We will use the learning rate of alpha = 0.10.

89

The updated weight for the second layer, w2_update = 4.36219 The updated weight for the first layer, w1_update = 3.03269 The initial input, p = 2

The n value for layer 1, n1 = 6.06538 The output for the first layer, a1 = 0.997684 The output for the second layer, a2 = 4.35208

The error, e = 1.64792

The sensitivity for the second layer, s2 = -3.29583 The sensitivity for the first layer, s1 = -0.0332269

The network will now update the weights. We will use the learning rate of alpha = 0.10.

The updated weight for the second layer, w2_update = 4.69101 The updated weight for the first layer, w1_update = 3.03933 The initial input, p = 2

The n value for layer 1, n1 = 6.07867 The output for the first layer, a1 = 0.997714

90

The output for the second layer, a2 = 4.68028

The error, e = 1.31972

The sensitivity for the second layer, s2 = -2.63943 The sensitivity for the first layer, s1 = -0.0282398

The network will now update the weights. We will use the learning rate of alpha = 0.10.

The updated weight for the second layer, w2_update = 4.95435 The updated weight for the first layer, w1_update = 3.04498 The initial input, p = 2

The n value for layer 1, n1 = 6.08996 The output for the first layer, a1 = 0.99774 The output for the second layer, a2 = 4.94315

The error, e = 1.05685

The sensitivity for the second layer, s2 = -2.1137 The sensitivity for the first layer, s1 = -0.0236173

91

The network will now update the weights. We will use the learning rate of alpha = 0.10.

The updated weight for the second layer, w2_update = 5.16524 The updated weight for the first layer, w1_update = 3.04971 The initial input, p = 2

The n value for layer 1, n1 = 6.09941 The output for the first layer, a1 = 0.997761 The output for the second layer, a2 = 5.15367

The error, e = 0.846325

The sensitivity for the second layer, s2 = -1.69265 The sensitivity for the first layer, s1 = -0.0195331

The network will now update the weights. We will use the learning rate of alpha = 0.10.

The updated weight for the second layer, w2_update = 5.33413 The updated weight for the first layer, w1_update = 3.05361 The initial input, p = 2

The n value for layer 1,

92

n1 = 6.10722 The output for the first layer, a1 = 0.997778 The output for the second layer, a2 = 5.32228

The error, e = 0.677725

The sensitivity for the second layer, s2 = -1.35545 The sensitivity for the first layer, s1 = -0.016028

The network will now update the weights. We will use the learning rate of alpha = 0.10.

The updated weight for the second layer, w2_update = 5.46937 The updated weight for the first layer, w1_update = 3.05682 The initial input, p = 2

The n value for layer 1, n1 = 6.11364 The output for the first layer, a1 = 0.997792 The output for the second layer, a2 = 5.4573

The error, e = 0.542704

The sensitivity for the second layer, s2 = -1.08541

93

The sensitivity for the first layer, s1 = -0.0130767

The network will now update the weights. We will use the learning rate of alpha = 0.10.

The updated weight for the second layer, w2_update = 5.57767 The updated weight for the first layer, w1_update = 3.05943 The initial input, p = 2

The n value for layer 1, n1 = 6.11887 The output for the first layer, a1 = 0.997804 The output for the second layer, a2 = 5.56542

The error, e = 0.434578

The sensitivity for the second layer, s2 = -0.869156 The sensitivity for the first layer, s1 = -0.0106232

The network will now update the weights. We will use the learning rate of alpha = 0.10.

The updated weight for the second layer, w2_update = 5.6644 The updated weight for the first layer, w1_update = 3.06156 The program has learned the code after 11 iterations

94

Good bye Cal Poly Press any key to continue

N.B. The code was compiled and executed from Microsoft Visual C++ 6.0. Data Table A.2 Results for §4.2. Target Value 3 3 1 0 9 7 4 0 3 1 6 8 4 9 6 8 8 7 6 8 0 6 4 4

Input Iterations 7 8 5 5 4 5 3 5 6 14 6 8 3 5 7 14 4 12 6 14 6 1 5 11 4 5 5 12 8 8 7 13 6 5 0 33 6 10 3 5 4 14 0 48 4 11 2 10 Sum 1-25 276

Target Value 6 8 6 0 7 1 8 7 4 9 4 5 2 3 3 0 5 1 8 4 1 7 0 9

Input Iterations 2 5 2 13 1 4 5 12 4 12 6 14 4 1 3 1 8 10 3 1 2 11 6 8 8 10 6 1 6 11 3 14 6 11 0 23 7 10 3 12 4 8 5 10 3 10 6 1 Sum 26-50 213 Sum 1-50 489 McCulloch-Pitts Measure 9.78

A.2.2 Finite Application Secundus: Statistics (Baseball), §4.3

For this application MATLAB was used along with the add-on “Neural Network Toolbox.” The toolbox allows the user to solve some of the problem automatically, i.e. the user does not have to program in every command. In particular, we used MATLAB

95

6.5 R13 with Neural Network Toolbox 4.01. It should be mentioned that the toolbox has excellent documentation, which should not come as a surprise since it was created, in part, by the authors of [21]. Source Code, First Iteration with Output >> % Generally a stands for input, w stands for weight, and >> % b stands for bias, e for error, & s for sensitivity. >> % N.B. the weights are integers >> % a01 is taken from the training set. >> a01=81; w01=[5; 4]; b01=[-.48; -.13]; w02=[9 -7]; b02=[.48]; >> % the following equation is the first transfer function >>a01=logsig(w01*a01+b01) a01 = 1 1 >> % the following equation is the second transfer function >> a02=purelin(w02*a01+b02) a02 = 2.4800 >> e01=79-a02 e01 = 76.5200 >> f02=1; >> s02=-2*(1)*e01 s02 = -153.0400 >> >> s01=[(1-0.2921)*0.2921;(1-0.3884)*0.3884] s01 = 0.2068 0.2375 >> w12=-w02-.1*s02*[ 0.2921 0.3884] w12 = -4.5297

12.9441

96

>> b12 = b02-.1*s02 b12 = 15.7840 >> w12=-w02-.1*s02*[ 0.2921 0.3884] w12 = -4.5297

12.9441

>> b12 = b02-.1*s02 b12 = 15.7840 >> w11=w01-.1*s01*[81] w11 = 3.3251 2.0759 >> b11=b01-.1*s01 b11 = -0.5007 -0.1538 >> % The end of the first iteration. This must be continued until the >> % error is less than alpha, i.e. .1.

Data Table A.3 Randominization of the Los Angeles Dodgers Wins 1969-2004 [111]. Train 79 91 88 63 92 79 95 98 92

Train 88 87 85 77 94 77 83 88 78

Test 81 63 93 86 86 86 73 73 95

97

Valid 102 95 85 89 93 85 92 90 58

Table A.4 Results for §4.3 Target Value 79 91 88 63 92 79 95 98 92

Input 81 81 63 63 93 93 86 86 86 Sum 1-9

Iterations 32 25 23 23 25 24 36 25 25 238

Target Value 88 87 85 77 94 77 83 88 78

Input Iterations 86 24 86 24 86 24 73 28 73 25 73 24 73 24 95 25 95 24 Sum 10-18 222 Sum 1-18 460 McCulloch-Pitts Measure 25.6

N.B. The reason that we did only 18 trials instead of fifty is because of the limited data. The target value comes from the training array and the input is from the test array.

98