Matrix Methods Vector Space Models Project.pdf

Viewer
Transcript

Vector Space Models and Search Engines

Robert Mickle (Section 001) Ho Yun Chan (Section 002) APPM 3310: Matrix Methods December 10th, 2007

Abstract Search engines are a fundamental part of modern technology with over 25% of the internet users using google.com on a daily basis (estimate, [4]) and over 27% using yahoo.com. [4] Quick information retrieval has changed the way people live their lives and go about their businesses, and with such a revolution, it is unsurprising that the technology behind search engines is impressive. In this paper we will explore the mathematical basis behind modern information retrieval systems and provide information on how linear algebra allows us to model vast amounts of data and effectively search it. There are many ways to build an information retrieval system but we will be exploring one system in particular, the vector space model.

1

Introduction It is easy to see how the advent of computing has brought along many new benefits and challenges to the world, but as with any sufficiently advanced technology it is sometimes hard to distinguish its mathematical underpinnings from magic. One of the most significant areas of recent development is the field of search engines. 15 years ago, the best ways to navigate the web were basically lists of favorites. [1, 2] However, this has changed dramatically since then and it is important to understand what exactly happened to search engine technology to make it so much better today. Google's PageRank algorithm is what most think of when they think of search engine algorithms; however, this and similar algorithms only serve as a way to model the importance of the pages of the web, they are not a panacea for searches alone, most importantly, they do not provide information about what the pages are about or how they relate to the user’s query. For this fundamental problem, we have to use other methods. The method that we will be exploring is creating a vector space model; this model creates an algebraic system that represents all the documents that we want to look though as well as the query the user has entered. The objective is to be able to compare a query to a document and determine how similar they are. Entire books have been written about this topic, so we don’t aim to cover all facets of search technology. [3] Rather, we will look into the basics of Vector Space Models and give an example of using one for a limited language and document set. We will focus on how to find similarities between queries and documents using QR decomposition. In doing so, we will not go into some important steps in the algorithm. In some of these cases we will provide references to an appendix where their importance is explained, but will not give through explanation (or examples) of their mathematics and implementations. There are several conventions that this paper will use, references to the bibliography will be in brackets[], references to the appendices will be in the form (A.1) where A is the appendix, and .1 is the section. Equations will be in the form (E.1) where .1 is the equation number. Figures will also be done in this fashion; (F.1) will refer to figure 1.

2

Vector Space Models Model Development

Key to building this system is realizing that what we want is a way to compare the similarity of one set of words to another, and see how similar they are. For search engines, we need to compare one set of words (the query) to many other sets of words 1 (the documents that make up the web, each is its own set). So we need to be able to compare the query to the first document, get a number representing how similar they are to one another, then to the second, and so on. Once we know how similar the query is to each document, we can return the documents that are most similar back to the user. In the vector space model, we achieve this is by modeling each set of words as a vector, thus the query would be modeled by the query vector, and each document would be modeled by a document vector. Modeling them as vectors allows us to use the tools of linear algebra to determine similarity, since we can easily determine the angle between two vectors. (E.1) This angle is the similarity and the documents have no words in common, they will have an angle between them of 90˚, and the more similar they are the closer to 0˚ they will become.

E.1 Using E.1 we can express the similarity between a query vector

and a document vector

to be E.2

So what do these vectors look like? Above, we described the query as a set of words, and we model it as such. We will have a row for each word, and in that row we will have the number of times that word appears. (See F.2) A row for each word isn’t as clear cut as it sounds though, there are many different ways to improve search engines though the choice of the words you use to represent the vectors. Two of the most common approaches are stop lists (A.2) and stemming (A.3). For our examples however, we will use a limited subset of the English language to

1

Comparing sets of words isn’t necessarily what the user wants from a search engine though, so there is a gap between the physical world and our mathematical model, see A.1

3

simplify things. Our language is given in (F.1), the (-ion) things after some of the words are an example of stemming (A.3)

Terms 1 2 3 mickey mouse(mice) protect(ion, ing ,ed)

4 copyright

5 exten(d, sion ,ed)

Terms 6 7 8 law walt disney

9 10 11 property classics advance

Figure 1: Our limited language

Query

mickey mouse protection act

Vector (Full English Language)

aardvark ⎡0⎤ abandon ⎢⎢0⎥⎥ ⎢M⎥ M ⎢ ⎥ acrylics ⎢0⎥ ⎢1 ⎥ act ⎢ ⎥ aculeate ⎢0⎥ ⎢M⎥ M ⎢ ⎥ mickey ⎢1⎥ micro ⎢0⎥ ⎢ ⎥ ⎢M⎥ M ⎢ ⎥ mouse ⎢1⎥ movie ⎢0⎥ ⎢ ⎥ M ⎢M⎥ protection ⎢1⎥ ⎢ ⎥ optics ⎢0⎥ ⎢M⎥ M ⎢ ⎥ zymurgy ⎣⎢0⎦⎥

Vector (Limited Language)

mickey mouse (mice) protection (-ion,-ing,-ed) Copyright exten (-d,-sion, - ed) law walt disney property classics advance

⎡1 ⎤ ⎢1 ⎥ ⎢ ⎥ ⎢1 ⎥ ⎢ ⎥ ⎢0 ⎥ ⎢0 ⎥ ⎢ ⎥ ⎢0 ⎥ ⎢0 ⎥ ⎢ ⎥ ⎢0 ⎥ ⎢0 ⎥ ⎢ ⎥ ⎢0 ⎥ ⎢ ⎥ ⎣0 ⎦

Figure 2: Various query vectors

To create our query vector q, all we did was set the row in the vector equal to the number of times that word was in our query. So for the full English version of q, we ended up with four 4

ones, as we had 4 distinct words in our query. We also ended up with a ton of zeros, which is the standard for vector space models. In our limited language version of q, we only ended up with three ones, as we don’t have the word “act” in our defined language. From now on, q will represent the vector of the query in our limited language. .

E.3

Now that we know how to represent a query vector all we need are document vectors to be able to compare them and get our results. While on the web, this would be all of the documents we could get to; in our example we will use a more limited set. Additionally, we are only going to index the titles of the documents, not all of the words that are in them. We choose variety of titles to see how they will do against our sample query. The first 7 definitely have relevance, the next 2 might (depending on what is inside them) and the last 3 definitely don’t have relevance. This will allow us to see why the vector space model does and does not choose certain things.

1. 2. 3. 4. 5. 6.

Modern Copyright law The Copyright Term Extension Act How Mickey Mouse will never be free Protecting Mickey Mouse Copyright Protection Basics How US copyright law is moving creative production offshore 7. To protect and smother; how advances in copyright law have extended Mickey’s life 95 years past Walt’s death (AKA, why we need to get rid of copyright) 8. Disney v. the People 9. Protecting the Classics 10. Don’t eat mice! A family’s guide to eating around Disney World 11. Steamboat Willie: The birth of Mickey Mouse 12. The best of Disney, a 10 DVD collection including Snow White, Mickey Mouse, and more classics Figure 3: Documents (terms that appear in our language are underlined)

5

Looking at the documents, the procedure we will use to create the document vectors is the same as that we used to create the query vector; the results are shown in figure 4. Note 2 how in document 7, where the term “copyright” showed up twice, the value of the 4th row is two. Shown this way, it’s also easy to see why this set of vectors together is used as a matrix, D, the document matrix.

.

E.4

Documents Terms 1 2 3 4 5 6 7 8 9 10 11 12 mickey 0 0 1 1 0 0 1 0 0 0 1 1 mouse(mice) 0 0 1 1 0 0 0 0 0 1 1 1 protect(ion, ing, ed) 0 0 0 1 1 0 1 0 1 0 0 0 copyright 1 1 0 0 1 1 2 0 0 0 0 0 exten(d, sion, ed) 0 1 0 0 0 0 1 0 0 0 0 0 law 1 0 0 0 0 1 1 0 0 0 0 0 walt 0 0 0 0 0 0 1 0 0 0 0 0 disney 0 0 0 0 0 0 0 1 0 1 0 1 property 0 0 0 0 0 0 0 0 0 0 0 0 classics 0 0 0 0 0 0 0 0 1 0 0 1 advance 0 0 0 0 0 0 1 0 0 0 0 0 Figure 4: The document vectors

Now that we have defined both q and D we can calculate the similarity of any of the documents to our query using E.2. However, from a practical standpoint, we will want to speed this up. What we want to do is put the equation in a form in which we can quickly run queries against it. Our ultimate goal is to make the computation E.2 as fast as possible. Looking at E.2, we can see we can get represent the equation being calculated for all the documents with 2 parts.

2

Also note how sparse the matrix is even with our extremely limited vocabulary. When contemplating how big the D matrix would be for web search engines, with a half a million different terms making up the rows and billions of documents making up the columns keep in mind that it will be considerably more sparse than this, and because of that have a relatively small rank.

6

1) n, a vector representing the numerators of the fraction E.5

2) f, a vector representing the denominators of the fraction. E.6

Thus,

E.7

In part 2, the magnitude of the document vectors can be pre-computed, and the magnitude of the document vector only has to be calculated once, so this is an extremely quick calculation, only 1 one additional multiplication per additional document. Part 1 on the other hand means multiplying an extremely large matrix by a vector, and this calculation could take much longer. Part 1 is where we aim to speed up the calculation. In our model, we are going to use QR decomposition to represent D, for an explanation and alternatives see A.4. Thus,

1) This is the definition of a QR decomposition:

2) Update E.5 and we get this:

3) Derived from 1,

is the jth column of D,

is the jth column of R:

4) This is the definition of E.2, then updated to use the definition in 3:

5) This is the definition of Q:

6) Using the definition of the L2 norm:

7) Derivation using 6

8) Expansion of transpose 7

9) Continuation of 4 using 7 and 8

10) Continuation of 9, the updated version of E.2

E.8

Though we can still represent E.8 in the same fashion as E.7, it is not helpful in our calculations. We are ready to calculate Q,R, and the results.

Numerical Work The first step we took in order to rank each document by relevance was to put the document matrix D (E.4) into Matlab. To do this we used the code (M.1) below: D=[ 0,0,1,1,0,0,1,0,0,0,1,1; 0,0,1,1,0,0,0,0,0,1,1,1; 0,0,0,1,1,0,1,0,1,0,0,0; 1,1,0,0,1,1,2,0,0,0,0,0; 0,1,0,0,0,0,1,0,0,0,0,0; 1,0,0,0,0,1,1,0,0,0,0,0; 0,0,0,0,0,0,1,0,0,0,0,0; 0,0,0,0,0,0,0,1,0,1,0,1; 0,0,0,0,0,0,0,0,0,0,0,0; 0,0,0,0,0,0,0,0,1,0,0,1; 0,0,0,0,0,0,1,0,0,0,0,0; ]

M.1

Once Matlab stored the document matrix into an 11 x 12 matrix, we used the code (M.2) so that Matlab could directly decompose matrix D and store them to into their respective matrices in Q (E.11) and R (E.12). 8

[Q,R] = qr(D)

M.2

0 0 0 − 0.707 0 − 0.707 ⎡ 0 ⎢ 0 0 0 0.707 0 − 0.707 0 ⎢ ⎢ 0 0 0 0 0 0 −1 ⎢ 0 0 0.577 0 0 ⎢− 0.707 − 0.408 ⎢ 0 0 0 − 0.577 0 0 − 0.817 ⎢ Q = ⎢− 0.707 0.408 0 0 − 0.577 0 0 ⎢ 0 0 0 0 0 0 − 0.707 ⎢ 0 0 0 0 0 0 ⎢ 0 ⎢ 0 0 0 0 0 0 0 ⎢ ⎢ 0 0 0 0 0 0 0 ⎢ 0 0 0 0 0 − 0.707 ⎣ 0 0 0 − 0.71 − 1.41 ⎡− 1.41 − 0.71 ⎢ 0 0 0 0 − 1.22 − 0.40 ⎢ ⎢ 0 0 0 0 − 1.41 − 1.41 ⎢ 0 0 0 −1 −1 ⎢ 0 ⎢ 0 0 0 0 0.58 0 ⎢ R=⎢ 0 0 0 0 0 0 ⎢ 0 0 0 0 0 0 ⎢ 0 0 0 0 0 ⎢ 0 ⎢ 0 0 0 0 0 0 ⎢ ⎢ 0 0 0 0 0 0 ⎢ 0 0 0 0 0 ⎣ 0

− 2.12 − 1.22 − 0.71 −1

0 0 0 0 0 0 0 1 0 0 0

⎤ ⎥ ⎥ 0 0 0 ⎥ ⎥ 0 0 0 ⎥ 0 0 0 ⎥ ⎥ 0 0 0 ⎥ 0 0 − 0.707 ⎥ ⎥ 0 0 0 ⎥ 0 −1 0 ⎥ ⎥ 0 ⎥ −1 0 ⎥ 0 0 0.707 ⎦ 0 0

0 0

0 0

0 ⎤ 0 ⎥⎥ 0 0 − 0.71 − 1.41 − 1.41⎥ ⎥ 0 −1 0 0 0 ⎥ 0 0 0 0 0 0 ⎥ ⎥ 0.71 0 0 ⎥ − 0.71 0 0 0 0 0 ⎥ − 1.41 0 0 ⎥ 0 1 0 1 0 1 ⎥ 0 0 −1 0 0 −1 ⎥ ⎥ 0 0 0 0 0 0 ⎥ ⎥ 0 0 0 0 0 0 ⎦ 0 0

0 0

0 0

E.11

0 0

E.12

After Matlab decomposed the matrices, we were able to calculate the angles between the document and query vectors individually. To do this we used the code from M.3 and calculated the angles separately between each of the documents and query vectors by changing r, ranging from 1-12. A table is shown in Figure 5 as a result of the angles calculated between the vectors. From there we could rank the documents from most to least significant (Figure 6) from the fact

9

that the angles that are closer to 0° are the documents that best matched the query as opposed to those that are closer to 90°. theta = acosd((transpose(R(:,r)) * ( transpose(Q) * q )) / (norm(R(:,r)) * norm(q))) Document 1 2 3 4 5 6 7 8 9 10 11 12

Angle in Degrees 90 90 35.2644 0 65.9052 90 68.5833 90 65.9052 65.9052 35.2644 54.7356

Figure 5

Ranked Documents 4 3 11 12 5 9 10 7 1 2 6 8

M.3

Angle in Degrees 0 35.2644 35.2644 54.7356 65.9052 65.9052 65.9052 68.5833 90 90 90 90

Figure 6

10

Conclusion Of course, in practice, with considerations for speed, accuracy, and ease of use, things become much more complicated quickly. We will see this in the discrepancy between the simple example here and the much more complex algorithm below. (And naturally, actually implementing something like this would be an order of magnitude more complicated.

11

Appendix A Additional Information: A.1

The idea of what people want from a search engine is hard to classify, and it is not always possible to discern what they want from what they type in a query, we can see the response to this in the commercial market in things like ask.com using more a guided search, or even Google, giving several different sets of results prefaced with a “did you mean.” Even in general though, people aren’t looking for the similarity between sets of words, but rather ideas, which is why vector space models have trouble dealing with synonymy (people referring to ideas by different terms) and polysemy (words referring to different ideas depending on context.) The user also might place importance on the document coming from a reliable source, or an important source (which is what Page Rank deals with) instead of wanting a document with the highest ratio of query words.

A.2

There are many common words that will not be helpful to us in determining how relevant a document is. These words (like “a, and, or, the” are called stop words and we should filter them out of our documents prior to processing them. We do this by creating a (or using an existing) stop list of all the words that are irrelevant to how similar one document is to another.

A.3

When the user searches for something like “fruit” we want to match all (or at least some) of the variants of it, for instance, “fruits” should be counted as a match, but probably not “fruity” as that is referring to a different concept. This is called stemming. We use stemming to help better match up words to ideas and avoid the problem of polysemy. (A.1)

A.4

An important idea from this paper is that the document matrix D is extremely sparse. Computationally, it’s a waste to go through the full matrix unchanged, as we would be doing many pointless operations. By decomposing the matrix, we can get a simplified representation of D, that will be faster to compute with. There are 2 popular methods for this, QR (A.5) and SVD 12

(A.6). The general idea in both though is to reduce the rank of D by an amount that will not be noticeable to the users, thus reducing the size and increasing the speed of the computation. Generally, once the rank is down to about 300, the system will be fast enough to run queries against. [5]

A.5 If we take our document matrix D, we can rewrite it as

where Q is an orthogonal

matrix and R is an upper triangular. In the calculation, if we use column pivoting, we can factor a QR so that the largest elements of R are in the upper left, effectively isolating a sub-matrix of R into the lower right. This sub-matrix is significant as if we remove it, we have a rank-reduced approximation of D. (As

) Since it

is smaller, it will be quicker to calculate and give us a good approximation with a greatly reduced time to compute.

A.6 While the Singular Value Decomposition takes longer computationally upfront than the QR factorization, it gives several advantages: this decomposition will allow us to arbitrarily change what rank we are approximating, and this rank approximation is guaranteed to be the best approximation according to a theory by Eckart and Young. [5] In this form, of D ( and

, where

, where

is a diagonal matrix composed of the singular values

is an eigenvalue of

associated Graham Matrix -

)

is orthogonal. rank(D) = number of nonzero elements in diag

discard (set to 0) the least significant elements in

, so to reduce the rank, we will just

(the lower right). The Eckart and Young

theory states that the error in approximating a matrix A by a matrix Ak is determined by the discarded singular values. This gives this method the significant advantage that it will always be the best approximation for any given rank.

13

Appendix B An Algorithmic Overview 1) Analyze and vet the content of the document a. Search engines do not just analyze blocks of text, when reading a document on the internet there are many things that have to be analyzed, and weighted. For instance, you do not want to count the words that appear as comments in html but not to the people that read it. Words that appear in document titles might be given more importance, as well as items that marked as headings. Web pages also might contain Meta information that we might treat differently. b. Apply a stop list. (A.2) 2) Normalize the documents a. Stem words. (See A.3) b. We need to put the documents in a file format standard to our search engine, so that we can easily discern information pertinent to us, e.g. where a word occurs in a document. The most common format is IFS, or an inverted file structure. 3) Normalize the document vectors a. Weight terms; some words are scarcer than others, even after we have gotten rid of the extremely common words on out stop list. If the user queries with something like “evergreen tree Pinus longaeva” we should put more weight on the last 2 terms in order to get better results. (Pinus longaeva is a type of evergreen tree). Many documents will have the first 2 terms, but the last 2 are rarer, and thus possibly more important. We can also use term weighting to counteract the discrepancies between large and small documents; we are much more likely to find all of our search terms together in a larger document. There are many strategies to weight terms, see [5] for a better overview. 4) Build the document matrix a. Since the matrix is so sparse, there are specialized formats we can take advantage of to be more efficient. We can store the matrix in Compressed Row Storage or Compressed Column Storage. See [5] for more information on this. 5) Put the document matrix in a form so that we can quickly run queries. (See A.4) 14

Bibliography [1] Yahoo History

http://docs.yahoo.com/info/misc/history.html [2] Vector Space Model Definition

http://en.wikipedia.org/wiki/Vector_space_model [3] The Classic Vector Space Model

http://www.miislita.com/term-vector/term-vector-3.html [4] Alexa web information

http://www.alexa.com/data/details/traffic_details/google.com [5] Understanding Search Engines

Berry, Michael W. and Browne, Murray. Understanding Search Engines, Mathematical Modeling and Text Retrieval, Second Edition. Society for Industrial and Applied Mathematics, 2005.

15