Text Search in Wade Kabir Shah

1

Overview

Text search with Wade searches a set of documents for a query to find the most relevant set of documents. To do this, the following must be taken into consideration: 1. The amount of documents. 2. The amount of terms in a search query. 3. The significance of the term.

2

Index

An index can be generated to allow for optimized searches within a text. This index can be generated by: 1. Preprocessing each document to make them lowercase, free of punctuation, and free of stop words. 2. Splitting each document into a multiset of terms. 3. Generating a trie of the terms and storing the indexes of the corresponding documents containing the term. 4. Storing the weighted significance of the term in the documents.

2.1

Significance

The significance of the term t in the set of documents d can be represented by the function: P|d|

i=0 [

P

µ(t) ] (µ(p))

p∈di

wm(t, d) = 1.5 −

|d|

1. d is a set of multisets di 2. t ∈ di for at least one di 1

3. µ(t) is the multiplicity of t in the multiset di P 4. p∈di (µ(p)) is the cardinality of the multiset di 5. P

µ(t) (µ(p))

is the ratio of occurrences of the term t in the document di to

p∈di

the total amount of terms in the document di This works by finding the average of how often the term appears within a document. After this, the significance is normalized between 0.5 and 1.5, allowing it to become higher when the average occurrence is lower. This allows for rarer terms to be amplified in significance.

3

Search

Searching for a query must follow the same preprocessing step used when building the index. Searching for a query can be done by: 1. Preprocessing the query to make it lowercase, free of punctuation, and free of stop words. 2. Splitting the query into a multiset of terms. 3. Searching the index for each term in the query.

3.1

Relevance

To find how relevant each term is to a document, the length of the query must be taken into consideration. The relevance of the term t in the query q to the set of documents d can be represented by the function: wr(t, q, d) = wm(t, d)[ P

p∈

1 ] q (µ(p))

1. q is a multiset 2. µ(p) is the multiplicity of p in the multiset q P 3. p∈ q (µ(p)) is the cardinality of the multiset q This can be used to represent how much each term should affect the score of the query. It works by taking significance of the term and the length of the query into account.

2

3.2

Exact Search

To search the index for each term except for the last, the term can be split into characters to use as a key for the trie. Each character will be used as a key to look up inside the previous node (with the root being the index). As a result, a new node in the trie is found for the next character to be looked up in. At the end, if every key was found in the index, and it has a list of documents with the term, then there was an exact match of the term. As a result, the score for the documents can be updated. The score for documents found for the term t in the query q searched within the documents d can be updated by adding: wr(t, q, d)

3.3

Depth-First Search

For the last term, a different process is used to update the scores. Since a user might still be typing the last term, it is treated as a prefix and a depth-first search is used. The index is searched in the same manner as described above, but when an ending point is found, a different process is followed. Instead of aborting the search, all child nodes are traversed for indexes and the documents are all updated for the term. The score is updated for the term t in the query q searched within the documents d by adding: wr(t, q, d) As a result, all of the documents that have a term with the same prefix are updated.

3

Text Search in Wade - GitHub

wr(t, q, d) = wm(t, d)[. 1. ∑ p∈ q(µ(p)). ] 1. q is a multiset. 2. µ(p) is the multiplicity of p in the multiset q. 3. ∑p∈ q(µ(p)) is the cardinality of the multiset q. This can be used to represent how much each term should affect the score of the query. It works by taking significance of the term and the length of the query into account. 2 ...

126KB Sizes 1 Downloads 402 Views

Recommend Documents

Entity Recommendations in Web Search - GitHub
These queries name an entity by one of its names and might contain additional .... Our ontology was developed over 2 years by the Ya- ... It consists of 250 classes of entities ..... The trade-off between coverage and CTR is important as these ...

datasheet search site | www.alldatasheet.com - GitHub
Jun 1, 2007 - ADC accuracy (fPCLK2 = 14 MHz, fADC = 14 MHz, RAIN

datasheet search site | www.alldatasheet.com - GitHub
DESCRIPTION. The L78M00 series of three-terminal positive regulators is available in TO-220, TO-220FP,. DPAK and IPAK packages and with several fixed.

What is Hibernate Search? - GitHub
2015 - MARTIN BRAUN - APPLIED COMPUTER SCIENCE IV, UNIVERSITY OF BAYREUTH. 1. Introduction. Hibernate Search with Hibernate ORM: Database.

datasheet search site == www.icpdf.com - GitHub
Notebook Computers. Package Types. Figure 1. ... 由 Foxit PDF Editor 编 .... 9. Techcode®. 2A 32V Synchronous Rectified Step-Down Converter TD1519(A).

datasheet search site | www.alldatasheet.com - GitHub
The ACTR433A/433.92/TO39-1.5 is a true one-port, surface-acoustic-wave (SAW) resonator in a low-profile metal TO-39 case. It provides reliable ...

Extraction and Search of Chemical Formulae in Text ... - CiteSeerX
trade-off between recall and precision for imbalanced data are proposed to improve the .... second set of issues involve data mining, such as mining fre- ... Documents PDF ...... machines for text classification through parameter-free threshold ...

Full-Text Indexing and Search for Go 10 July 2015 - GitHub
Jul 10, 2015 - All major bleve operations mapped. Assume JSON document bodies. See bleve-explorer sample app https://github.com/blevesearch/bleve- ...

Local Similarity Search for Unstructured Text
Jun 26, 2016 - sliding windows with a small amount of differences in un- structured text. It can capture partial ... tion 4 elaborates the interval sharing technique to share com- putation for overlapping windows. ...... searchers due to its importan

More Accurate Fuzzy Text Search for Languages Using ...
Jul 27, 2007 - mean a meta alphabet, i.e., number of letters and their ar- rangement, including the ..... Workshop, Australia., pages 164–171, 1998. [3] C-DAC.

Using Text-based Web Image Search Results ... - Semantic Scholar
top to mobile computing fosters the needs of new interfaces for web image ... performing mobile web image search is still made in a similar way as in desktop computers, i.e. a simple list or grid of ranked image results is returned to the user.

Mobile Search with Text Messages: Designing the User ... - CiteSeerX
Apr 7, 2005 - The goal of the Google SMS service is to provide this large existing base of users with ... from a personal computer, but users also need to find information when they are ..... CHI 2001, ACM, 365–371. 4. Jones, M., Buchanan ...

Using Text-based Web Image Search Results ... - Semantic Scholar
In recent years, the growing number of mobile devices with internet access has ..... 6 The negative extra space is avoided by the zero value in the definition of ...

Local Similarity Search for Unstructured Text
26 Jun 2016 - into (resp. delete from) Ai+1 the (data) window intervals retrieved from the index (Lines 15 – 16). Finally, we merge intervals in each Ai to eliminate the overlap among candidate intervals (Line 18) and perform verification (Line 20)

Automata Evaluation and Text Search Protocols ... - Research at Google
Jun 3, 2010 - out in the ideal world; of course, in the ideal world the adversary can do almost ... †Dept. of Computer Science and Applied Mathematics, Weizmann Institute and IDC, Israel. ... Perhaps some trusted certification authorities might one

Enhanced Search Widget XML Configuration Version 3.7.4 - GitHub
What this does is gets all the valid coded value domain values and adds then to a ...... Please refer to the Enhanced Search Widget Fixed Datagrid Setup.pdf for ...

Text Indexing for Go 1 February 2015 - GitHub
Feb 1, 2015 - NewSearchRequest(q) req.Highlight=bleve.NewHighlightWithStyle("html") req.Fields=[]string{"summary","speaker"} res,err:=index.Search(req).

Dwayne wade is_safe:1
Red rider discography.Zoomgolden years. karaoke ... out the world with thefeet forward..941972000222043056 Operating systemgalvin pdf. ... Ericclapton video.

Solution for the Search Results Relevance Challenge - GitHub
Jul 17, 2015 - They call such method as semi-supervised learning. ... 2. calculate the pdf/cdf of each median relevance level, 1 is about 7.6%, 1 + 2 is ..... Systems: Proceedings of the 2011 Conference (NIPS '11), pages 2546–2554, 2011.

Concepts in Crypto - GitHub
to check your email. ○. Enigmail: Thunderbird addon that adds OpenPGP ... you its certificate, which includes its public key ... iOS, Android: ChatSecure ...