IJRIT International Journal of Research in Information Technology, Volume 2, Issue 4, April 2014, Pg: 512- 518

International Journal of Research in Information Technology (IJRIT) www.ijrit.com

ISSN 2001-5569

Contextual Query Based On Segmentation & Clustering For Acquiring Related Features in Medical Diagnosis Prof.S.P.Akarte 1, Prof.G.A.Dashmukhe 2 1

2

Assistant Professor, Dept .of Computer Science & Engineering ,PRMIT & R, Badnera Amravati, Maharashtra, India [email protected]

Assistant Professor, Dept .of Computer Science & Engineering, Siddhivinayak Technical Campus,SPRT, Khamgaon, Maharashtra, India [email protected]

Abstract Nowadays internet plays an important role in information retrieval but user does not get the desired results from the search engines. Web search engines have a key role in the discovery of relevant information, but this kind of search is usually performed using keywords and the results do not consider the context. This paper describes the use of information extraction techniques applied in previously defined resources in order to suggest the terms to the users and run these expanded terms in web search engines, getting more useful search results, considering the domain context of the required information. The terms most often found in an information resource that is representative of a subject are more likely to also be present in other related documents available in database. The results show that the proposed approach can be used in a corporative environment to help the execution of contextual search activities with good results.

Keywords: information, engines , context, extraction, terms, documents

1. Introduction The web is nowadays one of the main information sources, and information search is an important area in which many advances have been registered. One approach to improve web search results is to consider contextual information. Usually, information about context has been provided through user logs on previous searches or the monitoring of clicks on first results, but different approaches can be used in special environments. In a web based learning environment, existing documents and exchanged messages could provide contextual information. So, the main goal of this work is to provide a contextual web search engine. Contextual search is provided through query expansion using medical documents .The proposed approach makes the context acquisition faster and more dynamic as it considers an automatic approach over text processing of documents. However, information retrieval is strongly dependent on the context. What a user, who has a specific knowledge and a specific experience, believes as relevant may not be relevant to another user with distinct characteristics and experiences, even if the search expressions used by both of them are the same. This paper presents a proposal to make web searches adaptive to the context of the users, according to their information needs, thus improving query results.

Prof.S.P.Akarte,

IJRIT

512

IJRIT International Journal of Research in Information Technology, Volume 2, Issue 4, April 2014, Pg: 512- 518

2. Literature Review Data mining [11] is a process of nontrivial extraction of implicit, previously unknown, and potentially useful information (such as knowledge rules, constraints, and regularities) from data in databases. In fact, the term “knowledge discovery” is more general than the term “data mining.” Data mining is usually viewed as a step towards the process of knowledge discovery, although these two terms are considered as synonyms in the computer literature. The entire life cycle of knowledge discovery includes steps such as data cleaning, data integration, data selections, data transformation, data mining, pattern evaluation, and knowledge presentation.

. Fig. 1. Life Cycle of knowledge presentation Data cleaning[11] is to remove noise and inconsistent data. Data integration is to combine data from multiple data sources, such as a database and data warehouse. Data selection is to retrieve data relevant to the task. Data transformation is to transform data into appropriate forms. Data mining is to apply intelligent methods to extract data patterns. Pattern evaluation is to identify the truly interesting patterns based on some interestingness measures. There are many data mining techniques, such as association rule mining, classification, clustering, sequential pattern mining, etc. Context Sensitive IR approach Information retrieval (IR)[18] is a scientific research field concerned with the design of models and techniques for selecting relevant information in response to user queries within a collection (corpus) of documents. Two main steps characterizing an IR process are document indexing and document–query matching. The objective of the indexing stage is to assign to each document in the collection the set of words, terms or concepts expressing the topic(s) or subject matter(s) addressed in the document. The matching stage aims at identifying the most valuable documents that better fit the query. Several and different issues arise from both indexing and matching in IR. In this paper, we are interested particularly in biomedical IR where collections entail medical knowledge and queries cover the information needs of physicians, researchers in the biomedical domain or more generally users of biomedical search tools. Our context sensitive IR[12][13] approach relies on two main steps detailed below: (1) Conceptual Document Indexing and (2) Context Sensitive Document Retrieval. We integrate them into a biomedical IR process as the combination of the global and local semantic contexts for improving the biomedical IR effectiveness. Query expansion vs Document expansion In order to close the semantic gap between the user’s query and documents in the collection, several research works have been focused on applying data smoothing techniques such as document expansion and query expansion on the original document/query. Theoretically, such techniques allow to enhance the semantics of the document/query by bringing the query closer to the relevant documents in the collection. Prof.S.P.Akarte,

IJRIT

513

IJRIT International Journal of Research in Information Technology, Volume 2, Issue 4, April 2014, Pg: 512- 518

As stated earlier, semantic information can be detected in a global context (usually from a domain knowledge source or an entire collection) or a local context (usually from a sub collection of related topranked documents).The principle goal of QE[7] is to increase the search performance by increasing the likelihood of term overlap between a given query and documents that are likely to be relevant to the user query. Current approaches of QE can be subdivided into two main categories: global analysis and local analysis. Global techniques aim to discover word relationships in a large collection (global context) such as Web documents or external knowledge sources like Wordnet , MeSH or UMLS or multiple terminological resources. Local techniques emphasize the analysis of the top-ranked documents (local context) retrieved for a given query in the previous retrieval stage. Word mismatch is a common problem in information retrieval. Most retrieval systems match documents and queries on a syntactic level, that is, the underlying assumption is that relevant documents contain exactly those terms that a user chooses for the query. However, a relevant document might not contain the query words as given by the user. Query expansion (QE) is intended to address this issue. Other topical terms are located in the corpus or an external resource and are appended to the original query, in the hope of finding documents that do not contain any of the query terms or of re-ranking documents that contain some query terms but have not scored highly. A disadvantage of QE is the inherent inefficiency of reformulating a query. With the exception of our earlier work , these inefficiencies have largely not been investigated. In this work we proposed improvements to the efficiency of QE by keeping a brief summary of each document in the collection in memory, so that during the expansion process no time-consuming disk accesses need to be made. While some of the methods proposed in this earlier research more or less maintain effectiveness, the process is sped up by roughly two thirds. However, expanding queries using the best of these methods still takes significantly longer than evaluating queries without expansion. Document expansion (DE) as an alternative to QE. In DE, documents are enriched with related terms. Although, while not prohibitively so, there is a significant cost associated with expanding documents, this is undertaken at indexing time, and there is only marginal cost at retrieval time. In principle it is reasonable to suppose that DE will help resolve the problem of vocabulary mismatch and thus yield benefits like those obtainable with QE. We propose two new corpus-based methods for DE[7]. 1. The first method is based on adding terms to documents in a process that is analogous to QE: each document is run as a query and is subsequently augmented by expansion terms. 2. The second method is based on regarding each term in the vocabulary as a query, which is expanded using QE and used to rank documents. The original query term is then added to the top-ranked documents. Our experiments measure the efficiency and effectiveness of QE and DE on several collections and query sets. We find that, on balance, DE leads to improvements in effectiveness, but few of the measured gains are statistically significant; the computational cost at query time is small. In contrast, both standard QE and the efficient QE that we proposed earlier lead to gains in most cases, many of them significant, while the efficient QE is less than twice the cost of querying without expansion. Our experiments were, within the constraints of our resources, reasonably exhaustive. We tested several alternative configurations of DE and explored the parameters, but did not observe useful gains in effectiveness. We conclude that corpus-based DE is unpromising. We did not explore QE to the same extent, yet found effectiveness to consistently improve, and thus believe that further gains in performance may be available.

3. System Architecture 3.1 Knowledge base creation The contextual information must be accessed through the system, so it is necessary to create a connector layer, called the data integration[4], which contains a set of specific software that can read every kind of information available. A domain expert must, from all sources of information available and accessible through the system, select the contents that are good representatives of the subjects that compose the context. Context is modeled with the use of existing resources such as databases, miscellaneous files.

Prof.S.P.Akarte,

IJRIT

514

IJRIT International Journal of Research in Information Technology, Volume 2, Issue 4, April 2014, Pg: 512- 518

3.2 Extraction of terms The objective of the Information Extraction module[4] is to identify the main terms of all the contextual information obtained from the Knowledge Base Configuration module, and to provide a list of terms to the search module. Two extractions of terms are executed, one for the most frequently used terms in the context (context extraction), and another for specific terms of each subject identified in context (subject extraction).In both situations, it is necessary to apply some activities of text preparation before the extraction of terms: (i) the Tokenization, the process of breaking a stream of text into words, phrases, symbols and other meaningful elements called tokens, (ii) the removal of stop words, a list of common or general terms that have little value in the text and must not be extracted, and (iii) stemming, the process for reducing inflected (or sometimes derived) words to their base[4]. The extraction of general terms of context is done by calculating the weights of the terms and extracting the n terms with highest weight. w(i) is the idf (inverse document frequency) factor computed as: w(i) = log2 N - Nt + 0.5/Nt + 0:5 [18] where N is the total number of documents in the collection and Nt is the number of documents containing term t (document frequency). Similarity measure: Okapi BM25 To measure the similarity of queries to documents, we use Okapi BM25 [7] in all our experiments, where constants k1 and b are set to 1:2 and 0:75 respectively. We set k3 to 0, motivated by the assumption that each term in contemporary queries only occurs once.

Term selection measures Depending on the expansion method, we use different measures to select terms from a set of candidate terms. TSV. We use the term selection value[7] in our experiments for ranking terms, if not stated otherwise: TSVt = (ft/N) fr,t * (|R|/fr,t) where ft is the number of documents in the collection in which term t occurs in, N is the total number of documents in the collection, and fr,t is the number of the |R| top ranked documents in which term t occurs. The weight calculation is performed with the formula of sublinear term frequency scaling. wf t,d = 1 + log tf t,d , if tf t,d > 0 [4] 0 , otherwise After the calculation of weights for all de terms, the n terms with highest weight are extracted for each identified subject. The extraction of the specific terms of each subject that is identified in the context requires the application of more information extraction activities. The first step is the use of a routine of text segmentation to divide the contents of the contextual information into sentences. The text segmentation is an activity of natural language processing that aims to identify subtopics within a document, defining its limits.

3.3 Searching the terms The original query can be expanded according to two ways. The first is the automatic expansion, in which the original query is expanded n times, where n is equal to one (context expansion) plus the number of subjects that were identified in the selected context (subject expansion)[4]. Each Prof.S.P.Akarte,

IJRIT

515

IJRIT International Journal of Research in Information Technology, Volume 2, Issue 4, April 2014, Pg: 512- 518

expanded query is executed on the selected web browser, and the results are presented to the user. The second mode of query expansion is the suggestion of terms, in which all extracted terms are presented as a suggestion. The user has to select the terms of his/her interest that will be incorporated to the original query, performing the query expansion.

Fig 2. System Architecture

4. System Implementation 4.1 Configuring the System 1. For configuring the system first of all JDK 7 is installed on the machine. 2. Net Beans 7.4 is installed & JDK is configured in the Net Beans for running & compiling java programs. 3. SQL server R2 2008 is installed 4. A new blank project is created in the NetBeans 5. Environment variables are set for libraries of SQL Server & JDK

5. Analyzing Results In the automatic expansion mode, in all metrics, the query expansion with general terms of the context and the query expansion with specific terms of the subjects showed better results than those obtained with the original query. In the metric of full precision, an improvement of 76.47% of the cases in at least one of the query expansions was observed, while the original query result only showed better results in 11.76% of cases[4]. The percentage improvement was observed in the comparison of results from the original query with the results of the expanded query with general terms of context (64.71%), and comparing results from the original query with the results of expanded queries with specific terms of subjects (64.71%).In the metrics of search length and correlation rank, improvement in results were also observed, but in smaller proportions than the full precision. Unlike what happened with the automatic expansion, in the mode of Prof.S.P.Akarte,

IJRIT

516

IJRIT International Journal of Research in Information Technology, Volume 2, Issue 4, April 2014, Pg: 512- 518

suggestion of terms, all metrics showed that the query expansion presented worse results than those obtained with the original query. In the three considered metrics, the differences between the obtained percentages were big, reaching a difference of up to seven times in the case of metric search length (in 47.06% of the case the original query obtained better results against 5.88% of the expanded query)[4][7].

6. Conclusion The use of information extraction activities in existing resources in databases, archives and information systems can be considered in order to make search results more contextualized and therefore more useful to users.

7. References [1] Bhogal. J. Macfarlane, A. & Smith, P. (2007). A Review of ontology based Query expansion Information Processing and Management: An International Journal [2] CAKE – Classifying, Associating and Knowledge DiscovEry – An Approach for Distributed Data Mining(DDM) Using Parallel Data Mining Agents (PADMAs)Web Intelligence and Intelligent Agent Technology, 2008.WI-IAT '08. IEEE/ WIC/ACM International Conference on (Volume:3 ) [3] Chien, B.C., Hu, C.H., Ju, M.Y. (2007) Intelligent Information Retrieval Applying Automatic Constructed Fuzzy Ontology. International Conference on Machine Learning and Cybernetics. [4] Contextual Query based on Segmentation and Clustering of Selected Documents for Acquiring Web Documents for Supporting Knowledge Management by João C. Prates,Sean W. M. Siqueira [5] Contextual web searches in Facebook using learning materials and discussion Messages by João Carlos Prates a, Eduardo Fritzen a, Sean W.M. Siqueira a,Maria Helena L.B. Braz b,Leila C.V. de Andrade [6] Dey A. K.Abowd, G. D. (1999) Towards a better understanding of context and Context awareness.International symposium on Handheld and Ubiquitous Computing. [7] Document Expansion versus Query Expansion for Ad-hoc Retrieval by Bodo Billerbeck Justin Zobel ( October 10, 2005) [8] Edelstein, H., A. (1999). Introduction to data mining and knowledge discovery MD: Two Crows Corp

(3rd ed). Potomac,

[9] Experian Hitwise Searches statistics (2010) 6. Hearst,M.A. (1997) TextTiling: Segmenting text into multi-paragraph subtopic passages.Computational Linguistics [10] Facebook.(2012). Apps on Facebook.com. http://developers.facebook.com/docs/guides/canvas [11] Han, J., Kamber, M. Data mining: Concepts and Techniques. New York: Morgan-Kaufman. [12] João C. Prates ,Sean W. M. Siqueira UNIRIO(2011):Contextual Query based on Segmentation and Clustering ,Seventeenth Americas Conference on Information Systems, Detroit, Michigan [13] Joho, H., Sanderson, M., Beaulieu, M. (2004) A study of user interaction with a concept-based interactive query expansion support tool, European Conference on IR Research, April 2004, Springer , pp. 42-56 Prof.S.P.Akarte,

IJRIT

517

IJRIT International Journal of Research in Information Technology, Volume 2, Issue 4, April 2014, Pg: 512- 518

[14] Kendall, M.G., Stuart, A. (1973) The Advanced Theory of Statistics, v. 2:Inference and Relationship, Griffin Introduction to Information Retrieval, Cambridge University Press [15] Manning, C.D., Raghavan, P., Schütze, H. (2008) On Density Based Transforms for Uncertain Data Mining Data Engg, ICDE 2007 IEEE 23rd International Conference. [16] Spink, A., Jansen, B. J. (2004) A study of Web search trends. Webology, 1(2),Article 4.Available at: http://www.webology.ir/2004/v1n2/a4.html [17]Tang, M.C., Sun Y. (2003) Evaluation of Web-Based Search Engines Using User-Effort Measures. LIBRES Research Electronic Journal [18]Towards a context sensitive approach to searching information based on domain specific knowledge sources by Duy Dinh , Lynda Tamine(Elsevier Journal) [19]Wang, C, Chang, G., Wang X., Ma, Y., Ma, H. (2009) A User Motivation Model for Web Search Engine. International conference on Hybrid Intelligent Systems.

Prof.S.P.Akarte,

IJRIT

518

Contextual Query Based On Segmentation & Clustering For ... - IJRIT

Abstract. Nowadays internet plays an important role in information retrieval but user does not get the desired results from the search engines. Web search engines have a key role in the discovery of relevant information, but this kind of search is usually performed using keywords and the results do not consider the context.

128KB Sizes 1 Downloads 322 Views

Recommend Documents

Contextual Query Based On Segmentation & Clustering For ... - IJRIT
In a web based learning environment, existing documents and exchanged messages could provide contextual ... Contextual search is provided through query expansion using medical documents .The proposed ..... Acquiring Web. Documents for Supporting Know

Query Segmentation Based on Eigenspace Similarity
§School of Computer Science ... National University of Singapore, .... i=1 wi. (2). Here mi,j denotes the correlation between. (wi ทททwj−1) and wj, where (wi ...

Survey on Data Clustering - IJRIT
common technique for statistical data analysis used in many fields, including machine ... The clustering process may result in different partitioning of a data set, ...

Query Segmentation Based on Eigenspace Similarity
University of Electronic Science and Technology. National ... the query ”free software testing tools download”. ... returns ”free software” or ”free download” which.

Query Segmentation Based on Eigenspace Similarity
the query ”free software testing tools download”. A simple ... returns ”free software” or ”free download” which ..... Conf. on Advances in Intelligent Data Analysis.

Survey on Data Clustering - IJRIT
Data clustering aims to organize a collection of data items into clusters, such that ... common technique for statistical data analysis used in many fields, including ...

Segmentation of Markets Based on Customer Service
Free WATS line (800 number) provided for entering orders ... Segment A is comprised of companies that are small but have larger purchase ... Age of business.

Outdoor Scene Image Segmentation Based On Background.pdf ...
Outdoor Scene Image Segmentation Based On Background.pdf. Outdoor Scene Image Segmentation Based On Background.pdf. Open. Extract. Open with.

Spatiotemporal Video Segmentation Based on ...
The biometrics software developed by the company was ... This includes adap- tive image coding in late 1970s, object-oriented GIS in the early 1980s,.

Interactive Segmentation based on Iterative Learning for Multiple ...
Interactive Segmentation based on Iterative Learning for Multiple-feature Fusion.pdf. Interactive Segmentation based on Iterative Learning for Multiple-feature ...

Query Expansion Based-on Similarity of Terms for ...
expansion methods and three term-dropping strategies. His results show that .... An iterative approach is used to determine the best EM distance to describe the rel- evance between .... Cross-lingual Filtering Systems Evaluation Campaign.

Tuning for Query-log based On-line Index Maintenance - People
Oct 28, 2011 - structure for index maintenance in Information Retrieval. (IR) systems. ..... Computer Systems Lab, Department of Computer Science,.

Query Expansion Based-on Similarity of Terms for Improving Arabic ...
same meaning of the sentence. An example that .... clude: Duplicate white spaces removal, excessive tatweel (or Arabic letter Kashida) removal, HTML tags ...

Tuning for Query-log based On-line Index Maintenance - People
Oct 28, 2011 - 1. INTRODUCTION. Inverted indexes are an important and widely used data structure for ... query, for instance Twitter, which is not present few years back is now ... This index- ing scheme provides a high degree of query performance si

Query Difficulty Prediction for Contextual Image ... - Research at Google
seen a picture of it when reading an article that contains the word panda. Although this idea sounds .... Educational Psychology Review, 14(1):5–26, March 2002. 3. Z. Le. Maximum ... Proc. of Workshop on Human. Language Technology, 1994.

Web page clustering using Query Directed Clustering ...
IJRIT International Journal of Research in Information Technology, Volume 2, ... Ms. Priya S.Yadav1, Ms. Pranali G. Wadighare2,Ms.Sneha L. Pise3 , Ms. ... cluster quality guide, and a new method of improving clusters by ranking the pages by.

Robust Speaker segmentation and clustering for ...
cast News segmentation systems([7]), in order to be able to index the data. 2 ... to the meeting or be videoconferencing. Two main ... news clustering technology.

ACTIVITY-BASED TEMPORAL SEGMENTATION FOR VIDEOS ... - Irisa
The typical structure for content-based video analysis re- ... tion method based on the definition of scenarios and relying ... defined by {(ut,k,vt,k)}t∈[1;nk] with:.

ACTIVITY-BASED TEMPORAL SEGMENTATION FOR VIDEOS ... - Irisa
mobile object's trajectories) that may be helpful for semanti- cal analysis of videos. ... ary detection and, in a second stage, shot classification and characterization by ..... [2] http://vision.fe.uni-lj.si/cvbase06/downloads.html. [3] H. Denman,

Learning Noun Phrase Query Segmentation - Center for Language ...
Natural Language Learning, pp. ... Learning Noun Phrase Query Segmentation ... A tech- nique such as query substitution or expansion (Jones et al., 2006) can ...

ACTIVITY-BASED TEMPORAL SEGMENTATION FOR VIDEOS ... - Irisa
based indexing of video filmed by a single camera, dealing with the motion and shape ... in a video surveillance context and relying on Coupled Hid- den Markov ...

CORE - A Contextual Reader based on Linked Data
Department of. Computer Science. A Contextual Reader for First World. War Primary Sources. Demonstrative documents: ○ a primary source PDF from the CU-Boulder WWI Collection. Online. ○ a postcard with metadata from the Great War Archive. ○ an e

Image segmentation approach for tacking the object: A Survey ... - IJRIT
Identification: Colour is a power description tool, for example, the red apple versus the ..... Conference on Pattern Recognition., Quebec City, Canada, volume IV, ...

Timetable Scheduling using modified Clustering - IJRIT
timetable scheduling database that has the information regarding timeslots .... One for admin login, teacher registration, student registration and last one is exit.