Handbook of Research on Text and Web Mining ...

Viewer
Transcript

Handbook of Research on Text and Web Mining Technologies Min Song New Jersey Institute of Technology, USA Yi-fang Brook Wu New Jersey Institute of Technology, USA

Volume I

Information science reference Hershey • New York

Director of Editorial Content: Director of Production: Managing Editor: Assistant Managing Editor: Typesetter: Cover Design: Printed at:

Kristin Klinger Jennifer Neidig Jamie Snavely Carole Coulson Chris Hrobak Lisa Tosheff Yurchak Printing Inc.

Published in the United States of America by Information Science Reference (an imprint of IGI Global) 701 E. Chocolate Avenue, Suite 200 Hershey PA 17033 Tel: 717-533-8845 Fax: 717-533-8661 E-mail: [email protected] Web site: http://www.igi-global.com and in the United Kingdom by Information Science Reference (an imprint of IGI Global) 3 Henrietta Street Covent Garden London WC2E 8LU Tel: 44 20 7240 0856 Fax: 44 20 7379 0609 Web site: http://www.eurospanbookstore.com Copyright © 2009 by IGI Global. All rights reserved. No part of this publication may be reproduced, stored or distributed in any form or by any means, electronic or mechanical, including photocopying, without written permission from the publisher. Product or company names used in this set are for identification purposes only. Inclusion of the names of the products or companies does not indicate a claim of ownership by IGI Global of the trademark or registered trademark. Library of Congress Cataloging-in-Publication Data Handbook of research on text and web mining techologies / Min Song and Yi-Fang Wu, editors. p. cm. Includes bibliographical references and index. Summary: "This handbook presents recent advances and surveys of applications in text and web mining of interests to researchers and endusers "--Provided by publisher. ISBN 978-1-59904-990-8 (hardcover) -- ISBN 978-1-59904-991-5 (ebook) 1. Data mining--Handbooks, manuals, etc. 2. Web databases--Handbooks, manuals, etc. I. Song, Min, 1969- II. Wu, Yi-Fang, 1970QA76.9.D343H43 2008 005.75'9--dc22 2008013118

British Cataloguing in Publication Data A Cataloguing in Publication record for this book is available from the British Library. All work contributed to this book set is original material. The views expressed in this book are those of the authors, but not necessarily of the publisher. If a library purchased a print copy of this publication, please go to http://www.igi-global.com/agreement for information on activating the library's complimentary electronic access to this publication.

288

Chapter XVII

Slicing and Dicing a Linguistic Data Cube Jan H. Kroeze University of Pretoria, South Africa Theo J. D. Bothma University of Pretoria, South Africa Machdel C. Matthee University of Pretoria, South Africa

Abstract This chapter discusses the application of some data warehousing techniques on a data cube of linguistic data. The results of various modules of clausal analysis can be stored in a three-dimensional data cube in order to facilitate on-line analytical processing of data by means of three-dimensional arrays. Slicing is such an analytical technique, which reveals various dimensions of data and their relationships to other dimensions. By using this data warehousing facility the clause cube can be viewed or manipulated to reveal, for example, phrases and clauses, syntactic structures, semantic role frames, or a two-dimensional representation of a particular clause’s multi-dimensional analysis in table format. These functionalities are illustrated by means of the Hebrew text of Genesis 1:1-2:3. The authors trust that this chapter will contribute towards efficient storage and advanced processing of linguistic data.

INTRODUCTION This chapter suggests a way in which data warehousing concepts may be used and adapted to store and view complex sets of linguistic data. After explaining and illustrating the concept of a three-dimen-

Copyright © 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.

Slicing and Dicing a Linguistic Data Cube

sional data cube as a suitable data structure to capture multi-dimensional linguistic data, slicing and dicing are discussed as a way in which various perspectives of this data can be revealed. Although these functionalities are illustrated by means of the Hebrew text of Genesis 1:1-2:3, linguistic data of text in any language could be explored and manipulated in a similar way.

BACKGROUND: USING A DATA CUBE TO INTEGRATE COMPLEX SETS OF LINGUISTIC DATA The clauses constituting a text can be analysed linguistically in various ways depending on the chosen perspective of a specific researcher. These different analytical perspectives regarding a collection of clauses can be integrated into a paper-based medium as a series of two-dimensional tables, where each table represents one clause and its multi-dimensional analysis. This concept can be explained with a simplified grammatical paradigm and a very small micro-text consisting of only three sentences (e.g. Gen. 1:1a, 4c and 5a)1: • • •

1).

Bre$it bara elohim et ha$amayim ve’et ha’arets (in the beginning God created the heaven and the earth) Vayavdel elohim ben ha’or uven haxo$ex (and God separated the light and the darkness) Vayiqra elohim la’or yom (and God called the light day)2 An interlinear multi-dimensional analysis of this text can be done as a series of tables (see Table

The linguistic modules3 that are represented here were chosen only to illustrate the concept of an integrated structure of linguistic data, as well as the manipulation thereof, and should not be regarded as comprehensive. In analyses that are more detailed additional layers of analyses, such as morphology, transliteration4 and pragmatics could be added. Although such series of tables can be regarded as a databank, if it is electronically available, these tables are not combined into a single coherent data structure and they do not allow for flexible analytical operations. Knowing the advanced ad hoc query possibilities that are facilitated by database management systems on highly structured data, the ability to perform similar operations on implicitly structured linguistic data becomes attractive. Such queries would be facilitated if all the separate tables could be combined into one complex data structure. This is an example of document processing that “needs database processing for storing and manipulating data” (Kroenke, 2004, p. 464). The obvious suggestion for solving this problem would be to use a relational database to capture linguistic data, but there are some prohibiting factors. There are many differences among the structures of clauses and the result will be a very sparse database (containing many empty fields) if one were to create attributes for all possible syntactic and semantic fields. Even in the event that this could work, an extra field will be needed to capture the word-order position for every phrase. Furthermore, relational database management systems are restricted to two dimensions: “The table in an RDBMS can only ever represent multi-dimensional data in two dimensions” (Connolly & Begg, 2005, p. 1209). Closer inspection of the above-mentioned two-dimensional clause tables reveals that they actually represent multi-dimensional data. The various rows of each table do not represent separate records (as is typical of a two-dimensional relational database), but deeper modules of analysis, which are related

289

Slicing and Dicing a Linguistic Data Cube

Table 1. A series of three two-dimensional tables, each containing a multi-dimensional linguistic analysis of one clause Phrase 1

Phrase 2

Phrase 3

Phrase 4

Phonetic transcription

bre$it

bara

elohim

et ha$amayim ve’et ha’arets

Literal translation

in the beginning

he created

God

the heaven and the earth

Word groups

PP

VP

NP

NP

Syntactic function

Adjunct

Main verb

Subject

Object

Semantic function

Time

Action

Agent

Product

Clause 2

Phrase 1

Phrase 2

Phrase 3

Phrase 4

Phonetic transcription

vayavdel

elohim

ben ha’or

uven haxo$ex

Literal translation

and he separated

God

between the light

and between the darkness

Word groups

VP

NP

PP

PP

Syntactic function

Main verb

Subject

Complement

Complement

Semantic function

Action

Agent

Patient

Source

Clause 3

Phrase 1

Phrase 2

Phrase 3

Phrase 4

Phonetic transcription

vayikra

elohim

la’or

yom

Literal translation

and he called

God

to the light

day

Word groups

VP

NP

PP

NP

Syntactic function

Main verb

Subject

IndObj

Complement

Semantic function

Action

Agent

Patient

Product

Clause 1

Figure 1. A three-dimensional clause cube.

290

Slicing and Dicing a Linguistic Data Cube

to the data in the first row. A collection of interlinear tables is in fact a two-dimensional representation of three- (or multi-) dimensional linguistic data structures. Each table represents one two-dimensional “slice” of this three-dimensional structure, and the whole collection is a stack of these slices. This insight holds the key to solving the problem of capturing and processing this data. If the data is essentially multi-dimensional, the ideal computerised data structure with which to capture it would be a multi-dimensional database. This type of data structure already exists and is usually employed in businesses’ data warehouses to enable multi-dimensional on-line analytical processing (MOLAP) (cf. Connolly & Begg, 2005; Ponniah, 2001). Data cubes are used to capture threedimensional data structures and hyper cubes5 for multi-dimensional data structures. They are based on three-dimensional or multi-dimensional arrays. A data cube also provides a way in which the results of various divergent linguistic research projects may be integrated. Before the implementation of these concepts in terms of programming is discussed, it should first be made clear how the linguistic data referred to above could indeed be regarded as three-dimensional. The knowledge that is represented by a collection of interlinear tables can be conceptualised threedimensionally as a cube, subdivided into rows and columns on three dimensions. The sizes of these dimensions, however, do not have to be the same and will be determined by requirements of the unique data set. Each sub-cube is a data-container and can store one piece of information. The information cube therefore consists of a cluster of clauses and their analyses. The horizontal dimension is divided into rows representing the various clauses - each row being a unique record or clause. The vertical dimension is divided into columns and represents the various phrases in the clauses. The depth dimension represents the various modules of analysis, for example, phonetic rendering, literal translation, word groups, syntactic functions and semantic functions. The linguistic data captured in the two-dimensional tables of the micro-text above can thus be stored in a three-dimensional data-structure in the following way (see Figure 1):

Such a clause data cube can be implemented on a computer using a three-dimensional array.6 A three-dimensional array is a stack of two-dimensional data variables.

Some programming languages, such as Visual Basic 6, also allow the use of multi-dimensional arrays (with four or more dimensions), which could represent a hyper cube of clauses, but due to huge space implications for the computer’s memory7 and the difficulty to visualize four or more dimensions, this chapter deals with three dimensions only. Since it is possible to declare the exact number of rows, columns and depth members of a three-dimensional array, enough members can be created on the depth dimension to store all modules of clause analyses.

PROCESSING THE INFORMATION IN A CLAUSE CUBE Combining repetition control structures such as nested loops with three-dimensional arrays makes it possible to process the stored information in an efficient manner. Using three- or multi-dimensional tables to represent abstract data is not only a tool to store information, but also an important intermediate step in creating computerized visualizations of this information (cf. Card et al., 1999). Koutsoukis et al. (1999) differentiate between manipulation and viewing functions performed on multi-dimensional data. Slicing, rotating and nesting are viewing functions, while drilling-down and rolling-up are manipulation

291

Slicing and Dicing a Linguistic Data Cube

functions. A slice is a two-dimensional layer of the data and implies that the dimension, which is being sliced, is dropped. To rotate or pivot the cube means to reveal another perspective or view that consists of a different combination of dimensions. Some authors use “slicing-and-dicing” as one concept, while others – like Koutsoukis et al. (1999, p. 8) – regard dicing as a synonym for rotation. This chapter uses dicing to indicate the retrieval of subsections of a slice of data. Nesting is “to display values from one dimension within another dimension” (Koutsoukis et al., 1999, p. 8). Drilling-down is the revelation of more detailed data, linked to a specific cell, on the deeper levels of a hierarchical dimension, while rolling-up (or drilling-up, consolidation, aggregation) refers to summarised data on the higher levels of a hierarchical dimension. In this way the three-dimensional array facilitates actions that are typical of data warehousing and on-line analytical processing (OLAP). In this chapter rotation, slicing and dicing, as well as simple searching functions on the clause cube, will be discussed in more detail. Nesting is probably not applicable to linguistic data, and rolling-up and drilling-down can only be explained by means of hierarchical analyses, such as syntactic tree diagrams. These more complex operations, including searches on more than one parameter and fuzzy searches, as well as the ordering and filtering of the sub-arrays of the clause cube, fall outside the scope of this chapter.8

Rotation Rotation can be regarded as a computerized version of the human ability to reflect on problem domains from various perspectives. “Different external views can be achieved … by applying rotational transformations to a multi-dimensional array” (Glasgow & Malton, 1994, p. 24). Viewing the clause cube from the front reveals the phonetic representation of the individual clauses of the text. Retrieving these elements may be used to display the phonetic rendering of the textual corpus. If the cube is rotated to show the top side, the first clause’s multi-modular analysis is revealed. The upside down order is due to the structure and rotation of the cube. A more logical order can be obtained by dicing the separate nuggets of information by means of array processing and presenting it in the required order, or by slicing the cube from the bottom (see below). Similarly, rotating the cube to display the original bottom side as the front side will reveal the last clause’s multi-modular analysis, presenting the information in an expected, logical order. Looking at the original right side of the cube, however, does not reveal any meaningful perspective (unless the researcher wants to focus, for some reason, on the last constituent of each clause, for example in a study on word order). The original left side is similar, but reveals data about the first element of each clause. This information could be used for studies in pragmatics on fronting of clausal elements serving as a topic or focus. The original backside is again very meaningful, from a semantic perspective, because it reveals the combinations of semantic functions per clause. This information can be used in a study on semantic frameworks, for example, to construct an ontological dictionary such as WORDNET or WORDNET++ (cf. Dehne et al., 2000), and to create a conceptual data model by the COLOR-X method (cf. Dehne et al., 2001). Rotating the cube from its original position in a clockwise manner towards the original backside reveals the semantic role frameworks of the clauses, with the hind part foremost, however. Rotating it head over heels toward the original backside reveals the same information but in a different, upside down, order. The correct order can be revealed by slicing (see below).

292

Slicing and Dicing a Linguistic Data Cube

Slicing Rotation is a relatively easy way to demonstrate the various perspectives that a researcher can glean from a multi-dimensional data set. However, the discussion above illustrates the fact that rotation can be confusing because the ordering of constituents differ due to the fact that top can become bottom, left can become right, et cetera, depending on the manner in which the cube is spun. Slicing is better in this regard because a meaningful easy-to-understand plane can be chosen and all the records can be viewed in the same order. The clause cube shown in Figure 1 could, for example, be sliced from the top to show the three clauses’ multi-modular analyses, which brings us back to where we started, namely the two-dimensional representation9 of multi-modular clausal data (although in a different order of presentation when left in the default data cube ordering). A slice is a “two-dimensional plane of the cube” (Ponniah, 2001, p. 362). The designer of the graphical interface for the output of a query actually has the freedom to place data elements wherever they will appear in a most user-friendly way. They do not have to be displayed in a fixed and rigid order that represents their position in the data cube. It would be very easy to change the order of the rows in these slices to a more user-friendly version of the display to show the phonetic rendering in the top row and the semantic functions in the bottom row. As indicated above, this option is only one of many possibilities offered by the clause cube. Another advantage of slicing is that it can reveal the elements inside the cube that cannot be seen by rotating it (like the hidden multi-dimensional analysis of the second clause in Figure 1). In larger cubes containing hundreds or thousands of clausal analyses, a large number of constituents will be hidden inside the cube. The more members each dimension has, the more data will be out of direct sight. Slicing can also be used to reveal a specific, required perspective that is hidden inside the cube. Say, for example, a researcher wants to see all the syntactic frameworks of the micro-text. Even in the simple 4x3x5 cube of Figure 1 this perspective cannot be acquired by looking at the six outer sides of the cube. One can only see the syntactic frameworks of the first and last clauses, which would not be satisfactory had the clause cube contained many clauses. However, this perspective can be obtained by slicing off the first three planes from the front side and looking at the fourth layer to reveal the syntactic frameworks of all the clauses in the cube. Similarly, slicing off four layers from the front will reveal the semantic function frameworks of all clauses in the cube. Slicing off the first two layers will reveal all the combinations of word groups, which may be relevant for a morpho-syntactic study. Slicing off the first layer only, reveals the literal translation of the text. It should already be clear by now that a multi-dimensional data structure provides much more versatility in data viewing and manipulation functions than mere two-dimensional tables. Slicing is not only more flexible and satisfactory than rotating, but is also closer to the manner in which a computer processes a three-dimensional array. There is, of course, not a real cube that can be rotated inside the computer’s memory,10 but there are millions of memory spaces that can be numbered and filled and called up in any required order. Any slice can be acquired relatively easily by using a repetition control structure (for-loop) containing the specific number that represents the required slice as a constant index in the array reference.11 Slicing can also be used as an option to rotation: slicing off and viewing the external layer on every side of the cube is the equivalent of rotating the cube. This is exactly how “rotation” is implemented in a three-dimensional array in the computer’s memory. Valuable slicing options in this problem space are

293

Slicing and Dicing a Linguistic Data Cube

slicing the cube from the front to reveal the Hebrew text (phonetically), literal translation, word group combinations, syntactic frameworks and semantic frameworks; and slicing the cube from the top to reveal multi-modular analyses of subsequent clauses. Slicing from the sides may be valuable in studies on word order and pragmatics.

Dicing

In this chapter the term dicing is used to indicate the subdivision of data slices into smaller pieces. Dicing can be used to retrieve very specific required data. One could, for example, retrieve only syntactic functions and their related semantic functions in order to study the mapping of these linguistic modules. In the micro-text above one would discover that the semantic function of patient may either be mapped on the syntactic function of complement or indirect object. Dicing may also be used to reorder a set of related data into a logical order on the user interface in order to present user-friendly information. In fact, slicing is actually also acquired by means of iterative sets of dicing. Dicing requires knowledge of the structure of the data cube (implemented as a three-dimensional array).

Searching Simple search functions can be used to look up clauses or phrases. If a specific clause’s array index (which acts as a primary key) is known, one can use it to search for the clause. One can also search for examples of specific elements, such as rare syntactic or semantic functions. When a function has to search through the whole multi-dimensional array to find all possible examples, execution of the program must be paused after each hit to allow the user to study a relevant example before moving on to the next one.

APPLICATION: SLICING AND DICING A CLAUSE CUBE OF GENESIS 1:1-2:3 The principles discussed above were applied to the Hebrew text of Genesis 1:1-2:3. The program was created in Visual Basic 6 (VB6). The database was included in the program and consists of a clause cube comprising of the analyses of all clauses containing a main verb in Genesis 1:1-2:3. The linguistic modules that were analysed are: • • • • •

Phonetic transcription of phrases12 Literal translation of phrases Identification of phrase types Syntactic functions Semantic functions (based on Dik, 1997a, 1997b).

These analyses were done by one of the authors, based on his personalised and tacit knowledge of Biblical Hebrew. Not everybody will necessarily agree with these categories and analyses; however, the analysis itself is not the main focus of this chapter. The primary goal is to illustrate how existing, integrated knowledge can be retrieved in various informative ways.

294

Slicing and Dicing a Linguistic Data Cube

Embedded clauses have been indicated as a unit in the main clause and separately analysed in a subsequent row.13 Embedded phrases containing an infinitive or participle have not been analysed in more detail. The size of the vertical dimension was set to 108 to allow space for the analyses of all the clauses in the corpus. The size of the horizontal dimension had to be enlarged to five to facilitate the analysis of a clause with five phrases in the rest of the data set. No clause in the data set had more than five phrases. The size of the depth dimension was set to six to make provision for all five linguistic modules, as well as additional space to capture the unique verse number of each clause, e.g. Gen01v01a, as a user-friendly primary key. The viewing and manipulation processes performed on the Genesis 1 clause cube reveal that it is not only possible to view the stored data in a typical interlinear manner, but that any meaningful perspective on the data can be acquired relatively easy. Once the data have been captured in a data structure that represents its natural multi-dimensionality,14 various queries can be answered by using array-processing functions.15 Below a few examples (screen shots) of the perspectives that are facilitated by slicing the Genesis 1 clause cube are shown (see Figures 2-4). Figure 2 shows the interface that the user may use to scroll forward or backward through the stack of two-dimensional analyses to study any clause’s multi-modular analysis. If the clause number is known, it can be used to display that clause directly. The verse number can also be used to access data directly. The clause cube can be searched on a specific parameter. The “Scroll through slice of syntactic frameworks” button, shown in Figure 3, is used to scroll through the slice that reveals the syntactic structures of all 108 clauses in the cube (six per screen). After observing six structures on a screen, the user may press the same button again to view the next six structures. The “Scroll through slice of semantic frameworks” button is used to scroll through the slice that reveals the combinations of semantic functions in all 108 clauses in the cube (six per screen) (see Figure 4). Clicking on the same button again reveals the next six frameworks until ones reaches the end of the databank.

CONCLUSION A multi-dimensional clause cube can facilitate the linguistic analysis with which any exegetical process should commence, which in turn can benefit a multi-dimensional approach to biblical exegesis (cf. Van der Merwe, 2002). It also facilitates a format in which the biblical text is processed for readers, that is “succinctly enough to be handled by the short-term memory”, thus enhancing the success of the communication process (ibid., p. 94). The Genesis 1:1-2:3 clause cube illustrated that linguistic data stored in a data cube can be viewed and manipulated with multi-dimensional array processing to answer a vast number of queries about the data and relationships between data on various linguistic levels. This implies that linguistic data have been transformed into information, which can again be used to facilitate knowledge acquisition and sharing. Using a small text corpus, this experiment demonstrated how data warehousing technology may be adapted and applied to linguistic data scenarios. In future work computational linguists should find a way to integrate data from existing linguistic databases into consolidated data marts or data warehouses.

295

Slicing and Dicing a Linguistic Data Cube

Figure 2. A slice of the Genesis 1:1-2:3 clause cube that reveals the multi-modular analysis of Gen. 1:17a-18a.

Figure 3. The interface that is used to reveal the syntactic structures of all 108 clauses in the cube (six per screen).

296

Slicing and Dicing a Linguistic Data Cube

Figure 4. A screen shot of semantic role frameworks extracted from the clause cube.

More advanced processing on these databanks should also be researched, as well as the visualisation of linguistic patterns hidden within such clause cubes.

REFERENCES Card, S. K., Mackinlay, J. D., & Shneiderman, B. (1999). Readings in information visualization: Using vision to think. San Francisco, CA: Morgan Kaufmann. Connolly, T. M., & Begg, C. E. (2005). Database systems: A practical approach to design, implementation, and management, 4th ed. Essex: Pearson/Addison Wesley. Dehne, F., Steuten, A., & Van de Riet, A. P. (2000). Linguistic and graphical tools for the COLOR-XMethod (Rapportnr IR-482). Amsterdam, Netherlands: Vrije Universiteit. Dehne, F., Steuten, A., & Van de Riet, A. P. (2001). WordNet++: A lexicon for the COLOR-X-Method. Data and Knowledge Engineering, 38, 3-29. Dik, S. C. (1997a). The theory of Functional Grammar, Part 1, The structure of the clause, 2nd ed. (edited by Kees Hengeveld). Berlin: Mouton de Gruyter.

297

Slicing and Dicing a Linguistic Data Cube

Dik, S. C. (1997b). The theory of Functional Grammar, Part 2, Complex and derived constructions (edited by Kees Hengeveld). Berlin: Mouton de Gruyter. Glasgow, J., & Malton, A. (1994). A semantics for model-based spatial reasoning (Tech. Rep. No. 1994-360). Kingston, Ontario: Queen’s University, Department of Computing and Information Science. Retrieved December 6, 2007, from www.cs.queensu.ca/TechReports/Reports/1994-360.pdf Groves, J. A. (1989). On computers and Hebrew Morphology. In E. Talstra (Ed.), Computer assisted analysis of Biblical texts: Papers read at the workshop on occasion of the tenth anniversary of the “Werkgroep Informatica”, Faculty of Theology, Vrije Universiteit, Amsterdam, November, 5-6, 1987 (pp. 45-86). Amsterdam: Free University Press. Koutsoukis, N. S., Mitra, G., & Lucas, C. (1999). Adapting on-line analytical processing for decision modelling: the interaction of information and decision technologies. Decision Support Systems, 26, 1-30. Kroenke, D. M. (2004). Database processing: Fundamentals, design and implementation, 9th ed. Upper Saddle River, NJ: Pearson. Kroenke, D. M. (2005). Database concepts, 2nd ed. Upper Saddle River, NJ: Pearson. Kroeze, J. H. (2004). Towards a multidimensional linguistic database of Biblical Hebrew clauses. Journal of Northwest Semitic Languages, 30(2), 99-120. Kroeze, J. H. (2006, June). Building and displaying a Biblical Hebrew linguistics data cube using XML. Paper presented at the Israeli Seminar on Computational Linguistics (ISCOL), Haifa, Israel. Retrieved December 5, 2007, from http://mila.cs.technion.ac.il/english/events/ISCOL2006/ Kroeze, J. H. (2007a). Round-tripping Biblical Hebrew linguistic data. In M. Khosrow-Pour (Ed.), Proceedings of 2007 Information Resources Management Association, International Conference, Vancouver, British Columbia, Canada, May 19-23, 2007. Managing worldwide operations and communications with information technology (pp. 1010-1012). Hershey, PA: IGI Publishing. Kroeze, J. H. (2007b). A computer-assisted exploration of the semantic role frameworks in Genesis 1:1-2:3. Journal of Northwest Semitic Languages (JNSL), 33(1), 55-76. Ponniah, P. (2001). Data warehousing fundamentals: A comprehensive guide for IT professionals. New York, NY: John Wiley. Rob, P., & Coronel, C. (2004). Database systems: Design, implementation, and management, 6th ed. Boston, MA: Course Technology. Van der Merwe, C. H. J. (2002). The Bible and hypertext technology: Challenges for maximizing the use of a new type of technology in Biblical studies. Journal of Northwest Semitic Languages, 28, 87-102.

298

Slicing and Dicing a Linguistic Data Cube

KEY TERMS Clause Cube: A clause cube is a three-dimensional data structure that integrates related linguistic data from various language modules. Data Cube: A data cube is a multi-dimensional data structure that integrates related data. Data Warehousing: Data warehousing is the collection, cleaning and reformatting of large amounts of existing data into complex, multi-dimensional data structures, in order to facilitate data mining and exploration (finding patterns and trends hidden within the data). In a data warehouse of linguistic data (e.g. a clause cube) the analyses of various language modules are consolidated in one data structure in order to facilitate the exploration of patterns in and across the interrelated levels. Dicing: Dicing may be used as a synonym for rotation, but with reference to a clause cube it refers to the extraction of detailed data “hidden” within the cube. OLAP: On-line analytical processing refers to interactive computer processing to analyse data that has been stored in a database or data warehouse to reveal different and multi-dimensional views (Connolly & Begg, 2005, p. 1205). With reference to a clause cube it pertains to the advanced processing and comparison of linguistic data collected from various language modules. Rotation: Rotation refers to the presentation of various sides or views of a data cube. With reference to a clause cube it refers to the data shown on the “external” sides of the data structure. Slicing: Slicing refers to the extraction of a subset of data stored in a three-dimensional data cube. With reference to a clause cube a slice may refer, for example, to one clause’s multi-modular analysis represented as a two-dimensional table.

Endnotes Examples from the Hebrew Bible are used because this article forms part of a series of related articles (see Kroeze, 2004, 2006, 2007a, 2007b). These three clauses were chosen because all of them have four phrases and because they represent different syntactic structures. Many of the other clauses have less than four phrases, which would imply empty cells. Only one clause in Gen. 1:1-2:3 has five phrases. 2 See below for a discussion of the phonetic transcription used to render the Hebrew text. 3 Cf. Van der Merwe (2002). The term module is preferred here to refer to the different layers of linguistic analysis, because level is used in data cube terminology to refer to the members of a hierarchical dimension (cf. Ponniah, 2001). 4 A transliteration is a precise rendering of text written in one alphabet by means of another alphabet. The transcription given in this chapter is a phonetic rendering, which cannot be used to mechanically reconstruct the Hebrew text. 5 Cf. Kroenke (2004). 6 Compare Kroeze (2004) for a detailed discussion on the design and implementation of a clause cube using a three-dimensional array, and Kroeze (2006, 2007a) for a discussion of the use of XML for permanent storage of a clause cube. 1

299

Slicing and Dicing a Linguistic Data Cube

7

8

9

10

11

12

13

14

15

300

“As the number of dimensions increases, the number of the cube’s cells increases exponentially” (Connolly & Begg, 2005, p. 1209). Compare Kroeze (2007b) for a discussion of semantic role frameworks extracted from the clause cube as one example of advanced processing that may be facilitated by the clausal data cube and array technology. Cf. Kroenke (2005), who discusses two-dimensional projections of three dimensions of student data. A cube is a “conceptual representation of multidimensional data .... A MOLAP system stores data in an MDBMS, using propriety matrix and array technology to simulate this multidimensional cube” (Rob & Coronel, 2004, p. 587). “[T]he dimension(s) that are held constant in a cube are called slices” (Kroenke, 2004, p. 554). It should be possible to use Hebrew characters by means of Unicode because Visual Basic 6 uses Unicode to represent character strings. A phonetic transcription, however, makes this study more accessible for a wider audience. The same ideas could be applied in any language, and knowledge of Hebrew writing should not be a prerequisite for participating in the academic debate on the validity of this concept. An alternative could be to use the Westminster or Michigan-Claremont transliteration (see Groves, 1989). Instead, a fourth dimension could have been used to capture and represent data of embedded clauses. However, it has been decided to view them as separate clauses, in order to keep the conceptualisation simpler and to minimise sparsity (empty elements in the multi-dimensional array). “Multi-dimensional structures are best visualized as cubes of data, and cubes within cubes of data” (Connolly & Begg, 2005, p. 1209). A well-designed data cube “obviates the need for multi-table joins and provides quick and direct access to arrays of data, thus significantly speeding up execution of multi-dimensional queries” (Connolly & Begg, 2005, p. 1211).

Research and Realization of Text Mining Algorithm on ...