RESEARCH PROPOSAL

Multimodal Co-located Collaboration By Edward Tse

THE UNIVERSITY OF CALGARY CALGARY, ALBERTA JANUARY 2006

Page 2

Abstract This research is concerned with the design and development of technologies to support multimodal co-located collaboration, a largely unexplored area in Human-Computer Interaction. People naturally perform multimodal interactions in everyday real world settings, as they collaborate over visual surfaces for both mundane and critical tasks. For example, ethnographic studies of military command and control, air traffic control, airplane flight decks and underground subway routing has shown that team members often use multiple hands, gestures and speech simultaneously in their communications and interactions. While a new generation of research technologies now support co-located collaboration, they do not yet directly leverage such rich multimodal interaction. Even though a rich behavioural foundation is emerging that informs the design of co-located collaborative technologies, most systems still typically limit input to a single finger or pointer. While a few now consider richer touch input such as gestural interaction, speech is ignored. This problem is partly caused by the fact that single point interaction is the only input easily accessible to researchers through existing toolkits and input APIs. Thus, researchers would have to “reinvent the wheel” if they are to achieve rich gesture and speech interactions for multiple people. Finally, co-located collaborative systems designers do not have a corpus of systematic guidelines to inform the use of rich multimodal over a large shared display. In this research, I will distil existing theories, models and ethnographic studies on co-present collaboration into behavioural foundations that describe the individual and group benefits for using gesture and speech multimodal input in a large display co-located setting. Next, I will develop a toolkit that will facilitate rapid prototyping of multimodal co-located applications over large digital displays. Finally, using these applications, I will conduct a number of studies exploring design in a multimodal setting. I will use study results to validate or refute my design premises and to refine the guidelines for designers of future multimodal co-located systems. Anticipated contributions are: a distillation of behavioural foundations outlining the individual and group benefits of multimodal interaction in a co-located setting, an input toolkit allowing researchers to rapidly explore rich multimodal interactions in a co-located environment, and the development and evaluation of several multimodal co-located applications built atop commercial applications and from the ground up. These evaluations will be used to form and/or validate a set of design implications.

Edward Tse – Thesis Proposal – Version 32 – 3/15/2006

Page 3

1

Research Proposal

This research is concerned with the design and development of information technologies to support multimodal co-located collaboration over large wall and table displays. By multimodal input, I mean interaction using rich hand gestures and speech. By co-located collaboration, I mean small groups of two to four people working together. By large display interaction, I mean display technologies designed to be viewed by multiple people (e.g., projectors, plasma displays). Consider everyday life. Co-located collaborators often work on artefacts placed atop physical tabletops, such as maps containing rich geospatial information. Their work is very nuanced, where people use gestures and speech in subtle ways as they interact with artefacts on the table and communicate with one another. With the advent of large multi-touch surfaces, researchers are now applying knowledge of colocated tabletop interaction to create appropriate technical innovations in digital table design. My research focus is on advancing our understanding of multimodal co-located interaction, specifically on the feasibility and potential benefits and problems of multimodal co-located input. My motivation for this thesis can be summarized as follows: 1. Co-located collaborators can leverage the power of digital displays for saving and distributing annotations, for receiving real-time updates, and for exploring and updating large amounts of data in real time. 2. Multimodal interaction allows people to interact with a digital surface using the same hand gestures and speech utterances that they use in the physical environment 3. An important side effect of multimodal interaction is that it provides improved awareness to people working together in a co-located environment. To investigate this thesis, my research will first examine the work practices and findings reported in various ethnographic studies of safety critical environments; e.g., air traffic control, military command and control, underground subway traffic management and hospital emergency rooms. I will examine what types of speech and gesture interactions are used, how it is performed (e.g., simultaneously vs sequentially), how conflicts are handled (e.g., turn taking protocols), and how these natural activities can be supported by technology. Ultimately, this research will involve investigating and bridging a range of

Edward Tse – Thesis Proposal – Version 32 – 3/15/2006

Page 4

perspectives: human computer interaction, human factors, social factors/psychology, cognitive psychology and technological applications. My research addresses the following fundamental limitations now found in the co-located setting: 1. Traditional desktop computers are unsatisfying for highly collaborative situations involving multiple co-located people exploring and problem-solving over rich digital spatial information. 2. Even if a large high resolution display is available, one person’s standard window/icon/mouse interaction – optimized for small screens and individual performance – becomes awkward and hard to see and comprehend by others involved in the collaboration 3. Ethnographic studies illustrate how the ‘single user’ assumptions inherent in current large display input devices limit collaborators who are accustomed to using multiple fingers and two-handed gestures often in concert with speech. In this research proposal, I first set the scene by briefly summarizing existing research on multimodal and co-located collaboration through several ethnographic studies and technology implementations. Second, I describe the context of this research as it relates to the field of human-computer interaction. Third, I outline the specific research problems that I will investigate and a corresponding list of objectives that I will address.

I then conclude with a discussion of my progress so far, and the anticipated

significance of this research.

1.1 Terminology This section clarifies terms and phrases used in this research proposal to avoid ambiguities. Interaction: When I use the term interaction, I am particularly referring to the actions people use to communicate and collaborate with each other in small groups of two to five people. Co-Located Interaction: These are interactions that occur in the same enclosed space over a plurality of digital wall and table surfaces. Multimodal Interaction: This describes the explicit natural actions performed by people working together in a co-located environment. Examples include, hand and arm gestures, speech acts, eye gaze, and torso orientation.

Edward Tse – Thesis Proposal – Version 32 – 3/15/2006

Page 5

Multimodal Input: This term specifically refers to a computing system being aware of and responding to the natural multimodal interactions of multiple people through any type of sensing technology. Postures: Postures are the natural hand and arm configurations that can be observed at any instant in time. For example, a fist is a hand posture that can be recognized at some moment in time. Gestures: While most gesture recognition engines focus on supporting the complex movements of a single point, gestures in this thesis represent the simple movement of postures. The focus is on simple affine transformations of postures (scale, rotation, translation placed or lifted from a digital surface). More complex gesture movements can be achieved by creating sequences of simple gestures. Speech: This term specifically refers to the natural verbal communication that occurs in small group interaction. Gaze: The locative references that are not covered by hand and arm gestures. Examples include: eye-gaze, torso orientation and head movements.

1.1.1 What This Thesis is Not Exploring The focus of this thesis is on the interaction issues of multimodal input and not on multimodal output. While it is likely that I will use both visual displays and audio as output, I will not explore the use of haptic force-feedback or olfactory output devices. This thesis is about the interaction possibilities of multimodal input rather than the development of robust commercial recognition systems. Thus, whenever possible I will use existing recognition systems rather than trying to build and refine my own recognition engine from the ground up. For example, I will likely use an existing speech recognition engine rather than build one myself. Also, I will only explore the natural gestures that occur in everyday conversations and collaborations. This means that I will try to avoid the use of complex gestures patterns that would be completely imperceptible in a regular conversation setting. Finally, this thesis is a research exploration into multimodal co-located interaction. I will not attempt to deliver products that are up to the standards of a commercial or industrial system.

Edward Tse – Thesis Proposal – Version 32 – 3/15/2006

Page 6

State of the art military command and control systems in action

What commanders prefer

Figure 2. Paper maps preferred over electronic displays in military command and control, McGee, 2001

1.2 Background People naturally use speech and gestures in their everyday communications over artefacts. Consequently, researchers are now becoming interested in exploiting speech and gestures in computer supported cooperative work systems. In this section, I provide a brief background to some of the ethnographic studies, mostly drawn from observations of safety critical environments, which form the motivations and foundations for my work in multimodal co-located interaction. Next, I extract implications for design for multimodal co-located system development.

Finally, I will explore technological explorations of

multimodal and co-located systems.

1.2.1 Ethnographic and Empirical Studies Ethnographic studies of mission critical environments such as military command posts, air traffic control centers and hospital emergency rooms have shown that paper media such as maps and flight strips are preferred even when digital counterparts are available [Cohen, 2002, Cohen 1997, Chin, 2003, Hutchins, 2000]. For example, Cohen et. al.’s ethnographic studies illustrate why paper maps on a tabletop were preferred over electronic displays by Brigadier Generals in military command and control situations [Cohen, 2002]. The ‘single user’ assumptions inherent in the electronic display’s input device

Edward Tse – Thesis Proposal – Version 32 – 3/15/2006

Page 7

Figure 1. Brigadier Generals using a map simultaneously with rich hand postures. From [Cohen, 2002] and its software limited commanders, as they were accustomed to using multiple fingers and two-handed gestures to mark (or pin) points and areas of interest with their fingers and hands, often in concert with speech [Cohen, 2002, McGee, 2001]. Several ethnographic researchers have focused on how gesture and speech provide improved awareness to group members in a co-located environment. Proponents of multimodal interfaces argue that the standard windows/icons/menu/pointing interaction style does not reflect how people work with highly visual interfaces in the everyday world [Cohen, 2002]. Results of empirical studies indicate that the combination of gesture and speech is more efficient and natural. For example, comparisons of speech/gestures vs. speech-only interaction by individuals performing map-based tasks showed that multimodal input resulted in more efficient use of speech (23% fewer spoken words), 35% less disfluencies (content self corrections, false starts, verbatim repetitions, spoken pauses, etc.), 36% fewer task performance errors, and 10% faster task performance [Oviatt, 1997]. These empirical and ethnographic studies provide motivation for multimodal support in co-located environments and have consequently led to specific design implications.

1.2.2 Implications to Design In this thesis I focus on group interaction theories that specifically handle issues of group communication, gesture and speech activity and apply them to the design of a digital tabletop. This section begins with low level implications that deal specifically with the mechanics of gesture and speech input and then moves into high level theories influencing group work. Deixis: speech refined by gestures. Deictic references are speech terms (‘this’, ‘that’, etc.) whose meanings are disambiguated by spatial gestures (e.g., pointing to a location). A typical deictic utterance is

Edward Tse – Thesis Proposal – Version 32 – 3/15/2006

Page 8

“Put that…” (points to item) “there…” (points to location) [Bolt, 1980]. Deixis often makes communication more efficient since complex locations and object descriptions can be replaced in speech by a simple gesture. For example, contrast the ease of understanding a person pointing to this sentence while saying ‘this sentence here’ to the utterance ‘the 5th sentence in the paragraph starting with the word deixis located in the middle of page 3’. Furthermore, when speech and gestures are used as multimodal input to a computer, Bolt states [1980] and Oviatt confirms [1997] that such input provides individuals with a briefer, syntactically simpler and more fluent means of input than speech alone. Complementary modes. Speech and gestures are strikingly distinct in the information each transmits. For example, studies show that speech is less useful for describing locations and objects that are perceptually accessible to the user, with other modes such as pointing and gesturing being far more appropriate [Bolt, 1980, Oviatt, 1999]. Similarly, speech is more useful than gestures for specifying abstract or discrete actions (e.g., Fly to Boston). Simplicity, efficiency, and errors. Empirical studies of speech/gestures vs. speech-only interaction by individuals performing map-based tasks showed that parallel speech/gestural input yields a higher likelihood of correct interpretation than recognition based on a single input mode [Oviatt, 1999]. This includes more efficient use of speech (23% fewer spoken words), 35% less disfluencies (content self corrections, false starts, verbatim repetitions, spoken pauses, etc.), 36% fewer task performance errors, and 10% faster task performance [Oviatt, 1997]. Natural interaction. During observations of people using highly visual surfaces such as maps, people were seen to interact with the map very heavily through both speech and gestures. The symbiosis between speech and gestures are verified in the strong user preferences stated by people performing mapbased tasks: 95% preferred multimodal interaction vs. 5% preferred pen only. No one preferred a speech only interface [Oviatt, 1999]. Gaze awareness. People monitor the gaze of a collaborator [Heath, 1991, Gutwin, 11]. It lets one know where others are looking and where they are directing their attention. It helps monitor what others are doing. It serves as visual evidence to confirm that others are looking in the right place or are attending one’s own acts. It even serves as a deictic reference by having it function as an implicit pointing act [Clark, 1996]. Gaze awareness happens easily and naturally in a co-located tabletop setting, as people are seated in a way where they can see each other’s eyes and determine where they are looking on the tabletop.

Edward Tse – Thesis Proposal – Version 32 – 3/15/2006

Page 9

Mechanics of collaboration. In terms of the low level mechanics, Wu breaks up gestural interaction into three phases, gesture registration (starting posture), gesture relaxation (dynamic phase) and gesture termination [Wu, 2006]. Thus gestural interaction must not require rigid postures to be held throughout continuously; rather they should be flexible in the time between the starting posture and gesture termination. McNeil explains how cognitive science proves that gesture and speech originate from the same cognitive system in the human mind and that there are various different types of gestures: deictic, iconic, cohesive, beat and metaphoric [McNeil, 1992]. This shows how the deictic pointing gestures done with current point to click interfaces encompass a very small portion of the gestures that people use in everyday conversation. Consequently, a system needs to understand how rich gestures are used in accordance with speech so that the gesture type can be determined. Consequential communication. Gutwin describes how speech and gestural acts provide awareness to group members through consequential communication [Gutwin, 2004, Segal, 1994]. For example, researchers have noticed that people will often verbalize their current actions aloud (e.g., “I am moving this box”) for a variety of reasons [Hutchins, 1997, Heath, 1991, Segal, 1994]: • to make others aware of actions that may otherwise be missed, • to forewarn others about the action they are about to take, • to serve as an implicit request for assistance, • to allow others to coordinate their actions with one’s own, • to reveal the course of reasoning, •

to contribute to a history of the decision making process. Distributed cognition. Clark’s presents a theoretical foundation of communication, where it serves

as an activity for building and using common ground [Clark, 1996]. While much of human computer interaction is focused on understanding cognition and factors within an individual, both Clark and Hollan emphasize a need to understand distributed cognition in a team setting [Hollan, 2000]. This means that researchers should consider the whole team as a cognitive system and use their communicative acts to understand patterns of information flow within the system [Hutchins, 2000].

1.2.3 Technological Explorations Multimodal Single User: Technological explorations of single user multimodal interaction begin as early as 1980 with Bolt’s Put That There multimodal system. Individuals could interact with a large

Edward Tse – Thesis Proposal – Version 32 – 3/15/2006

Page 10

Figure 3. Multimodal technological explorations (left) Put that there (right) STARS Tabletop Gaming display via speech commands qualified by deictic reference, e.g., “Put that…” (points to item) “there…” (points to location) [Bolt, 1980]. Bolt argues and Oviatt confirms [Oviatt, 1999] that this multimodal input provides individuals with a briefer, syntactically simpler and more fluent means of input than speech alone. While Bolt’s put that there system focused on deictic gestures in mid air, researchers have also explored direct touch manipulation using single touch surfaces such as the Smart Board by Smart Technologies (www.smarttech.com). McGee explored single point multimodal interaction over physical maps laid over a vertical Smart Board [McGee, 2001]. Similarly, Magerkurth (2004) explored single point multimodal interaction in a game environment over a Smart Board laid horizontally on a table surface. Although the figures in both these papers show multiple people, the system only supported a single input point at any time, thus collaborators had to take turns using the system. Single Display Groupware Toolkits: The increased interest in developing applications supporting multiple users over a single shared display has led to the development of several toolkits to support the rapid prototyping of Single Display Groupware (SDG) applications. Collaboration on a single display using multiple mice and keyboards allows researchers to rapidly explore co-located collaboration using low cost input devices. The Multiple Input Devices (MID) Toolkit was a toolkit that allowed input from multiple mice connected to the same computer to be recognized as separate streams, this is the most basic task required in all multi user applications [Bederson, 1998]. The SDG Toolkit extended the principles of the MID Toolkit by automatically drawing multiple cursors and providing a way to rapidly prototype multi user analogues of single user widgets [Tse, 2005]. Recent interest in large displays has led to the development of input toolkits to support rapid prototyping over large surfaces (e.g., Diamond Touch

Edward Tse – Thesis Proposal – Version 32 – 3/15/2006

Page 11

Toolkit, DViT Toolkit [Tse, 2005]) and toolkits to support the manipulations of objects over a table surface (e.g., Diamond Spin [Vernier, 2004], Buffer Framework [Miede, 2006]). Tabletop Input Devices: While most touch-sensitive display surfaces only allow a single point of contact, rich gestural interaction can only be achieved through digital surfaces that support richer multi touch interactions. However, the few surfaces that do provide multi-touch have limitations: Some, like SmartSkin [Rekimoto, 2002], are generally unavailable. Some limit what is sensed: SmartBoard’s DViT (www.smarttech.com/dvit) and Han’s Frustrated Total Internal Reflectance [Han, 2005] system recognize multiple touches, but cannot identify which touch is associated with which person. Others have display constraints: MERL’s DiamondTouch [Dietz, 2001] identifies multiple people, knows the areas of the table they are touching, and can approximate the relative force of their touches. My research will need to work around these limitations to explore the rich multi user gesture and speech interactions on a digital surface. Rich Gesture Input: Beyond the simple deictic reference, researchers have explored multi finger and whole hand rich gestural input often without the use of speech. Baudel explored the remote control of digital artefacts using a single hand connected to a data glove. By performing a sideways pulling gesture using the data glove a person could advance to the next slide of a presentation. Recently, researchers have explored rich gestural interaction directly on a tabletop surface, this provides the added benefit of gestures that are augmented with spatial references. Wu’s tabletop gestures included: a whole arm to sweep artefacts aside, a hand to rotate the table, a two finger rotation gesture, two arms moving together to gather artefacts and the back of a hand to show hidden information [Wu, 2003]. These gestures form the basis of the design of the rich multimodal gestures in my thesis as they produce improved awareness for group members as each gesture results in meaningful actions on the digital surface. Distributed Co-located Systems: Researchers have also explored applications that span across multiple computers in the same co-located area. Roomware environments enhance typical artefacts in a room such as tables, walls and chairs with digital interactive surfaces. The Beach architecture supported the communication between multiple input devices (e.g., interactive wall and table displays, Personal Digital Assistants and personal tablets). Tang also explored the concept of mixed presence groupware that allowed multiple groups of co-located individuals to work with other groups of co-located individuals located over a distance [Tang, 2005]. Important distinctions between previous work and my future research directions

Edward Tse – Thesis Proposal – Version 32 – 3/15/2006

Page 12

Keyboard and mouse interaction is unsatisfying for highly collaborative applications, especially those that occur in co-located environments [Gutwin, 2004]. While single point touch interaction over a Smart Board is an improvement due to the movements of people’s arms and bodies, they still reduce the expressive capabilities of people’s hands and arms to deictic pointers. Designers of multimodal co-located systems need to consider multiple people simultaneously using rich bimanual multi-postured gesture interaction, the system must be aware of alouds - the meaningful speech phrases not directed to any individual member but used to provide awareness of users’ current actions and intentions. However, the fundamental problem is that current input technologies either do not support these rich multimodal interactions or they require programmers to develop such software from the ground up. These hurdles must be overcome before even the most basic multimodal co-located applications can be developed and the empirical work explored to understand the nuances and potential solutions for multimodal co-located interaction.

1.3 Research Context This research investigates multimodal co-located collaboration. Figure 1 illustrates how this research fits into the broad context in human-computer interaction (HCI). Within HCI, my research is contained in computer-supported cooperative work (CSCW).

The next refinement narrows my primary focus to

technologies that support co-located collaborative work. My focus can further be streamlined to include only those co-located collaborative work practices that use multimodal gesture and speech input. My research builds upon the lessons learned from ethnographic studies of safety critical applications and applies these lessons to the design of general multimodal co-located systems. At the beginning, I will narrow my research to explore multimodal co-located interaction over interactive table tops leaving the possibility of exploring multimodal co-located wall or tablet interaction open.

Edward Tse – Thesis Proposal – Version 32 – 3/15/2006

Page 13

Figure 2: The context of my research

1.4 Research Objectives I will address the above mentioned problems by linking previous literature and ethnographic studies to my own studies of co-located multimodal environments. This synthesis will be used to inform the design and development of several multimodal co-located systems. However, as the research problems described above are inter-related and their respective research findings unknown, the findings obtained for each research problem may affect what and how the next research step should proceed.

Therefore, the

objectives described below are subject to revision depending on the outcome of the preceding research stage. 1. I will distill existing theories and ethnographic studies into a set of behavioural foundations that inform the design of multimodal co-located systems and list individual and group benefits. This objective will be achieved by performing a survey of existing theories of team work and ethnographic research in safety critical environments. I will examine interaction in real world situations, paying particular attention to the speech and gesture acts used to produce a list of individual and group benefits of multimodal co-located interaction. This summary will outline some of the benefits and provide

Edward Tse – Thesis Proposal – Version 32 – 3/15/2006

Page 14

motivation for adding multimodal interaction to co-located environments. It will form the basis of the design of our multimodal co-located applications and will be used in the evaluation of multimodal colocated systems in this thesis. 2. I will develop a toolkit to allow us to rapidly prototype multimodal co-located interactive systems Using some of the experience gained from my Master’s Thesis [Tse, 2005] on building toolkits to support application development using multiple mice and keyboards, I will develop a software toolkit that will facilitate the rapid prototyping of responsive and demonstrative multimodal gesture and speech applications in a co-located environment. This objective consists of 3 sub-goals: 1. I will develop a gesture recognizer that recognizes different hand postures (e.g., arm, hand, five finger, fist, etc) and their respective dynamic movements (e.g., two fingers moving apart) for multiple people (up to four) on a co-located table top display. 2. I will develop a multimodal integrator that accepts both speech and gesture commands from multiple people and is able to integrate commands to a single computer. I will use existing speech recognition technology to recognize voice commands.

This toolkit will support multiple computers over a

network because hardware and software limitations often require that multiple large displays or input devices be controlled by separate computers. 3. I will develop tools to simplify the adaptation of commercial single user applications to a multimodal co-located environment.

This will allow one to rapidly prototype rich multimodal applications

without the need to develop a working commercial system from the ground up and will facilitate the exploration and further understanding of multimodal co-located application development. To begin, this toolkit will be designed to support gestures on existing table top input devices (e.g., Diamond Touch, Smart DViT), however, there may be plans to extend this infrastructure to support other multimodal input devices at a later time. To evaluate the toolkit build the applications described in Objectives 3 and 4, and I will get others to develop multimodal co-located systems using my toolkit. 3. I will develop and evaluate multimodal co-located wrappers over existing commercial applications to further my understanding and inform the design of true multi-user multimodal interactive systems.

Edward Tse – Thesis Proposal – Version 32 – 3/15/2006

Page 15

Using the design implications and behavioural foundations developed in Objective One and the prototyping toolkit developed in Objective two, I will develop several multi user, multimodal co-located interface wrappers atop of existing commercial applications on a interactive table top display.

By

studying existing commercial applications I will be able to rapidly prototype rich multimodal applications that would otherwise be impossible for me to develop from the ground up. User studies of these systems will also provide an opportunity to observe how people naturally mitigate interference and turn taking when interacting with single user applications over a multimodal co-located table top display. They will also be used to evaluate how the design implications provided in Objective One work when moved out of the physical world in the realm of a digital tabletop. All of these observations will be used to inform the design of a true multi user multimodal system. 4. I will develop true multi user multimodal co-located systems and evaluate the technical and behavioural nuances of multimodal co-located systems development. I will develop and evaluate different techniques to deal with these nuances to inform the design of future multimodal colocated systems. Again, using the toolkit developed in Objective two, I will create several applications that explore new interaction possibilities available exclusively in a multi user multimodal co-located environment. I will get groups of people to perform collaborative tasks in this environment paying particular attention to the inter-person behaviours and technical nuances of multimodal co-located application development. Using the list of nuances, I will evaluate existing techniques used to mitigate these problems in the existing literature. I may also develop my own interaction techniques and compare them against commonly accepted approaches. For example, if one of the nuances of multimodal co-located interaction turns out to be the need to manage when speech recognition is activated, I will evaluate existing techniques used for speech recognition (e.g., push to talk, look to talk, etc) and possibly develop a new technique to mitigate the issue in the co-located environment. Some of the research directions that I am considering for Objective three include the examining how a multimodal co-located interactive system will influence or affect the natural interactions that occur in co-present meetings. For example, alouds are high level spoken utterances made by the performer of an action meant for the benefit of the group but not directed to any one individual in the group [Heath, 1991]. Since spoken commands are directed to computer they are not truly alouds, thus we do not know what behavioural affordances normally provided by alouds will be achieved in the multimodal speech recognition environment. Furthermore, I am interested in examining techniques to manage when speech

Edward Tse – Thesis Proposal – Version 32 – 3/15/2006

Page 16

utterances are meant for the computer versus when they are meant for collaborators. The research that I am capable of exploring will depending heavily on the capabilities provided by the multimodal co-located toolkit described in Objective 2.

1.5 Current Status Much of Objective one has been completed. I have written a paper that outlines a list of behavioural foundations describing the individual and group benefits and implications to design of gesture and speech interaction in an interactive co-located environment (see Appendix B). This initial list will be further expanded in my thesis to include ethnographic studies of other collaborative environments (e.g., NASA Control Centres, Hospital Surgery Rooms) and my own anecdotal experiences from application development in Objectives three and four. Parts of Objective two have been completed. I have developed a toolkit called the Diamond Touch Gesture Engine that allows different hand postures (e.g., hand, five fingers, fist, etc) and their respective movements (e.g., two fingers moving together) to be detected from multiple people on a table top display. I have used this gesture engine and the Microsoft Speech Application Programmers’ Interface (Microsoft SAPI) to prototype speech and gesture applications that interact with existing commercial applications (e.g., Google Earth and Blizzard’s Warcraft III, see Appendix B). This toolkit needs to be improved to provide more reliable recognition of gesture movement (rotation of five fingers) and input across multiple computers. I have begun work on a toolkit to allow input to be sent across different computers using a local area network called the Centralized External Input (CEXI) Toolkit. The next step will be to combine the Diamond Touch Gesture Engine, Microsoft SAPI and CEXI Toolkit into a unified infrastructure for exploring multimodal co-located interaction. I have explored multi user multimodal co-located wrappers for three commercial single user applications. My experiences from adapting Google Earth, Warcraft III and The Sims has allowed me to focus my efforts on providing rich multimodal gesture and speech interactivity rather than building a truly useful application. These application wrappers have developed significant interest in my research by industry (e.g., PB Faradyne) and government agencies (e.g., Disaster Services of the City of Calgary) as they are a compelling way of illustrating how existing commercial application would work on a table top surface using gesture and speech.

Edward Tse – Thesis Proposal – Version 32 – 3/15/2006

Page 17

A large amount of work remains to be completed in Objective four. In particular, I am actively searching for instances of multi user multi modal gestures in the current literature and I am looking for a application domain to develop a prototype true multi user application. My hope is that I will be able to gauge the interest of industry partners that will be able to tell me about the interactions and issues that they are facing. Using this information I will begin to embark on the exploration of true multi user multimodal systems and their subsequent evaluations. I anticipate several avenues that can be explored as this work unfolds. While it is unlikely that all these avenues will be pursued in the context of this research, it does describe a broader research agenda. Mixed Presence Groupware is an area that explores groups of co-located individuals working together, such systems could leverage multimodal interaction to improve awareness for remote participants. Gaze, head and torso tracking would allow a richer set of deictic gestures, where one could specify areas of interest by orienting their head and torso to a digital representation of a remote participant. Finally, multimodal co-located interaction could be used to explore the movement of digital information from a table top display to a peripheral wall display and vice versa, thus allowing digital content to move seamlessly in the digital work environment.

1.6 Conclusion This thesis argues that single point touch interaction on a large display reduces the expressive capabilities of people’s hands and arms to simple deictic pointers. Richer multimodal interaction that is aware of the hand postures, movements and speech acts that people naturally perform in a co-located environment will not only provide improved group awareness, but it will improve the accuracy and effectiveness of the collaborations in co-located environments. The related work has shown that there is a wealth of ethnographic, theoretical and technical research that has investigated and argued for the benefits of multimodal interaction in a co-located environment. This proposal has identified a largely unexploited area in Human-Computer Interaction: multimodal co-located collaboration.

The research I propose in this document aims to ground the

individual benefits and group benefits of multimodal interaction in the co-located setting.

The

contributions offered by this research are: an improved understanding of the benefits and tradeoffs of multimodal input in a co-located setting, a toolkit that allows rapid prototyping of multimodal interactive systems, the exploration of multi-user multimodal co-located wrappers around existing single user

Edward Tse – Thesis Proposal – Version 32 – 3/15/2006

Page 18

applications, and the design of evaluation of true multimodal co-located systems with the goal of understanding the nuances and design implications of effective multimodal co-located interaction.

1.7 References 1. Baudel, T., & Beaudouin-Lafon, M. (1993). Charade: remote control of objects using free-hand gestures. Communications of ACM, 36(7). p. 28-35. 1.

Bederson, B. and Hourcade, J. (1999): Architecture and implementation of a Java package for Multiple Input Devices (MID). HCIL Technical Report No. 9908. http://www.cs.umd.edu.hcil.

2. Bolt, R.A., Put-that-there: Voice and gesture at the graphics interface. Proc ACM Conf. Computer Graphics and Interactive Techniques Seattle, 1980, 262-270. 3. Clark, H. Using language. Cambridge Univ. Press, 1996. 4. Cohen, P. Speech can’t do everything: A case for multimodal systems. Speech Technology Magazine, 5(4), 2000. 5. Cohen, P.R., Johnston, M., McGee, D., Oviatt, S., Pittman, J., Smith, I., Chen, L. and Clow, J., QuickSet: Multimodal interaction for distributed applications. Proc. ACM Multimedia, 1997, 31-40. 6. Cohen, P.R., Coulston, R. and Krout, K., Multimodal interaction during multiparty dialogues: Initial results. Proc IEEE Int’l Conf. Multimodal Interfaces, 2002, 448-452. 7. Cohen, P.R., Johnston, M., McGee, D., Oviatt, S., Pittman, J., Smith, I., Chen, L. and Clow, J., QuickSet: Multimodal interaction for distributed applications. Proc. ACM Multimedia, 1997, 31-40. 8. Chin, T., Doctors Pull Plug on Paperless System, American Medical News, Feb 17, 2003, http://amaassn.org/amednews/2003/02/17/bil20217.htm 9. Deitz, P. and Leigh, D. (2001). DiamondTouch: A Multi-User Touch Technology. In Proceedings of ACM Symposium on User Interface Software and Technology (UIST) ‘01, pp. 219-226. 10. Gutwin, C., and Greenberg, S. The importance of awareness for team cognition in distributed collaboration. In E. Salas, S. Fiore (Eds) Team Cognition: Understanding the Factors that Drive Process and Performance, APA Press, 2004, 177-201. 11. Han, J., Low-cost multi-touch sensing through frustrated total internal reflection, Proc. UIST 2005, pp. 115 – 118. 12. Heath, C.C. and Luff, P. Collaborative activity and technological design: Task coordination in London Underground control rooms. Proc ECSCW, 1991, 65-80 13. Hollan, J., Hutchins, E., & Kirsh, D. Distributed Cognition: Toward a New Foundation for Human Computer Interaction. Proceedings of ACM TOCHI Vol 7 No 2 Jun 2000 pp. 174-196 14. Hutchins, E., and Palen, L. Constructing Meaning from Space, Gesture, and Speech. Discourse, tools, and reasoning: Essays on situated cognition. Heidelberg, Germany: Springer-Verlag, 1997 Pp. 23-40. 15. Hutchins, E. (2000) The Cognitive Consequences of Patterns of Information Flow. Proc. Intellectica 2000/1, 30, pp. 53-74.

Edward Tse – Thesis Proposal – Version 32 – 3/15/2006

Page 19

2. Tandler, P., (2003), The BEACH Application Model and Software Framework for Synchronous Collaboration in Ubiquitous Computing Environments, Journal of Systems & Software, Special Edition on Application Models and Programming Tools for Ubiquitous Computing, October, 2003. 16. Tobias Isenberg, André Miede, and Sheelagh Carpendale (2006). A Buffer Framework for Supporting Responsive Interaction in Information Visualization Interfaces. Proc. C5 2006, January 26-27, 2006, Berkeley, California, USA), Los Alamitos, CA. 17. Magerkurth, C., Memisoglu, M., Engelke, T. and Streitz, N., Towards the next generation of tabletop gaming experiences. Proceedings of the 2004 conference on Graphics Interface London, Ontario, Canada, 2004), Canadian Human-Computer Communications Society, pp. 73-80. 18. McGee, D.R. and Cohen, P.R., Creating tangible interfaces by augmenting physical objects with multimodal language. Proc ACM Conf Intelligent User Interfaces, 2001, 113-119. 19. McNeill, D. 1992. Hand and Mind: What Gestures Reveal About Thought. University of Chicago Press, Chicago. 20. Oviatt, S. Multimodal interactive maps: Designing for human performance. Human-Computer Interaction 12, 1997. 21. Oviatt, S. L. Ten myths of multimodal interaction, Comm. ACM, 42(11), 1999, 74-81. 22. Rekimoto, J. SmartSkin: An infrastructure for freehand manipulation on interactive surfaces. Proc ACM CHI, 2002. 23. Segal, L. Effects of checklist interface on non-verbal crew communications, NASA Ames Research Center, Contractor Report 177639. 1994 24. Shen, C.; Vernier, F.D.; Forlines, C.; Ringel, M., "DiamondSpin: An Extensible Toolkit for Aroundthe-Table Interaction", ACM Conference on Human Factors in Computing Systems (CHI), pp. 167174, April 2004 3. Tang, A. (2005) Embodiments in Mixed Presence Groupware. MSc Thesis, Department of Computer Science, University of Calgary, Calgary, Alberta, Canada T2N 1N4. Defended January 19, 2005. 25. Tse, E. (2004) The Single Display Groupware Toolkit. MSc Thesis, Department of Computer Science, University of Calgary, Calgary, Alberta, Canada, November. 26. Wu, M. and Balakrishnan, R., Multi-finger and whole hand gestural interaction techniques for multiuser tabletop displays. Proceedings of the 16th annual ACM symposium on User interface software and technology Vancouver, Canada, 2003), ACM Press, pp. 193-202. 4. Wu, M., Shen, C., Ryall, K., Forlines, C., Balakrishnan, R. (2006). Gesture Registration, Relaxation, and Reuse for Multi-Point Direct-Touch Surfaces, Proc IEEE Tabletop 2006, Adelaide, South Australia. pp. 183-190

Edward Tse – Thesis Proposal – Version 32 – 3/15/2006

Page 20

Appendix A. PhD Timeline Included is a rough schedule of upcoming events to provide a rough indication of anticipated timelines and when deliverables will be completed.

2006 February: Final Version of the Research Proposal Completed and submitted to Committee. March: Written candidacy examination, beginning some work on a tool to simplify the process of creating multimodal wrappers around existing single user applications (Thesis Objective 3). April: Oral candidacy examination. Objective is to submit a paper to the User Interface and Software Technologies (UIST) conference regarding the multimodal wrappers around existing single user applications. May: Begin internship at Mitsubishi Electric Research Laboratories (MERL). Work on true multi user multimodal systems (Thesis Objective 4) and begin to study the use of true multimodal systems. May – September: Objective is to submit a paper to the conference on Human Factors in Computing Systems (CHI) regarding studies of the usage of true multi user multimodal systems. I plan to do a presentation of my current work to the Supervisory Committee for approval and future directions. September – December: Continue the future work and directions provided by the supervisory committee at MERL.

2007 January: Directions meeting with the Supervisory Committee to examine current progress and directions. Begin work on my writing my PhD thesis, tie up loose ends in my research and publish any papers that remain to be published about my work. February: Begin final studies and systems to complete the requirements and goals of Thesis Objective 4. September: Begin writing the PhD thesis

2008 March: PhD Thesis completed and submitted to committee for approval. April: PhD Oral Defense

Edward Tse – Thesis Proposal – Version 32 – 3/15/2006

Page 21

Appendix B. Multimodal Co-Located Wrappers Paper

Reference Tse, E., Shen, C., Greenberg, S., Forlines, C. (2006) Enabling Interaction with Single User Applications through Speech and Gestures on a Multi-User Tabletop. Proceedings of AVI 2006, Venice, Italy, To appear.

Edward Tse – Thesis Proposal – Version 32 – 3/15/2006

Enabling Interaction with Single User Applications through Speech and Gestures on a Multi-User Tabletop 1

Edward Tse1,2, Chia Shen1, Saul Greenberg2 and Clifton Forlines1

Mitsubishi Electric Research Laboratories, 201 Broadway, Cambridge, MA, 02139, USA, +1 617 621-7500 2 University of Calgary, 2500 University Dr. N.W. Calgary, Alberta, T2N 1N4, Canada +1 403 220-6087

[shen, forlines]@merl.com and [tsee, saul]@cpsc.ucalgary.ca ABSTRACT Co-located collaborators often work over physical tabletops with rich geospatial information. Previous research shows that people use gestures and speech as they interact with artefacts on the table and communicate with one another. With the advent of large multi-touch surfaces, developers are now applying this knowledge to create appropriate technical innovations in digital table design. Yet they are limited by the difficulty of building a truly useful collaborative application from the ground up. In this paper, we circumvent this difficulty by: (a) building a multimodal speech and gesture engine around the Diamond Touch multi-user surface, and (b) wrapping existing, widely-used off-the-shelf single-user interactive spatial applications with a multimodal interface created from this engine. Through case studies of two quite different geospatial systems – Google Earth and Warcraft III – we show the new functionalities, feasibility and limitations of leveraging such single-user applications within a multi user, multimodal tabletop. This research informs the design of future multimodal tabletop applications that can exploit single-user software conveniently available in the market. We also contribute (1) a set of technical and behavioural affordances of multimodal interaction on a tabletop, and (2) lessons learnt from the limitations of single user applications.

Categories and Subject Descriptors H5.2 [Information interfaces and presentation]: User Interfaces – Interaction Styles.

General TermsDesign, Human Factors Keywords Tabletop interaction, visual-spatial displays, multimodal speech and gesture interfaces, computer supported cooperative work.

1. INTRODUCTION Traditional desktop computers are unsatisfying for highly collaborative situations involving multiple co-located people exploring and problem-solving over rich spatial information. These situations include mission critical environments such as military command posts and air traffic control centers, in which paper media such as maps and flight strips are preferred even when digital counterparts are available [4][5]. For example, Cohen et. al.’s ethnographic studies illustrate why paper maps on Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

AVI '06, May 23-26, 2006, Venezia, Italy. Copyright 2006 ACM 1-59593-353-0/06/0005...$5.00.

a tabletop were preferred over electronic displays by Brigadier Generals in military command and control situations [4]. The ‘single user’ assumptions inherent in the electronic display’s input device and its software limited commanders, as they were accustomed to using multiple fingers and two-handed gestures to mark (or pin) points and areas of interest with their fingers and hands, often in concert with speech [4][16]. While there are many factors promoting rich information use on physical tables over desktop computers, e.g., insufficient screen real estate and low image resolution of monitors, an often overlooked problem with a personal computer is that most digital systems are designed within single-user constraints. Only one person can easily see and interact with information at a given time. While another person can work with it through turn-taking, the system is blind to this fact. Even if a large high resolution display is available, one person’s standard window/icon/mouse interaction – optimized for small screens and individual performance – becomes awkward and hard to see and comprehend by others involved in the collaboration [12]. For a computer system to be effective in such collaborative situations, the group needs at least: (a) a large and convenient display surface, (b) input methods that are aware of multiple people, and (c) input methods that leverage how people interact and communicate over the surface via gestures and verbal utterances [4][18]. For point (a), we argue that a digital tabletop display is a conducive form factor for collaboration since it lets people easily position themselves in a variety of collaborative postures (side by side, kitty-corner, round table, etc.) while giving all equal and simultaneous opportunity to reach into and interact over the surface. For points (b+c), we argue that multimodal gesture and speech input benefits collaborative tabletop interaction: reasons will be summarized in Section 2. The natural consequence of these arguments is that researchers are now concentrating on specialized multi-user, multimodal digital tabletop applications affording visual-spatial interaction. However, several limitations make this a challenging goal: 1. Hardware Limitations. Most touch-sensitive display surfaces only allow a single point of contact. The few surfaces that do provide multi-touch have serious limitations. Some, like SmartSkin [20], are generally unavailable. Others limit what is sensed: SmartBoard’s DViT (www.smarttech.com/dvit) currently recognizes a maximum of 2 touches and the touch point size, but cannot identify which touch is associated with which person. Some have display constraints: MERL’s DiamondTouch [6] identifies multiple people, knows the areas of the table they are touching, and can approximate the relative force of their touches; however, the technology is currently limited to front projection and their surfaces are

relatively small. Consequently, most research systems limit interaction to a single touch/user, or by having people interact indirectly through PDAs, mice, and tablets (e.g., [16]). 2. Software Limitations. It is difficult and expensive to build a truly useful collaborative multimodal spatial application from the ground up (e.g., Quickset [5]). As a consequence, most research systems are ‘toy’ applications that do not afford the rich information and/or interaction possibilities expected in well-developed commercial products. The focus of this paper is on wrapping existing single user geospatial applications within the multi-user, multimodal tabletop setting. Just as screen/window sharing systems let distributed collaborators share views and interactions with existing familiar single user applications [9], we believe that embedding familiar single-user applications within a multi-user multimodal tabletop setting – if done suitably – can benefit co-located workers. The remainder of this paper develops this idea in three ways. First, we analyze and summarize the behavioural foundations motivating why collaborators should be able to use both speech and gestures atop tables. Second, we briefly present our Gesture Speech Infrastructure used to add multimodal, multi user functionality to existing commercial spatial applications. Third, through case studies of two different systems – Google Earth and Warcraft III – we analyze the feasibility and limitations of leveraging such single-user applications within a multi-user, multimodal tabletop.

2. BEHAVIOURAL FOUNDATIONS This section reviews related research and summarize them in the form of a set of behavioural foundations.

2.1 Individual Benefits Proponents of multimodal interfaces argue that the standard windows/icons/menu/pointing interaction style does not reflect how people work with highly visual interfaces in the everyday world [4]. They state that the combination of gesture and speech is more efficient and natural. We summarize below some of the many benefits gesture and speech input provides to individuals. Deixis: speech refined by gestures. Deictic references are speech terms (‘this’, ‘that’, etc.) whose meanings are qualified by spatial gestures (e.g., pointing to a location). This was exploited in the Put-That-There multimodal system [1], where individuals could interact with a large display via speech commands qualified by deictic reference, e.g., “Put that…” (points to item) “there…” (points to location). Bolt argues [1] and Oviatt confirms [18] that this multimodal input provides individuals with a briefer, syntactically simpler and more fluent means of input than speech alone. Studies also show that parallel recognition of two input signals by the system yields a higher likelihood of correct interpretation than recognition based on a single input mode [18]. Complementary modes. Speech and gestures are strikingly distinct in the information each transmits, how it is used during communication, the way it interoperates with other communication modes, and how it is suited to particular interaction styles. For example, studies clearly show performance benefits when people indicate spatial objects and locations – points, paths, areas, groupings and containment – through gestures instead of speech [17][18][5][3]. Similarly, speech is more useful than gestures for specifying abstract actions.

Simplicity, efficiency, and errors. Empirical studies of speech/gestures vs. speech-only interaction by individuals performing map-based tasks showed that multimodal input resulted in more efficient use of speech (23% fewer spoken words), 35% less disfluencies (content self corrections, false starts, verbatim repetitions, spoken pauses, etc.), 36% fewer task performance errors, and 10% faster task performance [18]. Rich gestures and hand postures. Unlike the current deictic ‘pointing’ style of mouse-based and pen based systems, observations of people working over maps showed that people used different hand postures as well as both hands coupled with speech in very rich ways [4]. Natural interaction. During observations of people using highly visual surfaces such as maps, people were seen to interact with the map very heavily through both speech and gestures. The symbiosis between speech and gestures are verified in the strong user preferences stated by people performing map-based tasks: 95% preferred multimodal interaction vs. 5% preferred pen only. No one preferred a speech only interface [18].

2.2 Group Benefits Spatial information placed atop a table typically serves as conversational prop to the group, creating a common ground that informs and coordinates their joint actions [2]. Rich collaborative interactions over this information often occur as a direct result of workspace awareness: the up-to-the-moment understanding one person has of another person’s interaction with the shared workspace [11]. This includes awareness of people, how they interact with the workspace, and the events happening within the workspace over time. As outlined below, many behavioural factors comprising the mechanics of collaboration [19] require speech and gestures to contribute to how collaborators maintain and exploit workspace awareness over tabletops. Alouds. These are high level spoken utterances made by the performer of an action meant for the benefit of the group but not directed to any one individual in the group [13]. This ‘verbal shadowing’ becomes the running commentary that people commonly produce alongside their actions. For example, a person may say something like “I am moving this box” for a variety of reasons: • to make others aware of actions that may otherwise be missed, • to forewarn others about the action they are about to take, • to serve as an implicit request for assistance, • to allow others to coordinate their actions with one’s own, • to reveal the course of reasoning, • to contribute to a history of the decision making process. When working over a table, alouds can help others decide when and where to direct their attention, e.g., by glancing up and looking to see what that person is doing in more detail [11]. Gestures as intentional communication. In observational studies of collaborative design involving a tabletop drawing surface, Tang noticed that over one third of all activities consisted of intentional gestures [23]. These intentional gestures serve many communication roles [19], including: • pointing to objects and areas of interest within the workspace, • drawing of paths and shapes to emphasise content, • giving directions, • indicating sizes or areas, • acting out operations.

Deixis also serves as a communication act since collaborators can disambiguate one’s speech and gestural references to objects and spatial locations [19]. An example is one person telling another person “This one” while pointing to a specific object. Deixis often makes communication more efficient since complex locations and object descriptions can be replaced in speech by a simple gesture. For example, contrast the ease of understanding a person pointing to this sentence while saying ‘this sentence here’ to the utterance ‘the 4th sentence in the paragraph starting with the word deixis located in the middle of the column on page 3’. Gestures as consequential communication. Consequential communication happens as one watches the bodies of other’s moving around the work surface [22][19]. Many gestures are consequential vs. intentional communication. For example, as one person moves her hand in a grasping posture towards an object, others can infer where her hand is heading and what she likely plans to do. Gestures are also produced as part of many mechanical actions, e.g., grasping, moving, or picking up an object: this also serves to emphasize actions atop the workspace. If accompanied by speech, it also serves to reinforce one’s understanding of what that person is doing. Simultaneous activity. Given good proximity to the work surface, participants often work simultaneously over tables. For example, Tang observed that approximately 50-70% of people’s activities around the tabletop involved simultaneous access to the space by more than one person [23]. Gaze awareness. People monitor the gaze of a collaborator [13][14][11]. It lets one know where others are looking and where they are directing their attention. It helps one check what others are doing. It serves as visual evidence to confirm that others are looking at the right place or are attending one’s own acts. It even serves as a deictic reference by having it function as an implicit pointing act. While gaze awareness is difficult to support in distributed groupware technology [14], it happens easily and naturally in the co-located tabletop setting [13][11].

2.3 Implications The above points clearly suggest the benefits of supporting multimodal gesture and speech input on a multi-user digital table. This not only is a good way to support individual work over spatially located visual artefacts, but intermixed speech and gestures comprise part of the glue that makes tabletop collaboration effective. Taken all together, gestures and speech coupled with gaze awareness support a rich multi-person choreography of often simultaneous collaborative acts over visual information. Collaborators’ intentional and consequential gesture, gaze movements and verbal alouds indicate intentions, reasoning, and actions. Participants monitor these acts to help coordinate actions and to regulate their access to the table and its artefacts. Participant’s simultaneous activities promote interaction ranging from loosely coupled semi-independent tabletop activities to a tightly coordinated dance of dependant activities. While supporting these acts are good goals for digital table design, they will clearly be compromised if we restrict a group to traditional single-user mouse and keyboard interaction. In the next section, we describe an infrastructure that lets us create a speech and gesture multimodal and multi-user wrapper around these single-user systems. As we will see in the following case studies, these afford a subset of the benefits of multimodal interaction.

3. GESTURE SPEECH INFRASTRUCTURE Our infrastructure is illustrated in Fig. 1. A standard Windows computer drives our infrastructure software, as described below. The table is a 42” MERL Diamond Touch surface [6] with a 4:3 aspect ratio; a digital projector casts a 1280x1024 pixel image on the table’s surface. This table is multi-touch sensitive, where contact is presented through the DiamondTouch SDK as an array of horizontal and vertical signals, touch points and bounding boxes (Fig. 1, row 5). The table is also multi-user, as it distinguishes signals from up to four people. While our technology uses the Diamond Touch, the theoretical motivations, strategies developed, and lessons learnt should apply to other touch/vision based surfaces that offer similar multi user capabilities. Speech Recognition. For speech recognition, we exploit available technology: noise canceling headset microphones for capturing speech input, and the Microsoft Speech Application Programmers’ Interface (Microsoft SAPI) (Fig. 1, rows 4+5). SAPI provides an n-best list of matches for the current recognition hypothesis. Due to the one user per computer limitation in Microsoft SAPI, only one headset can be attached to our main computer. We add an additional computer for each additional headset, which collects and sends speech commands to the primary computer (Fig. 1, right side, showing a 2nd headset). Gesture Engine. Since recognizing gestures from multiple people on a table top is still an emerging research area [25][26], we could not use existing 3rd party gesture recognizers. Consequently, we developed our own Diamond Touch gesture recognition engine to convert the raw touch information produced by the DiamondTouch SDK into a number of rotation and table-size independent features (Fig. 1, rows 4+5 middle). Using a Univariate Gaussian clustering algorithm, features from a single input frame are compared against a number of pre-trained hand and finger postures. By examining multiple frames over time, we capture dynamic information such as a hand moving up or two fingers moving closer together or farther apart. This allows applications to be developed that understand both different hand postures and dynamic movements over the Diamond Touch. Input Translation and mapping. To interact with existing single user applications, we first use the GroupLab WidgetTap toolkit [8] to determine the location and size of the GUI elements within

Figure 1. The Gesture Speech Infrastructure

it. We then use the Microsoft Send Input facility to relay the gesture and speech input actions to the locations of the mapped UI elements (Fig. 1, rows 1, 2 and 3). Thus speech and gestures are mapped and transformed into one or more traditional GUI actions as if the user had performed the interaction sequence via the mouse and keyboard. The consequence is that the application appears to directly understand the spoken command and gestures. Section 5.5 elaborates further on how this mapping is done. If the application allows us to do so, we also hide the user interface GUI elements so they do not clutter up the display. Of importance is that application source code is neither required nor modified.

4. GOOGLE EARTH and WARCRAFT III Our case studies leverage the power of two commercial single user geospatial applications: Google Earth (earth.google.com) and Blizzard’s Warcraft (www.blizzard.com/war3). The following sections briefly describe their functionality and how our multimodal interface interacts with them. While the remainder of this paper primarily focuses on two people working over these applications, many of the points raised apply equally to groups of three or four.

4.1 Google Earth Google Earth is a free desktop geospatial application that allows one to search, navigate, bookmark, and annotate satellite imagery of the entire planet using a keyboard and mouse. Its database contains detailed satellite imagery with layered geospatial data (e.g., roads, borders, accommodations, etc). It is highly interactive, with compelling real time feedback during panning, zooming and ‘flying’ actions, as well as the ability to tilt and rotate the scene and view 3D terrain or buildings. Previously visited places can be bookmarked, saved, exported and imported using the places feature. One can also measure the distance between any two points on the globe. Table 1 provides a partial list of how we mapped Google Earth onto our multimodal speech and gesture system, while Fig. 2 illustrates Google Earth running on our multimodal, multi user table. Due to reasons that will be explained in §5.4, almost all speech and gesture actions are independent of one another and immediately invoke an action after being issued. Exceptions are ‘Create a path / region’ and ‘measure distance’, where the system waits for finger input and an ‘ok’ or ‘cancel’ utterance (Fig. 1).

4.2 Warcraft III

Table 1. The Speech/Gesture interface to Google Earth Speech commands

Gesture commands

Navigates to location, Fly to eg., Boston, Paris

One finger move / flick

Pans map directly / continuously

Flys to custom-created Places places, e.g., MERL

One finger double tap

Zoom in 2x at tapped location

Navigation panel

Toggles 3D Navigation controls, e.g., rotate

Two fingers, spread apart

Zoom in

Layer

Toggles a layer, e.g., bars, banks

Two fingers, spread together

Zoom out

Undo layer

Removes last layer

Above two actions done rapidly

Continuous zoom out / in until release

Reorient

Returns to the default upright orientation

One hand

3D tilt down

Create a path Creates a path that can be travelled in 3D Ok

Five fingers

3D tilt up

Tour last path Does a 3D flyover of the previously drawn path

Bookmark

Pin + save current location

Create a region

Highlight via semitransparent region

Last bookmark

Fly to last bookmark

Measure Distance

Measures the shortest distances between two

Next bookmark

Fly to previous bookmark

Create a path

ppooiinntt ppooiinntt

Warcraft III is a real time strategy game. It implements a command and control scenario over a geospatial landscape. The landscape is presented in two ways: a detailed view that can be panned, and a small inset overview. No continuous zooming features are available like those in Google Earth. Within this setting, a person can create units comprising semi-autonomous characters, and direct characters and units to perform a variety of actions (e.g., move, build, attack). While Google Earth is about navigating an extremely large and detailed map, Warcraft is about giving people the ability to manage, control and reposition different units over a geospatial area.

Figure 2. Google Earth on a table. where the object to attack can be specified before, during or even after the speech utterance.

Table 2 shows how we mapped Warcraft III onto speech and gestures, while Fig. 3 illustrates two people interacting with it on a table. Unlike Google Earth and again for reasons that will be discussed in §5.4, Warcraft’s speech and gesture commands are often intertwined. For example, a person may tell a unit to attack,

This section is loosely structured as follows. The first three subsections raise issues that are primarily a consequence of constraints raised by how the single user application produces visual output: upright orientation, full screen views, and feedthrough. The remaining subsections are a consequence of

5. ANALYSIS and GUIDELINES From our experiences implementing multi-user multi-modal wrappers for Google Earth and Warcraft III, we encountered a number of limitations that influenced our wrapper design, as outlined below. When possible, we present solutions to mitigate these limitations, which can also guide the design of future multiuser multi-modal interactions built atop single user applications.

Table 2. The Speech/Gesture interface to Warcraft III Speech commands

One hand

Pans map directly

Attack / attack Selected units attack a pointed to location here [point]

One finger

Selects units & locations

Build Build object at current location, e.g., farm, here [point] barracks

Two fingers

Context – dependant move or attack

Two sides of hand

Select multiple workers in an area

Next worker

Navigate to the next worker

Unit <#>

Move / move here [point]

Selects a numbered unit, e.g., one, two

Gesture commands

Move to the pointed to location

[area] Label as Adds a character to a unit group unit <#> Stop

Stop the current action

Move here

Label as Unit 1

ttw woo--ssiiddee sseelleeccttiioonn ppooiinntt

Figure 3. Two people interacting with Warcraft III. constraints raised by the application consider user input: interacting speech and gestures, mapping, and turntaking.

5.1 Upright Orientation Most single user systems are designed for an upright display rather than a table. Thus all display items and GUI widgets are oriented in a single direction usually convenient for the person seated at the ‘bottom’ edge of the display, but would be upside down for the person seated across from them. As illustrated in the upside down inset figure, a screenshot from Google Earth, problems introduced include text readability (but see [24]), difficulties in comprehending incorrectly oriented 3D views, inhibiting people from claiming ownership of work areas [15], and preventing people from naturally adjusting orientation as part of their collaborative process [15]. Similarly, the layout of items on the surface usually favors a single orientation, which has implications for how people can see and reach distant items if they want to perform gestures over them. Warcraft III maintains a strictly upright orientation; while people can pan, they cannot rotate the landscape. Critical interface features, such as the overview map, are permanently positioned at the bottom left corner, which is inconvenient for a person

seated to the right who wishes to navigate using the overview map. Google Earth has similar constraints: its navigation panel (exposed by a speech command) is at the very bottom, making its tilt GUI control awkward to use for anyone but the upright user. While Google Earth allows the map to be rotated, text labels atop the map are not rotated. In both systems, 3D perspective is oriented towards the upright user. A tilted 3D image is the norm in Warcraft III. While Google Earth does provide controls to adjust the 3D tilt of a building on the map, the viewpoint always remains set for the upright user. Some of these problems are not solvable as they are inherent to the single user application, although people can choose to work side by side on the bottom edge. However, speech appears to be an ideal input modality for solving problems arising from input orientation and reach, since users can sit around any side of the table to issue commands (vs. reach, touch or type).

5.2 Full Screen Views Many applications provide a working area typically surrounded by a myriad of GUI widgets (menus, palettes, etc.). While these controls are reasonable for a single user, multiple people working on a spatial landscape expect to converse over the scene itself. Indeed, one of the main motivations for a multimodal system is to minimize these GUI elements. Fortunately, many single user applications provide a ‘full screen’ view, where content fills the entire screen and GUI widgets are hidden. The trade-off is that only a few basic actions are allowed, usually through direct manipulation or keyboard shortcuts (although some applications provide hooks through accessibility APIs). Because Warcraft III is designed as a highly interactive game, it already exploits a full screen view in which all commands are accessible through keyboard shortcuts or direct manipulation. Thus speech/gesture can be directly mapped to keyboard/mouse commands. In contrast, Google Earth contains traditional GUI menus and sidebars: 42% of the screen real estate is consumed by GUI items on a 1024x768 screen! While these elements can be hidden by toggling it into full screen mode, much of Google Earth’s functionality is only accessible through these menus and sidebars. Our solution uses full screen mode, in which we map multimodal commands to action macros that first expose a hidden menu or sidebar, perform the necessary action on it (via WidgetTap and Send Input), and then hide the menu or sidebar (see §5.5). When this stream of interface actions is executed in a single step, the interface elements and inputs are hidden.

5.3 Feedback and Feedthrough Feedback of actions is important for single user systems. Feedthrough (the visible consequence of another person’s actions) is just as important if the group is to comprehend what another person is doing [7]. True groupware systems can be constructed to regulate the feedback and feedthrough so it is appropriate to the acting user and the viewing participants. Within single user systems, we can only use what is provided. Fortunately, both Google Earth and Warcraft III are highly interactive, immediately responding to all user commands in a very visual and often compelling manner. Panning in both produces an immediate response, as does zooming or issuing a ‘Fly to’ command in Google Earth. Warcraft III visually marks all selections, re-enforcing the meaning of a gestural act. Warcraft III also gives verbal feedback. For example, if one says the ‘Move

here’ or ‘Attack here’ voice command and points to a location (Table 2), the units will respond with a prerecorded utterance such as “yes, master” and will then move to the specified location.

By understanding the sometimes subtle input constraints of the single user application, a designer can decide if and where intermixing of speech and gestures via mapping is possible.

In both systems, some responses are animated over time. For example, ‘Fly to, Calgary from a distant location will begin an animated flyover by first zooming out of the current location, flying towards Calgary, and zooming into the centre of the city. Similarly, panning contains some momentum in Google Earth, thus a flick gesture on the table top will send the map continually panning in the direction of the flick. In Warcraft III, if one instructs ‘Unit one, build farm’ , it takes time for that unit to run to that location and to build the farm. These animations provide excellent awareness to the group, for the feedthrough naturally emphasises individual actions [12].

5.5 Mapping

Animations over time also provide others with the ability to interrupt or modify the ongoing action. For example, animated flyovers, continuous zooming or continuous panning in Google Earth can be interrupted by a collaborator at any point by touching on the table surface. Similarly a ‘stop’ voice command in Warcraft III can interrupt any unit’s action at any time. Feedback, even when it is missing, is also meaningful as it indicates that the system is waiting for further input. For example, if one says ‘Unit one move’ to Warcraft III, the group will see unit one selected and a cross hair indicating that it is waiting for a location to move to, but nothing will actually happen until one points to the surface. This also provides others with the ability to interrupt, and even to take over the next part of the dialog (§5.6).

5.4 Interacting Speech and Gestures Ideally, we would like to have the system respond to interacting and possibly overlapping speech and gesture acts, e.g., ‘Put that’ ‘there’ [1]. This is how deixis and consequential communication works. It may even be possible to have multiple people contribute to command construction through turn taking (see §5.6). However, the design of the single user application imposes restrictions on how this can be accomplished. Google Earth only allows one action to be executed at a time; no other action can be executed until that action is completed. For example if a person performs simultaneous keyboard and mouse interactions only the keyboard commands will be performed. The design consequence is that we had to map most spoken and gestural actions into separate commands in Google Earth (Table 1). As mentioned, with the exception of the ‘create a path/region’ and ‘measure distance’ command, gestures and speech do not interact directly. Some gesture and speech commands move or zoom to a location. Other speech commands operate in the context of the current location, usually the center of the screen. For example, ‘bookmark’ only acts on the screen center; while a person can position the map so the location is at its center, they cannot say ‘Bookmark’ and point to a location off to the side. In contrast, Warcraft III is designed to be used with the keyboard and mouse in tandem, i.e., it can react to keyboard and mouse commands simultaneously. This makes it possible to use intermixed speech and deixis for directing units. Our mapping uses speech in place of keyboard commands, and gesture in place of mouse commands, e.g., saying ‘Unit 1, move here’ while pointing to location.

Complementary Modes. Our behavioural foundations state that speech and gesture differ in their ability to transmit and communicate information, and in how they interact to preserve simplicity and efficiency [17][5][3]. Within Google Earth and Warcraft III (Tables 1 & 2), we reserve gestures primarily for spatial manipulations: navigation, deixis and selections. ‘Abstract’ commands are moved onto the speech channel. Mapping of Gestures. Many systems rely on abstract gestures to invoke (i.e., mode change into) commands. For example, a two fingered gesture invokes an ‘Annotate’ mode in Wu’s example application [25]. Yet our behavioural foundations state that people working over a table should be able to easily understand other people’s rich gestural acts and hand postures as both consequential communication and as communicative acts. This strongly suggests that our vocabulary of postures and dynamics must reflect people’s natural gestures as much as possible (a point also advocated in [25][26]). Because we reserve gestures for spatial manipulations, very little learning is needed: panning by dragging one’s finger or hand across the surface is easily understood by others, as is the surface stretching metaphor used in spreading apart or narrowing two fingers to activate discrete or continuous zooming in Google Earth. Pointing to indicate deictic references, and using the sides of two hands to select a group of objects in Warcraft III is also well understood [17][5][3]. Because most of these acts work over a location, gaze awareness becomes highly meaningful. However, the table’s input constraints can restrict what we would like to do. For example, an upwards hand tilt movement would be a natural way to tilt the 3D map of Google Earth, but this posture is not recognized by the DiamondTouch table. Instead, we resort to a more abstract one hand / five finger gesture set to tilt the map up and down (Table 1). Mapping of Speech. A common approach to wrapping speech atop single user systems is to do a 1:1 mapping of speech onto system-provided command primitives. This is inadequate for a multi-user setting: a person should be able to rapidly issue semantically meaningful commands to the table, and should easily understand the meaning of other people’s spoken commands within the context of the visual landscape and their gestural acts. In other words, speech is intended not only for the control of the system, but also for the benefits of one’s collaborators. If speech were too low level, the other participants would have to consciously reconstruct the intention of the user. The implication is that speech commands must be constructed so that they become meaningful ‘alouds’. Within Google Earth, we simplified many commands by collapsing a long sequential interaction flow into a macro invoked by a single well formed utterance (Table 1). For example, with a keyboard and mouse, flying to Boston while in full screen mode requires the user to: 1) use the tool menu to open a search sidebar, 2) click on the search textbox, 3) use the keyboard to type in ‘Boston, MA’ followed by the return key, and 4) use the tool menu to close the search sidebar. Instead, a person simply speaks the easily understood two-part utterance ‘Fly to’ ‘Boston’. We

also created ‘new’ commands that make sense within a multimodal multi-user setting, but that are not provided by the base system. For example, we added the ability for anyone to undo layer operations (which adds geospatial information to the map) by creating an ‘Undo Layer’ command (Table 1). Under the covers, our mapping module remembers the last layer invoked and toggles the correct checkbox in the GUI to turn it off. Intermixing of Speech and Gesture. We explained previously that a strength of multimodal interaction is that speech and gestures can interact to provide a rich and expressive language for interaction and collaboration. Because of its ability to execute simultaneous commands, Warcraft III provides a good example how speech and gesture can be mapped to interact over a single user application. Our Warcraft III speech vocabulary was constructed as easily understood phrases: nouns such as ‘unit one’, verbs such as ‘move’, action phrases such as ‘build farm’ (Table 2). These speech phrases are usually combined with gestures describing locations and selections to complete the action sequence. For example, a person may select a unit, and then say ‘Build Barracks’ while pointing to the location where it should be built. This intermixing not only makes input simple and efficient, but makes the action sequence easier for others to understand.

5.6 Turn taking Single user applications expect only a single stream of input coming from a single person. In a multi-user setting, these applications cannot disambiguate what commands come from what person, nor can they make sense of overlapping commands and/or command fragments that arise from simultaneous user activities. In shared window systems, confusion arising from simultaneous user input across workstations is often regulated through a turn taking wrapper interposed between the multiple workstation input streams and the single user application [9][10]. Akin to a switch, this wrapper regulates user pre-emption so that only one workstation’s input stream is selected and sent to the underlying application. The wrapper could embody various turn taking protocols, e.g., explicit release (a person explicitly gives up the turn), pre-emptive (a new person can grab the turn), pause detection (explicit release when the system detects a pause in the current turn-holder’s activity), queue or round-robin (people can ‘line up’ for their turns), central moderator (a chairperson assigns turns), and free floor (anyone can input at any time, but the group is expected to regulate their turns using social protocol) [10]. In the distributed setting of shared window systems, technical enforcement of turn taking is often touted since interpersonal awareness is inadequate to effectively use social mediation. Our two case studies reveal far richer opportunities for social regulation of turn-taking in tabletop multimodal environments. Ownership through Awareness. We noticed that unlike distantseparated users of shared window systems, co-located tabletop users were aware of moment by moment actions of others and thus were far better able to use social protocol to mediate their interactions. Alouds arising from speaking into the headset let others know that one had just issued a command so they could reconstruct its purpose; thus people are unlikely to verbally overlap one another, or to unintentionally issue a conflicting command. Through consequential communication, people see that one is initiating, continuing or completing a gestural act; this

strongly suggests one’s momentary ‘ownership’ of the table and thus regulates how people time appropriate opportunities for taking over. The real time visual feedback and feedthrough provided by both Google Earth and Warcraft emphasises who is in control, what is happening, when the consequences of their act is completed, and when it is appropriate to intercede. Interruptions. We noticed that awareness not only lets people know who is in control, but also provides excellent opportunities for interruptions. That is, a person may judge moments where they can stop, take over and/or fine-tune another person’s actions. Eye gaze and consequential communication helps people mutually understand when this is about to happen, enabling cooperation vs. conflict. We already described how animations initiated by user actions (e.g., unit movement in Warcraft or the animated flyovers in Google Earth) can be stopped or redirected by a spoken command (‘Stop’) or a gestural command (touching the surface). Assistance. Awareness also provides opportunities for people to offer assistance. Indeed, the interruptions mentioned above are likely a form of assistance, i.e., to repair or correct an action initiated by another person. Assistance also occurs when multiple people interleave their speech and gestures to compose a single command. For example, we previously mentioned in §5.5 how multi modal commands in Warcraft III are actually phrases, where phrases are chained together to compose a full command. As one person starts a command (‘unit one’, ‘move’) another can continue by pointing to the place where it should move to. Similarly, the ‘create a path’ and ‘create a region’ spoken commands in Google Earth expect a series of points: all members of the group can contribute these points through touch gestures. The Mode problem. In spite of the above, people can only work within the current mode of the single user application. While one can take over (through turn taking) actions within a mode, two people cannot work in different modes at the same time. For example, in Warcraft III it is not possible for multiple people to control different units simultaneously. In summary, while our experiences with our case studies suggest that social regulation of turn taking suffices for two people working over a multi modal, multi user tabletop (since the group has enough information to regulate themselves), there could be situations in which technical mediation is desired. Examples could include larger groups (to avoid accidental command overlap and interruptons), participants with different roles, or conflict situations. This proved fairly easy to do by incorporating a turn taking layer to the Application Mapping module in our infrastructure (Fig. 1). This module already knows which user is trying to interact with the system by touch or speech, and can detect when multiple people are contending for the turn. Decision logic or coordination policies [10][21] can then decide which input to forward to the application, and which to ignore (or queue for later). The logic could enforce turn taking policies at different levels of granularity. • Floor control dictates turns at a person level, i.e., a person is in control of all interaction until that turn is relinquished to someone else. • Input control: one input modality has priority over another modality, e.g., gesture takes priority over speech commands. • Mode control enforces turn taking at a finer granularity. If the system detects that a person has issued a command that enters



a mode, it blocks or queues all other input until the command is complete and the mode is exited. For example, if a person opens the navigation panel or begins a tour flyover in Google Earth, all input is blocked until the flyover is completed. Command control considers turn taking within command composition. If the system detects that a person has issued a phrase initiating a command, it may restrict completion of that command to that person, e.g., if a person selects a character in Warcraft III, the system may temporarily block others from issuing commands to that character. Alternately, other people may be allowed to interleave a subset of command phrases to that character, e.g., while they can gesture to enter points via to Google Earth’s ‘Create a Path’ command, only the initiator can complete that command with the spoken ‘Ok’.

6. CONCLUSIONS This paper described how and why we enabled speech and gestural interaction with commercial single user applications on a multi-user tabletop. By surveying the literature on groupware and multimodal interactions, we presented key behavioural affordances that motivate and inform the use of multimodal, multi user table top interaction. These behavioural affordances are applied in practice to implement two existing geospatial systems (Google Earth and Warcraft III) atop a common Gesture Speech Infrastructure. From our experiences, we derived a detailed but generalized analysis of issues and workarounds, which in turn provides guidance to future developers of this class of systems. This work represents an important first step bringing multimodal multi-user interaction to a table display. By leveraging the power of popular single user applications, we bring a visual and interactive richness to table top interaction that can not be achieved by a simple research prototype. Consequently, demonstrations of our systems to the creators of Google Earth, real world users of geospatial systems including NYPD officers with the Real Time Crime Center, and Department of Defence members have evoked overwhelming positive and enthusiastic comments, e.g., "How could it be any more intuitive"? For our next steps, we are studying ‘true’ multi-user, multimodal tabletop systems which will serve as stand-alone applications, and as an interactive layer placed atop single-user systems. Illustrative video: Visit http://grouplab.cpsc.ucalgary.ca/tabletop/ Acknowledgements. We are grateful for the support from our sponsors: ARDA, NGA, NSERC, Alberta Ingenuity and iCORE.

7. REFERENCES [1] Bolt, R.A., Put-that-there: Voice and gesture at the graphics interface. Proc ACM Conf. Computer Graphics and Interactive Techniques Seattle, 1980, 262-270. [2] Clark, H. Using language. Cambridge Univ. Press, 1996. [3] Cohen, P. Speech can’t do everything: A case for multimodal systems. Speech Technology Magazine, 5(4), 2000. [4] Cohen, P.R., Coulston, R. and Krout, K., Multimodal interaction during multiparty dialogues: Initial results. Proc IEEE Int’l Conf. Multimodal Interfaces, 2002, 448-452. [5] Cohen, P.R., Johnston, M., McGee, D., Oviatt, S., Pittman, J., Smith, I., Chen, L. and Clow, J., QuickSet: Multimodal interaction for distributed applications. Proc. ACM Multimedia, 1997, 31-40.

[6] Dietz, P. and Leigh, D. DiamondTouch: a multi-user touch technology. Proc ACM UIST, 2001, 219-226. [7] Dix, A., Finlay, J. Abowd, G. and Beale, R. HumanComputer Interaction. 2nd ed. Prentice Hall, 1998. [8] Greenberg, S. and Boyle, M. Customizable physical interfaces for interacting with conventional applications. Proc ACM UIST, 2002, 31-40. [9] Greenberg, S., Sharing views and interactions with singleuser applications. Proc ACM COIS, 1990, 227-237 [10] Greenberg, S. Personalizable groupware: Accommodating individual roles and group differences. Proc ECSCW, 1991,17-32, [11] Gutwin, C., and Greenberg, S. The importance of awareness for team cognition in distributed collaboration. In E. Salas, S. Fiore (Eds) Team Cognition: Understanding the Factors that Drive Process and Performance, APA Press, 2004, 177-201. [12] Gutwin, C. and Greenberg, S. Design for individuals, design for groups: Tradeoffs between power and workspace awareness. Proc ACM CSCW, 1998, 207-216 [13] Heath, C.C. and Luff, P. Collaborative activity and technological design: Task coordination in London Underground control rooms. Proc ECSCW, 1991, 65-80 [14] Ishii, H., Kobayashi, M. and Grudin, J. Integration of interpersonal space and shared workspace: ClearBoard design and experiments. ACM TOIS, 11 (4), 1993, 349-375. [15] Kruger, R., Carpendale, M.S.T., Scott, S. and Greenberg, S. Roles of orientation in tabletop collaboration: Comprehension, coordination and communication. J CSCW, 13(5-6), 2004, 501-537. [16] McGee, D.R. and Cohen, P.R., Creating tangible interfaces by augmenting physical objects with multimodal language. Proc ACM Conf Intelligent User Interfaces, 2001, 113-119. [17] Oviatt, S. L. Ten myths of multimodal interaction, Comm. ACM, 42(11), 1999, 74-81. [18] Oviatt, S. Multimodal interactive maps: Designing for human performance. Human-Computer Interaction 12, 1997. [19] Pinelle, D., Gutwin, C. and Greenberg, S. Task analysis for groupware usability evaluation: Modeling shared-workspace tasks with the mechanics of collaboration. ACM TOCHI, 10(4), 2003, 281-311. [20] Rekimoto, J. SmartSkin: An infrastructure for freehand manipulation on interactive surfaces. Proc ACM CHI, 2002. [21] Ringel-Morris, M., Ryall, K., Shen, C., Forlines, C., Vernier, F. Beyond social protocols: Multi-user coordination policies for co-located groupware. Proc ACM CSCW, 262-265, 2004. [22] Segal, L. Effects of checklist interface on non-verbal crew communications, NASA Ames Research Center, Contractor Report 177639. 1994 [23] Tang, J. Findings from observational studies of collaborative work. Int. J. Man-Machine. Studies. 34 (2), 1991, 143-160. [24] Wigdor, D., Balakrishnan, R. Empirical investigation into the effect of orientation on text readability in tabletop displays. Proc ECSCW, 2005. [25] Wu, M., Shen, C., Ryall, K., Forlines, C., and Balakrishnan, R. Gesture registration, relaxation, and reuse for multi-point direct-touch surfaces. IEEE Int’l Workshop Horizontal Interactive Human-Computer Systems (TableTop). 2006. [26] Wu, M. and Balakrishnan, R. Multi-finger and whole hand gestural interaction techniques for multi-user tabletop displays. Proc ACM UIST, 193-202. 2003.

Page 22

Appendix C. Multimodal Games Paper

Reference Tse, E., Greenberg, S., Shen, C. And Forlines, C. (2006) Multimodal Multiplayer Tabletop Gaming. Pervasive Games Workshop, Dublin, Ireland, To Appear.

Edward Tse – Thesis Proposal – Version 32 – 3/15/2006

MULTIMODAL MULTIPLAYER TABLETOP GAMING Edward Tse1,2, Saul Greenberg2, Chia Shen1, Clifton Forlines1 Abstract There is a large disparity between the rich physical interfaces of co-located arcade games and the generic input devices seen in most home console systems. In this paper we argue that a digital table is a conducive form factor for general co-located home gaming as it affords: (a) seating in collaboratively relevant positions that give all equal opportunity to reach into the surface and share a common view, (b) rich whole handed gesture input normally only seen when handling physical objects, (c) the ability to monitor how others use space and access objects on the surface, and (d) the ability to communicate to each other and interact atop the surface via gestures and verbal utterances. Our thesis is that multimodal gesture and speech input benefits collaborative interaction over such a digital table. To investigate this thesis, we designed a multimodal, multiplayer gaming environment that allows players to interact directly atop a digital table via speech and rich whole hand gestures. We transform two commercial single player computer games, representing a strategy and simulation game genre, to work within this setting.

1. Introduction Tables are a pervasive component in many real-world games. Players sit around a table playing board games; even though most require turn-taking, the ‘inactive’ player remains engaged and often has a role to play (e.g., the ‘banker’ in Monopoly; the chess player who continually studies the board). In competitive game tables, such as air hockey and foosball, players take sides and play directly against each other – both are highly aware of what the other is doing (or about to do), which affects their individual play strategies. Construction games such as Lego® invite children to collaborate while building structures and objects (here, the floor may serve as a ‘table’). The dominant pattern is that tabletop games invite co-located interpersonal play, where players are engaged with both the game and each other. People are tightly coupled in how they monitor the game surface, and each other’s actions [10]. There is much talk between players, ranging from exclamations to taunts to instructions and encouragement. Since people sit around a digital table, they can monitor both the artefacts on the digital display as well as the gestures of others. Oddly, most home-based computer games do not support this kind of play. Consider the dominant game products: desktop computer games, and console games played on a television. Desktop computers are largely constructed as a single user system: the size of the screen, the standard single mouse and keyboard, and how people orient computers on a desk impedes how others can join in. Consequently, desktop computer games are typically oriented for a single person playing either alone, or with remotely located players. If other co-located players are present, they normally have to take turns using the game, or work ‘over the shoulder’ where one person controls the game while others offer advice. Either way, the placement and relatively small size of the monitor usually means that co-located players have to jockey for space [7]. Console games are better at inviting colocated collaboration. Televisions are larger and are usually set up in an area that invites social interaction, meaning that a group of people can easily see the surface. Interaction is not limited to a single input device; indeed four controllers are the standard for most commercial consoles. However, co-located interaction is limited. On some games, people take turns at playing game rounds. Other games allow players to interact simultaneously, but they do so by splitting the screen, 1

Mitsubishi Electric Research Laboratories [shen, forlines]@merl.com and University of Calgary, Alberta, Canada [tsee, saul]@cpsc.ucalgary.ca

2

Tse, E., Greenberg, S., Shen, C. And Forlines, C. Multimodal Multiplayer Tabletop Gaming. Jointly available as Report 2006-823-16, Department of Computer Science, University of Calgary, Alberta, CANADA and MERL Technical Report TR2006009, Mitsubishi Electric Research Laboratories, 201 Broadway, Cambridge, MA, 02139, USA. February, 2006.

providing each player with one’s own custom view onto the play. People sit facing the screen rather than each other. Thus the dominant pattern is that co-located people tend to be immersed in their individual view into the game at the expense of the social experience. We believe that a digital table can offer a better social setting for gaming when compared to desktop and console gaming. Of course, this is not a new idea. Some vendors of custom video arcade games (e.g., as installed in video arcades, bars, and other public places) use a tabletop format, typically with controls placed either side by side or opposite one another. Other manufacturers create special purpose digital games that can be placed atop a flat surface. The pervasive gaming community has shown a growing interest in bringing physical devices and objects into the gaming environment. For example, Magerkurth [12] tracked tangible pieces placed atop a digital tabletop. Akin to physical devices in arcades, the physical manipulation of game pieces supports rich visceral and gestural affordances (e.g., holding a gun). Yet to our knowledge, no one has yet analyzed the relevant behavioural foundations behind tabletop gaming and how that can influence game design. Our goal in this paper is to take on this challenge, First, we summarize the behavioural foundations of how people work together over shared visual surfaces. As we will see, good collaboration relies on at least: (a) people sharing a common view, (b) direct input methods that are aware of multiple people, (c) people’s ability to monitor how others directly access objects on the surface, and (d) how people communicate to each other and interact atop the surface via gestures and verbal utterances. From these points, we argue that the digital tabletop is a conducive form factor for co-located game play as it lets people easily position themselves in a variety of collaborative postures (side by side, kitty-corner, round table, etc.) while giving all equal and simultaneous opportunity to reach into and interact over the surface. We also argue that multimodal gesture and speech input benefits collaborative tabletop interaction. Second, we apply this knowledge to the design of a multimodal, multiplayer gaming environment that allows people to interact directly atop a digital table via speech and gestures, where we transform single player computer games to work within this setting via our Gesture Speech Infrastructure [18].

2. Behavioural Foundations The rich body of research on how people interact over horizontal and vertical surfaces agrees that spatial information placed atop a table typically serves as conversational prop to the group. In turn, this creates a common ground that informs and coordinates their joint actions [2]. Rich collaborative interactions over this information often occur as a direct result of workspace awareness: the up-to-the-moment understanding one person has of another person’s interaction with the shared workspace [10]. This includes awareness of people, how they interact with the workspace, and the events happening within the workspace over time. As summarized below, key behavioural factors that contribute to how collaborators maintain workspace awareness by monitoring others’ gestures, speech and gaze. [10]. 2.1 Gestures Gestures as intentional communication. In observational studies of collaborative design involving a tabletop drawing surface, Tang noticed that over one third of all activities consisted of intentional gestures [17]. These intentional gestures serve many communication roles [15], including: pointing to objects and areas of interest within the workspace, drawing of paths and shapes to emphasise content, giving directions, indicating sizes or areas, and acting out operations. Rich gestures and hand postures. Observations of people working over maps showed that people used different hand postures as well as both hands coupled with speech in very rich ways [4]. These

animated gestures and postures are easily understood as they are often consequences of how one manipulates or refers to the surface and its objects, e.g., grasping, pushing, and pointing postures. Gestures as consequential communication. Consequential communication happens as one watches the bodies of other’s moving around the work surface [16][15]. Many gestures are consequential vs. intentional communication. For example, as one person moves her hand in a grasping posture towards an object, others can infer where her hand is heading and what she plans to do. Gestures are also produced as part of many mechanical actions, e.g., grasping, moving, or picking up an object: this also serves to emphasize actions atop the workspace. If accompanied by speech, it also serves to reinforce one’s understanding of what that person is doing. Gestures as simultaneous activity. Given good proximity to the work surface, participants often gesture simultaneously over tables. For example, Tang observed that approximately 50-70% of people’s activities around the tabletop involved simultaneous access to the space by more than one person, and that many of these activities were accompanied by a gesture of one type or another. 2.2 Speech and alouds. Talk is fundamental to interpersonal communication. It serves many roles: to inform, to debate, to taunt, to command, to give feedback [2]. Speech also provides awareness through alouds. Alouds are high level spoken utterances made by the performer of an action meant for the benefit of the group but not directed to any one individual in the group [11]. This ‘verbal shadowing’ becomes the running commentary that people commonly produce alongside their actions. When working over a table, alouds can help others decide when and where to direct their attention, e.g., by glancing up and looking to see what that person is doing in more detail [10]. For example, a person may say something like “I am moving this car” for a variety of reasons: • • • • • •

to make others aware of actions that may otherwise be missed, to forewarn others about the action they are about to take, to serve as an implicit request for assistance, to allow others to coordinate their actions with one’s own, to reveal the course of reasoning, to contribute to a history of the decision making process.

2.3 Combination: Gestures and Speech Deixis: speech refined by gestures. Deictic references are speech terms (‘this’, ‘that’, etc.) whose meanings are disambiguated by spatial gestures (e.g., pointing to a location). A typical deictic utterance is “Put that…” (points to item) “there…” (points to location) [1]. Deixis often makes communication more efficient since complex locations and object descriptions can be replaced in speech by a simple gesture. For example, contrast the ease of understanding a person pointing to this sentence while saying ‘this sentence here’ to the utterance ‘the 5th sentence in the paragraph starting with the word deixis located in the middle of page 3’. Furthermore, when speech and gestures are used as multimodal input to a computer, Bolt states [1] and Oviatt confirms [13] that such input provides individuals with a briefer, syntactically simpler and more fluent means of input than speech alone. Complementary modes. Speech and gestures are strikingly distinct in the information each transmits. For example, studies show that speech is less useful for describing locations and objects that are perceptually accessible to the user, with other modes such as pointing and gesturing being far more appropriate [3,5,13]. Similarly, speech is more useful than gestures for specifying abstract or discrete actions (e.g., Fly to Boston).

Simplicity, efficiency, and errors. Empirical studies of speech/gestures vs. speech-only interaction by individuals performing map-based tasks showed that parallel speech/gestural input yields a higher likelihood of correct interpretation than recognition based on a single input mode [14]. This includes more efficient use of speech (23% fewer spoken words), 35% less disfluencies (content self corrections, false starts, verbatim repetitions, spoken pauses, etc.), 36% fewer task performance errors, and 10% faster task performance [14]. Natural interaction. During observations of people using highly visual surfaces such as maps, people were seen to interact with the map very heavily through both speech and gestures. The symbiosis between speech and gestures are verified in the strong user preferences stated by people performing map-based tasks: 95% preferred multimodal interaction vs. 5% preferred pen only. No one preferred a speech only interface [13]. 2.4 Gaze awareness People monitor the gaze of a collaborator [11,10]. It lets one know where others are looking and where they are directing their attention. It helps monitor what others are doing. It serves as visual evidence to confirm that others are looking in the right place or are attending one’s own acts. It even serves as a deictic reference by having it function as an implicit pointing act [2]. Gaze awareness happens easily and naturally in a co-located tabletop setting, as people are seated in a way where they can see each other’s eyes and determine where they are looking on the tabletop. 2.5 Implications The above points, while oriented toward any co-located interaction, clearly motivates digital multiplayer tabletop gaming using gesture and speech input. Intermixed speech and gestures comprise part of the glue that makes tabletop collaboration effective. Multimodal input is a good way to support individual play over visual game artefacts. Taken together, gestures and speech coupled with gaze awareness support a rich choreography of simultaneous collaborative acts over games. Players’ intentional and consequential gestures, gaze movements and verbal alouds indicate intentions, reasoning, and actions. People monitor these acts to help coordinate actions and to regulate their access to the game and its artefacts. Simultaneous activities promote interaction ranging from loosely coupled semi-independent tabletop activities to a tightly coordinated dance of dependant activities. It also explains the weaknesses of existing games. For example, the seating position of console game players and the detachment of one’s input from the display means that gestures are not really part of the play, consequential communication is hidden, and gaze awareness is difficult to exploit. Because of split screens, speech acts (deixis, alouds) are decoupled from the artefacts of interest. In the next section, we apply these behavioural foundations to ‘redesign’ two existing single player games. As we will see, we create a wrapper around these games that affords multimodal speech and gesture input, and multiplayer capabilities.

3. Warcraft III and The Sims To illustrate our behavioural foundations in practice, we implemented multiplayer multimodal wrappers atop of the two commercial single player games illustrated in Figure 1: Warcraft III (a command and control strategy game) and The Sims (a simulation game). We chose to use existing games for three reasons. First, they provide a richness and depth of game play that could not be realistically achieved in a research prototype. Second, our focus is on designing rich multimodal interactions; this is where we wanted to concentrate our efforts rather than on a fully functional

Figure 1. Two People Interacting with (left) Warcraft III, (right) The Sims game system. Finally, we could explore the effects of multimodal input on different game genres simply by wrapping different commercial products. The two games we chose are described below. Warcraft III, by Blizzard Inc., is a real time strategy game that portrays a command and control scenario over a geospatial landscape. The game visuals include a detailed view of the landscape that can be panned, and a small inset overview of the entire scene. Similar to other strategy games, a person can create units comprising semi-autonomous characters, and then direct characters and units to perform a variety of actions, e.g., move, build, attack. Warcraft play is all about a player developing strategies to manage, control and reposition different units over a geospatial area. The Sims, by Electronic Arts Inc., is a real time domestic simulation game. It implements a virtual home environment where simulated characters (the Sims) live. The game visuals include a landscape presented as an isometric projection of the property and the people who live in it. Players can either control character actions (e.g., shower, play games, sleep) or modify the layout of their virtual homes (e.g., create a table). Game play is about creating a domestic environment nurturing particular lifestyles. Both games are intended for single user play. By wrapping them in a multimodal, multiuser digital tabletop environment, we repurpose them as games for collaborative play. This is described next.

4. Multiplayer Multimodal Interactions over the Digital Table For the remainder of this paper, we use these two games as case studies of how the behavioural foundations of Section 2 motivated the design and illustrated the benefits of the rich gestures and multimodal speech input added through our multiplayer wrapper. Tse et. al. [18] provides technical aspects of how we created these multi-player wrappers, while Dietz et. al. [6] describes the DiamondTouch hardware we used to afford a multiplayer touch surface. 4.1 Meaningful Gestures We added a number of rich hand gestures to player’s interactions of both Warcraft III and The Sims. The important point is that a gesture is not only recognized as input, but is easily understood as a communicative act providing explicit and consequential information of one’s actions to the other players. We emphasise that our choice of gestures are not arbitrary. Rather, we examined the rich multimodal interactions reported in ethnographic studies of brigadier generals in real world military command and control situations [4].

Figure 2. The Sims: five-finger grabbing gesture (left), and fist stamping gesture (right)

Figure 3. Warcraft III, 2-hand region selection gesture (left), and 1-hand panning gesture (right) To illustrate, observations revealed that multiple controllers would often use two hands to bracket a region of interest. We replicated this gesture in our tabletop wrapper. Figure 3 (left) and Figure 1 (left) show a Warcraft III player selecting six friendly units within a particular region of the screen using a two-handed selection gesture, while Figure 3 (right) shows a one handed panning gesture similar to how one moves a paper map on a table. Similarly, a sampling of other gestures includes: • a 5-finger grabbing gesture to reach, pick up, move and place items on a surface (Figure 2, left). • a fist gesture mimicing the use of a physical stamp to paste object instances on the terrain (Figures 1+2, right). • pointing for item selection (Figure 1 left, Figure 4) .

Table 1. The Speech and Gesture Interface to Warcraft III and the Sims Speech Commands in Warcraft III

Speech Commands in The Sims

Unit <#>

Selects a numbered unit, e.g., one, two

Rotate

Rotates the canvas clockwise 90 degrees

Attack / attack here [point]

Selected units attack a pointed to location

Zoom

Zooms the canvas to one of three discrete levels

Build here [point]

Build object at current location, e.g., farm, barracks

Floor

Moves the current view to a particular floor

Move / move here [point]

Move to the pointed to location

Return to Neighbourhood

Allows a saved home to be loaded

[area] Label as unit <#>

Adds a character to a unit group

Create here [points / fists] okay

Creates object(s) at the current location, e.g., table, pool, chair

Stop

Stop the current action

Delete [point]

Removes an object at the current location

Next worker

Navigate to the next worker

Walls

Shows / Hides walls from current view

4.2. Meaningful Speech A common approach to wrapping speech atop single user systems is to do a 1:1 mapping of speech onto system-provided command primitives (e.g., saying ‘X’, the default keyboard shortcut to attack). This is inadequate for a multiplayer setting. If speech is too low level, the other players would have to consciously reconstruct the intention of the player. As with gestures, speech serves as a communicative act (a meaningful ‘aloud’) that must be informative. Thus a player’s speech commands must be constructed so that (a) a player can rapidly issue commands to the game table, and (b) its meaning is easily understood by other players within the context of the visual landscape and the player’s gestures. In other words, speech is intended not only for the control of the system, but also for the benefits of one’s collaborators. To illustrate, our Warcraft III speech vocabulary was constructed using easily understood phrases: nouns such as ‘unit one’, verbs such as ‘move’, action phrases such as ‘build farm’ (Table 1). Internally, these were remapped onto the game’s lower level commands. As described in the next section, these speech phrases are usually combined with gestures describing locations and selections to complete the action sequence. While these speech phrases are easily learnt, we added a 2nd display to the side of the table that listed all available speech utterances; this also provided visual feedback of how the system understood the auditory commands by highlighting the best match. 4.3 Combining Gesture and Speech together The speech and gesture commands of Warcraft and The Sims are often intertwined. For example in Warcraft III, a person may tell a unit to attack, where the object to attack can be specified before, during or even after the speech utterance. As mentioned in Section 2, speech and gestures can interact to provide a rich and expressive language for interaction and collaboration, e.g., through deixis. Figure 1 gives several examples, where deictic speech acts are accompanied by one and two-finger gestures and by fist stamping; all gestures indicate locations not provided by the speech act. Further combinations are illustrated in Table 1. For example, a person may select a unit, and

Figure 4. Warcraft III: 1-finger multimodal gesture (left), and 2-finger multimodal gesture (right) then say ‘Build Barracks’ while pointing to the location where it should be built. This intermixing not only makes input simple and efficient, but makes the action sequence easier for others to understand. These multimodal commands greatly simplify the player’s task of understanding the meaning of an overloaded hand posture. A user can easily distinguish different meanings for a single finger using utterances such as ‘unit two, move here’ and ‘next worker, build a farm here’ (Figure 4, left). We should mention that constraints and offerings of the actual commercial single player game significantly influences the appropriate gestures and speech acts that can be added to it via our wrapper. For example, continuous zooming is ideally done by gestural interaction (e.g., a narrowing of a two-handed bounding box). However, since The Sims provides only three discrete levels of zoom it was appropriate to provide a meaningful aloud for zooming. Table 1 shows how we mapped Warcraft III and The Sims onto speech and gestures, while Figure 1 illustrates two people interacting with it on a table. 4.3. Feedback and feedthrough For all players, game feedback re-enforces what the game understands. While feedback is usually intended for the player who did the action, it becomes feedthrough when others see and understand it. Feedback and feedthrough is done by the visuals (e.g., the arrows surrounding the pointing finger in Figure 4, the bounding box in Figure 3 left, the panning surface in Figure 3 right). As well, each game provides its own auditory feedback to spoken commands: saying ‘Unit One Move Here’ in Warcraft III results in an in-game character responding with phrases such as ‘Yes, Master’ or ‘Right Away’ if the phrase is understood (Figure 4). Similarly, saying ‘Create a tree’ in The Sims results in a click sound. 4.4. Awareness and Gaze Because most of these acts work over a spatial location, awareness becomes rich and highly meaningful. By overhearing alouds, by observing players’ moving their hands onto the table (consequential communication), by observing players’ hand postures and resulting feedback (feedthrough), participants can easily determine the modes, actions and consequences of other people’s actions. Gestures and speech are meaningful as they are designed to mimic what is seen and understood in physical environments; this meaning simplifies communication [2]. As a player visually tracks what the other is doing, that other player is aware of where that person is looking and gains a consequential understanding of how that player understands one’s own actions.

4.5 Multiplayer Interaction Finally, our wrapper transforms a single player game into a multi-user one, where players can interact over the surface. Yet this comes at a cost, for single player games are not designed with this in mind. Single player games expect only a single stream of input coming from a single person. In a multiplayer setting, these applications cannot disambiguate what commands come from what person, nor can they make sense of overlapping commands and/or command fragments that arise from simultaneous user activities. To regulate this, we borrow from ideas in shared window systems. To avoid confusion arising from simultaneous user input across workstations, a turn taking wrapper is interposed between the multiple workstation input streams and the single user application [8]. Akin to a switch, this wrapper regulates user pre-emption so that only one workstation’s input stream is selected and sent to the underlying application. The wrapper could embody various turn taking protocols, e.g., explicit release (a person explicitly gives up the turn), pre-emptive (a new person can grab the turn), pause detection (explicit release when the system detects a pause in the current turn-holder’s activity), queue or round-robin (people can ‘line up’ for their turns), central moderator (a chairperson assigns turns), and free floor (anyone can input at any time, but the group is expected to regulate their turns using social protocol) [9]. In the distributed setting of shared window systems, turn taking is implemented at quite gross levels (e.g., your turn, my turn). Our two case studies reveal far richer opportunities in tabletop multimodal games for social regulation by micro turn-taking. That is, speech and gestural tokens can be interleaved so that actions appear to be near-simultaneous. For example, Figure 1 (left) shows micro turn taking in Warcraft III. One person says ‘label as unit one’ with a two hand side selection, and the other person then immediately directs that unit to move to a new location. Informal observations of people playing together using the multimodal wrappers of Warcraft III and The Sims showed that natural social protocols mitigated most negative effects of micro turn taking over the digital table. Players commented about feeling more engaged and entertained after playing on the tabletop as compared to their experiences playing these games on a desktop computer.

5. Summary and Conclusion While video gaming has become quite pervasive in our society, there is still a large gulf between the technologies and experiences of arcade gaming versus home console gaming. Console games and computers need to support a variety of applications and games, thus they use generic input devices (e.g., controllers, keyboard and mouse) that can be easily repurposed. Yet generic input devices fail to produce meaningful gestures and gaze awareness for people playing together for two reasons: First, everyone is looking at a common screen rather than each other, thus gaze awareness has the added cost of looking away from the screen. Second, generic input devices lock people’s hands and arms in relatively similar hand postures and spatial locations, thus people fail to produce useful awareness information in a collaborative setting. Conversely, arcade games often use dedicated tangible input devices (e.g., gun, racing wheel, motorcycle, etc) to provide the behavioural and visceral affordances of gestures on real world objects for a single specialized game. Yet specialized tangible input devices (e.g., power glove, steering wheel) are expensive: they only work with a small number of games, and several input devices must be purchased if multiple people are to play together. Even when meaningful gestures can be created with these tangible input devices, people are still looking at a screen rather than each

other; the spatial cues of gestures are lost since they are performed in mid air rather than on the display surface. This paper contributes multimodal co-located tabletop interaction as a new genre of home console gaming, an interactive platform where multiple people can play together using a digital surface with rich hand gestures normally only seen in arcade games with specialized input devices. Our behavioural foundations show that allowing people to monitor the digital surface, gesture and speech acts of collaborators, produces an engaging and visceral experience for all those involved. Our application of multimodal co-located input to command and control (Warcraft III) and home planning (The Sims) scenarios show that single user games can be easily repurposed for different game genres. Consequently, this work bridge the gulf between arcade gaming and home console gaming by providing new and engaging experiences on a multiplayer multimodal tabletop display. Unlike special purpose arcade game, a single digital table can become a pervasive element in a home setting, allowing co-located players to play different game genres atop of it using their own bodies as input devices.

6. References [1] BOLT, R.A., Put-that-there: Voice and gesture at the graphics interface. Proc ACM Conf. Computer Graphics and

Interactive Techniques, Seattle, 1980, 262-270. [2] CLARK, H. Using language. Cambridge Univ. Press, 1996. [3] COHEN, P.R., Johnston, M., McGee, D., Oviatt, S., Pittman, J., Smith, I., Chen, L. and Clow, J., QuickSet:

Multimodal interaction for distributed applications. Proc. ACM Multimedia, 1997, 31-40. [4] COHEN, P.R., Coulston, R. and Krout, K., Multimodal interaction during multiparty dialogues: Initial results.

Proc IEEE Int’l Conf. Multimodal Interfaces, 2002, 448-452. [5] COHEN, P.R, Speech can’t do everything: A case for multimodal systems. Speech Technology Magazine, 5(4),

2000. [6] DIETZ, P.H.; Leigh, D.L., DiamondTouch: A Multi-User Touch Technology, ACM Symposium on User Interface

Software and Technology (Proc ACM UIST), 2001, 219-226. [7] GREENBERG, S. Designing Computers as Public Artifacts. International Journal of Design Computing: Special

Issue on Design Computing on the Net (DCNet'99), November 30 - December 3, University of Sydney. 1999. [8] GREENBERG, S., Sharing views and interactions with single-user applications. Proc ACM COIS, 1990, 227-237 [9] GREENBERG, S. Personalizable groupware: Accommodating individual roles and group differences. Proc

ECSCW, 1991,17-32, [10] GUTWIN, C., and Greenberg, S. The importance of awareness for team cognition in distributed collaboration. In

E. Salas, S. Fiore (Eds) Team Cognition: Understanding the Factors that Drive Process and Performance, APA Press, 2004, 177-201. [11] HEATH, C.C. and Luff, P. Collaborative activity and technological design: Task coordination in London

Underground control rooms. Proc ECSCW, 1991, 65-80 [12] MAGERKURTH, C., Memisoglu, M., Engelke, T. and Streitz, N., Towards the next generation of tabletop gaming

experiences. Proc. Graphics Interface, 2004, 73-80. [13] OVIATT, S. L. Ten myths of multimodal interaction, Comm. ACM, 42(11), 1999, 74-81. [14] OVIATT, S. Multimodal interactive maps: Designing for human performance. Human-Computer Interaction 12,

1997. [15] PINELLE, D., Gutwin, C. and Greenberg, S. Task analysis for groupware usability evaluation: Modeling shared-

workspace tasks with the mechanics of collaboration. ACM TOCHI, 10(4), 2003, 281-311. [16] SEGAL, L. Effects of checklist interface on non-verbal crew communications, NASA Ames Research Center,

Contractor Report 177639. 1994 [17] TANG, J. Findings from observational studies of collaborative work. Int. J. Man-Machine. Studies. 34 (2), 1991. [18] TSE, E., Shen, C., Greenberg, S. and Forlines, C. (2005) Enabling Interaction with Single User Applications

through Speech and Gestures on a Multi-User Tabletop. MERL Technical Report TR2005-130, Cambridge, MA.

Candidacy Examination Reading List for Edward Tse

Human Computer Interaction- Design and Evaluation Methodologies This section contains foundational material for Human Computer Interaction, stressing evaluation techniques. 1. Dix, A., Finlay, J., Abowd, G., & Beale, R. (1998). Human Computer Interaction, 2nd ed. Toronto: Prentice-Hall. a. Ch. 4: Usability paradigms and principles (pp. 143-177) b. Ch. 5: The design process (pp. 178-221) c. Ch. 11: Evaluation techniques (pp. 405-442) 2. Helander, M., ed. (1988) Handbook of human-computer interaction. New York: North Holland. a. Ch. 42: T.K. Landauer. Research methods in human-computer interaction (pp. 905-928) 3.

Nielsen, J. (1993). Usability Engineering. New York: Morgan Kaufmann Publishers. a. Ch. 2: What Is Usability? (pp. 23-48) b. Ch. 4: Usability Engineering Lifecylye c. Ch. 5: Usability Heuristics (pp. 115-163) d. Ch 6: Usability Testing e. Ch 7: Usability Assessment Methods beyond Testing

4. Strauss, A. and Corbin, J. (1998). Basics of qualitative research, Techniques and Procedures for Developing Grounded Theory, Second Edition, SAGE Publications.] a. Ch. 6: Basic operations: Asking Questions and Making Comparisons, pp. 73-86. b. Ch. 7: Analytic Tools, pp. 87-100. c. Ch. 8: Open Coding, pp. 101-122. d. Ch. 9: Axial Coding, pp. 123-142. e. Ch. 10: Selective Coding, pp. 143-162. f. Ch. 11: Coding for Process, pp. 163-179.

Edward Tse’s Candidacy Examination Reading List

Computer Supported Cooperative Work: Background, Theories and Methods General background 1. Dix, Alan; Finlay, Janet; Abowd, Gregory; Beale, Russel. (1998) Human Computer Interaction, Second Edition. Chapter 13,14. Prentice Hall International. 2. Ellis, C.; Gibbs, S.; Rein, G. (1992) Groupware: Some Issues and Experiences. In Computer Supported Cooperative Work. In Baecker, R. (ed) Readings in computer supported cooperative work. Pages 9-28, Morgan Kaufmann Publishers. 3. Grudin, J. Groupware and Social Dynamics: Eight Challenges for Developers. Communications of the ACM, 37(1), 92-105, 1994. 4. Baecker, R. (1992) The future of Groupware for CSCW. In Baecker, R. (ed) Readings in computer supported cooperative work. Pages 851-854, Morgan Kaufmann Publishers.

Theories and models of Small Group Interaction 5. Clark, H. Using language. Cambridge Univ. Press, 1996. a. Ch 6. Signaling b. Ch 8. Grounding 6. Hollan, J., Hutchins, E., & Kirsh, D. Distributed Cognition: Toward a New Foundation for Human Computer Interaction. Proceedings of ACM TOCHI Vol 7 No 2 Jun 2000 pp. 174-196 7. McGrath, J. (1984). Groups: Interaction and Performance, Englewood, NJ: Prentice-Hall. a. Ch 3: Methods for the Study of Groups (pp. 28-40) b. Ch 4: A Topology of Groups (pp. 41-50) c. Ch 5: A Topology of Tasks (pp. 53-66) 8. Gutwin, C., and Greenberg, S. The importance of awareness for team cognition in distributed collaboration. In E. Salas, S. Fiore (Eds) Team Cognition: Understanding the Factors that Drive Process and Performance, APA Press, 2004, 177-201. 9. Gutwin, C., & Greenberg, S. (2000). The Mechanics of Collaboration: Developing Low Cost Usability Evaluation Methods for Shared Workspaces. IEEE 9th International Workshop on Enabling Technologies: Infrastructure for Collaborative Enterprises (WETICE'00). June 14-16, held at NIST,Gaithersburg, MD USA.

2/5

Edward Tse’s Candidacy Examination Reading List

Co-located and Multimodal Interaction Behavioural Foundations 10. Bekker, M.M., Olson, J.S., & Olson, G.M. (1995). Analysis of gesture in face-to-face design teams provides guidance for how to use groupware in design. In Proceedings of the Symposium on Designing Interactive Systems 1995, pp. 157-166. 11. Bentley, R., Hughes, J., Randall, D., Rodden, T., Sawyer, P., Shapiro, D. And Sommerville, I. (1992). Ethnographically-informed Systems Design for Air Traffic Control. In Proceedings of Computer-Supported Cooperative Work (CSCW) 1992, pp. 123-129.| 12. Buxton, W.A.S., Chunking and phrasing and the design of human-computer dialogues. Human-Computer Interaction: Toward the Year 2000, 1995), Morgan Kaufmann Publishers Inc., pp. 494-499. 13. Cohen, P. Speech can’t do everything: A case for multimodal systems. Speech Technology Magazine, 5(4), 2000. 14. Cohen, P.R., Coulston, R. and Krout, K., Multimodal interaction during multiparty dialogues: Initial results. Proc IEEE Int’l Conf. Multimodal Interfaces, 2002, 448-452. 15. Heath, C.C. and Luff, P. Collaborative activity and technological design: Task coordination in London Underground control rooms. Proc ECSCW, 1991, 65-80 16. Hutchins, E., and Palen, L. Constructing Meaning from Space, Gesture, and Speech. Discourse, tools, and reasoning: Essays on situated cognition. Heidelberg, Germany: Springer-Verlag, 1997 Pp. 23-40. 17. Hutchins, E., (1995). Cognition in the Wild. MIT Press, Cambridge, MA. a. Ch 4. Organization of Team Performances (pp. 175-228) b. Ch 5. Communication (pp. 229-262) c. Ch 9. Cultural Cognition (pp. 353-374) 18. Hutchins, E. (2000) The Cognitive Consequences of Patterns of Information Flow. Proc. Intellectica 2000/1, 30, pp. 53-74. 19. Kruger, R., Carpendale, M.S.T., Scott, S.D., Greenberg, S.: Roles of Orientation in Tabletop Collaboration: Comprehension, Coordination and Communication. In Journal of Computer Supported Collaborative Work, 13(56), 2004, pp. 501–537. 20. Luff, P., Heath, C., and Greatbatch, D. (1992). Tasks-in-Interaction: Paper and Screen Based Documentation in Collaborative Activity. Proceedings of Computer-Supported Cooperative Work '92, pp. 163-170. 21. McNeill, D. 1992. Hand and Mind: What Gestures Reveal About Thought. University of Chicago Press, Chicago. a. Chapter 1: Images, Inside and Out b. Chapter 2: Conventions, Gestures, and Signs 22. Oviatt, S. Multimodal interactive maps: Designing for human performance. Human-Computer Interaction 12, 1997. 23. Oviatt, S. L. Ten myths of multimodal interaction, Comm. ACM, 42(11), 1999, 74-81.

3/5

Edward Tse’s Candidacy Examination Reading List

24. Ryall, K.; Forlines, C.; Shen, C.; Ringel-Morris, M., "Exploring the Effects of Group Size and Table Size on Interactions with Tabletop Shared-Display Groupware", ACM Conference on Computer Supported Cooperative Work (CSCW), pp. 284-293, November 2004 (ACM Press). 25. Scott, S.D., Carpendale, M.S.T, Inkpen, K.M.: Territoriality in Collaborative Tabletop Workspaces. In Proceedings of the ACM Conference on Computer- Supported Cooperative Work (CSCW)’04, 2004, pp. 294–303 26. Segal, L. Effects of checklist interface on non-verbal crew communications, NASA Ames Research Center, Contractor Report 177639. 1994 27. Sellen, A., Harper, R., (2002) The Myth of the Paperless Office. Cambridge, MA, MIT Press. a. Ch 5: Paper in Support of Working Together 28. Stewart, J., Bederson, B.B, and Druin, A. (1999). Single Display Groupware: A Model for Co-present Collaboration. Proceedings of Human Factors in Computing Systems 1999 (CHI 99), pp. 286-293. 29. Tang, J.C. (1991). Findings from observational studies of collaborative work. International Journal of ManMachine Studies, 34, pp. 143-160.

Technologies 30. Bier, E.A., & Freeman, S. (1991). MMM: A User Interface Architecture for Shared Editors on a Single Screen. In Proceedings of the Symposium on User Interface Software Technology (UIST)’91, pp. 79-86. 31. Bolt, R.A., Put-that-there: Voice and gesture at the graphics interface. Proc ACM Conf. Computer Graphics and Interactive Techniques Seattle, 1980, 262-270. 32. Bolt, R. A. and Herranz, E. 1992. Two-handed gesture in multi-modal natural dialog. In Proceedings of the 5th Annual ACM Symposium on User interface Software and Technology (Monteray, California, United States, November 15 - 18, 1992). UIST '92. ACM Press, New York, NY, 7-14 33. Bricker, L.J., Bennett, M.J., Fujioke, E., & Tanimoto, S.L. (1999). Colt: A System for Developing Software that Supports Synchronous Collaborative Activities. In Proceedings of Educational Media ‘99, pp. 587-592. 34. Buxton, W., Fitzmaurice, G.W., Balakrishnan, R., and Kurtenbach, G. (2000). Large Displays in Automotive Design. IEEE Computer Graphics and Applications, 20(4), pp. 68-75. 35. Carroll, J.M. (2002). Human-Computer Interaction in the New Millennium. Toronto, ON: ACM Press. a. Ch 25: N.A. Streitz, P. Tandler, C. Müller-Tomfelde, & S. Konomi. Roomware: Towards the Next Generation of Human-Computer Interaction Based on an Integrated Design of Real and Virtual Worlds (pp. 553-578) 36. Cohen, P.R., Johnston, M., McGee, D., Oviatt, S., Pittman, J., Smith, I., Chen, L. and Clow, J., QuickSet: Multimodal interaction for distributed applications. Proc. ACM Multimedia, 1997, 31-40. 37. Corradini, A., Wesson, R.M. and Cohen, P.R., A Map-Based System Using Speech and 3D Gestures for Pervasive Computing. Proceedings of the 4th IEEE International Conference on Multimodal Interfaces, 2002), IEEE Computer Society, pp. 191.

4/5

Edward Tse’s Candidacy Examination Reading List 38. Deitz, P. and Leigh, D. (2001). DiamondTouch: A Multi-User Touch Technology. In Proceedings of ACM Symposium on User Interface Software and Technology (UIST) ‘01, pp. 219-226. 39. Inkpen, K., Hawkey, K., Kellar, M., Mandryk, R., Parker, K.,Reilly, D., Scott, S., & Whalen, T. (2005). Exploring Display Factors that Influence Co-Located Collaboration: Angle, Size, Number, and User Arrangement. In Proceedings of HCI International 2005, July 22-27, 2005, Las Vegas, NV. 40. Ishii, H., Kobayashi, M. and Grudin, J. Integration of interpersonal space and shared workspace: ClearBoard design and experiments. ACM TOIS, 11 (4), 1993, 349-375. 41. Krueger, M. W., Gionfriddo, T., and Hinrichsen, K. 1985. VIDEOPLACE—an artificial reality. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (San Francisco, California, United States). CHI '85. ACM Press, New York, NY, 35-40. 42. Magerkurth, C., Memisoglu, M., Engelke, T. and Streitz, N., Towards the next generation of tabletop gaming experiences. Proceedings of the 2004 conference on Graphics Interface London, Ontario, Canada, 2004), Canadian Human-Computer Communications Society, pp. 73-80. 43. Myers B., Malkin, R., Bett, M., Waibel, A., Bostwick, B., Miller, R., Yang, J., Denecke, M., Seemann, E., Zhu, J., Hong Peck, C., Kong, D., Nichols, J., Scherlis, B. Flexi-modal and Multi-Machine User Interfaces, In Proceedings of the Fourth IEEE International Conference on Multimodal Interfaces, October 14-16, 2002. Pittsburgh, PA. pp. 343-348 44. Rekimoto, J. and Saitoh, M. (1999). Augmented Surfaces: A Spatially Continuous Work Space for Hybrid Computing Environments. Proceedings of Human Factors in Computing Systems 1999 (CHI 99), pp. 378-385. 45. Shen, C., Lesh, N., Vernier, F., Forlines, C., & Frost, J. (2002), Sharing and Building Digital Group Histories. Proceedings of ACM Conference on Computer Supported Cooperative Work (CSCW), pp. 324-333. 46. Ullmer, B. & Ishii, H. (1997). The metaDESK: Models and Prototypes for Tangible User Interfaces, In the Proceedings of the Symposium of User Interface Software Technology (UIST) '97, pp 223-232. 47. Wu, M., Shen, C., Ryall, K., Forlines, C., Balakrishnan, R. Gesture Registration, Relaxation, and Reuse for MultiPoint Direct-Touch Surfaces. IEEE International Workshop on Horizontal Interactive Human-Computer Systems (TableTop), January 2006. pp.183-190 (IEEE Computer Society P2494, ISBN 0-7695-2494-X)

5/5

Edward Tse Written Departmental Exam This examination is open book, where you may peruse those materials found in your reading list, your MSc thesis, and your proposal. At your discretion, you may use other related HCI/CSCW material found in the HCI library. However, you may not search the Internet. Although the examination should be doable in 3-4 hours, you have 6 hours total. You may take a 5 minute bathroom / stretch breaks every hour. As well, you may take 30 to 60 minute break for lunch, which will not be included in your total examination time. Because the questions are fairly open-ended, you should limit the detail provided so that it is appropriate to the time available to you. Watch your time. Cite your reading list if needed (Author/date is fine as we have your reading list).

Candidacy Examination

Page 1 of 5

Question 1. Computer Supported Cooperative Work Consider the following short episode of two people working together over a table. John and Mary are sorting some photos, where their goal is to decide which ones to use as part of a slide show. John walks away from the table, selects and lifts a photo album off a nearby shelf, takes a few photos out of it. He then brings both the photo album and the photos back to the table. He puts the photo album on the floor next to him. He then stacks all but one of the selected photos on the table near him, and then places the remaining photo in the middle of the table while saying “what do you think” As he does these acts, Mary is sifting through the photos on the table, placing ones she thinks interesting in a pile on the center. She occasionally glances up to see what John is doing, and turns to look at the photo when he places it on the table. a) Use McGrath’s Typology of Tasks to characterize this episode b) Analyze this episode by using either Clark’s Theory of Common Ground or Gutwin’s Mechanics of Collaboration.

Candidacy Examination

Page 2 of 5

Question 2. The HCI Development Process

Scenario. An oil company in Calgary currently uses Google Earth to examine the terrain surrounding potential oil exploration sites. Currently, they use GoogleEarth from a desktop machine to make preliminary decisions about planning (e.g., vehicle access to the site, forecasting how the site would have to be cleared, possible road construction, nearby accommodations and supplies, etc.). The company has seen your GoogleEarth tabletop prototype, and is considering hiring you as an HCI consultant to redevelop the wrapper to fit their particular needs (they have their own technical team that will do the actual programming).

On the following page are two possible tasks A and B related to this company's needs. Choose either A or B, but not both. Regardless of the task chosen, please note the following: • There are many possible methods that you can use as part of your answer. Your job is to choose and recommend one that you feel may be appropriate to this situation (i.e., do NOT try to describe all possible methods). • We are not looking for a particular method; rather, what we are looking for is your rationale for why you chose this method and how you would use it in practice. • Your answer should concentrate on your methodologies; the GoogleEarth wrapper scenario is just used to motivate the question.

Candidacy Examination

Page 3 of 5

Answer either A or B but not both.

A. At this preliminary stage, they have asked you to briefly rough out an interface development strategy that identifies the particular requirements this team may have of a tabletop map-viewing system. These requirements will then be used to develop and evaluate some preliminary low fidelity design prototypes. The company has already said that you can have full access to the team who is currently doing this work. The budget is not large, so the methods used must be cost-effective. Outline this development strategy. For each stage in this strategy, clearly indicate: • the goal of the stage • the method used • the expected outcome • the expected time frame (i.e., a rough schedule) • a rationale explaining why this method was chosen (i.e., the benefits of this method to this particular setting) and roughly how you expect you would apply it to this situation. You must detail a development strategy for requirements discovery that at the very least goes up to but not including low fidelity prototype development. If you have sufficient time, you may add low fidelity prototyping/testing to your strategy.

OR B. At this preliminary stage, they have asked you to briefly rough out a process that will evaluate how their group uses your current implementation to do their tasks. The outcome of this evaluation will then be used to help the decision makers consider if your multimodal tabletop approach is worth pursuing, and what modifications should be done. The company has already said that you can have full access to the team who is currently doing this work. The budget is not large, so the methods used must be cost-effective. Outline the methodology you will use to evaluate the group’s use of your current implementation. Clearly indicate: • the method • a step by step detailed description of how you would apply it to this particular situation • what data you would collect (if any) • how you would analyze it • the expected time frame (i.e., a rough schedule) • a rationale explaining why this method was chosen

Candidacy Examination

Page 4 of 5

Question 3. Co-located and multimodal interaction Answer either A or B but not both.

A.

Comparing technologies

i) In terms of input and output capabilities, what are the primary limitations of the existing Smart 4-camera DViT system vs. the MERL DiamondTouch surface vs. a multiple mice system? ii) Based on your knowledge of the above limitations (and ignoring technical workarounds that may be developed in the future), compare these systems in terms of their effects on how people will be able to interact over the surface. You are expected to structure your answer in terms of the key behavioural concepts found on your reading list.

OR

B.

Analyzing systems Choose one example digital tabletop / large screen system from your reading list. Critique the system design from a theoretical perspective of behavioural concepts found on your reading list.

Candidacy Examination

Page 5 of 5

Edward Tse Written Departmental Exam Take-Home (72 hours) Open book, 72 hours (3 full days). You may use material from your reading list and other sources you may be aware of, and you may build upon the material already presented in your research proposal. The library of Saul Greenberg will be available to you if you require it. Feel free to use any library and the World Wide Web to search for information e.g., the HCI Bibliography or the ACM Digital Library. However, you are not to consult other experts in the area. References to portions of this material used to inform your discussion must be included, and direct extracts from these works must be quoted (i.e., as in a normal scientific paper) Format: Use the ACM CHI format for your paper http://www.chi2006.org/docs/chi2006pubsformat.doc . Don’t worry about minor deviations from this format. The paper should be around 10 pages long (it is probably too short if its less than 8 pages, and too long if it is more than 15 pages). Pages used for listing references are not included in this count.

In your proposal, you say that you “focus on group interaction theories that specifically handle issues of group communication, gesture and speech activity and apply them to the design of a digital tabletop [… going from … ] low level implications that deal specifically with the mechanics of gesture and speech input and then moves into high level theories influencing group work.” Write a conference-style paper titled A theoretical framework for co-located tabletop interaction: From low-level mechanics to high-level theories Your goal in this paper is to briefly summarize and inter-relate the various group interaction theories you have read about from your reading list. You may include knowledge acquired from other sources, but we do not expect nor recommend that you read new material during this examination period. The theories of interest are those that you believe directly relate to co-located digital tabletop interaction. The obvious ones of interest include: a) Mechanics of Collaboration b) Workspace Awareness c) Clark’s Grounding Theory d) Distributed Cognition Your task is to assemble and synthesize this knowledge into a single theoretical framework. While we do expect you to briefly summarize the various theories you wish to include, your main effort should be on discussing the inter-relations between these theories and how these can collectively be viewed as components of a unifying theoretical framework (conversely, you can argue why they should not be considered in a single framework if that is your belief). Cite the literature as needed. Feel free to use summary tables and/or diagrams that show how one theory and/or its components relate to others (although these should be explained in text). You should pay particular attention to the logic of argumentation You may assume that the reader has a reasonable knowledge of CSCW and groupware, and of multimodal interaction.

Candidacy Examination

Page 6 of 5

Open Book Exam Edward Tse Question 1: Computer Supported Cooperative Work a) McGrath’s Typology of Tasks McGrath’s Typology of Tasks describes categories of tasks that groups can engage in [McGrath, 1984].

These categories are useful for understanding the

differences and effects of task types in studying group behavior.

The eight

categories of tasks provided by McGrath are: Planning tasks, creativity tasks, intellective tasks, decision making tasks, cognitive conflict tasks, mixed motive tasks, competitive tasks, and psycho-motor tasks. John and Mary’s photo sorting most closely matches the description of McGrath’s decision making tasks. John and Mary’s goal is to decide which photos are to be used as part of a slide show, thus sorting their photos will let them decide which to include. In this task, there is no absolute right or wrong answer (like those in an intellective task), the preferred or agreed upon selection is the right one. The set of photos placed in the centre of the table represent photos that Mary and John prefer. The choices that John and Mary make may draw from their social and cultural values. For example, John may choose photos that are related to hockey because he knows that the people watching the slideshow are avid hockey fans. Decision making tasks often rely on social comparison or social influence processes. For example, John explicitly asks Mary what she thinks about his photo addition, providing her the opportunity to influence and/or approve his decision. Consensus in this task is attained by sharing relevant information. For example, by placing the interesting photos in the centre of the table, Mary and John are sharing the photos that they feel should be in the slide show. Decision making tasks are somewhere in the middle of the conflict versus cooperation continuum. This is appropriate for Mary and John’s tasks because they need to work together and resolve conflicts to come up with a slide show that

they are happy with. Their task is deep on the conceptual end of the conceptual versus behavioral continuum.

The task is not the simple mechanical act of

selecting photos for a slide show but rather a rich group decision that involves thinking about the higher level content of each photo and the audience that they will be presenting to. Finally, John and Mary’s photo sorting task falls into the choose quadrant of McGrath’s group tasks because they both need to choose photos from their larger collection and agree upon which photos to include in the slide show. b) Gutwin’s Mechanics of Collaboration Gutwin’s

Mechanics

of

Collaboration

is

a

conceptual

framework

for

understanding “the low level actions and interactions that must be carried out to complete a task in a shared manner” [Gutwin, 2000]. The seven major activities include: explicit communication, consequential communication, coordination of action, planning, monitoring, assistance and protection.

The follow analysis

describes how each activities takes place in the short episode of John and Mary working together. Explicit Communication describes the intentional acts that people often perform over artifacts in a shared space. For example, when John moves a photo into the middle of the table and asks Mary what she thinks about it, John is explicitly providing information to Mary with the intention of having her respond. Consequential Communication describes the information unintentionally provided by people as they go about their group activities and is important for shared group activities. Consequential communication can be provided by the manipulation of artifacts. For example, when John flips through a photo album and selects photos to place on the table, he is providing consequential communication to Mary about photos that he may want to bring to her attention in the near future. Consequential communication can also be provided by the actions of a person’s embodiment in the workspace. For example, when John gets up and walks from the table to the shelf to pick up a photo album, he is

unintentionally giving off information to Mary that he is unavailable to look at any photos that she might want to show him. Coordination of Action and Planning describes how people organize their actions in a shared workspace so that they do not conflict with others and how they divide the task as they go along. For example, when Mary places piles of photos close to where she is working she is avoiding conflicts with John by not working in the same space and she is also dividing up the task of sorting photos for the slideshow by working on a separate pile of photos. Monitoring and Assistance describes the ability for people to monitor what others are doing in the shared space and their ability to help one another when needed. For example, when Mary glances up to see what John is doing and turns to look at the photo when he places it on the table, she is monitoring John’s activities so that when he asks “what do you think” she is able to provide an answer to John’s call for assistance. Protection describes actions people use to prevent others from accidentally changing or destroying their work. By keeping independent piles, John and Mary avoid destroying the intermediate photo sorting that each collaborator needs to do before they contribute their photos to the group pile in the middle of the table. The group pile is also monitored by both John and Mary. Thus, when John adds a photo into the pile he asks Mary for her opinion and approval.

Question 2. The HCI Development Process While there are many different development strategies for identifying the requirements needed by the new system, I believe that an interview process would be well suited for our situation. Interviews provide the benefit of being able to speak with individual users in a natural and unstructured way. Interviews allow opinions and comments to be recorded without the social stigma of providing a response that might be evaluated by their peers (as is the case for Focus groups). Interviews allow unscripted discussion since they do not need to follow a rigorous script. This allows interviewees to discuss topics that may have been overlooked in the script. The interviewer has the ability to redirect the conversation if the discussion goes off topic. Finally, the structure of interviews allows both quantitative and qualitative results to be extracted to help inform the design of a new system. To clarify my description of the HCI development process I will assume that the system will be used by a single person rather than a group of. And I will also assume that interviews can be performed during business hours inside the company building, since I have full access to the team who is currently doing this work. Step 1: Identify the users of the system (1 week) The goal of this stage is to determine who should be interviewed. This is an important step since interviewing a large number of people can be quite time consuming. The purpose of this step is to identify the potential users of the system and to gain an initial understanding of questions that might be suited for the interview script. I would achieve this method by asking managers or employees about who currently uses the Google Earth system and who might be potentially using the system in the future. Is the manager points out several employees, it would be helpful to ask the employees who else might be a good candidate for an interview.

Step 2: Develop the interview script (1 week) The goal of the interview script is to provide a number of open ended questions to promote discussion and to keep the conversation on track during the interview process.

With open ended questions, users are “encouraged to explain

themselves in depth, often leading to colorful quotes that can be used to enliven reports and presentations to management” [Neilsen, 1993].

This step is

important to ensure that key system requirements can be extracted from the interviews and may be changed during the interview step. For example, if the first interview outlined a number of comments about the system’s integration with external databases, questions could be added asking about the tasks that are done with external databases and Google Earth. I would achieve this method by using some of the preliminary information obtained in step 1 to create a set of open ended and neutral questions that represent core issues that are important to the requirements of the system. For example, I might ask questions about a person’s role within the oil company, the tasks they perform with Google Earth, the problems they experience with the current system and the features that they might like to see in the future. To provide some quantitative results, demographic information could also be included in the interview. For example, I could ask questions about their age, educational background, and use of gesture and speech on a computer. Step 3: Arrange meetings with system users (1 week) The goal of this step is to prepare for the interview with each potential system user. This is important to ensure that the interview process goes smoothly and to ensure minimal interruptions to the daily work practices of employees in the oil company. The benefit of doing interviews is that unlike questionnaires, once people have agreed to do an interview they are likely to complete it. I would achieve this method by first finding a location within the company building that would be suitable for performing interviews. An ideal location would be inaudible to other members of the team so that their comments would

be not be influenced by how they felt that their peers would judge their response. Then, I would contact the system users and schedule appropriate times for interviews. Before any of the interviews begin, I need to ensure that I have purchased all of the appropriate materials. For example, if I were to perform audio recording of each interview for later review I would need to purchase audio tapes and a recorder. Step 4: Conduct the interviews (3 weeks) The goal of this step is to extract what each potential user feels are the system requirements. This is the core information from which the requirements will be extracted. The purpose of the interviewer is to promote natural discussion about system and to ensure that the interview stays focused. I would achieve this method by first thanking each user for taking part in the interview, preparing them for the format of the interview, and ask for permission to record audio. I would then ask each question in the script out loud providing an opportunity to answer in between each question. I would allow the discussion to focus on issues that the user felt was important but if the discussion went too far off topic I would try to redirect the conversation to focus on the core issues relating to the requirements. I would keep notes of participant responses and important comments. Step 5: Analyze the Results (2 weeks) The goal of this step is to categorize, quantify and prioritize the system requirements extracted from the interview process. The purpose is to produce a list of requirements that will inform the design of the new system. I would achieve this method by first reviewing my field notes and clarifying any gaps in my understanding by returning to the audio recordings. Then, I would extract common trends in the requirements and tasks from each interview. These trends would form the core set of requirements that I would prioritize based on company needs and the subjective user needs. To further inform the design of

the system I would then analyze the demographic information of potential users to try to develop a profile of a typical system user. Finally, I would compose a report that would summarize the findings obtained from the interviews. This report would form the basis for the design and subsequent low fidelity prototyping of the system.

Question 3a: Co-located and multimodal interaction, Comparing Technologies i) Limiations of three existing systems There are a number of software limitations involved with interacting with multiple point input as existing operating system limit interaction to a single point [Tse, 2004].

However, I will assume that we are moving beyond the

software limitations, and consequently are using software toolkits (e.g., the Diamond Touch Gesture Engine) that support multiple points of contact and provide graphical visualizations.

Also, while issues such as cost and

transportability are important in practice, they are not input or output limitations.

Input

DViT 4 Camera System

MERL Diamond Touch

Multiple Mice

Input surface must be rectilinear

Input surface must be rectilinear

Not aware of different hand postures

Limitations Cannot Identify of the Unique Users input hardware capabilities Supports a maximum of two touch points Asynchronous recognition of point size Requires occasional cleaning of camera lenses

Limited Size Users must be tethered or seated on a pad

Fights with system cursor (causes erroneous input to system windows)

Cannot always distinguish between two hands from the same user

Gestures and all complex unnatural movements of the mouse

Front projected only

Insensitive to pressure

Slow Frame Rate (25 fps on 107” display) Camera noise introduces jittery input Insensitive to point pressure Cannot distinguish similar hand postures Only four users can (hand side vs palm be detected down) Insensitive to point pressure

Output feedback provided by the input hardware

No explicit force feedback, all display items feel the same when touched Soft touches produces no audio feedback

No explicit force feedback (see DViT) Soft touches produce no audio feedback

Input is relative, there is no direct manipulation of objects on the digital surface

No hardware visual feedback (e.g., LEDs)

No hardware visual feedback (e.g., LEDs)

ii) Effects of input devices on key Behavioral Concepts In this section I describe both the individual and group effects on interaction over a digital surface. These effects are based on the behavioral affordances that summarize key behavioral concepts found on my reading list [Tse, 2005, AVI] Individual Benefits The primary benefit for the individual is the ability to provide naturally rich hand postures and movements as described in the table below. Rich hand gestures and postures allow people to interact with digital content in the same way that they interact with artifacts in the real world. For example, one can pick up a digital artifact using five fingers and place it down on the digital surface just like they would pick up a physical object and move it. This also produces a more engaging experience for gaming environments [Tse, 2006, Pervasive Games].

Rich Hand Postures and Movements and Natural Interactions [Cohen, 2002]

Smart DViT

MERL Diamond Touch

Multiple Mice

Limited but can identify point sizes and can distinguish between a finger and a whole hand.

Somewhat supported as rich multi finger and whole handed postures and movements possible, e.g., five fingers lifting from the table, two hand side for making a selection.

Not possible since one’s hands are tied to holding a mouse and rich hand postures and movements are not recognized.

Group Benefits The provision of rich whole hand gestures provides a number of benefits for group collaboration as people are producing more awareness information for others. Interaction with a keyboard and a mouse is inherently private as the hands and arms are locked into a relatively fixed position. Interaction over a large display with arms and hands makes actions publicly visible to others around a table top surface [Tse, 2006, Pervasive Games] Smart DViT

MERL Diamond Touch

Multiple Mice

Deixis and Provided through direct touch Explicit Communication manipulations. E.g., moving a [Pinelle, 2003] photo on a table

Provided through direct touch manipulations. E.g., panning a map using a single finger

Harder to determine since the hand is not directly in contact with the digital artifact.

Consequential If all interactions Communication are done with a single finger then [Gutwin, 2004] this produces less consequential communication as the meaning of the single point is overloaded

The provision of numerous rich gestural interactions makes it possible for interactions to provide rich consequential communication.

Since all mouse interactions use the same hand posture, they do not provide a lot of consequential communication for others.

Simultaneous Activity [Tang, 1991]

The identification of multiple people allows one to clearly distinguish between the simultaneous actions of multiple people.

While people interact on the DViT, people can monitor their hand and arm movements as well as their body positions and orientations

Because of the tethering requirement of the Diamond Touch, gaze awareness is mostly limited to observing the hand and arm movements of people around the table.

In an upright desktop scenario, looking at a colocated collaborator has the disadvantage of losing the context of the shared display. On a small screen it can be difficult to distinguish the exact object that another person is looking at.

People can monitor the actions of a collaborator by looking across the digital surface

People can monitor actions easily due to the close spatial proximity of collaborators

It takes explicit effort to monitor the activities of a collaborator

Simultaneous Bimanual interactions are limited by the fact that only four users can be detected. Furthermore, the inability to identify Thus, if a different user is used for each which user is interacting further hand, only two users can interact limits one’s ability for to match deictic with both hands on speech and gesture the Diamond Touch actions. E.g., we do surface. not know whose gesture belongs to a speech command Gaze Awareness [Gutwin, 2004]

Monitoring [Gutwin, 2004]

Multiple mice fully support a large number of simultaneous users and interactions over a shared digital surface.

The limitation of only two inputs severely limits the types of simultaneous activities possible. For example, it is not possible to have two people use both hands for interaction on the Smart DViT.

A Theoretical Framework for Co-located Tabletop Interaction: From Low Level Mechanics to High Level Theories Edward Tse University of Calgary 2500 University Dr. N.W., Calgary, Alberta, Canada T2N 1N4 [email protected] (403) 210-9502 ABSTRACT

collaborative situations involving multiple co-located people exploring and problem solving over artifacts. These situations include safety critical situations such as military command and control, air traffic control, and hospital emergency rooms where paper media such as maps and flight strips are preferred even when digital counterparts are available [McGee, 01, Bentley, 92, Chin, 03]. For example, McGee et al.’s observational studies illustrate why paper maps on a tabletop were preferred over electronic displays by Brigadier Generals in military command and control situations [McGee, 01]. The ‘single user’ assumptions inherent in the electronic display’s input device and its software limited commanders, as they were accustomed to using multiple fingers and two-handed gestures to mark (or pin) points and areas of interested with their fingers and hands, often in concert with speech [Cohen 02, Oviatt, 99].

Co-located collaborators often work over physical tabletops using multimodal combinations of speech, gesture and attention over artifacts. With the advent of large digital multi-touch surfaces, researchers are actively exploring colocated collaboration over digital artifacts on a table top. However, there are numerous individual and group mechanics and high level theories that need to be understood in order to be able to design tabletop applications that support existing practices and improve collaborative work. In this paper, I present the Individual, Group, and Reason (IGR) theoretical framework that provides a basic foundational understanding of the subtle nuances of collaborative work that are often overlooked by tabletop designers accustomed to developing within the context of a personal computer. This framework breaks collaborative activity into low level individual and group mechanical actions and explains their use in a collaborative system using high level theories describing group activity. We apply the IGR framework by presenting implications for design and a tabletop system evaluation example.

Many technical implementations of co-located systems have failed simply because their system design prohibited the natural interactions of multiple people [Tang, 06]. For example, early work on digital tabletops automatically rotated objects, but this protocol disrupted the fundamental role that subtle rotation variations play in coordinating collaboration, namely that the orientation of objects helps define individual and group working areas [Kruger, 04]. Similarly, early attempts to develop computer-based policies for coordination were limited due to overly rigid or poorly integrated protocols. Recent efforts in developing coordination mechanisms for tabletop interfaces have recognized the importance of existing work practices and social protocols [Tse, 04, Scott, 04].

Author Keywords

Single Display Groupware, Multimodal Interaction, Colocated Collaboration, Theoretical Frameworks, Communications Theory, Distributed Cognition ACM Classification Keywords

H5.2. Information interfaces and presentation (e.g., HCI): User Interfaces. INTRODUCTION

The fundamental problem is that we do have a foundational understanding of the basic natural interactions that people perform in co-located environment. We do not foundationally understand:

Traditional desktop computers are unsatisfying for highly

• • •

The actions that individuals use for communication. The actions that groups use to work together. The reasons behind these individual and group actions

In this paper, I provide a foundation through the IGR Framework that details the natural individual and group mechanical actions that people perform in co-located

1

environments. This framework summarizes theories, empirical results, observational studies, and technical investigations found in the areas of psychology, anthropology, ethnography, distributed groupware, and colocated collaboration. The IGR Framework contributes a foundational background for understanding the subtle intricacies of designing interactive digital tabletop systems. I begin with an introduction to the framework followed by a description of individual mechanical actions. Then describe the mechanical actions that are better understood within the context of a group. Next, I describe the reasons behind these mechanical actions using high level theories. I then describe the implications to interactive tabletop systems design and explain common pitfalls in applying the theoretical framework. Finally I illustrate how the framework can be applied to the evaluation and iterative refinement of existing tabletop applications. THE IGR THEORETICAL FRAMEWORK

The Individual Group and Reason Framework focuses on providing a foundational understanding of the natural and explicit interactions that people perform in co-located environments. It ignores subconscious actions such as breathing and blinking that do not directly influence a group’s collaborative effort. Often individual and group mechanical actions are mistaken for subconscious acts since people take these natural interactions for granted and do not recognize their relevance to collaboration. This framework explicitly lists key activities that a tabletop designer might overlook. To illustrate the concepts of this paper, I will use the following short episode typical of two people working together over a physical table (Figure 1): John and Mary are sorting some photos, where their goal is to decide which ones to use as part of a slide show. John walks away from the table, selects and lifts a photo album off a nearby shelf, and takes a few photos out of it. He then brings both the photo album and the photos back to the table. He puts the photo album on the floor next to him. He then stacks all but one of the selected photos on the table near him, and then pushes the remaining photo to the middle of the table while saying “what do you think about this photo?” As he does these acts, Mary is sifting through the photos on the table, placing ones she finds interesting in a pile on the centre. She occasionally glances up to see what John is doing, and turns to look at the photo when he places it on the table. [I]NDIVIDUAL MECHANICAL ACTIONS

There are numerous actions that people perform with their arms, mouths, head and body during everyday communication. The low-level actions are the building blocks from which all communications are formed. This section summarizes mechanical actions involving speech, gesture, attention, and their respective multimodal combinations.

Figure 1. John (middle) and Mary (right) working on a table near a shelf full of photo albums (left) Speech

In speech recognition systems, speech actions are often divided into discrete lists or free speech dictation. This overly broad categorization limits our understanding of the different speech acts that can occur in a co-located space. There are high level spoken utterances that are omitted in this section. They will be discussed later in the context of group mechanical actions and underlying theories. Direct Artifact Indication: Speech actions can directly refer to an object in the shared environment. For example, Mary could refer to “the photo album on the floor next to John”. Speech often uses terms such as “there, that or this” to complete a locative action such as pointing an artifact in the collaborative environment [Clark, 96, Bolt, 80]. For example, when John says “this photo” he is completing the indication started by his hand gesture over the photo using a speech act. Indirect Artifact Indication: Speech actions need not refer to objects in the collaborative environment. They often represent objects that the speaker has seen or heard about in the past. There are two methods of indirectly indicating artifacts using speech: icons and symbols [Clark, 96]. Icons represent an object perceptually, often by demonstrating a physical property of the object [Clark, 96]. For example, John could mimic the sound of a loud motorcycle engine to indicate that he is talking about a Harley Davidson. Symbols are speech acts that indicate a specific artifact or expression [Clark, 96]. The simplest example is the use of spoken language (e.g., “photo”, “table”). However, there are non-language speech symbols as well. For example, if John makes a wolf whistling sound after Mary walks by, this could be interpreted to mean “how beautiful”. If Mary consequently made a clicking noise with her tongue this could be interpreted to mean “shame on you”. Expression: People vary characteristics of their speech using tone, timing and volume to annotate to add expression their speech acts [Clark, 96]. For example, if John hesitates to say “what… do you… think… about this photo” this

first point and raising her right hand for the second point [Bekker, 95].

might indicate to Mary that John is not sure if the photo should be added. Similarly, if Mary screamed “HEY!!” when John moved the photo to the middle of the table this would indicate that Mary does not approve of John’s action.

Attention

This section covers all of the other explicit actions that people perform that do not involve speech and gesture. Attention is a very important cue that aids in providing a context for speech and gesture activity. While always on speech and gesture recognition would inhibit the natural speech and gesture actions that people do together, attention detection could help a computer understand when speech and gesture actions are meant for a computer vs. another person.

Gesture

Gestures on a computer are often understood to be the complex click and drag movements that one can do with a mouse or stylus. This overly broad simplification limits our understanding of the natural interactions that people perform with their hands and arms in a co-located setting. Direct Artifact Indication: The most obvious forms of direct artifact indication are pointing and touching, these are the fundamental metaphor used in current operating systems and most digital tabletop applications. However, there are many other ways of directly indicating objects such as rotating, sweeping, grabbing, pushing, kicking, punching, body checking, and jumping [Clark, 96, Kruger, 04]. For example, Mary can indicate that a set of photos belong to her by using her arms to sweep them closer to her side of the table. John can body check a sealed door to indicate that it won’t open.

Artifact Indication: Artifacts or groups of artifacts can be directly indicated through gaze, head position, torso orientation, and space occupation [Clark, 96]. If John says “this chair is too short” Mary knows that he is referring to the chair he is currently sitting in. These attention cues are useful for determining who a particular speech or gesture act is directed to [Clark, 96]. For example, when John asks “what do you think” his torso and head turns towards Mary. Similarly, there are symbolic indications involving body posture and head gaze. A nod signifies approval and a shoulder shrug says “I don’t know”. These symbols are also culturally dependent. A bow in North America signifies “thank you” whereas a bow in Japan signifies “hello” [Clark, 96].

Indirect Artifact Indication: Similar to speech acts, there are two types of indirect artifact indications: icons and symbols [Clark, 96, McNeil, 92]. Icons demonstrate an object perceptually through various instruments such as drawing, measuring, acting and mimicking [Clark, 96, McNeil, 92]. For example, John could describe the length of the fish he caught using two hands spread apart. Mary could demonstrate a volleyball spike by simulating the movements with an imaginary ball. Iconic gestures or demonstrations are often characterized by three distinct phases: preparation, stroke and recovery [Clark, 96, Wu, 06, Buxton, 00]. For example, the preparation phase of Mary’s volleyball gesture is when she bends her knees and prepares to jump. This is followed by the jump and spike of the imaginary volleyball (the stroke). Finally, Mary returns to her upright position, this marks the recovery.

Expression: People’s facial gestures and body positions can help to add expression to their speech and gesture actions [Clark, 96]. For example, if John is slouching in his chair with a dejected look on his face, Mary might ask “what’s wrong?” Combining Speech, Gesture and Attention

The majority of co-located activities involve combinations of gesture speech and attention. For example, when John places the photo in the middle of the table and asks Mary what she thinks he is simultaneous gesturing over the photo with his hands, speaking to the gesture with his comment and looking at the photo in the middle of the table.

Symbols have specific meanings, the simplest example is written language. However, there are also non-language symbols that are often culturally specific. For example, putting the middle finger over the index finger means “May I be protected” in England, Scandinavia, and Yugoslavia, but “I am breaking a friendship” in Turkey and Corfu, and “May I have good luck” in North America. Other examples include: thumbs up – “I approve”, index finger to protruding lips – “be quiet” [Clark, 96].

Multimodal Preference: Ethnographic studies of Brigadier Generals in military command and control situations have shown that the majority (69%) of their speech utterances are coupled with gestures that often involve multiple fingers and two hands placed on top of a paper map [Cohen, 02]. Similarly, observations of people using digital maps showed that 95% preferred multimodal interaction vs. 5% preferred pen only. No one preferred a speech only interface [Oviatt, 97].

Expression: These are gestures that do represent an artifact or the properties of an artifact. McNeil describes beat gestures, as those where the hand moves with the rhythmic pulsation of speech [McNeil, 92]. For example, John adds expression to his comments by moving his hand up and down during each point that he wants to describe to Mary. Mary can contrast two points by raising her left hand for the

Complementary Modes: Speech and gestures are strikingly distinct in the information that each transmits, how it is used during communication, the way it interoperates with other communication modes, and how it is suited to particular interaction styles. For example, studies clearly show performance benefits when people indicate artifacts –

3

points, paths, areas, groupings and containment – through gestures instead of speech [Cohen, 00, Oviatt, 97]. Simplicity, Efficiency and Errors: Empirical studies of speech/gestures vs. speech-only interaction by individual performing map-based tasks showed that multimodal input results in more efficient use of speech (23% fewer spoken words), 35% less disfluencies (content self corrections, false starts, verbatim repetitions, spoken pauses, etc.), 36% fewer task performance errors, and 10% faster task performance [Oviatt, 97]. Gestures and Speech are a Single System: McNeil argues that gesture and speech are closely linked in our minds and should be viewed as aspects of a single cognitive process [McNeil, 92]. Research in people’s speaking patterns indicate: • Gestures occur only during speech: people almost never gesture while listening, and 90% of a speaker’s gesture occur only when the speaker is actually saying something), • Gestures are co-expressive: both speech and gesture express the same or closely related meaning • Gestures are often synchronous: the stroke of a gesture often occurs simultaneously with key speech utterances. Physiological evidence also reveals that gestures and speech develop together as children and break down together in aphasia [McNeil, 92]. [G]ROUP MECHANICAL ACTIONS

Some mechanical actions that people naturally perform in a co-located environment are better understood within the context of a group. This section provides a framework for understanding workspace division, task division and publicized actions. Since seating arrangement and display size are factors that influence mechanical actions, they are discussed within the context of group mechanical actions. Workspace Division

Rigidly defined personal and group territories overlook the many subtle nuances of space use on a tabletop. Designers can improve the design of tabletop applications by understanding the ways in which people naturally divide the workspace when collaborating over a physical table surface. Spatial Partitioning and Territoriality: When multiple people interact simultaneously over a shared surface there is a risk of interference: the obstruction of another person’s current working area. Empirical studies have show that people on digital surfaces naturally mitigate interference by working apart from one another. This spatial partitioning is highly dependent on people’s seating position and orientation [Tse, 04]. Scott extends the traditional notion of personal and group space [Tang, 91, Morris, 04] with personal, group and storage territories (Figure 2). Territories differ from personal space in that they have both spatial and functional

Figure 2. Territories in collaborative work [Scott, 04]. properties. The spatial properties of territories can be influenced by the number of collaborators, seating arrangement, table size, task activities and visible barriers [Scott, 04]. Personal territories support obtaining and reserving resources. They are often located directly in front of where a person is seated [Scott, 04, Tang, 91]. For example, Mary and John have collections of photos in their personal territories, and they avoid performing actions in each others’ space. Group territories are where most task activities occur and cover the remaining space not used by personal or storage territories. These territories support obtaining resources, protecting work, handing off and depositing artifacts [Scott, 04, Gutwin, 00]. For example, Mary and John can deposit photos in the middle of the table to be included in the slide show. They both monitor the group territory to protect their selections. Group territories do not always need to be stationary. For example, the Lazy Susan seen in most Chinese restaurants is a rotation capable group space that allows people around the dinner table to reach items that are far away [Hinrichs, 05]. Storage territories are mobile spaces that are use to store and reserve shared task resources [Scott, 04]. For example, the photo album that moved from the nearby shelf to the floor besides John and Mary would be a shareable storage territory. Workspace Customization and History: In a shared environment people often customize a workspace to better suit their collaborative tasks. As mentioned earlier, people change the position, shape and size of personal, group and storage territories depending on their situations and tasks. Studies have shown that this spatial positioning is important in reducing the cognitive load of collaborators [Hutchins, 95, Hollan, 00]. For example, carpenters often make

different area, one working another peripherally glancing, one working another not working, and different problems) and revealed that this style of collaboration is directly related to how closely people stand together around the table [Tang, 06, Inkpen, 05]. For example, when John places the photo in the middle of the table and asks Mary for assistance, he gradually transitions from a same problem different area style to a one working another actively monitoring collaboration style.

recently used tools available for reuse by keeping them on hand with infrequently used but valuable tools located farther away. This relieves the need to remember where the recently used tools are kept in the workshop [Greenberg, 93]. John’s interactions would be limited if his photo album automatically returned to the shelf each time he picked up a photo from it. People are constantly reorganizing space to enhance performance. Intelligent spatial arrangements can be used to [Hollan, 00]:

Roles: It is typical to see rigidly defined roles in real world collaborative situations [Cohen, 02, Hutchins, 95, Heath, 91, Segal, 92]. Social organization helps to reduce conflict and makes it easier to predict the future actions of others [Gutwin, 00, Hutchins, 95]. Consider a naval ship example: A crew member might be assigned to report when the ship reaches a certain depth. This crew member has the role of a daemon waiting for specified conditions before performing a particular action. A phone talker on a ship listens to phone calls and waits for pauses in bridge activity to provide new notifications. This phone talker has the role of a buffer for information received over the phone [Hutchins, 95].

• simplify choice: electrical tape over a broken light switch • simplify perception: similar jigsaw pieces placed together • simplify internal computation: locating carpenters’ tools Physical objects support simplification through a history of use [Hollan, 00]. For example, a new paperback book opens to the last place that you stopped reading, well worn paths often indicate shortcuts or safe routes. Ownership of Artifacts: “Although groupware systems often have a strong notion of ownership – where the system dictates access control and restricts who can do what – in real life, ownership and control is a socially mediated process determined by implicit subtleties such as proximity and history of use” [Kruger, 04, Gutwin, 00]. Kruger describes how the orientation of artifacts supports the coordination of personal and group territories [Kruger, 04]. For example, all of the photos in John’s personal territory are oriented towards his seating arrangement, while photos in the middle of the table are in a compromised position that both John and Mary can see. Rigid access control over who could access what photos would limit their natural social mediation.

Publicized Actions

People often perform actions in a co-located space for the benefit of the group. These public actions are expressed through requests for validation and assistance, spoken alouds and gestures. Public actions are important because they are the primary means for providing workspace awareness to other collaborators. Validation and Assistance: During conversation, people provide cues to show they understood what was said. Most question and answer pairs include implicit validation [Clark, 96]. For example, if Mary responded “it’s good” to John’s question “what do you think about this photo?” this would validate that John’s question was understood. There are other forms of validation provided by non-language means, for example, a nod signifies “I understand”, the request “pass the pepper please” could be responded by completing the task of providing pepper. Distance reaching is often made into an explicit request for assistance because it is less disruptive than reaching over other people’s personal and group territories.

Task Division

Observational studies have shown that of people often work simultaneously over physical table surfaces. For example, Tang observed that approximately 50-70% of people’s activities around the tabletop involved simultaneous access to the space by more than one participant [Tang, 91]. People naturally divide tasks to support parallel operation. Divide and Conquer: If a task is divisible, people can separate parts of the task into independent chunks [Tse, 04, Tang, 06]. For example, John and Mary can easily divide the task of selecting photos by working in their own personal territories. If John and Mary were using a map to plan their next vacation together it would be harder to divide the task into independent chunks.

If a person does not understand or if they require assistance they can explicitly ask or break the discourse [Clark, 96]. For example, if Mary did not answer John’s question in a reasonable amount of time, he would assume that she did not understand. To provide assistance it is important to monitor what others are doing to understand their current task state [Gutwin, 00]. For example, John may see that Mary is busy sorting her own photos and will ask for assistance when she returns to the table.

Mixed Focus Collaboration: The transition between group work and individual work has been of particular interest to the research community. Hancock described collaboration as a complex adaptive system where designers need to consider sudden and unpredictable shifts between individual and group work [Hancock, 06]. Tang identified six styles of collaboration (same problem same area, one working another actively monitoring, same problem

Alouds: Alouds are high level spoken utterances made by the performer of an action meant for the benefit of the group, but not directed to any one individual in the group

5

[Heath, 91]. This ‘verbal shadowing’ becomes a running commentary that people commonly produce alongside their actions. For example, John may say “I’m adding this photo” for a variety of reasons: • • • • •

To make Mary aware of the actions that he is doing To forewarn Mary about the addition of a photo To allow Mary to coordinate her actions with his own To reveal the course of reasoning To contribute to a history of the decision making process

People’s alouds are sensitive to the context of the collaborative environment. For example, John would not say “I’m adding this photo” if Mary was getting a photo album from the shelf, instead he would wait until Mary returned to the table before asking her opinion about the new photo. Gestures: As people gesture around a tabletop they make their actions visible and public. Contrast this to an individual working on a desktop computer, where their hands are bound to the keyboard and the mouse. This limited movement produces almost no awareness information for others [Gutwin, 04]. Consequential communication is the direct result of manipulating artifacts on a table [Gutwin, 00]. For example, when John moves the photo to the middle of the screen he is providing consequential communication through the movement and reorientation of the photo. Workspace Awareness: People maintain a shared mental model of the tasks and work of others through workspace awareness. Awareness elements include answers to the questions: • Who: Presence, identity, authorship • What: Action, intention, artifact • Where: Location, gaze, view, and reach Awareness information is updated by monitoring the individual mechanical actions of others and the changes they make to artifacts in the shared workspace [Guwtin, 04]. While feedback is taken for granted in the physical world it needs to be explicitly added to digital table displays. Designers need to pay particular attention to the visual, auditory, haptic and other sensory cues found in real life interaction [Gutwin, 04]. For example, when John flips through a photo album he feels the edge of the pages with his hands, Mary can see both the pages and John’s hand flipping through the album, and Mary can hear the sounds made when a page is flipped. This produces feedthrough that Mary can see in the form of a new photo placed in the middle of the table. Designers of interactive digital tabletop applications should be sensitive to the effects that private (invisible) spaces have on workspace awareness. If each person’s workspace is completely invisible to others around a table there is the risk that the spatial references made by gestures, speech and attention would be meaningless to collaborators [Gutwin,

00, Tse, 06]. For example, if John and Mary were sorting photos using augmented reality goggles, John might do a gesture over a set of digital photos that are invisible to Mary, thus his gesture would not make sense to her. [R]EASONS UNDERLYING MECHANICAL ACTIONS

There are two high level theories that provide an explanation for many of the low level mechanics shown above: Common Ground and Distributed Cognition. These theories that should help the reader understand why people perform the individual and group mechanical actions in colocated collaboration. Common Ground

Shared understandings of context, environment and situations form the basis of a group’s common ground. A fundamental purpose behind all communications is the increase of common ground. This is achieved by obtaining closure on a group’s joint actions [Clark, 96]. For example, Mary’s response to John’s question provides closure to his question and increases their collective common ground. In a collaborative setting, an increased amount of common ground results in smoother collaborations. It is easier to work with close friends rather than complete strangers as you share more common ground with your friends. In a collaborative setting many of the individual mechanical actions generate group awareness that is used to update and improve common ground. Furthermore, many of the group mechanical actions are attempts to coordinate activities within the co-located environment. This results in improved common ground within the group. Tracks: Conversations can be broken up into two tracks: track one describes the business of the conversation and track two describes the efforts made improve communication [Clark, 96]. For example, if we consider this paper as a conversation between me the writer and you the reader, the scenario of John and Mary is a track two communication attempting to establish common ground to improve your understanding of the IGR Framework (the business). Most of the individual and group mechanics described in this paper fall into the category of track two communications. Insufficient Common Ground: Errors in co-located collaborative work are often the direct result of insufficient common ground. Interference results from the inability to predict where others are going to be working. People accidentally destroy other people’s work by failing to recognize the personal territories that the group has established. Confusion about the current state of the task results from a failure to monitor the shared workspace. Frustration occurs when people assume that others share the same common ground that they have. Distributed Cognition

While cognitive psychology focuses specifically on the processes within an individual’s mind, Hollan and Hutchins

Figure 3. Diagram of the IGR Framework

Figure 4. Multimodal Tabletop Interaction [Tse, 2006]

challenge designers to consider the collaborative group as one distributed cognitive system. For example, consider communication aboard a navy ship where people have very distinct roles and tasks. In navigation, the outcomes that matter to the ship are not determined by the cognitive processes of any single person. Instead the navigational team must communicate and work together to determine the current location of the ship. Different people understand different parts of the system. Knowledge about the environment is shared between people through communication. This redundant distribution of knowledge adds flexibility and robustness, thus the system does not fail on the account of a single person. Every action can be considered a piece of the bigger computation that manages the navigation of a ship [Hollan, 00, Hutchins, 95].

mechanical actions support direct indication, indirect indication and expression. IMPLICATIONS TO DESIGN

In order for designers to create tabletop systems that are effective for co-located collaboration they must first understand the natural behaviors of individuals, groups and their respective reasons. This is the fundamental implication of the IGR Framework. I illustrate this point with several examples seen in the literature. RNT: Kruger defined an interaction technique called Rotate N Translate that allowed individuals to seamlessly rotate and translate objects on a table top surface using a single point of contact [Kruger, 04]. When compared with the traditional windows metaphor of separate translation areas and rotation handles, RNT provides a much closer mapping to physical reality. As mentioned earlier, rotation plays an important role in distributed cognition and common ground and is one of the fundamental tools used to divide shared workspaces.

Below is a summarization of the core principles of distributed cognition and their direct relationship to group mechanical actions. Structure in the Environment: People use the layout of artifacts in co-located environments to offload cognitive effort that they would otherwise have to remember internally. This explains why workspace division is so important to collaborative activity.

Interface Currents: Hinrichs defined a new interaction technique know as interface currents. Currents support the movement of objects over a distance using a boat train metaphor similar to those seen in Sushi restaurants [Hinrichs, 05]. Items flow along a common track and can be removed or added at any time. Currents support Distributed Cognition by supporting workspace division through personal, group and storage territories. For example, items placed in a current become smaller thus making currents well suited for storage territories.

Coordination: The parallel activities of multiple people will not be effective unless there is good coordination of their activities. Task division allows people to use social organization to perform parallel activities. However, the coordination of parallel activities requires effort. Publicized actions constitute the track two activities that groups use to improve common ground.

Multimodal Interaction: Tse illustrated interaction distinct from the simple single point interactions seen in most interactive tabletop systems [Tse, 06]. Rich multi finger and whole handed gestures along with speech recognition on a tabletop system improved Common Ground and Distributed Cognition by better supporting the publication of individual mechanical actions [Tse, 06, Wu, 06]. Figure 4 shows two people interacting with a Command and

SUMMARY

In summary, Common Ground and Distributed Cognition are the reasons that direct group mechanical actions. The Group mechanical actions of task division, workspace division and publicized actions are supported by individual mechanical actions (Figure 3). These individual

7

Control game where one person uses two hand side and the speech command “Label as Unit 1” to select an area while the other person uses specifies a location with the “move here” speech utterance. Display and Input Technology Factors: Numerous papers have been published concerning the benefits and tradeoffs of different display orientations, sizes, seating arrangements and input technologies, e.g. [Inkpen, 05, Scott, 04, Ryall, 04]. To create effective tabletop systems, designers need to understand the tasks that people will do on the tabletop so they can select appropriate technologies and interaction metaphors to maximize the distributed cognition and common ground needs of the team. CAUTIONS FOR DESIGNERS

A diligent understanding of the individual and group mechanical actions that people naturally perform in a colocated environment requires an equally diligent implementation by designers. There is ultimately the risk of developing interaction techniques that show the designer is aware of individual and group mechanical actions but does not support a group’s collaborative activity. The Implicit to Explicit Trap: Often there is the risk that designers will attempt to support the natural activities of collaborators by making explicit what is implicit in people’s everyday co-located work [Dix, 98]. For example, if Mary and John were using a table top system that required them to specify the owner of each photo this would diminish the effectiveness of their collaboration because the computer is requiring them to repeat common ground that the group already shares. This makes it difficult to establish closure and hinders the establishment of common ground.

Figure 5. John (left) and Mary (right) working on a digital tabletop (sadly, Mary has put on weight). computer can determine when to turn speech recognition on and off. While this seems like a good idea people may not be looking at the table top until the stroke phrase of their gesture [Oh, 02, Clark, 96]. For example, John is only looking at the table when he says “this photo”. If speech recognition was turned on only when he looked at the table the computer would miss half of John’s phrase. Forcing John to always look at the table when he is using speech limits the communication and establishment of common ground with Mary. Mary is also looking at the photo but not saying anything, her validations may be incorrectly interpreted as commands to the computer. More modalities are better only if we clearly understand their subtle interplay in everyday natural interaction. USING THE IGR FRAMEWORK FOR EVALUATION

More is Not Always Better: No current tabletop system supports all the individual and group mechanics mentioned in the Framework. While the interaction techniques mentioned in the implications may benefit from particular individual and group mechanics there is a risk that additional technology may require behaviors that hinder Distributed Cognition and Common Ground.

To illustrate the use of the IGR framework in practice, I present the following scenario (Figure 5): Mary and John are using an interactive digital tabletop to select photos for a slide show. They bring their photos from home on some digital media (e.g., a USB key) where each photo’s file name is an annotated caption. Mary and John load their photos onto the digital tabletop. The computer automatically organizes their photos into two large rings around the table, the outer ring shows the photos whose filename starts from A to M and the inner ring shows photos whose filename starts from N to Z. All the photos on the table slowly rotate clockwise so they are accessible. Photos are automatically rotated around the circle so they are upright for the person seated on the closest edge of the table. John and Mary can both move photos around the tabletop simultaneously using a single finger and their actions result in direct changes to the size, position and orientation of the photo. John pushes a photo towards the centre of the table and asks Mary “what do you think about this photo?”

For example, people using speech recognition technology require a way to specify that a verbal command is meant for a collaborator rather than the computer. One simple solution is to add the attention modality to the system through eye tracking. By monitoring when a person is looking at the digital table vs. looking at another person the

Reasons: Comparing this digital scenario to the physical scenario described earlier, we see that the digital scenario has many weaknesses. Automatically rotating all of the photos on the digital tabletop prevents Mary and John from using the position of the photos to support Distributed Cognition. The inability to permanently place a photo on

Sometimes people make implicit actions explicit in the real world to support coordination and collaboration. For example, a Lazy Susan in a Chinese restaurant is an implicit concept made explicit to support the reach of artifacts on a table. An understanding of the benefits and tradeoffs of a particular implementation can be attained by examining the benefits to the reason section of the IGR Framework. Designers can learn a lot by simply looking at how people interact in the real world since people are experts at adapting to their surroundings [Greenberg, 93].

ACKNOWLEDGMENTS

the table also prevents Mary and John from establishing a common ground regarding their working areas. In fact, the automatic combining of both set of photos into one giant list sorted by filename eliminates the common ground that both Mary and John started with, namely their separate sets of photos.

Saul Greenberg, Chia Shen, Sheelagh Carpendale, and Carey Williamson are all connected to the conspiracy of my abduction from March 6 to 9, 2006. This research was done with Mitsubishi Electric Research Laboratories. Sponsors include: ARDA/NGA, NSERC, Alberta Ingenuity and iCORE. Opinions in this paper do not reflect the views and opinions of any governmental agencies.

Group Mechanics: Division of the workspace is hindered since John and Mary have lost the ability to use photo orientation to establish personal and group territories. In fact, many of Mary’s photos are oriented upside down for her making them harder to see (poor Mary). Furthermore, the automatic rotation of all photos prevents coordination through personal, group and storage territories. John’s act of moving a photo towards the center of the table is less visible to Mary since there is a large amount of motion already on the screen. The direct visual feedback of moving a photo is beneficial but could be further improved with subtle auditory feedback. Finally, the automatic photo rotation scheme prevents task division as people cannot establish space to do individual work.

REFERENCES

1. Bekker, M.M., Olson, J.S., & Olson, G.M. (1995). Analysis of gesture in face-to-face design teams provides guidance for how to use groupware in design. In Proceedings of the Symposium on Designing Interactive Systems 1995, pp. 157-166. 2. Bentley, R., Hughes, J., Randall, D., Rodden, T., Sawyer, P., Shapiro, D. And Sommerville, I. (1992). Ethnographically-informed Systems Design for Air Traffic Control. In Proceedings of Computer-Supported Cooperative Work (CSCW) 1992, pp. 123-129.| 3. Bolt, R.A., Put-that-there: Voice and gesture at the graphics interface. Proc ACM Conf. Computer Graphics and Interactive Techniques Seattle, 1980, 262270.

Individual Mechanics: This system only recognizes the input of a single finger. Any rich gestures such as selecting a group of photos using two hands is not supported. This system would benefit from speech commands such as “stop moving” or “pass me those photos”. Also, automated photo rotation lessens the awareness information provided by monitoring John and Mary’s eye gaze.

4. Buxton, W.A.S., Chunking and phrasing and the design of human-computer dialogues. Human-Computer Interaction: Toward the Year 2000, 1995), Morgan Kaufmann Publishers Inc., pp. 494-499.

CONCLUSION

5. Chin, T., Doctors Pull Plug on Paperless System. American Medical Association News, Feb 17, 2003. http://www.ama-assn.org/amednews/2003/02/17/

Designers need to understand the natural interactions the people do in everyday collaborative work in order to design effective co-located tabletop systems. The Individual, Group and Reason (IGR) Theoretical Framework presents a foundational understanding of natural co-located activities. It summarizes the low level individual mechanical actions of speech, gesture, attention and their respective combinations. It provides a framework for understanding how people naturally coordinate their actions by dividing workspaces, how they distribute tasks to support parallel activity and how they publicize their actions to improve group awareness. Coordination, parallel activity and awareness are tools that people naturally use to improve common ground so they can work more effectively as a distributed cognitive system.

6. Clark, H. Using language. Cambridge Univ. Press, 1996. 7. Cohen, P. Speech can’t do everything: A case for multimodal systems. Speech Technology Magazine, 5(4), 2000. 8. Cohen, P.R., Coulston, R. and Krout, K., Multimodal interaction during multiparty dialogues: Initial results. Proc IEEE Int’l Conf. Multimodal Interfaces, 2002, 448-452. 9. Dix, Alan; Finlay, Janet; Abowd, Gregory; Beale, Russel. (1998) Human Computer Interaction, Second Edition. Chapter 13,14. Prentice Hall International.

FUTURE WORK

10. Greenberg, S., The Computer User as Toolsmith: The Use, Reuse, and Organization of Computer-based Tools. Cambridge University Press, 1993.

This paper presents a foundational understanding of the basic mechanics and high level theories behind co-located tabletop collaboration. It is not designed to be a comprehensive survey of all theories, ethnographic studies, empirical and observational studies related to co-located interaction. Rather it provides a basic understanding for designers that are not familiar with designing co-located interactive tabletop systems. Further empirical and observational will help designers further understand how to create digital tabletop systems that are truly beneficial for co-located collaborators.

11. Gutwin, C., and Greenberg, S. The importance of awareness for team cognition in distributed collaboration. In E. Salas, S. Fiore (Eds) Team Cognition: Understanding the Factors that Drive Process and Performance, APA Press, 2004, 177-201. 12. Gutwin, C., & Greenberg, S. (2000). The Mechanics of Collaboration: Developing Low Cost Usability Evaluation Methods for Shared Workspaces. IEEE 9th 9

International Workshop on Enabling Technologies: Infrastructure for Collaborative Enterprises (WETICE'00). June 14-16, held at NIST,Gaithersburg, MD USA.

23. Oh, A., Fox, H., Kleek, M., Adler, A., Gajos, K., Morency, L., Darrell, T., (2002) Evaluating Look-toTalk: A Gaze-Aware Interface in a Collaborative Environment. In Extended Abstracts of ACM CHI 02

13. Hancock, M., Carpendale, S. (2006) The Complexities of Computer-Supported Collaboration. Technical Report 2006-812-05, University of Calgary, Alberta Canada.

24. Oviatt, S. Multimodal interactive maps: Designing for human performance. Human-Computer Interaction 12, 1997.

14. Heath, C.C. and Luff, P. Collaborative activity and technological design: Task coordination in London Underground control rooms. Proc ECSCW, 1991, 65-80 15. Hinrichs, U., Carpendale, S., Scott, S. (2005) Interface Current Supporting Fluent Collaboration on Tabletop Displays. Proceedings of Smart Graphics 2005. 16. Hollan, J., Hutchins, E., & Kirsh, D. Distributed Cognition: Toward a New Foundation for Human Computer Interaction. Proceedings of ACM TOCHI Vol 7 No 2 Jun 2000 pp. 174-196 17. Hutchins, E., (1995). Cognition in the Wild. MIT Press, Cambridge, MA. 18. Inkpen, K., Hawkey, K., Kellar, M., Mandryk, R., Parker, K.,Reilly, D., Scott, S., & Whalen, T. (2005). Exploring Display Factors that Influence Co-Located Collaboration: Angle, Size, Number, and User Arrangement. In Proceedings of HCI International 2005, July 22-27, 2005, Las Vegas, NV. 19. Kruger, R., Carpendale, M.S.T., Scott, S.D., Greenberg, S.: Roles of Orientation in Tabletop Collaboration: Comprehension, Coordination and Communication. In Journal of Computer Supported Collaborative Work, 13(5-6), 2004, pp. 501–537. 20. McGee, D., Cohen, P. (2001) Creating Tangible Interfaces by Augmenting Physical Objects with Multimodal Language. Proceedings of IUI ’01. pp. 113-119 21. McNeill, D. 1992. Hand and Mind: What Gestures Reveal About Thought. University of Chicago Press, Chicago. 22. Morris, M., Ryall, K., Shen, C., Forlines, C., Vernier, F. (2004) Beyond “Social Protocols”: Multi User Coordination Policies for Co-located Groupware. Proceedings of ACM CSCW ’04, pp. 266-265.

25. Oviatt, S. L. Ten myths of multimodal interaction, Comm. ACM, 42(11), 1999, 74-81. 26. Ryall, K., Forlines, C., Shen, C., Morris, M. (2004) Exploring the Effects of Group Size and Table Size on Interactions with Tabletop Shared-Display Groupware. Proceedings of CSCW 04, pp. 284-293. 27. Scott, S.D., Carpendale, M.S.T, Inkpen, K.M.: Territoriality in Collaborative Tabletop Workspaces. In Proceedings of the ACM Conference on ComputerSupported Cooperative Work (CSCW)’04, 2004, pp. 294–303 28. Segal, L. Effects of checklist interface on non-verbal crew communications, NASA Ames Research Center, Contractor Report 177639. 1994 29. Tang, A. (2006) Group Cohesion on Collaborative Tabletop Displays. Proceedings of ACM CHI 06. To appear. 30. Tang, J.C. (1991). Findings from observational studies of collaborative work. International Journal of ManMachine Studies, 34, pp. 143-160. 31. Tse, E., Histon, J., Scott, S., Greenberg, S. (2004) Avoiding Interference: How People Use Spatial Separation and Partitioning in SDG Workspaces. Proceedings of ACM CSCW ’04. 32. Tse, E. (2006) Enabling Interaction with Single User Applications through Speech and Gestures on a MultiUser Tabletop. Proceedings of AVI 2006. To appear. 33. Wu, M., Shen, C., Ryall, K., Forlines, C., Balakrishnan, R. Gesture Registration, Relaxation, and Reuse for Multi-Point Direct-Touch Surfaces. IEEE International Workshop on Horizontal Interactive Human-Computer Systems (TableTop), January 2006. pp.183-190 (IEEE Computer Society P2494, ISBN 0-7695-2494-X)

The columns on the last page should be of approximately equal length.

research proposal

Mar 15, 2006 - While a new generation of research technologies now support co-located collaboration, they do not yet .... and its software limited commanders, as they were accustomed to using multiple fingers and two-handed gestures to mark (or .... to contribute to a history of the decision making process. Distributed ...

4MB Sizes 2 Downloads 326 Views

Recommend Documents

Research Paper Proposal
while the course unfolding? Secondly, to develop an optional tool that is capable of assessing instructor control in computer-mediated online courses of higher education using system dynamic model. The researcher hopes that this will facilitate the l

A Research Proposal -
To determine how advertisement exposure response functions differ between established brands and market newcomers. Submitted To. Prof Sanjeev Verma.

Research Proposal for ICER Fellowship
tional Sample Survey Employment & Unemployment data. ... is an ever-present topic in economics as well as outside of it, opportunities to test it's presence, ...

Research Proposal on Social Network ...
Research Proposal on Social Network Recommendation Systems. Nabi A. Rezvani. 1 Introduction. Recommendation systems have been the topic of a lot of ...

Common Guidelines for Scientific Research Proposal Submission.pdf ...
Page 3 of 4. Common Guidelines for Scientific Research Proposal Submission.pdf. Common Guidelines for Scientific Research Proposal Submission.pdf. Open.

Research Proposal on Social Network ...
social network-specific recommendations that can be offered to their users,data and ... To tackle those problems, two approaches have been proposed [9, 10].

Structure of MBA Thesis Research Proposal
Hypotheses are relatively easy to test if your research study is more or less quantitative in nature and if you wish ... What are the best practices in the world and how do they relate to your research issue? ... main data analysis methods and why yo

Sleeper Future Research Proposal- The Effect of Bleaching on ...
Sleeper Future Research Proposal- The Effect of Bleac ... Symbiodinium Clades Present within Acropora spp..pdf. Sleeper Future Research Proposal- The Effect ...

From Research Hypotheses to Practical Guidelines: A Proposal to ...
leverages the power of social networking as a vehicle for collaboration between researchers and practitioners, which will mutually benefit both parties. Conduits ...

Research Proposal Lyapunov method in stability ...
discontinuous dynamics since they can play different roles in a view of contribution towards stability. A dwell-time condition that restricts the frequency of impulses will be developed in order to guarantee the desired stability properties of the in

Research Proposal - Prevalence of SNHL in DLP in ...
Background and Rationale. Dyslipidemia (DLP) is a major public health concern, especially in Thailand, where the prevalence of DLP is increasing in both ...

Research Proposal: A Unified Approach to Scheduling ...
between tasks, and data dependencies between tasks and files. .... Many different types of grid computing systems have been developed over the years.

Panel Proposal
Choose an option. ( ) Member of SAAS ( ) Member of ASA ( ) Processing Membership. Title of Proposed panel: Panel Abstract (200-300 words): Please, complete this form and send it, in electronic format (via e-mail), to board members. Rodrigo Andrés (r

SASE-Submitting-a-Proposal
Log into your account at sase.org – green button “sign in” in the top right-hand corner of the homepage: ... Once you are logged in, you will see the green button “submit a paper” in the top right-hand corner of any page on the ... Conferen

venue proposal -
Coke and Still Mineral Water. Unlimited and bottled. 1.5 Pricing. Original Menu (Less the lamb and chicken karahi, veg curry and naan) is £15.95 per head.

Request for Proposal - Ning
Sep 3, 2013 - Synopsis: Enhancing Mobile Populations' Access to HIV and AIDS Services, Information and. Support a 5 year project funded by Big Lottery ...

proposal pdf.pdf
Page 1 of 6. UAL Awarding Body – Foundation. Unit 7- Project Proposal. Candidate. Number. Candidate. Number. Candidate Name Odysseus Miltiadous. Candidate. Number. 96519217. Pathway Graphic Design. Project Title Can we rely on the government to act

pre-proposal for finance wg guidelines for proposal -
OB will e recompensed between General Assemblies (GAs); and ... That the General Assembly (GA) authorize the FWG to open a checking account at WECU as ...

Proposal Writing Officer
cgiar.org. International Center for Agricultural Research in the Dry Areas. A CGIAR Research Center. Proposal Writing Officer. Reports to: Assistant Director General-International Cooperation & Communication. Location: Cairo, Egypt. Advertisement Dat

request for proposal - AOS92
Feb 26, 2015 - In the event taxes are imposed on the services purchased, the District will not be responsible for payment of the taxes. The vendor shall absorb the taxes entirely. Upon request, the District's Tax Exempt Certificate will be furnished.

PROPOSAL KEWIRAUSAHAAN.pdf
Page 1 of 1. ISI PROPOSAL KEWIRAUSAHAAN. 1. JUDUL. 2. NAMA KELOMPOK. PEMBAHASAN: 1. LATAR BELAKANG USAHA. Apa alas an saudara ...

Collaboration Proposal -
Collaboration Proposal. In-band Telemetry, VM latency measurements and sFlow acceleration. February 15, 2018. To Whom It May Concern,. This document represents a formal proposal from Napatech to ONF/CORD and OPNFV to extend the Barometer project to i

Proposal Writing Officer
Jul 8, 2017 - ... health in the face of global challenges including climate change. ... Lebanon, ICARDA operates in regional and country offices across Africa,.