Human Computation and Crowdsourcing: Works in Progress and Demonstration Abstracts AAAI Technical Report CR-13-01

Inserting Micro-Breaks into Crowdsourcing Workflows Jeffrey M. Rzeszotarski1,2, Ed Chi1, Praveen Paritosh1, Peng Dai1 1 Google, Mountain View, CA 94043, USA 2 Human-Computer Interaction Institute, Carnegie Mellon University, Pittsburgh, PA 15213, USA

Abstract

While we now know how to mitigate quality control issues in crowdsourcing systems using CAPTCHA questions, gold standards, and majority voting (Bernstein et al. 2010; Callison-Burch 2009), it is much less clear how to deal with other performance issues such as reduced cognitive abilities brought on by fatigue and boredom. We know that fatigue, both physical and cognitive, can affect workers doing large batches of tasks, not only risking strain injuries and reduced well-being, but also creating lower quality, unreliable data as a result (Krueger 1989). Even worse, many quality control measures for crowdsourcing tasks are guided by consensus, which may be forfeited by unreliable user performance. We propose a method that operates within a human computation system to help mitigate fatigue and boredom through breaks that are similar in size and scope to the other human computation tasks. Imagine a worker has been rating the quality of search results for 15 minutes. All of the sudden she gets a micro-break, and instead of evaluating another page she sees encouraging text and a leaderboard showing how well she is doing. Or, instead, she is awarded a funny video for 30 seconds. Such interruptions do indeed incur a cost in context switching, as the worker must return to her task afterwards (Wylie and Allport 2000). Yet, the interrupting refreshes the worker.

Participants in human computation workflows may become fatigued or get bored over long, interminable working hours. This leads to a slump of motivation and morale, which in the long run causes reductions in both productivity and work quality. In this paper we propose an initial investigation into possible ways to alleviate worker fatigue and boredom by employing micro-breaks that provide timely relax to workers during long sequences of tasks. We experimentally test micro-breaks on Amazon’s Mechanical Turk, showing that micro-breaks can significantly improve worker retention rate as task batches reach hours in length, and appear to increase overall worker engagement and commitment to their work.

Introduction In crowdsourcing task markets such as Amazon’s Mechanical Turk or oDesk, hundreds of thousands of workers take on tasks that are difficult for a computer to solve. Workers on such markets can often perform a large amount of tasks in one huge batch, which in many ways is efficient and beneficial due to worker expertise growth. On the other hand, many workers choose to work long hours, perhaps for reasons such as economics, competitive pressure from peers, or bad habits. We intuitively know that high task loads and long working hours can introduce negative side effects such as boredom and fatigue. For example, imagine an easy task that has workers tag images. A worker can easily tag the images or even do a few more without trouble. Suppose a worker has been tagging continuously for four hours. She has become extremely bored, and her reduced attention may not even meet the demands of such a simple task. Likewise, another worker may take shortcuts by cutting and pasting sets of tags. These working strategies are potential coping mechanisms for reduced cognitive resources and boredom.

Study We performed an initial experimental investigation that explores whether implementing micro-breaks in crowdsourcing markets (Amazon’s Mechanical Turk) improves worker engagement. We implemented a system for delivering micro-breaks in a controlled fashion using a break scheduler built with Google AppEngine. Research shows that breaks can ameliorate boredom, but that their effects vary. Breaks may have more impact if they are similar to the task, and generally, the benefits for a break roughly match the length of the break (Henning et al. 1989). But even small three- to thirty-second breaks can have a benefit. We developed two different breaks based

Copyright © 2013, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

62

intercept Game break Comic break

Wikipedia 1.6802 p<0.001*** 1.0924 p=0.002** 1.5185 p<0.001***

Merging 2.5416 p<0.001*** 0.2837 p=0.454 1.0354 p=0.006**

different from the game, introducing switching costs and making the break less effective. The image identification task is markedly different. Micro-breaks seem to have a negative effect. Instead of keeping people around, they seem to be pushing people away. This is possibly due to the rapid, perceptual nature of the task, as many tasks can be completed in a short amount of time with low effort. Because even the game requires some effort, it might only serve as a distraction. Our analysis shows that the negative effect on retention is in fact significant in the game condition. Overall, we notice slight time and accuracy improvements with breaks, though no significant changes.

Images 3.8622 p<0.001*** -0.7662 p=0.042* 0.4013 p=0.285

Table 1: Negative binomial regression coefficient estimates for game and comic conditions. on these findings with a goal of maximizing potential effects and exploring the design space of micro-breaks. In our first break condition we give workers a game where they can choose to risk part of their current earnings for a fair chance of more payout. We adjust the odds so that the expected extra earnings are zero, yet because participants may win from time to time, the gambler’s fallacy may provide a degree of extrinsic motivation. In the second break condition we give workers an eye-catching comic to read. We chose these breaks because one is primarily motivation and judgment-based while the other is non-directed and cognitive. This may cause different interaction effects with differing tasks. Our tasks include a Wikipedia article evaluation task that requires reading comprehension, a knowledge base entity merging task that requires studying and making a judgment, and an image subject identification task that needs only simple, binary judgments in rapid succession. We populated queues with at least an hour of work so as to induce boredom and fatigue. We also introduced gold standard data so that we could measure work quality. We recruited 30 unique workers per condition (3 tasks X 2 breaks + control/no break) to stay and do as many copies of the different tasks as we had. This gives us 3x3x30=270 unique worker submissions. We adjusted the payment rate and had breaks appear within tasks so that the compensation was equal between conditions. We examined the retention rate of the workers to measure break effectiveness. If more workers stayed to do more tasks in a condition, we suggest that it was more engaging.

Discussion This work makes several contributions. We propose introducing micro-breaks to crowdsourcing workflows to overcome boredom and fatigue. We describe the implementation of such a system, and provide example breaks. We conduct experiments demonstrating that introducing micro-breaks into a workflow can make a difference in worker retention rates and slightly improves accuracy and work time. We find the difference in retention varies depending upon the type of task workers were doing, and the kind of micro-break they received. Tasks and breaks that were more closely aligned in content seemed to perform better, and the results suggest that increasing the complexity of a task improved the effectiveness of the micro-breaks. In the future, we plan to study more thoroughly reward amount and the nature of the task versus the effectiveness of micro-breaks. We suppose that micro- breaks are more useful when it is relatively less competitive with the main task and cognitively challenging. We plan to apply machine learning and decision-theoretic models (Dai et al. 2010) to optimally schedule breaks and suggest break types.

References

Results For the Wikipedia rating task, we observed that both the game and comic micro-breaks influenced workers to perform more work comparing to the control baseline. More than 20% of workers were still around compared to the control case after doing half of the tasks, yet without increasing their total pay. Negative binomial regression (used because of the exponential distributions) suggests that the breaks had a significant effect (Table 1). In the entity merging tasks we see something slightly different. While the comic micro-break still has an impact, the game condition converges to the control baseline. This is confirmed by a negative binomial regression. We surmise that this is due to the cognitive nature of the task being too

Bernstein, M.S., Little, G., Miller, R.C., et al. 2010. Soylent: a word processor with a crowd inside. In Proc. UIST ’10, 313–322. Callison-Burch, C. Fast, cheap, and creative: Evaluating translation quality using Amazon’s Mechanical Turk. 2009. In Proc. Conference on Empirical Methods in NLP, 286–295. Dai, P., Mausam, and Weld, S. 2010. Decision-theoretic control of crowd-sourced workflows. In Proc. AIII ’10. Henning, R., Sauter, S., Salvendy, G., and Kreig, E. 1989. Microbreak length, performance, and stress in a data entry task. Ergonomics 32, 7, 855–864. Krueger, G. 1989. Sustained work, fatigue, sleep loss and performance: A review of the issues. Work & Stress 3,2,129–141. Wylie, G. and Allport, A. 2000. Task switching and the measurement of “switch costs”. Psychological research 63, 3-4.

63

Inserting Micro-Breaks into Crowdsourcing ... - Research at Google

Dai, P., Mausam, and Weld, S. 2010. Decision-theoretic control of crowd-sourced workflows. In Proc. AIII '10. Henning, R., Sauter, S., Salvendy, G., and Kreig, ...

452KB Sizes 1 Downloads 326 Views

Recommend Documents

CrowdForge: Crowdsourcing Complex Work - Research at Google
article writing, decision making, and science journalism that demonstrate the ... scribing audio recordings, or researching data details, with many tasks taking ...

Crowdsourcing and the Semantic Web - Research at Google
Semantic Web technologies (Hitzler et al., 2009) have become use- ful in various ..... finding those tasks that best match their preferences. A common ... 10 C. Sarasua et al. .... as well as data hosting and cataloging infrastructures (e. g. CKAN,.

Quizz: Targeted Crowdsourcing with a Billion ... - Research at Google
Our experiments, which involve over ten thousand users, confirm that ... allows the advertising platform to naturally identify web- sites with ... Third, we evaluate the utility of a host of different ...... The best paid worker had a 68% quality for

Engineering Reliability into Sites - Research at Google
Dr Alexander Perry. Staff Software Engineer in Site Reliability Engineering ... Changing the asset accounting modifies the ratio between aging metrics. ○ Hobbs ...

Online Inserting Virtual Characters into Dynamic Video ...
3School of Computer Science and Technology, Shandong University. Abstract ... virtual objects should reach an acceptable level of realism in comparison to ...

Incorporating Eyetracking into User Studies at ... - Research at Google
Laura: I have been using eyetracking for three years in a. Web based context. ... Kerry: My background is in computer science research, and as part of my PhD ... Page 2 ... dilation, which taps into the degree of a user's interest or arousal in the .

Engineering Reliability into Web Sites: Google ... - Research at Google
Santa Monica, Dublin and Kirkland manage. Google's many services and websites. They draw upon the Linux based computing resources that are distributed in ...

Translating Queries into Snippets for Improved ... - Research at Google
tistical machine translation technology (SMT) is readily applicable to this task. ..... communication - communications international communication - college 1.3 in veterinary ... rant portland, maine, or ladybug birthday ideas, or top ten restaurants

Mathematics at - Research at Google
Index. 1. How Google started. 2. PageRank. 3. Gallery of Mathematics. 4. Questions ... http://www.google.es/intl/es/about/corporate/company/history.html. ○.

Faucet - Research at Google
infrastructure, allowing new network services and bug fixes to be rapidly and safely .... as shown in figure 1, realizing the benefits of SDN in that network without ...

BeyondCorp - Research at Google
41, NO. 1 www.usenix.org. BeyondCorp. Design to Deployment at Google ... internal networks and external networks to be completely untrusted, and ... the Trust Inferer, Device Inventory Service, Access Control Engine, Access Policy, Gate-.

VP8 - Research at Google
coding and parallel processing friendly data partitioning; section 8 .... 4. REFERENCE FRAMES. VP8 uses three types of reference frames for inter prediction: ...

JSWhiz - Research at Google
Feb 27, 2013 - and delete memory allocation API requiring matching calls. This situation is further ... process to find memory leaks in Section 3. In this section we ... bile devices, such as Chromebooks or mobile tablets, which typically have less .

Yiddish - Research at Google
translation system for these language pairs, although online dictionaries exist. ..... http://www.unesco.org/culture/ich/index.php?pg=00206. Haifeng Wang, Hua ...

traits.js - Research at Google
on the first page. To copy otherwise, to republish, to post on servers or to redistribute ..... quite pleasant to use as a library without dedicated syntax. Nevertheless ...

sysadmin - Research at Google
On-call/pager response is critical to the immediate health of the service, and ... Resolving each on-call incident takes between minutes ..... The conference has.

Introduction - Research at Google
Although most state-of-the-art approaches to speech recognition are based on the use of. HMMs and .... Figure 1.1 Illustration of the notion of margin. additional ...

References - Research at Google
A. Blum and J. Hartline. Near-Optimal Online Auctions. ... Sponsored search auctions via machine learning. ... Envy-Free Auction for Digital Goods. In Proc. of 4th ...

BeyondCorp - Research at Google
Dec 6, 2014 - Rather, one should assume that an internal network is as fraught with danger as .... service-level authorization to enterprise applications on a.

Browse - Research at Google
tion rates, including website popularity (top web- .... Several of the Internet's most popular web- sites .... can't capture search, e-mail, or social media when they ..... 10%. N/A. Table 2: HTTPS support among each set of websites, February 2017.

Continuous Pipelines at Google - Research at Google
May 12, 2015 - Origin of the Pipeline Design Pattern. Initial Effect of Big Data on the Simple Pipeline Pattern. Challenges to the Periodic Pipeline Pattern.

Accuracy at the Top - Research at Google
We define an algorithm optimizing a convex surrogate of the ... as search engines or recommendation systems, since most users of these systems browse or ...