| |
 |
 |
CHI 2003 Feature: Testing... 1 2 3 4 5 ... Testing...
Source: UN, 1 May 2003
Submitted by
Larry Constantine
Usability can sometimes be more about belief than about evidence or engineering, with usability testing heading the list as a central tenet of the dogma of modern practice. One disgruntled participant in a recent conference even commented: 'It is unbelievable that an instructor at CHI would question the importance of user research and usability testing'. Yet, precisely because of its leading role, it is important for the profession to question the dogma of usability testing and for professionals to keep abreast of new developments and changing perspectives.
The received view in our field holds that testing is the yellow-brick road to usability, that testing is always a good idea, and that a small number of tests is enough to catch most of the problems. How many is enough and how much is most? Since Jakob Nielsen and Thomas Landauer published their now classic work in 1993 (see resources at end), the widely accepted answer has been that 5 subjects is probably enough to uncover roughly 80% of the usability defects. Beyond that point the return-on-investment from added testing falls off steadily. Many organizations stake their reputations and the success of their products on some variant of this formula.
This particular piece of the received wisdom of usability was revisited in a panel at the recent CHI 2003 conference in Fort Lauderdale, Florida. Called, appropriately enough, "The Magic Number 5," the panel's acknowledged aim was to be the last panel of its kind and to lay the issue to rest for once and for all time.
Moderated by Nigel Bevan of Serco Usability Systems in the UK, the panel brought together Carol Barnum of Southern Polytechnic State University, Gilbert Cockton from University of Sunderland in the UK, Rolf Molich of DialogDesign in Denmark, Jared Spool of User Interface Engineering, and Dennis Wixon from Microsoft. Jakob Nielsen of the Nielsen Norman Group, who was listed on the program but failed to show for the panel, was a persistent presence nonetheless.
True to their aim, the panel reviewed and summarised relevant work already reported elsewhere and previously discussed. As an erstwhile firm believer in the small numbers approach, Jared Spool started the panel by recounting how his views had changed after an experience with one client who insisted on testing with at least 18 users. Expecting to uncover fewer and fewer problems as testing progressed, Spool and company were surprised that new problems were still showing up at about the same pace after 16 users. Their experience was supported by Rolf Molich, whose well-known CUE (Comparative Usability Evaluation) studies have the same software evaluated by a number of different usability testing labs. In the second study in that series, for example, the 7 teams returned almost completely different, mostly non-overlapping results. Of the total 310 usability problems uncovered, 75% were identified by but a single testing team and missed by the others, and only one problem showed up in the findings of every team.
Far from being a scientific and reproducible procedure, as it is touted by many professionals and regarded by many managers, usability testing now appears to be a highly variable art in which the results depend on who is testing what by which protocol with which particular subjects. It is quite possible that for some systems being evaluated by some procedures, no matter how many subjects you test, you will continue to uncover new and significant problems. The problems you find will be different from those you would find with other users and different from those that another tester would uncover with the same number of users.
The Nielsen and Landauer work is often summarised as a set of curves plotting the discovery of defects versus the number of test subjects and the relative return-on-investment with each additional subject. The latter curve, which peaks around 5 users, was repeatedly referred to by Spool and other panelists as the 'parabola of optimism'. Not only do recent experiences and research call the numbers into question, but even the underlying statistical assumptions were challenged by Gilbert Cockton. Indeed, if there is a curve of diminishing returns that gradually levels off with more and more test subjects – and that itself is no longer beyond question – the shape of the curve may be unique to every product and even every test protocol.
Not every panelist was on the attack. As Carol Barnum highlighted in her defence of Nielsen-style discount usability, effective testing with small numbers requires clear, well-defined test scenarios based on specific testing objectives and using carefully chosen test subjects. All too often in practice, test scenarios are overly vague and open-ended and test subjects are really just the people who happen to show up at the laboratory door.
The most impassioned advocacy came from Dennis Wixon, who kicked off a lively exchange by arguing that size is not what matters and that identifying the total set of problems is irrelevant. The true goal of testing is not finding defects but fixing them. Wixon championed the RITE approach (Rapid Iterative Testing and Evaluation) currently in favor at Microsoft. The technique uses a short find-fix cycle that echoes the daily-build philosophy pioneered by David Cutler on the Windows NT project. Every day the latest version of the software is tested with another user. Problems are fixed immediately and then the system is re-evaluated with another test the next day. This approach, Wixon claimed, continually answers the question 'Does the system as modified actually work for users?'.
Interestingly, neither the panel nor the audience challenged the concept of repeatedly testing a changing system with changing test subjects. If, as the panel's own evidence clearly suggests, each test and subject is in some aspects unique and the testing process may not converge even over substantial numbers of subjects using a stable system, the repeated testing of an ever-changing system is more about the illusion of rapid progress than the reality.
Nevertheless, rapid iteration struck a responsive chord with the audience. Speaking from the floor, Robin Jeffries, Distinguished Engineer at Sun Microsystems, drew applause when she advocated that professionals 'test early, test often, and test iteratively'. Repeated with variation – and invariably to enthusiastic audience response – this theme became a veritable mantra for the session.
One major conclusion from the panel and the conference might be that usability testing is so entrenched in the canon of usability practice that no amount of counter-evidence will shake the faith of its true believers. An unbiased reading of the research results would suggest that no amount of testing is enough, a conclusion already well-established a quarter century ago within the software quality movement. The focus of software quality improvement is now on reducing the so-called injection rate, that is, avoiding problems in the first place. There is a limit to how many defects can be uncovered, cataloged, and analysed in a given number of sessions no matter what protocols one follows or how rapidly one iterates. The more problems lurking in the system to be tested, the more hopeless the fate of those who put their faith in testing.
Among the panellists, only Rolf Molich took that next logical step to question the very role of usability testing. He shared a hopeful vision of the future in which robust and disciplined design processes avoided most usability problems from the outset. In his vision, usability testing facilities would gradually fall into disuse and ultimately be abandoned to gather dust.
Amen, brother.
As I listened in on the buzz among attendees leaving the session, it became even clearer that Molich was addressing a small sect that believes in usability by design while Wixon was preaching to the choir. The widespread belief in the power of testing remains safely unshaken by the facts.
Larry Constantine, Constantine & Lockwood Ltd
RESOURCES Constantine, L. L. "Testing, Testing, One, Two." forUSE: The Electronic Newsletter of Usage-Centered Design, #11. [http://www.foruse.com/newsletter/foruse11.htm]
Medlock, M. C., Wixon D., Terrano, M., Romero R., Fulton B. (2002). "Using the RITE Method to improve products: a definition and a case study." Usability Professionals Association (UPA2002), Orlando, FL, July 2002. [http://www.microsoft.com/usability/Playtest/Publications/Using%20the%20RITE%20Method%20to%20improve%20products.doc;%20a%20definition%20and%20a%20case%20study.doc]
Molich, Rolf, Nigel Bevan, Ian Curson, Scott Butler, Erika Kindlund, Dana Miller & Jurek Kirakowski. (1998). Comparative Evaluation of Usability Tests. Proceedings of the Usability Professionals Association (UPA98) Conference, Washington, DC. [http://www.dialogdesign.dk/tekster/cue1/cue1paper.doc]
Molich, Rolf, Ann Thomsen, Barbara Karyukina, Lars Schmidt, Meghan Ede, Wilma van Oel, & Meeta Arcuri. (1999). Comparative Evaluation of Usability Tests. CHI99 Extended Abstracts 83-84. [http://www.dialogdesign.dk/tekster/cue2/abstract.doc]
Nielsen, J., and Landauer, T. K. 1993. A mathematical model of the finding of usability problems. Proceedings ACM/IFIP INTERCHI'93 Conference (Amsterdam, The Netherlands, April 24-29), 206-213. [See http://www.useit.com/alertbox/20000319.html and http://www.useit.com/papers/heuristic/heuristic_evaluation.html]
Woolrych, A. and Cockton, G., "Why and When Five Test Users Aren't Enough," in Proceedings of IHM-HCI 2001 Conference: Volume 2, eds. J. Vanderdonckt, A. Blandford, and A. Derycke, Cépadèus Éditions: Toulouse, 105-108, 2001. [http://osiris.sunderland.ac.uk/~cs0gco/fiveusers.doc]
Associated Link:
Constantine & Lockwood Ltd
|
|
 |
 |
|
Six Metrics for Managing UI Design Source: Russell Wilson, 28 August 2008 A proposal of six metrics to be used for managing a user interface design department. Don't Judge a Form by its Cover Source: Formulate Information Design, 27 August 2008 The saying "don't judge a book by its cover" reminds us that looks are deceptive. It turns out that this idiom applies to forms too. Beijing Olympics - special State of the eNation report Source: www.abilitynet.org.uk, 26 August 2008 In this special report AbilityNet asked disabled users to try out the Beijing Olympics website in our interaction lab. It's Who You Know (Or Don't) Source: Stanford Magazine, 23 August 2008 Online social networks are powerful and ineffectual all at once. Winning Considerations for Interactive Content Source: UXMatters, 22 August 2008 Rich options for interactively presenting content also come with a challenge. Microsoft sees end of Windows era Source: BBC, 20 August 2008 Microsoft has kicked off a research project to create software that will take over when it retires Windows. News you can Use Source: Gerry McGovern, 18 August 2008 When the homepage is dominated by news you are not necessarily communicating more. Feeling Through your Computer Source: Discoveries and Breakthroughs Inside Science, 16 August 2008 A newly designed device lets computer users feel the texture and movement of what they are seeing in front of them. User interviews - A basic Introduction Source: Webcredible, 15 August 2008 It's surprising how few people have a real understanding of who's using their site. Helping Visitors find Information Source: UN, 13 August 2008 A new report outlines the key findings from surveys that explored factors which influence the quality of online experience.
|
|
|