Graphical user interfaces (GUI's) make applications easier to learn and use. At the same time, they make application design, construction, and especially test more difficult because user-directed dialogs increase the number of potential execution paths. This paper considers a subset of GUI-based application testing: how to exercise an application like a novice user. We discuss different solutions and a specific implementation that uses genetic algorithms to automatically generate user events in an unpredictable yet controlled manner to produce novice-like test scripts.
Automated test generation, dialog model specification, genetic algorithms,
software engineering test process.
INTRODUCTION
The role of the user interface has become increasingly important to the success of computer applications. Graphical interfaces make complex applications visually attractive and accessible to a wide range of users.
Users can exercise an interactive application in many different ways. In reactive, GUI-based applications, even more options are available because multiple widgets and paths can be active concurrently. This causes problems as an application is tested for failures, especially when the it is made available to a large user community. Unsophisticated and novice users often exercise applications in ways that the designer, the developer, and the tester did not anticipate.
Tools and techniques to improve testing for program failures (e.g., capture/playback tools) are becoming more widely available.
As the tools have been deployed, the prime beneficiaries have been experts. An expert user or tester usually follows a predictable path through an application to accomplish a familiar task. A developer knows where to probe to find the 'weak points' in an application.
As a result, applications contain state transitions that work well for predicted usage patterns but become unstable when given to novice users. Novices follow unexpected paths and do things 'no one would ever do.' Such program failures are hard to generate, reproduce, diagnose, and predict. Current methods (e.g., recruiting naive users, beta testing) are manual, costly, and usually occur after development is finished.
This paper presents a technique to move novice-like testing earlier in the overall system test process. We use genetic algorithms as a repeatable technique for generating user events that drive conventional automated test tools. Reprogramming the genetic algorithm reward system can mimic different forms of novice user behavior. Our prototype implementation works at run-time and is independent of application design and development tools.
Figure 1. Test Process Framework
This paper deals with one aspect of the overall test process: how to automate generation of test scripts that exhibit novice user characteristics as part of Design Test. Novice-like testing is often the target of beta programs and can involve many people (e.g., Microsoft's 400,000 beta copies of Windows95). When this type of novice testing is done, it is designed to find program failures rather than to determine usability characteristics.
GUI-based applications present special problems in test design. Older interactive applications often embedded command hierarchies directly in the structure of the program. The user then followed the command tree while navigating from one function to another. The test designer could assume that certain logical conditions held because the user could not deviate from the program-controlled sequence.
Multiple dialog sequences are available concurrently in a GUI-based application. The application becomes reactive and paths more unpredictable. Therefore, a test designer must consciously assure that the program maintains logical relationships through the state changes needed to control:
The general problem of test automation for GUI-based applications has received some attention. The principal effort has been in the area of automated record/playback tools [16]. Such tools accept input at two levels:
Test automation tools work well when given to an expert and poorly when given to a novice because:
Testing can be conducted on a number of different levels to discover application failures and to measure and verify acceptable performance. The most common characterization of testing describes phases where:
Figure 2. Different Paths through an Application
As shown in Figure 2, there are three paths that can be taken to perform a task during system test:
Novice testing is often ignored. At worst, some applications have been released in anticipation that users will find errors and accept fixes in the next version. Beta programs find some failures caused by novice, but beta users are often quite literate in an application domain. Some companies recruit large numbers of naive users to test true novice behavior prior to beta release.
All three approaches to novice testing are costly. Not only do they occur late in the cycle, but a novice's actions are also hard to replicate. Novices wander through convoluted paths that only rigorous keystroke recording can capture. Large numbers of novice keystroke files become difficult to manage and upgrade to the next version.
We chose to automate generation of novice test scripts to address these problems. Automation requires a script environment to record and playback sessions and a method to generate user events in a novice-like manner for those scripts. Our approach assumes the existence of an automated record/playback tool. The automation effort focused on user event generation.
This section contains a brief review and analysis of three techniques designers and developers use to specify user interface dialog and control its state.
A GUI toolkit does not manage state information. Therefore, both logical relationships and concurrency must be managed in programmer maintained code. Application code grows more complex as the number of relationships increases. In addition, the way in which a programmer manages logical relationships and concurrency is different from but interleaved with the algorithmic code that performs a requested function.
Because the code is an integral part of the specification and state information cannot be derived from static code analysis, any automated test generation scheme needs to be able to obtain current user interface and state information from the application code itself.
This is possible because a UIMS contains a dialog specification language that contains state information. The language is translated into an informal model that is interpreted to control program execution. Because they contain state information, UIMS specifications are inherently more complete than GUI widget hierarchies and more suitable for use in generating test scripts.
However, few applications are specified with a UIMS. Applying any automated test generation technique to a manually developed application requires reverse engineering to derive the UIMS model. This is a daunting task: a working program often contains 'features' and loopholes that are difficult to capture in a more abstract UIMS model.
In contrast to the informal models in UIMSs, formal models can be validated prior to execution. Theorem proving techniques can be used to determine correctness for text-based formal specifications [5]. Graphical formal models like Petri nets can be analyzed mathematically [17] or via discrete event simulation [7].
Formal models have been used as direct input to the test process, especially for safety critical applications. Two examples are path models built from source code to generate test data values [6] and finite state models to reduce the number of required tests [2]. Little has been done to automate the generation of the scripts themselves.
Petri nets have been used as a type of process model applied to dialog [15, 19]. Palanque and Bastide have documented reasonably complex interfaces based on Petri nets. The nets proved to be an effective specification technique. However, they were manually verified and translated into an executable application. Manual translation of a formal model to an executing program can contain errors and deviations from the specification.
Both user interface management systems and formal process models contain the information necessary to generate test scripts automatically from the specification itself. But neither technique is used widely. As a result, we would have needed to build a reverse engineering tool to recreate either a UIMS or formal model specification from an existing program. This approach would make the automated test generator itself difficult to use and error prone. Formal models also suffer from a lack of tools to translate the abstract specification into an executable form.
Therefore, we chose to work with the application itself as the dialog specification. Moreover, we chose to use the executing application rather than the source code. Source code analysis techniques cannot deduce program and dialog state changes that occur during execution. The state changes allow each new step in the script send syntactically correct input to an active part of the application.
Given that an executing program specifies the application user interface dialog, we needed to develop methods to:
The prototype uses standard tools and techniques wherever possible. The prototype is application independent and needs no special application dialog design or code structure.
Commercial test tools provide a mechanism to drive a GUI-based application from captured keystrokes. Automation requires a method to generate keystrokes.
Analyzing user behavior led to the conclusion that emulating novice user behavior requires a way to 'remember' success. Both novices and experts use an application to perform tasks. Novices explore to learn the semantics of individual functions and how to combine sequences of functions into meaningful work. Experts have already discovered successful paths through an application and rely on past experience to accomplish new tasks.
Random number generators alone are inadequate because they do not rely on past history to govern future choices. Genetic algorithms do rely on past history. Success has been reported in applying genetic algorithms to hardware test sequence generation [18]. Therefore, we chose to use genetic algorithms to simulate novice user events.
Genetic algorithms [11] can be programmed to simulate a pseudo-natural selection process. In its simplest form, a genetic algorithm manipulates a table (or pool) of random numbers. Each row in the table represents a different gene. The individual components of a gene are called alleles and contain a numeric genetic 'code.' The interpretation of the allele values varies according to application.
Allele values start as random numbers that define an initial 'genetic code'. The genetic algorithm lets genes that contain 'better' alleles survive to compete against new genes in subsequent generations.
During a run through multiple generations to determine the best genes, the gene pool contains the same number of genes that have the same number of alleles. The number of genes in the pool and the number of alleles in a gene can vary from run to run.
A basic genetic algorithm:
Gene crossover styles, mutation rates, and death rates can be programmed and varied. These techniques are useful in determining how many genes survive into a new generation, how existing allele values are swapped among genes, and how new random allele values are inserted into surviving or new genes. Each generation is guaranteed to produce a set of genes that survive. Allowing a genetic algorithm user to vary survival techniques lets the user tune the algorithm to a particular problem.
Genetic algorithms are not an effective way to explore all possible paths in a dialog sequence. Instead, a genetic algorithm uses previous sets of random numbers as the basis for new ones. This means that the method used to compute the 'best results' becomes the key factor in any application that uses genetic algorithms. As applied to user interface event generation, 'best results' meant designing an algorithm that represents how novice users learn to use an application.
At run-time, different GUIs require different implementations to capture the current state of the user interface as the Application Under Test executes. Our prototype, whose architecture is shown in Figure 3, works with applications built with Motif 1.2 and X11R5. All software components are written as objects in Modula-3 [3].
Figure 3. Run-time Architecture
The test script generator (XTest) can be applied to different Applications Under Test (AUT). During execution,
XProbe, TestDriver, and TestPort use standard Motif and X communications with a protocol based on the editres protocol. This approach proved superior to other UNIX techniques for sharing information across process spaces like shared memory, rpc servers, pipes, and shared files.
Xtest controls the AUT through the standard XSendEvent mechanism. While this technique works, XSendEvent operates strictly at the keystroke level. Therefore, XTracer generates test scripts at the keystroke level. A reasonable extension is to implement a reverse protocol to pass widget-level information to the application under test and to generate a more readable test script based on widget names.
The editres protocol asks the AUT to put some of its process-specific information into a form for use by TestDriver. The prototype requires only one slight modification to the AUT source code. Two procedure calls must be inserted to establish the TestPort callback and to provide the ID of its top level window. These calls are executed only once during AUT initialization, and all other processing happens transparently. A preferable approach would require no source code change.
Figure 4. Observe and Control Loop
After all the genes in the pool are tried, the genetic algorithm in GenePool (refer back to Figure 3) lets the winning genes survive, generates new genes, and mutates the pool before proceeding to the next generation. The script that corresponds to the top scoring gene in the last generation is output via XTracer.
We based our prototype reward system on the observation that a novice user often learns how to use an application via controlled exploration. A novice starts one function in a dialog sequence and experiments with a number of different parameters. In this way, the novice uses localized parameter settings to understand the overall effect of a single function. This is only one of the possible characterizations of novice user behavior.
To implement this reward system, we set the weight for all user events to zero except one. A gene receives a positive score each time its allele value(s) generate input for a widget (e.g., entering data into a text field, choosing an item from a list, selecting a radio button) that has the same window name as the last active window name. No additional score is generated to differentiate among the possible types of widgets on the window. The net result is that the longer a gene stays on the same window, the higher its score and better its odds of survival.
This interface strategy lets the tester control when deviations occur because a DEVIATE command can be inserted at arbitrary script locations. The script can then continue in either of the two modes shown in Figure 5. Pullback mode rewards genes for returning to the original script, while meander mode allows the activity to wander indefinitely. Even though pullback mode returns to the expert script, it will generally not generate the same results because additional functions are exercised.
Figure 5. Deviation Modes
(# expert script, in the form of a list of window, widget pairs with
pullback #)
("SGE: On Version Dialog" "Cancel")
("Style Guide Example - <New File>" "file")
("Root"
"open")
("SGE: Open Dialog" "sb_text" "solardat")
("SGE: Open Dialog" "OK")
Deviate
("Root"
"dataTable")
("SGE: Table Dialog" "text" 4 "30")
("SGE: Table Dialog" "text" 5 "40")
("SGE: Table Dialog"
"OK")
The implementation of meander mode is simple: execute the expert script and turn control over to the genetic algorithm when the DEVIATE command is encountered. The reward system then identifies genes that stay on the same window.
Pullback mode relies on the ability to look ahead to the next command in the expert script when processing a DEVIATE command. To implement pullback mode,
We first tried our reward system to see if the genes could learn to simulate novice-like behavior in a standalone manner. The test script was a single DEVIATE command. We varied the genetic algorithm parameters to let the algorithm itself generate novice-like events. At best, the resulting scripts seemed more chimpanzee-like than novice-like. Getting a script to accomplish anything meaningful was unlikely. This occurred because we could not provide any application semantic knowledge based on GUI widgets alone. A higher level specification (as found in a UIMS or process modeler) is needed to insure that a particular run can even open a file.
We then used meander mode and inserted a DEVIATE command at the end of an existing expert script. In this way, we were able to open files and do other activities before turning control over to the genetic algorithm. The results were better but could be attributed more to starting with something already done than to the genetic algorithm.
Pullback mode produced the best results. This occurred because pullback mode forces the script back to a state in which the script performs some meaningful activity. We were able to insert more than one DEVIATE command in a script to insure that the application could continue to operate through multiple encounters with unpredictable user events.
We have been able to evaluate the 'novice-ness' of the resulting scripts only on an informal basis. We asked other group members to observe a set of automatically generated scripts and collected their feedback. The scripts used with pullback mode were judged to be the best representation of their understanding of novice user behavior.
Using automated script generation decreases the total number of scripts that need to be saved and modified as application versions change. Only the parameters that govern genetic algorithm execution are saved. Therefore, new novice scripts can be regenerated for a new user interface and application. The regeneration process does not guarantee identical scripts because the application dialog state changes when the user interface changes.
Test script configuration management. Our approach can be used to generate a large number of novice test scripts quickly. Measures must be developed to determine when enough novice testing has occurred.
Test results evaluation and comparison. The results of each novice test script must be evaluated to insure that the system has worked properly. At a minimum, the novice tests can be used to insure that the application does not break. The applications we tested with the prototype did not fail, which increased our confidence in them. We used manual observation to determine that the applications continued to work properly.
More work is required to develop effective ways for determining that a script produces the proper results. A simple comparison of the results produced by a companion script is inadequate even in pullback mode. The deviation may have deleted or edited data in a way that makes direct results comparison impossible.
Emulation and evaluation of more types of novice user behavior. Additional genetic scoring algorithms and reward systems will expand the repertoire of characterizations of novice user behavior beyond our learning-by-experimentation style.
Care must be taken to formally evaluate the results of any automated techniques to insure their value and validity. The context for such comparisons is well documented (for a good example, see [8]). Usability testing facilities offer a solid technology foundation for conducting real versus automated novice evaluations.
Integration with automated test tools. Given the current implementation of the novice test script generator, the use of genetic algorithms could be incorporated as a new command in any existing widget-based test tool.
Higher level user interface specifications. In the long term, higher level dialog models than a GUI widget tree should be used to generate both application user interface code and test scripts. Automatic UI code generation will decrease user interface state management errors (although application code errors will still occur). Such specifications should be analyzable for usability for experts or novices. Testing will still be needed across multiple user skill levels to determine if program failures occur whether the application is driven by real or simulated events.
Our technique works best as a companion to automated test tools and expert test scripts. Expert users still must generate complex scripts that exercise an application thoroughly. Genetic algorithms provide a controllable method of emulating novice input events to test an application in an unexpected, but not purely random, way. Including automated novice testing early in the development process should improve overall application quality.
Rob Jasper and Dan Murphy of the Boeing Commercial Airplane Group provided insight into the issues involved with testing and genetic algorithms. Keith Butler of Boeing Information and Support Services helped mold early drafts. The SIGCHI referees provided excellent comments as part of their reviews.
Toward Automatic Generation of Novice User Test Scripts
David J. Kasik and Harry G. George
Boeing Commercial Airplane Group
P.O. Box 3707, Mail Stop 6H-WT
Seattle WA 98124 USA
+1 206 234 0575
kasik@ata.ca.boeing.com