Stories Out Loud

An Investigation into the Grouping of Numbers, and the Implications for the Usability of Speech Recognition Systems

Msc Human Computer Interaction
1994-1995

Abstract

This project was undertaken to discover if the way in which numbers are formatted on paper could be improved to aid readability, and if so, could this lead to the improved usability of an automated telephone service that uses speech recognition.

The first experiment of this project investigated how people group numbers when reading them over the telephone. The findings were used to test usability in a second experiment, where catalogue items were ordered over the telephone. Both experiments used the Wizard of Oz technique to simulate a speech recognition system.

It was found that groups of two and three digits were the most common. The groups were either all the same size, or if equal groups were not possible, groups of the same size were positioned together.

Using this format on paper improved the subjects' attitude toward the telephone service in the second experiment, although there were no significant improvements in the time taken or the number of errors made.

The First Experiment - Introduction

The following experiment aimed to discover how people group numbers when reading them aloud over the telephone. If a common grouping pattern could be found, this may help to improve the readability of numbers presented on paper, and therefore improve the usability of automated telephone services that require such numbers to be read.

From the findings of previous studies, the hypotheses to be tested in this study are:

The most popular group size is about three.

The size of group increases as the length of the number increases.

The observations in previous studies that the actual digits and their order may cause specific groupings, has not been tested in this experiment. It was felt that this would make the experimental design too complicated. Also it would be unrealistic, for example, to group credit card numbers depending on the actual digits. However, the effect is commented upon in the results of this experiment.

It was felt that the context of the number may affect the way in which the user read the number. For example, a six digit date may always be grouped into three groups of two (e.g. 10 12 95), but a six digit telephone number may always be grouped into two groups of three (e.g. 421 346). To test this, the control context of 'reference number' was used. For each type of number, there was a number of the same length under the title of 'reference number'. The term 'reference number' was felt to be ambiguous enough to refer to all other types of numbers that were not otherwise mentioned, such as invoice numbers and customer numbers.

All numbers had some form of context as it was felt extremely unlikely that a number would ever be read over the telephone, when using a commercial service, without some form of context attached.

The First Experiment - Conclusions and Discussions

Group Sizes

Looking at the objective data it appears that there is a common pattern in the way subjects read sequences of digits, namely in groups of two and/or three digits. The groups are either all equal in size, or where that is not possible, the groups of the same size are positioned together. The common pattern for four digit numbers is slightly different, namely two groups of two digits or one group of four.

These findings correspond to the visual short-term memory studies of Zhang and Simon (1985) and Ingber (1995) who found that the number of items that could be stored in short term memory was about three.

They also correspond well to the work of Wickelgren (1967) who found that lists of digits were best remembered if rehearsed in groups of three; and the work of Muller and Schumann (1894) whose rhythm studies used groups of two to improve recall.

It was suggested in Wickelgren's (1967) paper, when looking at the number of errors made by subjects, that group size may increase with number length. This was not supported by the finding of this study as the group size of two was the most popular for all the number lengths.

The findings of this experiment therefore proved the hypothesis that the most popular groups size is around three, but the hypothesis that the size of group increases as the length of the number increases could not be proved.

Context

There are some differences due to the context of the numbers. These differences occurred within four, six, eight, nine, eleven and fourteen digit numbers. There were no differences due to context when comparing six, seven, thirteen, fifteen and sixteen digit numbers, however, none were expected.

Four Digit Numbers

The differences between four digit numbers were the similarities of four digit times, dates and telephone codes compared with the similarities of four digit pin numbers and reference numbers. Two groups of two were the most popular overall contexts, but two groups of two were more popular for the time, date and telephone code when compared with the pin number and reference number.

This could be explained by the fact that four digit dates, such as expiry dates, are normally broken into twos, e.g. 02/95, as are four digit times, e.g. 12:30. It was, however, a surprise that telephone codes would be more popularly grouped into twos as four digit codes have never been grouped this way.

Four digit pin numbers are usually shown as one group of four, and it may be that reference numbers were considered in the same vein by subjects. This would explain why a group of four was more popular in these two contexts.

Six Digit Numbers

The difference could be seen when the similarities between dates and sort codes were compared with the similarities between telephone numbers and reference numbers. Groups of two were more popular for dates and sort codes, whereas groups of three were more popular for telephone numbers and reference numbers.

This cold be explained by the fact that the dates and sort codes are usually broken into groups of two, e.g. 12/10/95 and 10-20-30 respectively. Telephone numbers and reference numbers are not regularly shown in any particular format.

Eight Digit Numbers

The difference between an eight digit bank account number and an eight digit reference number was the larger amount of groups of two and four in the bank account number. This could be explained by the fact that most bank account numbers are eight digits long. Subjects may have anticipated the number of digits and grouped them evenly as a result. There would have been no indication as to the length of the reference number without counting the digits.

A possible experiment for further study to investigate this point, could be to have subjects read numbers but not be told what sort of numbers they were. The subjects could then be asked to guess the context of the numbers they had just read.

Nine Digit Numbers

The difference between the nine digit bank account number and nine digit reference number was in the different grouping patterns. However, no clear reason for the difference could be seen, nor an explanation for it.

Eleven Digit Numbers

The difference between the eleven digit telephone number and eleven digit reference number was in the size of the first group. Telephone numbers had more grouping patterns beginning with a group of four. This could be explained by the convention of four digit telephone codes at the beginning of a telephone number.

Fourteen Digit Numbers

The difference between the fourteen digit card number and the fourteen digit reference number was that groups of two and four were more popular in the reference number, but one and three in the card number. No particular explanation could be thought of for this.

Unexplained Context Differences

The unexplained differences due to the contexts of nine and fourteen digit numbers may be due to the actual digit sequence in the number being read. This would correspond to the studies of Chapanis and Moulden (1990) who found that some sequences of two and three digit numbers were easier to remember than others. It may be that one or other of the nine or fourteen digit numbers contained more memorable sequences of digits at different positions within the whole number.

This difference due to digit sequence also corresponds to the work of Wickelgren (1967) who suggested that there were item-to-item associations which could improve memory recall.

The Second Experiment - Introduction

The previous study showed that there was a common pattern in the way people group sequences of digits. However, it was not clear as to whether grouping numbers in accordance with this pattern would improve the readability of the numbers.

If the readability is improved, this may cause the subjects to make less mistakes, and therefore take less time, when using a speech recognition system that involves reading numbers over the telephone. They may also have a more favourable attitude towards the system if they do not have to concentrate so hard when reading the numbers.

The tasks involved in the previous study were artificial in that the numbers were not read within the context of a meaningful task. This study aimed to find whether there was an effect on usability due to the format of numbers, and used a meaningful task of the sort that may been encountered in real life.

Therefore, the hypothesis to be tested was:

Formatting numbers in accordance with the findings of the previous study, affects the usability of a speech recognition system.

The Second Experiment - Conclusions and Discussions

Time and Errors

The task where the numbers were grouped into twos and threes did result in a slightly lower average task completion time, but not significantly so. The small difference may have been due to the consistency of the Wizard. It may have been that the Wizard did not consistently play error messages and was more lenient than a speech recogniser would have been.

It is difficult to be objective when judging whether to play an error message, especially when it is obvious the subject is having trouble finding the information they need to read. This effect could be removed by using a real speech recogniser, unfortunately one was not available for this experiment.

From the comments made by subjects, and the observations of the wizard, there does appear to have been a learning effect between tasks. It would have therefore been beneficial to offer the subjects a practice task. An example of one of the problems was where two subjects made up a delivery date before realising it was written down for them.

The practice task would have eliminated a learning effect, becoming more familiar with a task is likely to have had an effect on the time taken to complete the task and the number or errors made. The order in which the subjects carried out the tasks was randomised to reduce such an effect, but considering the small number of subjects it may be have been relevant.

It may have been that the task was not complicated enough to show the differences. The task was deliberately kept simple so that errors made would have been because of the nature of the number formats, not because of navigation errors and problems with the actual task. However, a more complicated task could have been used provided the subjects were given practice tasks to familiarise themselves with the procedure.

The number of errors made in the task where the format was groups of two and three, was slightly lower than for the other two tasks, but not significantly so. The errors made were categorised to help ascertain what sort of problems the subjects were having.

The total number of errors made was small so it was difficult to come to any conclusions on the nature of the errors. It could be said, however, that if a subject read the wrong number, this showed that this was a navigation problem in finding the number to read, not an effect due to its format.

The most noticeable difference was in the number of times a subject started to read the number again from the beginning. This was once for the groups of two and three, but five times for the other two tasks. This was once for the groups of two and three, but five times for the other two tasks. This may suggest that the grouping helped to keep the subject's place in the number, therefore being able to continue when they make a mistake, and not having to being again from the beginning.

The fact that two of the subjects thought the computer did not recognise them as well in the tasks where they had made more mistakes, may suggest that a speech recogniser can appear to be more accurate if the task of reading the numbers is made easier.

Attitudes and Preferences

By formatting the numbers in groups of two and three digits, the subjects' attitude to the speech recognition system was significantly improved.

The improvement in attitude may be because the formatting of twos and threes corresponds to the subject's voluntary grouping of numbers. It can be seen from the proportion of numbers that were grouped according to the way they were formatted on paper, that the subject did not try to override the formatting of twos and threes, but did when the numbers were not formatted in twos and threes.

The results for numbers not formatted in twos and threes were not significantly different to those with no formatting. This implies that bad formatting is not better than no formatting at all. This experiment does not distinguish which of the numbers were formatted particularly badly. It is to be expected that dates and times that are grouped any other way than into twos would be confusing, as the numbers do not correspond to the month, year etc. It is not clear whether the grouping of the other numbers is as confusing.

In the count of the subjects' preferred formats, the not twos and threes format only appeared for the telephone number and the credit card number. The -5-6- format of the telephone number could be expected to be preferred some of the time as it is the standard way in which a telephone number is presented, although only two people preferred it. This suggests a case where a standard format that people are used to is not the best.

The sixteen digit credit card number was subject to the most criticism. The all twos format was preferred overall, but five people preferred the -5-6-5- format, and eight people said they would have preferred groups of three or four. Interestingly, no-one liked the -6-2- format of the customer number and only three subjects would have changed the all twos format into threes or fours. The catalogue number, which was in threes, was not criticised at all.

This suggests that smaller numbers are more acceptably grouped into twos, but as the length of the number increases larger groups are preferred. This corresponds to Wickelgren's (1967) suggestion that as the number length increases, the group sizes become larger to decrease the number of groups. This is in contrast to the findings of the previous study, where larger groups did not become more popular as the number length increased.

The criticism of the card number corresponds more to the findings of the previous study's question on the preferred size of group, where three was the most popular with four second. Groups of five and six do appear to be too large as twos are preferred in this study, but further study would have to be carried out to find the preferences between groups of two, three and four in relation to the length of the number to be read.

Dates and times are by far most suitably grouped into twos, this is confirmed by the results of the questionnaire in this study, and the comments made by subjects who said they found it confusing when dates and times were grouped in any other way.

This experiment therefore proves the hypothesis that formatting numbers in accordance with the findings of the previous study affects the usability of a speech recognition system.

Readability may be further improved by considering larger group sizes for longer numbers. Time and error rate improvements may be shown by using a larger group of subjects and more complicated tasks.

Final Conclusion

The first experiment investigated how subjects grouped numbers when reading them over the telephone. It was found that there was a common pattern, namely groups of two and/or three digits. The groups were either equal in size, or where that was not possible, groups of the same size were positioned together.

Overall, the experiment ran smoothly. The lesson to be learnt, however, was that a full pilot study should have been carried out. In this way the procedure and suitability of the materials used could have been tested fully, and the wizard could have had more practice. I found that it was not sufficient to look at each piece of the experiment individually, and that it was very difficult to predict all possible problems.

The second experiment aimed to establish if formatting numbers on paper in accordance with the findings of the first study, would improve the readability of the numbers, and therefore improve the usability of an automated telephone service that required subjects to read numbers.

It was found that a subject's attitude towards an automated telephone service could be significantly improved, by formatting the numbers to be read into groups of two and/or three digits. No significant improvement could be shown in the time taken or the number or errors made, although further studies using a larger group of subjects and more complicated tasks may prove more successful.

Again, the lesson to be learnt from this experiment was in respect of having a pilot study. In addition it would have been desirable to have had two experimenters - one playing the wizard and one attending to the subject.

Overall, it was felt that this project was a success and that recommendations can be made regarding the formatting of numbers.

References

Chapanis & Mouldon (1990) - Chapanis, A. & Mouldon, J. V., Short-Term Memory for Numbers, Human Factors, Vol. 32 pp 123-137, 1990

Ingber (1995) - Ingber, L., Statistical Mechanics of Neocortical Interactions: Constraints on 40 Hz Models of Short-Term Memory, to be published in Physical Review E, 1995, WWW URL http://www.ingber.com/smni95_stm40hz.pdf

Muller & Schumann (1894) - see Woodworth, R. S., Experimental Psychology, Holt, New York, 1938, pp 28-30

Wickelgren (1967) - Wickelgren, W. A., Rehearsal Grouping and Hierarchical Organisations of Serial Position Cues in Short-Term Memory, Quarterly Journal of Experimental Psychology, Vol. 19 pp 97-102, 1967

Zhang & Simon (1985) - Zhang, G. & Simon, H. A., STM Capacity for Chinese Words and Idioms: Chunking and Acoustical Loop Hypothesis, Memory and Cognition, Vol. 13 pp 93-201, 1985