DATA ANALYSIS IN THE UNIX ENVIRONMENT

Data Analysis in the UNIX Environment

Gary Perlman, University of California, San Diego

Note: This paper appeared as: Perlman, G. (1983) Data Analysis in the UNIX Environment. pp. 130-138 in K. W. Heiner, R. S. Sacher, & J. W. Wilkinson (Eds.) Computer Science and Statistics: Proceedings of the 14th Symposium on the Interface, July 5-7, 1982, Springer-Verlag.
The descriptions of the programs are little more than for historical curiosity because there are many more programs, best described at the |STAT home page: www.acm.org/~perlman/stat/. However, this paper has the best description of the ideas behind the |STAT file formats and the principles of automatic interfaces for implicit design specification.

ABSTRACT
KEYWORDS
INTRODUCTION
DESCRIPTIONS OF DATA ANALYSIS PROGRAMS
Using UNIX
PROGRAM DESCRIPTIONS
AUTOMATED INTERFACES FOR DESIGN SPECIFICATION
CONCLUSION
REFERENCES
ACKNOWLEDGMENTS

ABSTRACT

In this paper I discuss data analysis on the UNIX operating system and how the UNIX environment affect both the design and use of programs. UNIX is a highly interactive operating system and as such, is ideal for data analysis, allowing analysts to make immediate decisions based on intermediate results. UNIX provides facilities for directing the output of one program as the input to another and this has resulted in a program design philosophy unique to and ubiquitous in UNIX: to build modular programs that do one task well and that can be combined in many ways to form complex ones. The application of this philosophy to the design of data analysis programs in UNIX has resulted in the development of separate programs to validate, transform and reformat, enter and edit, print, and do calculations on data. From the user s point of view, programs are smaller and hence more portable to small systems, and can be used in a wide variety of contexts. I give examples of programs developed under the UNIX philosophy, and show how it leads to the development of automated interfaces for experimental design specification.

KEYWORDS

Interactive User Interface Design, Computer Statistical Packages

INTRODUCTION

In this article, I will discuss how the UNIX (TM Bell Telephone Laboratories) operating system (Ritchie & Thompson, 1974) supports data analysis, and how some of my data analysis programs (Perlman, 1980) developed under the influence of the UNIX environment. I begin with an overview of UNIX, and how its support of process control affects data analysis. Then I describe some programs developed for data analysis on the UNIX operating system. Finally, I discuss how many aspects of data analysis, particularly the specification of experimental designs, can be automated.

UNIX is an interactive operating system and as such is ideal for data analysis. Users generally sit in front of a terminal and repeatedly specify a program, its input, and to where its output should be directed. Users can immediately observe intermediate results of programs, and can make decisions based on them. One common decision is to use another program on such results.

It is possible to combine simple UNIX programs into more complicated ones. The implementors of UNIX made it possible to direct the output of one program into the input of another program. This linking of outputs to inputs is called "pipelining." Programs in a pipeline are connected with the pipe symbol, the vertical bar. For example, several programs can be joined together with a command like:

IO DATA I PROGRAM1 | PROGRAM2 | PROGRAM3

Here, a series of transformations of the file DATA is initiated by an input/output program called IO. (I will use the convention of capitalizing the names of programs and files, though in actual use, they would probably be lower case.) The output from IO is used as input to PROGRAM1 whose output is used as input (piped) to PROGRAM2 whose output is piped finally to PROGRAM3. In effect, PROGRAM1, PROGRAM2, and PROGRAM3 are combined to make a more complex program.

The ability to pipeline programs has led to a philosophy of modular program design unique to and ubiquitous in UNIX: to build program tools that accomplish one task well that produce outputs convenient for input to other programs. Convenience here means output stripped of unnecessary labeling information, and that is easily parsed by other programs. In UNIX, there are dozens of programs with general uses such as ones to search though or sort files, count lines, words, and characters in files, and so on. The philosophy is to have programs that can be used in a variety of contexts. This philosophy has been followed in the design of data analysis programs I will describe later, especially those for applying pre-analysis transformations.

Doing data analysis in a general programming environment overcomes some problems associated with many statistical packages. In a general environment like UNIX, users have easy access to all the programs that come with the system, as well as those locally developed. That means people doing data analysis in UNIX can use system editors, printing programs, and perhaps most importantly, they can control data analysis with the general command language used to control all program execution in UNIX. The UNIX command line interpreter is called the shell and it simplifies file input and output. It also provides the pipelining facilities mentioned earlier. The UNIX shell is a high level programming language in which users can write shell "scripts" that allow program control with primitives for condition testing and iteration. In short, doing data analysis in a general programming environment provides many resources lacking or weak in statistical packages.

While UNIX allows easy access to a wide variety of programs peripherally related to data analysis, such is not the case with most statistical packages. These usually have individual programs have many general utilities, such as data transformation, editing, and printing subroutines built in, making them larger than necessary, and hence less portable to small computers. This is necessary for statistical packages expected to run on a variety of operating systems where it is not clear what utilities are available. It is also a remnant of the 1960 s batch processing philosophy when the major statistical packages were first developed.

Under the UNIX philosophy of modular design, data are transformed by specialized programs before being input to specific data analysis programs. The result is that data transforming functions are independent of analysis programs and are applicable in a wider variety of contexts.

Recall that the UNIX design philosophy of modularity is in support of pipelines of simple uni-purpose tools. An application of this philosophy to data analysis is to first transform data and pipe the transformed data to specific analysis programs. The following represents an example of just such a use.

IO DATA | TRANSFORM | ANALYZE

Writing programs to be used in pipelines require that their outputs be uncluttered and easily parsed by programs down stream. This requirement is satisfied by having all programs process data as human readable character files with fields separated by white space (blanks or tabs). While such free format slows processing, the slowdown is negligible compared to the time required for complex analyses. Free format also costs storage space, but space limitations do not affect the majority of users, and as the cost of storage goes down, this factor is even less of a problem. The requirement of- putting data in plain files rather than binary code, or even the traditional fixed column FORTRAN formats makes the files easier for people to read, and increases their acceptability to all UNIX programs, such as system wide programs like editors and printing programs.

To be able to efficiently construct pipelines of programs requires that much of the task of the user be automated. If a special control language file were needed for each program in a pipeline, constructing analysis commands would be prohibitively time consuming, and error prone. It would also obliterate the advantage UNIX offers as an interactive system. In many cases, there is no need for a control language when programs are made intelligent enough to determine mundane properties of their inputs automatically. Delimiting fields with white space allows programs to count the number of fields per line, and UNIX file handling makes it easy to tell when the end of a file has been reached, which makes it easy to count lines as well as columns.

DESCRIPTIONS OF DATA ANALYSIS PROGRAMS

The best way to appreciate how data are analyzed in a UNIX environment is to see examples of their use. In this section, I describe some of the programs I developed for data analysis (Perlman, 1980). My major concern in designing them was to allow data analysis to be automated to such an extent that users had to interact with them minimally. This is desirable because people make errors, and the more chances they have to make them, the more they will. My goal was to reduce errors by removing opportunities to make them.

The programs have been designed to be easy to use, and small enough to fit on mini computers such as DEC PDP 11/45 s, 11/34 9, and 11/23 s. I will first describe the format prescribed for the programs, and how people use the programs in the UNIX environment. The programs are of several different types:

Data Transformation. These programs are useful for changing the format of data files, for transforming data, and for filtering unwanted data. In addition, one program i9 useful for monitoring the progress of the data transformations.

Data Validation. These include programs for checking the number of lines and columns in data files, their types (e.g., alphanumeric, integer), and their ranges.

Descriptive Statistics. These procedures include both numerical statistics, and simple graphical displays. There are procedures for single distributions, paired data, and multivariate cases.

Inferential Statistics. These include multivariate linear regression and analysis of variance. Some simple inferential statistics are also incorporated into the descriptive statistics programs, but are used less often.
Table of Programs Described
Pre-Processing Programs
ABUT abut files
DM conditional transformations of data
IO control/monitor file input/output
TRANSPOSE transpose matrix type file
VALIDATA verify data file consistency
Analysis Programs
ANOVA anova with repeated measures
BIPLOT bivariate plotting + summary statistics
CORR linear correlation + summary statistics
DESC statistics, frequency tables, histograms
PAIR bivariate statistics + scatterplots
REGRESS multivariate linear regression

Table of Programs Described
Pre-Processing Programs
ABUT	abut files
DM	conditional transformations of data
IO	control/monitor file input/output
TRANSPOSE	transpose matrix type file
VALIDATA	verify data file consistency
Analysis Programs
ANOVA	anova with repeated measures
BIPLOT	bivariate plotting + summary statistics
CORR	linear correlation + summary statistics
DESC	statistics, frequency tables, histograms
PAIR	bivariate statistics + scatterplots
REGRESS	multivariate linear regression

Using UNIX

This section describes the typical use of UNIX and is meant to give non-users a brief introduction to its use. It may also provide a useful summary for somewhat experienced users who have little experience with constructing complex commands with pipelines. UNIX users generally sit in front of a terminal at which they repeatedly specify a program, the input to that program, and to where the output from the program should be directed. They specify this program, input, and output to a program called a "shell." The shell is most users primary way of interacting with UNIX. If the user does not specify from where the input to a program is to be read, the default "standard input" is the user s terminal keyboard. (For data analysis programs, this often is a mistake.) Similarly, if unspecified, the default "standard output" is the terminal screen. To override these default standard input and outputs, UNIX shells provide simple mechanisms called "redirection" and "pipelining." To indicate that a program should read its input from a file rather than the terminal keyboard, a user can "redirect" the input from a file with the "<" symbol. Thus,

PROGRAM < FILE

indicates to UNIX (really it indicates to the shell which controls input and output) that the input to the program named PROGRAM should be read from the file named FILE rather than the terminal keyboard. Analogously, the output from a program can be saved in a file be redirecting it to a file with the ">" symbol. Thus,

PROGRAM < INPUT > OUTPUT

indicates that the program PROGRAM should read its input from the file INPUT and put its output into a new file called OUTPUT. If the file INPUT does not exist, an error message will be printed. If the file OUTPUT exists, then whatever was in that file before will get destroyed. A mistake novices and experts alike should avoid is a command like:

PROGRAM < DATA > DATA

which might be used to replace the contents of data with whatever PROGRAM does to it. The effect of such a command is to destroy the contents of DATA before PROGRAM has a chance to read it. For a safer method of input and output, see the later discussion of the IO program.

The output from one program can be made the input to another program without the need for temporary files. This action is called "pipelining" or "piping" and is used to create one complex function from a series of simple ones. The vertical bar, or "pipe" symbol, | , is placed between programs to pass the output from one into the other. Thus,

PROGRAM1 < INPUT | PROGRAM2

tells UNIX to run the program PROGRAM1 on the file INPUT and feed the output to PROGRAM2. In this case, the final output would be printed on the terminal screen because the final output from PROGRAM2 is not redirected. Redirection could be accomplished with a command line like:

PROGRAM1 < INPUT | PROGRAM2 > OUTPUT

In general, only one input redirection is allowed, and only one output redirection is allowed, and the latter must follow the former. Any number of programs can be joined with piping.

In general, UNIX programs do not know if their input is coming from a terminal keyboard or from a file or pipeline. Nor do they generally know where their output is destined. One of the features of UNIX is that the output from one program can be used as input to another via a pipeline making it possible to make complex programs from simple ones without touching their program code. Pipelining makes it desirable to keep the outputs of programs clean of annotations so that they can be read by other programs. This has the unfortunate result that the outputs of many UNIX programs are cryptic and have to be read with a legend. The advantages of pipelining will be made clear in the examples of later sections.

Data are most easily manipulated and analyzed if they are in what I will call the master data file format. The key ideas of the format of the master data file are simplicity and self documentation, and are derived from relational databases. The reason for this is to make transformation of data easy, and to be able to use a wide variety of programs to operate on a master data file. Each line of a master data has the same number of alphanumeric fields. For readability, the fields can be separated by any amount of white space (blank spaces or tabs), and, in general, blank lines are ignored. Each line of a master data file corresponds to the data collected on one trial or series of trials of an experiment. Along with the data, a set of fields describe the conditions under which those data were obtained. Usually, a master data file contains all the data for an experiment. However, in many cases, a user would not want all the data from an experiment from this file to be input to a program. Some parts may be of particular interest for a specific statistical test, or some data may need to be transformed before input to a data analysis program. In a later section, I will expand on the notion of a master data file, and show how it can be used to convey experimental design information implicitly.

PROGRAM DESCRIPTIONS

The programs described here were designed with the philosophy that data, in a simple format, can convey all or most of the information a program needs to analyze them. With data transforming utilities, the need for a special language to specify design information all but disappears. Users can implicitly specify design information by putting their data into specific formats.

The strategy I will use here is to describe the programs and give examples of how they are used. This is meant only to make the ideas of analysis with these programs familiar and should not be used as a substitute for the manual entries on individual programs. Only a few of the capabilities of the programs are described.

Transforming Data

In this section, I will describe programs for transforming data. The reason for describing these before statistical programs is that in most cases, user will want to transform their data before analyzing them. The general form of a command would thus be:

TRANSFORM < DATA | ANALYZE > OUTPUT

where TRANSFORM is some program to transform data from an input file DATA and the output from TRANSFORM is piped to an analysis program, ANALYZE, whose output is directed to an output file, OUTPUT.

ABUT: Abut Files

ABUT is a program to take several files, each with N lines, and make one file with N lines. This is useful when data from repeated measures experiments, such as paired data, are in separate files and need to be placed into one file for analysis (see PAIR and REGRESS). For example, the command:

ABUT FILE1 FILE2 FILE3 > FILE123

would create FILE123 with its first line the first lines of FILE1, FILE2, and FILE3, in order. Successive lines of FILE123 would have the data from the corresponding lines of the named files joined together.

IO: Control and Monitor Input/Output

IO is a general program for controlling input and output of files. It can be used instead of the standard shell redirection mechanisms and is in some cases safer. It can also be used to monitor the progress of data analysis commands, some of which can take a long time.

Catenate Files. IO has a similar function to ABUT. Instead of, in effect, placing the files named beside each other, IO places them one after another. A user may want to analyze the data from several files, and this can be accomplished with a command like:

IO FILE1 FILE2 FILE3 I PROGRAM

Monitoring Input and Output. IO also can be used to monitor the flow of data between programs. When called with the -m flag, it acts as a meter of input and output flow, printing the percentage of its input that has been processed. (In UNIX, it is common for options to be specified to programs by preceding letters by a dash to distinguish them from file names.) The program can also be used in-the middle and end of a pipeline to monitor the absolute flow of data, printing a special character for every block of data processed.

Input and Output Control. Finally, it can be used as a safe form of controlling i/o to files, creating temporary files and copying rather than automatically overwriting output files. For example, IO can be used to sort a file onto itself with one pipeline using the standard UNIX SORT program:

IO -m FILE | SORT | IO -m FILE

The above command would replace FILE with a sorted version of itself. Because the monitor (-m) flag is used, the user would see an output like:

10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
==========

The percentages show the flow from FILE into the SORT program (percentages are possible because IO knows the length of FILE), and the equal signs indicate a flow of about a thousand bytes each coming out of the SORT program. The command:

SORT FILE > FILE

would destroy the contents of FILE before SORT had a chance to read it. The command with IO, is both safer and acts as a meter showing input and output progress.

Saving Intermediate Transformations. In a command like:

IO FILE1 FILE2 | PGM1 | PGM2 | IO OUT

the intermediate results before PGM1 and before PGM2 are lost. IO can be used to save them by diverting a copy of its input to a file before continuing a pipeline:

IO FILE1 FILE2 | PGM1 | IO SAVE | PGM2 | IO OUT

Any of these calls to IO could be made metering versions by using the -m option flag.

TRANSPOSE: Transpose Matrix-Type Files

TRANSPOSE is a program to transpose a matrix- like file. That is, it flips the rows and columns of the file. For example, if FILE looks like:

1	2	3	4
5	6	7	8
9	10	11	12

then the command:

TRANSPOSE FILE

will print:

TRANSPOSE is useful as a pre/post-processor for the data manipulator, DM, as it reads from the standard input stream when no files are supplied.

DM: A Data Manipulator

DM is a data manipulating program that allows its user to extract columns (delimited by white space) from a file, possibly based on conditions, and produce algebraic combinations of columns. DM is probably the most used of all the programs described in this paper. To use DM, a user writes a series of expressions, and, for each line of its input, DM reevaluates and prints the values of those expressions in order.

DM allows users to access the columns of each line of its input. Numerical values of fields on a line can be accessed by the letter "x" followed by the column number. Character strings can be accessed by the letter "9" followed by the column number. Consider for example the following contents of the file EX1:

12	45.2	red	***
10	42	blue	---
8	39	green	---
6	22	orange	***

The first line of EX1 has four columns, or fields. In this line, x1 is the number 12, and 91 is the string 12 . DM distinguishes between numbers and strings (the latter are enclosed in quotes) and only numbers can be involved in algebraic expressions.

Column extraction. Simple column extraction can be accomplished by typing the strings in the columns desired. To print, in order, the second, third, and first columns from the file EX1, one would use the call to DM:

DM 92 93 91 < EX1

This would have the effect of reordering the columns and removing column 4.

Algebraic Expressions. DM can be used to produce algebraic combinations of columns. For example, the following call to DM will print the first column, the sum of the first two columns, and the square root of the second column.

DM x1 x1+x2 "sqrt(x2)" < EX1

Note that the parentheses in the third expression requires quotes around the whole expression. This is because parentheses are special characters in the shell. If a string in either of the columns was not a number, DM would print and error message and stop.

Conditional Operations. DM can be used to filter out lines that are not wanted. A simple example would be to print only those lines with stars in them.

DM "if * C INPUT then INPUT else NEXT" < EX1

The above call to DM has one expression (in quotes to overcome problems with special characters and spaces inserted for readability). The conditional has the syntax "if-then-else." between the "if" and "then" is a condition that is tested, in this case, testing if the one-character string * is in INPUT, a special string holding the input line. If the condition is true, the expression between the "then" and "else" parts is printed, in this example, the input line, INPUT. If the condition is not true, then the expression after the "else" part is printed, in this case, NEXT is a special control variable that is not printed and causes the next line to be read.

Data Validation

Before analysis begins, it is a good idea to make sure data are entered correctly. The programs described in this sub-section are useful for verifying the consistency of data files. Individual analysis programs do their own verification, so the programs described are, in practice, used to help find errors picked up by specific analysis programs.

VALIDATA: Check Data Validity

A master data file is assumed to have an equal number of fields per line. VALIDATA checks its input from the standard input or argument file and complains if the number of fields per line changes. After reading all its input, VALIDATA reports the number of entries of various data types for each column. The data types VALIDATA knows about include integer and real numbers, alphabetic, and alphanumeric fields, and some others. VALIDATA can be used to spot incorrect entries such as columns expected to be numerical that are not, and accidentally entered invisible control characters.

DM: Range Checking on Columns

DM can be used to verify data as well as transform it. For example, to check that all numbers in column three of a file are greater than zero and less than 100, the following call to DM would print all lines that did not display that property.

DM "if !(x3>0 & x3<100) then INPUT else NEXT"

If non-numerical data appeared in column three, DM would report an error.

Descriptive Statistics

DESC: Describing a Single Distribution

DESC can be used to analyze a single distribution of data. Its input is a series of numbers in any format, so that numbers are separated by spaces, tabs, or newlines. Like most of these programs, DESC reads from the standard input.

Summary Statistics. DESC prints a variety of statistics, including order statistics. Under option, DESC will print a t-test for any specified null mean.

Frequency Tables. DESC can print frequency tables, or tables of proportions, with cumulative entries if requested. These tables will be formed based on the data so they are a reasonable size, or they can be formatted by the user who can specify the minimum value of the first interval, and the interval width. For example, the following command would print a table of cumulative frequencies and proportions (cfp) in a table with a minimum interval value of zero (m0) and an interval width of ten (i10).

DESC i10 mO cfp < DATA

Histograms. If requested, DESC will print a histogram with the same format as would be obtained with options used to control frequency tables. For example, the following command would print a histogram of its input by choosing an appropriate interval width for bins.

DESC h < DATA

The format of the histogram, as well as tables, can be controlled by setting options. The following line sets the minimum of the first bin to zero, and the interval width to ten, an appropriate histogram for grading exams.

DESC h i10 mO < GRADES

PAIR: Paired Data Analysis

PAIR can be used to analyze paired data. Its input is a series of lines, two numbers per line, which it reads from the standard input. Options are available for printing a bivariate plot, which is the default when the program is called by its alias, BIPLOT. Other options control the type of output.

Summary Statistics. From PAIR 9 input, minimums, maximums, means, and standard deviations are printed for both columns as well as their difference. Also printed is the correlation of the two columns and the regression equation relating them. The simplest use of PAIR is with no arguments. To analyze a data file of lines of X-Y pairs, the following command will in most cases be satisfactory:

PAIR < DATA

Often the paired data to be analyzed are in two files, each variable occupying a single column. These can be joined with ABUT and input to PAIR via a pipe:

ABUT VAR1 VAR2 | PAIR

Or perhaps the two variables occupy two columns in a master data file. If the variables of interest are in columns four and six, the following command would produce the paired data analysis:

IO DATA | DM 94 s6 | PAIR

Scatterplots. With the "p" or "b" options, a scatterplot of the two variables can be printed. Alternatively, PAIR has an alias, called BIPLOT, which lets the user get a bivariate plot of a data file of X-Y pairs:

BIPLOT < DATA

CORR: Multiple Correlation

CORR can be used to get summary statistics for repeated measures data. Its input is a series of lines, each with an equal number of data. It prints the mean, standard deviation, minimum, and maximum for each column in its input. Then it prints a correlation matrix with all pairwise correlations. Like PAIR, columns from files can be joined with ABUT or extracted from files with

ANOVA: Multivariate Analysis of Variance

ANOVA performs multivariate analysis of variance with repeated measures factors (within subjects), and with unequal cell sizes allowed on grouping factors (between subjects). ANOVA reads in a series of lines from the standard input, each with the same number of alphanumeric fields. Each datum occupies one line and is preceded by a list of levels of independent variables describing the conditions under which that datum was obtained. The first field is some string indicating the level of the random variable in the design, and subsequent fields describe other independent variables. From this input, ANOVA infers which factors are between subjects, and which are within subjects. ANOVA prints cell sizes, means, and standard deviations for all main effects and interactions. Then ANOVA prints a summary of the design of its input, followed by a standard F-table.

Suppose you had a design in which you presented problems to subjects. These problems varied in difficulty (easy/hard), and in length (short/medium/long). The dependent measure is time to solve the problem, with a time limit of five minutes. Your data file would have lines like this:

fred	easy	medium	5
ethel	hard	long	2

In the first column is a string that identifies the level of the random factor (here, subject name), followed by strings indicating the level of the independent factors, followed by the dependent measure (here, solution time). The data file holding lines like those above would be analyzed with:

ANOVA subject difficulty length time < DATA

Individual factors can be ignored by excluding their columns from the analysis:

DM 91 92 s4 < DATA | ANOVA subject difficult time

Similarly, different factors can be used as the random factor. This is common in psycho- linguistic experiments in which both subjects and items are random variables.

Regress: Multivariate Linear Regression

REGRESS reads its input of a series of lines from the standard input, each with the same number of columns of numbers. From this input, REGRESS prints minimums, maximums, means, and standard deviations for each variable in each column. Also printed is the correlation matrix showing the correlations between all pairs of variables. Suppose you had a file called DATA with any number of lines and five columns, respectively called "blood pressure," "age," "height," "risk," and "salary." You could do a multiple regression with:

REGRESS bp age height risk salary < DATA

If only a few columns were of interest, they could be extracted with DM:

DM x1 x3 x5 < DATA | REGRESS bp height salary

PAIR: Paired Data Comparison PAIR can be used to compare two distributions of paired data. Often, the two columns of interest are pulled out of a master data file with DM. The following command takes columns 5 and 3 from DATA and inputs them to PAIR:

DM x5 x3 < DATA | PAIR

For its two-column input, PAIR will print a t-test on the differences of the two columns, which is equivalent to a paired t-test. PAIR will also print a regression equation relating the two, along with a significance test on their correlation, which is equivalent to testing the slope against zero.

AUTOMATED INTERFACES FOR DESIGN SPECIFICATION

Now I will turn to automated interfaces for design specification. This section is a natural extension of the ideas introduced in the ANOVA program discussed in the previous section. My claim is that most data can be analyzed automatically, without the need for any special design specification language commonly used in statistical packages. In their stead, data are put in formats that make their structure transparent, often by some preprocessing by data transforming facilities.

Most statistical programs offer similar capabilities, but they differ in how easily they are used. One purpose of this paper is to introduce a system that is both easy to use and highly powerful in the class of designs it can be used to specify. My goal is to motivate the implementation of more easily used statistical programs by presenting a convenient interface between users and programs.

The Master Data File Format

Besides providing a notation for describing the structure of data, the system introduced in this section allows the specification of design information difficult or impossible using schemes used by most packages. It also allows programs to analyze data without any specification other than the format of the data. The system will be introduced by examples showing how designs of increasing complexity are coded. I will first concentrate on designs with one random factor, and assume the data will be subjected to an analysis of variance. Later on, that assumption will be relaxed. Besides coding of the simplest designs, the following special cases will be discussed.

Many independent variables
Different types of factors
Unequal group sizes
Replications
Missing data
Categorical data
Many dependent variables
Covariates
Other than one random factor

In this system, the role of each datum in the overall design is specified so a design interpreter program can infer the design by relationships among data. All design information is implicit in the format of data. Data are kept in a Master Data File, or MDF. Each line of an MDF contains an equal number of alphanumeric fields that code information about an experimental trial. These lines are in free format with fields separated by tabs or spaces for readability and are conceptually divided into two parts: a description of the conditions under which the data are obtained (e.g.,. independent variables), a description of the data (dependent measures) obtained on that trial.

The MDF format provides useful terms in which people can think about their data. Each line corresponds to a trial on which data were collected. Each column corresponds to a variable such as independent or dependent.

Coding Designs with the MDF Format

In the MDF format, design information is implicit in the format of the data, and analysis routines can be designed to remove the task of specification from the user, resulting in efficient and error free analysis. Just how programs can infer design information from data formats will be shown in the following examples.

Examples of Coding Design Information

A Simple Design: One Group's Data. In the simplest case, each level of the random factor has one datum associated with it. The task of a statistical program is to provide a description about the distribution of those scores, and to compare this distribution to some other. In the MDF format, data follow the description of the random variable, as in the following example where a person's score follows his or her name.

fred	68
judy	62
bob	67
jane	67

One Grouping Factor. To compare data from two groups, there must be some way of distinguishing them. This is done by using strings to indicate levels of factors in a column as in the following example where each person s sex is included as an index to distinguish groups.

fred	male	68
judy	female	62
bob	male	67
jane	female	67

Factorial Designs. It is a simple generalization to higher order designs. For each factor, a column is added holding levels for that factor. In the following example, there are two factors, sex with {male, female} as levels, and difficulty with {easy, hard} as levels.

bob	male	easy	56
bob	male	hard	67
jane	female	easy	63
jane	female	hard	70

Different Types of Factors. All types of designs can be coded in the same system, and design information is implicit in the relation of columns holding levels of independent variables to that of the random factor. By assuming that the indexes for the random factor fall in the first column of an MDF, the type of a factor can be inferred to be between groups, within groups, or nested within some other factor. In the previous example, the difficulty factor can be inferred to be a within groups factor because each level of the random factor is paired with all levels of the that factor. Of course, the sex factor is between subjects, and a computer program can infer this because each level of the random variable is paired with exactly one level of the sex factor.

Nested factors can also be easily coded. In the following example, gymnastic event, indexed in column three, is nested in gender.

fred	male	rings	9.4
bob	male	horse	8.2
judy	female	vault	8.8
jane	female	bar	9.2

Unequal Group Sizes. To code unequal group sizes on between groups factors, all that is needed is more data for one group than an for other. In the following example, the male group has three members compared to two for the female.

fred	male	69
bill	male	62
bob	male	78
jane	female	65
judy	female	74

Replications. Replications are coded by having more than one line with the same leading indexes. This is shown in the next example. One person has three replications while another has two.

john	male	5
john	male	3
john	male	8
jane	female	10
jane	female	12

Missing Data. Many statistical programs require fixed column input formats and treat blank fields as zeros and have to handle missing data by requiring special values that the programs recode as missing. This is clumsy and can lead to errors. Missing values are a special case of multiple observations for which zero replications are collected. In the MDF format, a value is missing if there is no line in the MDF containing it.

Categorical Data. Even though the methods of analysis of numerical data and categorical data differ greatly, the sorts of designs from which they are obtained are often similar. It is not uncommon to collect both numerical and categorical data during one experiment. Categorical data are coded in the same way as numerical data, except that categorical dependent measures can be coded by arbitrary strings. These strings, like those specifying levels of variables, can be used in printouts to make outputs more easily interpreted. For example, the following data might have been obtained from a questionnaire.

john	question-1	agree
john	question-2	disagree
jane	question-1	agree
jane	question-2	disagree

Extending the MDF Format

One drawback of the MDF format as so far presented is that it depends heavily on the fact that there is one random factor that is indexed by the first column, and that there is one dependent measure, in the final column. An extension of the MDF format that solves a number of problems is the use of column tags. A column tag holds information about the variable in that column so that columns need not have any fixed type or location. Some column tags that promote ease of use are the name of a variable, its type, the criterion for selection of its levels, its scale of measurement, and its allowable range of values. If unspecified, reasonable default values often exist. The use of column tags will be shown in a later example and their meanings are described below.

Name. Analysis programs with access to meaningful variable names can use them in messages and outputs so that users can more easily interpret results. Adding names promotes a master data file to a full relational database, making possible powerful database operations. If unspecified, default names can be chosen based on column number and type.

Type. A variable can be of type unit (from which individual scores are collected), independent, dependent, or covariate. Intelligent programs with type information can determine the desired analysis by the relationships between pairs of columns in an MDF. This allows automated design interpretation and removes the earlier restriction that columns be ordered.

Selection. The levels of unit and independent variables are either selected according to some fixed criterion or by random sampling. Program with information about how levels of these variables are selected can select the correct error terms for significance tests. Most often, a unit variable is random while all independent variables are fixed, and both are useful default assumptions.

Scale. How data are analyzed depends on the scale of measurement of all variables in a design. Variables can have levels defined by name alone (nominal), or there may be an inherent ordering of levels (ordinal), or differences between levels may be meaningful (interval), or the ratio of levels may be meaningful (ratio). Depending on the scale of measurement, data may be analyzed with cross-tabulations, order statistics, or parametric methods. If unspecified, it is reasonable to assume unit and independent variables are nominal while dependent measures are at least on an interval scale if their range is numerical.

Range. The range of allowable values of a variable is useful for checking the validity of inputs. Ranges can be specified by individual values or by implied subranges in the case of variables at least on an ordinal scale.

More Examples of Coding Designs

More Than One Dependent Measure. To code having more than one dependent measure, columns holding the data are added to the MDF. To analyze data with several dependent variables, a separate analysis is computed for each measure, and an analysis routine would only need know to do a separate analysis for each column tagged with type dependent.

Covariates. Similarly, covariates are included as columns in the MDF, and are tagged with type covariate. An analysis of covariance program would then know to first remove any variability on the data attributable to the covariate before performing analyses for dependent measures. Variables labeled covariate can also be coded as independent variables measured on interval or ratio scales.

Other Than One Random Factor. Experimental designs with any number of random factors can be coded by tagging the columns corresponding to those variables random. Designs with no random factors are identified by having no random variables.

Coding A Complicated Design

In this section, the MDF for a complicated design is shown as an example. Students from two classes are compared for their performance on a manual dexterity test. There are five students, two in group A and three in group B. Before taking the dexterity test, a physical skill test score is obtained for each student. Students attempt three dexterity problems of increasing difficulty which they try to complete in a limited recorded time span. Some students try the same problem more than once, so occasionally there are replications. At the top of the MDF in Table 1 are the column tags for the variables.

Table 1. Master Data File with Column Tags
A master data file with column tags for a design with one random factor, two independent variables (one with unequal cell sizes), one covariate and two dependent measures. At the top of the file are column tags for column name, type, selection, scale, and allowable range.
student	class	problem	skill	complete?	time
unit	indep	independ	covar	dependent	depen
random	fixed	fixed
nominal	nominal	ordinal	interv	nominal	ratio
1-5	{A,B}	{1-3}	{1-10}	{yes,no}	{1-5}
1	a	1	4	no	2
1	a	1	4	no	3
1	a	2	4	yes	1
1	a	3	4	yes	4
2	a	1	7	no	4
2	a	2	7	yes	2
2	a	3	7	no	5
3	b	1	2	yes	3
3	b	2	2	yes	1
3	b	3	2	yes	3
4	b	1	10	yes	1
4	b	1	10	yes	2
4	b	2	10	no	2
4	b	3	10	no	3
4	b	3	10	yes	4
5	b	1	8	yes	5
5	b	2	8	no	4

Final Comments on the MDF Format

Most experimental designs are representable in the same simple notation. What are special cases for many programs are handled naturally in the MDF format. MDFs can be used as input to a large variety of programs because they have a uniform uncluttered structure. Users can think of an MDF as a file with columns corresponding to variables of various types with definable names and allowable ranges. Or they can think of an MDF as a file with data collected for each trial on its own line. The notation has the virtue that the complexity of the design determines the complexity of the data file, making the system easy to learn, and the virtue that similar designs have similar representations, making it possible to transfer what is learned from one analysis to another. Options are often not necessary because the data in an MDF are structured so that programs can analyze them without any specification by the user. Data in MDF format can be transformed easily with data manipulating tools because of the columnar format of variables.

By incorporating column tags, most designs can be represented and allow design information to be interpreted by data analysis routines instead of requiring a person to describe the structure of the input. This last property is perhaps the system's most important virtue because it reduces the probability of design specification errors, and also can be used to force the use of appropriate statistical procedures.

CONCLUSION

In this paper I described how the UNIX environment affects the development of programs which in turn affects the way people view the task of data analysis. UNIX pipelining promotes the view of data analysis as a series of transformations on a relational data base. Using an interactive system like UNIX allows analysts to use many programs peripherally related to data analysis in addition to statistical programs. The same simple command language can be used to control data analysis as is used for system and personally developed programs. Because of UNIX support for combining programs (with pipelines), individual programs tend to be smaller and can be used on small systems, and programs tend to do one task well and so can be used in a relatively wide variety of tasks. All the preceding properties of UNIX--its interactive interface and its capability to combine modular programs--combine to make for fast command execution. In support of this, in the context of data analysis, it became desirable, even necessary, to remove the task of design specification from users by having programs determine mundane properties of data files such as the number of lines and columns. In support of this, data files are best restricted to rectangular ASCII files with fields separated by white space. This also removes the need for format statements. With some assumptions about the data types of columns, data analysis can be automated. Alternatively, auxiliary data structures defining properties of columns holding variables (name, type, selection, scale, and range) make rectangular data files relational databases that allow automatic data validation and design interpretation followed by a suggested data analysis. Such an orientation provides people with considerable freedom in doing data analysis because many of the tedious and error prone tasks are automated. This is most valuable to the majority of users, non-experts, who are often overwhelmed by the user interfaces of data analysis programs which require much more computer sophistication than this paper has shown is necessary.

REFERENCES

Perlman, G. (1980) Data analysis programs for the UNIX operating system. Behavior Research Methods and Instrumentation, 1980, 12:5, 554- 558.

Ritchie, D. M., & Thompson, K. The UNIX time- sharing system. Communications of the ACM, 1974, 17, 365-375.

ACKNOWLEDGMENTS

Jay McClelland has been instrumental in his influence in the user interfaces, particularly DM and ANOVA. He also wrote the original version of ABUT. Mark Wallen has been helpful in being able to convey many of the intricacies of UNIX C programming. Greg Davidson was the force behind most of the error checking facilities. Don Norman wrote the original flow meter on which the IO program is based.

The research reported here was conducted under Contract N00014-79-C-0323, NR 157-437 with the Personnel and Training Research Programs of the Office of Naval Research, and was sponsored by the Office of Naval Research and the Air Force Office of Scientific Research.