Gary Perlman, University of California, San Diego
Note: This paper appeared as:
Perlman, G. (1983)
Data Analysis in the UNIX Environment.
pp. 130-138
in
K. W. Heiner, R. S. Sacher, & J. W. Wilkinson (Eds.)
Computer Science and Statistics: Proceedings of the 14th
Symposium on the Interface,
July 5-7, 1982,
Springer-Verlag.
The descriptions of the programs are little more than for historical curiosity because there are many more programs, best described at the |STAT home page: www.acm.org/~perlman/stat/. However, this paper has the best description of the ideas behind the |STAT file formats and the principles of automatic interfaces for implicit design specification. |
In this article, I will discuss how the UNIX (TM Bell Telephone Laboratories) operating system (Ritchie & Thompson, 1974) supports data analysis, and how some of my data analysis programs (Perlman, 1980) developed under the influence of the UNIX environment. I begin with an overview of UNIX, and how its support of process control affects data analysis. Then I describe some programs developed for data analysis on the UNIX operating system. Finally, I discuss how many aspects of data analysis, particularly the specification of experimental designs, can be automated.
UNIX is an interactive operating system and as such is ideal for data analysis. Users generally sit in front of a terminal and repeatedly specify a program, its input, and to where its output should be directed. Users can immediately observe intermediate results of programs, and can make decisions based on them. One common decision is to use another program on such results.
It is possible to combine simple UNIX programs into more complicated ones. The implementors of UNIX made it possible to direct the output of one program into the input of another program. This linking of outputs to inputs is called "pipelining." Programs in a pipeline are connected with the pipe symbol, the vertical bar. For example, several programs can be joined together with a command like:
IO DATA I PROGRAM1 | PROGRAM2 | PROGRAM3Here, a series of transformations of the file DATA is initiated by an input/output program called IO. (I will use the convention of capitalizing the names of programs and files, though in actual use, they would probably be lower case.) The output from IO is used as input to PROGRAM1 whose output is used as input (piped) to PROGRAM2 whose output is piped finally to PROGRAM3. In effect, PROGRAM1, PROGRAM2, and PROGRAM3 are combined to make a more complex program.
The ability to pipeline programs has led to a philosophy of modular program design unique to and ubiquitous in UNIX: to build program tools that accomplish one task well that produce outputs convenient for input to other programs. Convenience here means output stripped of unnecessary labeling information, and that is easily parsed by other programs. In UNIX, there are dozens of programs with general uses such as ones to search though or sort files, count lines, words, and characters in files, and so on. The philosophy is to have programs that can be used in a variety of contexts. This philosophy has been followed in the design of data analysis programs I will describe later, especially those for applying pre-analysis transformations.
Doing data analysis in a general programming environment overcomes some problems associated with many statistical packages. In a general environment like UNIX, users have easy access to all the programs that come with the system, as well as those locally developed. That means people doing data analysis in UNIX can use system editors, printing programs, and perhaps most importantly, they can control data analysis with the general command language used to control all program execution in UNIX. The UNIX command line interpreter is called the shell and it simplifies file input and output. It also provides the pipelining facilities mentioned earlier. The UNIX shell is a high level programming language in which users can write shell "scripts" that allow program control with primitives for condition testing and iteration. In short, doing data analysis in a general programming environment provides many resources lacking or weak in statistical packages.
While UNIX allows easy access to a wide variety of programs peripherally related to data analysis, such is not the case with most statistical packages. These usually have individual programs have many general utilities, such as data transformation, editing, and printing subroutines built in, making them larger than necessary, and hence less portable to small computers. This is necessary for statistical packages expected to run on a variety of operating systems where it is not clear what utilities are available. It is also a remnant of the 1960 s batch processing philosophy when the major statistical packages were first developed.
Under the UNIX philosophy of modular design, data are transformed by specialized programs before being input to specific data analysis programs. The result is that data transforming functions are independent of analysis programs and are applicable in a wider variety of contexts.
Recall that the UNIX design philosophy of modularity is in support of pipelines of simple uni-purpose tools. An application of this philosophy to data analysis is to first transform data and pipe the transformed data to specific analysis programs. The following represents an example of just such a use.
IO DATA | TRANSFORM | ANALYZEWriting programs to be used in pipelines require that their outputs be uncluttered and easily parsed by programs down stream. This requirement is satisfied by having all programs process data as human readable character files with fields separated by white space (blanks or tabs). While such free format slows processing, the slowdown is negligible compared to the time required for complex analyses. Free format also costs storage space, but space limitations do not affect the majority of users, and as the cost of storage goes down, this factor is even less of a problem. The requirement of- putting data in plain files rather than binary code, or even the traditional fixed column FORTRAN formats makes the files easier for people to read, and increases their acceptability to all UNIX programs, such as system wide programs like editors and printing programs.
To be able to efficiently construct pipelines of programs requires that much of the task of the user be automated. If a special control language file were needed for each program in a pipeline, constructing analysis commands would be prohibitively time consuming, and error prone. It would also obliterate the advantage UNIX offers as an interactive system. In many cases, there is no need for a control language when programs are made intelligent enough to determine mundane properties of their inputs automatically. Delimiting fields with white space allows programs to count the number of fields per line, and UNIX file handling makes it easy to tell when the end of a file has been reached, which makes it easy to count lines as well as columns.
The programs have been designed to be easy to use, and small enough to fit on mini computers such as DEC PDP 11/45 s, 11/34 9, and 11/23 s. I will first describe the format prescribed for the programs, and how people use the programs in the UNIX environment. The programs are of several different types:
Data Transformation. These programs are useful for changing the format of data files, for transforming data, and for filtering unwanted data. In addition, one program i9 useful for monitoring the progress of the data transformations.
Data Validation. These include programs for checking the number of lines and columns in data files, their types (e.g., alphanumeric, integer), and their ranges.
Descriptive Statistics. These procedures include both numerical statistics, and simple graphical displays. There are procedures for single distributions, paired data, and multivariate cases.
Inferential Statistics. These include multivariate linear regression and analysis of variance. Some simple inferential statistics are also incorporated into the descriptive statistics programs, but are used less often.
Pre-Processing Programs | |
---|---|
ABUT | abut files |
DM | conditional transformations of data |
IO | control/monitor file input/output |
TRANSPOSE | transpose matrix type file |
VALIDATA | verify data file consistency |
Analysis Programs | |
ANOVA | anova with repeated measures |
BIPLOT | bivariate plotting + summary statistics |
CORR | linear correlation + summary statistics |
DESC | statistics, frequency tables, histograms |
PAIR | bivariate statistics + scatterplots |
REGRESS | multivariate linear regression |
PROGRAM < FILEindicates to UNIX (really it indicates to the shell which controls input and output) that the input to the program named PROGRAM should be read from the file named FILE rather than the terminal keyboard. Analogously, the output from a program can be saved in a file be redirecting it to a file with the ">" symbol. Thus,
PROGRAM < INPUT > OUTPUTindicates that the program PROGRAM should read its input from the file INPUT and put its output into a new file called OUTPUT. If the file INPUT does not exist, an error message will be printed. If the file OUTPUT exists, then whatever was in that file before will get destroyed. A mistake novices and experts alike should avoid is a command like:
PROGRAM < DATA > DATAwhich might be used to replace the contents of data with whatever PROGRAM does to it. The effect of such a command is to destroy the contents of DATA before PROGRAM has a chance to read it. For a safer method of input and output, see the later discussion of the IO program.
The output from one program can be made the input to another program without the need for temporary files. This action is called "pipelining" or "piping" and is used to create one complex function from a series of simple ones. The vertical bar, or "pipe" symbol, | , is placed between programs to pass the output from one into the other. Thus,
PROGRAM1 < INPUT | PROGRAM2tells UNIX to run the program PROGRAM1 on the file INPUT and feed the output to PROGRAM2. In this case, the final output would be printed on the terminal screen because the final output from PROGRAM2 is not redirected. Redirection could be accomplished with a command line like:
PROGRAM1 < INPUT | PROGRAM2 > OUTPUTIn general, only one input redirection is allowed, and only one output redirection is allowed, and the latter must follow the former. Any number of programs can be joined with piping.
In general, UNIX programs do not know if their input is coming from a terminal keyboard or from a file or pipeline. Nor do they generally know where their output is destined. One of the features of UNIX is that the output from one program can be used as input to another via a pipeline making it possible to make complex programs from simple ones without touching their program code. Pipelining makes it desirable to keep the outputs of programs clean of annotations so that they can be read by other programs. This has the unfortunate result that the outputs of many UNIX programs are cryptic and have to be read with a legend. The advantages of pipelining will be made clear in the examples of later sections.
Data are most easily manipulated and analyzed if they are in what I will call the master data file format. The key ideas of the format of the master data file are simplicity and self documentation, and are derived from relational databases. The reason for this is to make transformation of data easy, and to be able to use a wide variety of programs to operate on a master data file. Each line of a master data has the same number of alphanumeric fields. For readability, the fields can be separated by any amount of white space (blank spaces or tabs), and, in general, blank lines are ignored. Each line of a master data file corresponds to the data collected on one trial or series of trials of an experiment. Along with the data, a set of fields describe the conditions under which those data were obtained. Usually, a master data file contains all the data for an experiment. However, in many cases, a user would not want all the data from an experiment from this file to be input to a program. Some parts may be of particular interest for a specific statistical test, or some data may need to be transformed before input to a data analysis program. In a later section, I will expand on the notion of a master data file, and show how it can be used to convey experimental design information implicitly.
The strategy I will use here is to describe the programs and give examples of how they are used. This is meant only to make the ideas of analysis with these programs familiar and should not be used as a substitute for the manual entries on individual programs. Only a few of the capabilities of the programs are described.
TRANSFORM < DATA | ANALYZE > OUTPUTwhere TRANSFORM is some program to transform data from an input file DATA and the output from TRANSFORM is piped to an analysis program, ANALYZE, whose output is directed to an output file, OUTPUT.
ABUT FILE1 FILE2 FILE3 > FILE123would create FILE123 with its first line the first lines of FILE1, FILE2, and FILE3, in order. Successive lines of FILE123 would have the data from the corresponding lines of the named files joined together.
Catenate Files. IO has a similar function to ABUT. Instead of, in effect, placing the files named beside each other, IO places them one after another. A user may want to analyze the data from several files, and this can be accomplished with a command like:
IO FILE1 FILE2 FILE3 I PROGRAMMonitoring Input and Output. IO also can be used to monitor the flow of data between programs. When called with the -m flag, it acts as a meter of input and output flow, printing the percentage of its input that has been processed. (In UNIX, it is common for options to be specified to programs by preceding letters by a dash to distinguish them from file names.) The program can also be used in-the middle and end of a pipeline to monitor the absolute flow of data, printing a special character for every block of data processed.
Input and Output Control. Finally, it can be used as a safe form of controlling i/o to files, creating temporary files and copying rather than automatically overwriting output files. For example, IO can be used to sort a file onto itself with one pipeline using the standard UNIX SORT program:
IO -m FILE | SORT | IO -m FILEThe above command would replace FILE with a sorted version of itself. Because the monitor (-m) flag is used, the user would see an output like:
10% 20% 30% 40% 50% 60% 70% 80% 90% 100% ==========The percentages show the flow from FILE into the SORT program (percentages are possible because IO knows the length of FILE), and the equal signs indicate a flow of about a thousand bytes each coming out of the SORT program. The command:
SORT FILE > FILEwould destroy the contents of FILE before SORT had a chance to read it. The command with IO, is both safer and acts as a meter showing input and output progress.
Saving Intermediate Transformations. In a command like:
IO FILE1 FILE2 | PGM1 | PGM2 | IO OUTthe intermediate results before PGM1 and before PGM2 are lost. IO can be used to save them by diverting a copy of its input to a file before continuing a pipeline:
IO FILE1 FILE2 | PGM1 | IO SAVE | PGM2 | IO OUTAny of these calls to IO could be made metering versions by using the -m option flag.
1 2 3 4 5 6 7 8 9 10 11 12then the command:
TRANSPOSE FILEwill print:
1 5 9 2 6 10 3 7 11 4 8 12TRANSPOSE is useful as a pre/post-processor for the data manipulator, DM, as it reads from the standard input stream when no files are supplied.
DM allows users to access the columns of each line of its input. Numerical values of fields on a line can be accessed by the letter "x" followed by the column number. Character strings can be accessed by the letter "9" followed by the column number. Consider for example the following contents of the file EX1:
12 45.2 red *** 10 42 blue --- 8 39 green --- 6 22 orange ***The first line of EX1 has four columns, or fields. In this line, x1 is the number 12, and 91 is the string 12 . DM distinguishes between numbers and strings (the latter are enclosed in quotes) and only numbers can be involved in algebraic expressions.
Column extraction. Simple column extraction can be accomplished by typing the strings in the columns desired. To print, in order, the second, third, and first columns from the file EX1, one would use the call to DM:
DM 92 93 91 < EX1This would have the effect of reordering the columns and removing column 4.
Algebraic Expressions. DM can be used to produce algebraic combinations of columns. For example, the following call to DM will print the first column, the sum of the first two columns, and the square root of the second column.
DM x1 x1+x2 "sqrt(x2)" < EX1Note that the parentheses in the third expression requires quotes around the whole expression. This is because parentheses are special characters in the shell. If a string in either of the columns was not a number, DM would print and error message and stop.
Conditional Operations. DM can be used to filter out lines that are not wanted. A simple example would be to print only those lines with stars in them.
DM "if * C INPUT then INPUT else NEXT" < EX1The above call to DM has one expression (in quotes to overcome problems with special characters and spaces inserted for readability). The conditional has the syntax "if-then-else." between the "if" and "then" is a condition that is tested, in this case, testing if the one-character string * is in INPUT, a special string holding the input line. If the condition is true, the expression between the "then" and "else" parts is printed, in this example, the input line, INPUT. If the condition is not true, then the expression after the "else" part is printed, in this case, NEXT is a special control variable that is not printed and causes the next line to be read.
DM "if !(x3>0 & x3<100) then INPUT else NEXT"If non-numerical data appeared in column three, DM would report an error.
Summary Statistics. DESC prints a variety of statistics, including order statistics. Under option, DESC will print a t-test for any specified null mean.
Frequency Tables. DESC can print frequency tables, or tables of proportions, with cumulative entries if requested. These tables will be formed based on the data so they are a reasonable size, or they can be formatted by the user who can specify the minimum value of the first interval, and the interval width. For example, the following command would print a table of cumulative frequencies and proportions (cfp) in a table with a minimum interval value of zero (m0) and an interval width of ten (i10).
DESC i10 mO cfp < DATA
Histograms. If requested, DESC will print a histogram with the same format as would be obtained with options used to control frequency tables. For example, the following command would print a histogram of its input by choosing an appropriate interval width for bins.
DESC h < DATAThe format of the histogram, as well as tables, can be controlled by setting options. The following line sets the minimum of the first bin to zero, and the interval width to ten, an appropriate histogram for grading exams.
DESC h i10 mO < GRADES
Summary Statistics. From PAIR 9 input, minimums, maximums, means, and standard deviations are printed for both columns as well as their difference. Also printed is the correlation of the two columns and the regression equation relating them. The simplest use of PAIR is with no arguments. To analyze a data file of lines of X-Y pairs, the following command will in most cases be satisfactory:
PAIR < DATAOften the paired data to be analyzed are in two files, each variable occupying a single column. These can be joined with ABUT and input to PAIR via a pipe:
ABUT VAR1 VAR2 | PAIROr perhaps the two variables occupy two columns in a master data file. If the variables of interest are in columns four and six, the following command would produce the paired data analysis:
IO DATA | DM 94 s6 | PAIRScatterplots. With the "p" or "b" options, a scatterplot of the two variables can be printed. Alternatively, PAIR has an alias, called BIPLOT, which lets the user get a bivariate plot of a data file of X-Y pairs:
BIPLOT < DATA
Suppose you had a design in which you presented problems to subjects. These problems varied in difficulty (easy/hard), and in length (short/medium/long). The dependent measure is time to solve the problem, with a time limit of five minutes. Your data file would have lines like this:
fred easy medium 5 ethel hard long 2In the first column is a string that identifies the level of the random factor (here, subject name), followed by strings indicating the level of the independent factors, followed by the dependent measure (here, solution time). The data file holding lines like those above would be analyzed with:
ANOVA subject difficulty length time < DATAIndividual factors can be ignored by excluding their columns from the analysis:
DM 91 92 s4 < DATA | ANOVA subject difficult timeSimilarly, different factors can be used as the random factor. This is common in psycho- linguistic experiments in which both subjects and items are random variables.
REGRESS bp age height risk salary < DATAIf only a few columns were of interest, they could be extracted with DM:
DM x1 x3 x5 < DATA | REGRESS bp height salaryPAIR: Paired Data Comparison PAIR can be used to compare two distributions of paired data. Often, the two columns of interest are pulled out of a master data file with DM. The following command takes columns 5 and 3 from DATA and inputs them to PAIR:
DM x5 x3 < DATA | PAIRFor its two-column input, PAIR will print a t-test on the differences of the two columns, which is equivalent to a paired t-test. PAIR will also print a regression equation relating the two, along with a significance test on their correlation, which is equivalent to testing the slope against zero.
Most statistical programs offer similar capabilities, but they differ in how easily they are used. One purpose of this paper is to introduce a system that is both easy to use and highly powerful in the class of designs it can be used to specify. My goal is to motivate the implementation of more easily used statistical programs by presenting a convenient interface between users and programs.
The MDF format provides useful terms in which people can think about their data. Each line corresponds to a trial on which data were collected. Each column corresponds to a variable such as independent or dependent.
fred 68 judy 62 bob 67 jane 67One Grouping Factor. To compare data from two groups, there must be some way of distinguishing them. This is done by using strings to indicate levels of factors in a column as in the following example where each person s sex is included as an index to distinguish groups.
fred male 68 judy female 62 bob male 67 jane female 67Factorial Designs. It is a simple generalization to higher order designs. For each factor, a column is added holding levels for that factor. In the following example, there are two factors, sex with {male, female} as levels, and difficulty with {easy, hard} as levels.
bob male easy 56 bob male hard 67 jane female easy 63 jane female hard 70Different Types of Factors. All types of designs can be coded in the same system, and design information is implicit in the relation of columns holding levels of independent variables to that of the random factor. By assuming that the indexes for the random factor fall in the first column of an MDF, the type of a factor can be inferred to be between groups, within groups, or nested within some other factor. In the previous example, the difficulty factor can be inferred to be a within groups factor because each level of the random factor is paired with all levels of the that factor. Of course, the sex factor is between subjects, and a computer program can infer this because each level of the random variable is paired with exactly one level of the sex factor.
Nested factors can also be easily coded. In the following example, gymnastic event, indexed in column three, is nested in gender.
fred male rings 9.4 bob male horse 8.2 judy female vault 8.8 jane female bar 9.2Unequal Group Sizes. To code unequal group sizes on between groups factors, all that is needed is more data for one group than an for other. In the following example, the male group has three members compared to two for the female.
fred male 69 bill male 62 bob male 78 jane female 65 judy female 74Replications. Replications are coded by having more than one line with the same leading indexes. This is shown in the next example. One person has three replications while another has two.
john male 5 john male 3 john male 8 jane female 10 jane female 12Missing Data. Many statistical programs require fixed column input formats and treat blank fields as zeros and have to handle missing data by requiring special values that the programs recode as missing. This is clumsy and can lead to errors. Missing values are a special case of multiple observations for which zero replications are collected. In the MDF format, a value is missing if there is no line in the MDF containing it.
Categorical Data. Even though the methods of analysis of numerical data and categorical data differ greatly, the sorts of designs from which they are obtained are often similar. It is not uncommon to collect both numerical and categorical data during one experiment. Categorical data are coded in the same way as numerical data, except that categorical dependent measures can be coded by arbitrary strings. These strings, like those specifying levels of variables, can be used in printouts to make outputs more easily interpreted. For example, the following data might have been obtained from a questionnaire.
john question-1 agree john question-2 disagree jane question-1 agree jane question-2 disagree
Name. Analysis programs with access to meaningful variable names can use them in messages and outputs so that users can more easily interpret results. Adding names promotes a master data file to a full relational database, making possible powerful database operations. If unspecified, default names can be chosen based on column number and type.
Type. A variable can be of type unit (from which individual scores are collected), independent, dependent, or covariate. Intelligent programs with type information can determine the desired analysis by the relationships between pairs of columns in an MDF. This allows automated design interpretation and removes the earlier restriction that columns be ordered.
Selection. The levels of unit and independent variables are either selected according to some fixed criterion or by random sampling. Program with information about how levels of these variables are selected can select the correct error terms for significance tests. Most often, a unit variable is random while all independent variables are fixed, and both are useful default assumptions.
Scale. How data are analyzed depends on the scale of measurement of all variables in a design. Variables can have levels defined by name alone (nominal), or there may be an inherent ordering of levels (ordinal), or differences between levels may be meaningful (interval), or the ratio of levels may be meaningful (ratio). Depending on the scale of measurement, data may be analyzed with cross-tabulations, order statistics, or parametric methods. If unspecified, it is reasonable to assume unit and independent variables are nominal while dependent measures are at least on an interval scale if their range is numerical.
Range. The range of allowable values of a variable is useful for checking the validity of inputs. Ranges can be specified by individual values or by implied subranges in the case of variables at least on an ordinal scale.
Covariates. Similarly, covariates are included as columns in the MDF, and are tagged with type covariate. An analysis of covariance program would then know to first remove any variability on the data attributable to the covariate before performing analyses for dependent measures. Variables labeled covariate can also be coded as independent variables measured on interval or ratio scales.
Other Than One Random Factor. Experimental designs with any number of random factors can be coded by tagging the columns corresponding to those variables random. Designs with no random factors are identified by having no random variables.
A master data file with column tags for a design with one random factor, two independent variables (one with unequal cell sizes), one covariate and two dependent measures. At the top of the file are column tags for column name, type, selection, scale, and allowable range. | |||||
student | class | problem | skill | complete? | time |
unit | indep | independ | covar | dependent | depen |
random | fixed | fixed | |||
nominal | nominal | ordinal | interv | nominal | ratio |
1-5 | {A,B} | {1-3} | {1-10} | {yes,no} | {1-5} |
1 | a | 1 | 4 | no | 2 |
1 | a | 1 | 4 | no | 3 |
1 | a | 2 | 4 | yes | 1 |
1 | a | 3 | 4 | yes | 4 |
2 | a | 1 | 7 | no | 4 |
2 | a | 2 | 7 | yes | 2 |
2 | a | 3 | 7 | no | 5 |
3 | b | 1 | 2 | yes | 3 |
3 | b | 2 | 2 | yes | 1 |
3 | b | 3 | 2 | yes | 3 |
4 | b | 1 | 10 | yes | 1 |
4 | b | 1 | 10 | yes | 2 |
4 | b | 2 | 10 | no | 2 |
4 | b | 3 | 10 | no | 3 |
4 | b | 3 | 10 | yes | 4 |
5 | b | 1 | 8 | yes | 5 |
5 | b | 2 | 8 | no | 4 |
By incorporating column tags, most designs can be represented and allow design information to be interpreted by data analysis routines instead of requiring a person to describe the structure of the input. This last property is perhaps the system's most important virtue because it reduces the probability of design specification errors, and also can be used to force the use of appropriate statistical procedures.
Ritchie, D. M., & Thompson, K. The UNIX time- sharing system. Communications of the ACM, 1974, 17, 365-375.
The research reported here was conducted under Contract N00014-79-C-0323, NR 157-437 with the Personnel and Training Research Programs of the Office of Naval Research, and was sponsored by the Office of Naval Research and the Air Force Office of Scientific Research.