|STAT Statistical Data Analysis
Free Data Analysis Programs for UNIX and DOS
|
by Gary Perlman |
Home | | | Preface | | | Intro | | | Example | | | Conventions | | | Manipulation | | | Analysis | | | DM | | | Calc | | | Manuals | | | History |
Last updated:
|STAT was an invention of necessity -- my necessity -- although I have tried to allow others to benefit from the effort. And it is highly skewed to the procedures used in experimental psychology. If I needed a statistical procedure or a feature, it would not mean much effort for me to add it, and then others could use it. I have never had the ambition of writing a complete package -- all that might mean is to have what other packages have -- although I have thought about more general models for design specification and analysis.
I was at UCSD in the late 1970s. Our lab ran UNIX (version 6) on a PDP 11/45 with 256K of memory (128K for UNIX). The UCSD Burroughs mainframe had the BMDO and BMDP packages (among others). We used BMD-O8V and BMD-P2V for most of our ANOVA needs, but it was inconvenient to print out data, punch it onto cards in another building, and wait for a job to finish and print out. We were, after all, accustomed to an interactive time-sharing system!
BMD-P2V offered many new capabilities, but we found the program difficult to use. Data had to be placed in exact column positions and read with the equivalent to Fortran format statements. Errors were common. I tried to humanize BMD-P2V by rewriting their manual. It was my first real introduction to unusable software where errors could have career-ending effects.
Our lab obtained BMD-P for our PDP 11/45. The initial excitement was met with dismay as the programs were far too large to fit in 128K of memory. The program worked by overlaying -- loading parts, unloading, loading other parts -- which was time-consuming. To obtain the mean of two numbers, it took several minutes of CPU time, and on a loaded time-sharing system, that could span 10-15 minutes and downgrade the performance of all other users.
People wrote their own programs in C to do the most mundane analyses. We shared code, and some programs became used by more than their original authors.
In the fall of 1979, I got an idea about how to automatically choose bin sizes to make a good looking histogram, and desc was born (actually, it was first called hist). Over time, many stats were added, and desc was clearly a cut above the rest of the little programs.
pair followed soon after. Don Gentner had written a paired data analysis program, but the output was inconsistent, especially in alignment. I rewrote it to align similar items, add more stats, and eventually added error handling (which was missing from most of these home-grown programs), and more options. I wrote a character-based bivariate plotting program, biplot, which eventually merged into pair as the -p option.
Next was dm, originally called the data massager but later the data manipulator for public distribution (interactive mode could have greeted analysts with "Welcome to the massage parlor," but I don't think it ever did). The major effort was in learning about building a parser with yacc (Yet Another Compiler-Compiler). Jay McClelland helped motivate me to add string handling to the numeric functions (He argued that it would be much easier for me to change the code than for him to, and I made most of the changes in our lab while he looked over my shoulder.)
By the summer of 1980, I took on the task of writing anova, adapting as much of the methods from Keppel's Design and Analysis book as I could. Jay McClelland had written dt, the Data Tabulator, which was highly influential in the design of anova and how it simplified the specification of the relationship of the data to the experimental design.
regress (initially called corr) followed later in 1980, as did many of the data validation facilities. Only a few programs, mostly for data manipulation, were added during my remaining time at UCSD:
While at UCSD, I started distributing the package (via magnetic tape). One request came from the Hospital for Sick Children in Toronto. "Yow!" I thought. "If there are bugs in the programs and these doctors are basing decisions on the results, they might say 'Well, these results from Gary Perlman's programs are clear. Off with Timmy's leg'" I would wake up at night and think about doing more testing. It made me much more serious about testing, especially after making changes. It also made me reluctant to make changes.
Having been a professor who taught software engineering for 12 years, I've told that story in every class in which I discussed testing and the potential impact of poorly-tested software.
At the Wang Institute, I began adding to the package in 1985, particularly non-parametric/rank-based statistics:
The programs have remained remarkably stable over their first 20 years. Few features were added and most changes were when there was a package-wide change (e.g., common help options across all programs). Fortunately, none of the programs were changed for computational errors.
While I was teaching at the Wang Institute, my wife-to-be was at Cornell in Psychology graduate school. Fred Horan, a department programmer, worked on porting the package to the PC using the Lattice C compiler. After helping with some of the stickier problems, the package was running with no problems.
The policy for |STAT arose from experiences with well-meaning people sending in enhancements to various programs. Each enhancement was accompanied by an enthusiastic message that indicated a real sense of cooperation and pride. Unfortunately, they also included bugs -- computational errors -- that could result in an incorrect decision. After about ten contributions, none of which had ever been included because none had ever been correct, the non-redistribution policy was born.
I have often considered distributing the validation suite with the software, so that the software could be checked after compilation. Once someone suggested using a new version of a compiler for a huge performance improvement; the results came back in one tenth the time -- wrong results, but fast! Usually, however, portability problems are clear, and it would be a large additional effort to prepare the scripts for distribution, so that feature may need to wait for the need.
I ran a contest with some students, and although they had some good ideas (okay, I'm lying here, but you want to encourage students), they all would require some effort on my part to protect the name from infringers. Continuing the package's tradition of never doing more work than is necessary for my needs (or my wife's, depending on how forcefully she expresses her needs), I decided that a name like "STAT" would be ideal because it was generic and unprotectable so I would not need to worry about protection, infringement, etc. I kept the pipe symbol, although that resulted in some people interpreting the pipe symbol as an uppercase "I", and citing |STAT as ISTAT.
Some people ask me how the pronounce "|STAT". The pipe symbol is silent. Now you know. Some people like to call it "pipe stat", which is probably better than a silent symbol, but leans toward something that I might need to protect, so I have not considered it seriously.
© 1986 Gary Perlman |