|STAT Statistical Data Analysis
Free Data Analysis Programs for UNIX and DOS
|
by Gary Perlman |
Home | | | Preface | | | Intro | | | Example | | | Conventions | | | Manipulation | | | Analysis | | | DM | | | Calc | | | Manuals | | | History |
Last updated:
There are several classes of data manipulation programs.
repeat: repeat a string or file
repeat can repeat strings
or lines in a file as many times as requested. It helps generate labels for
datasets, or feed a program like dm that needs input to produce
output. The following will repeat the file data 10 times.
repeat -n 10 dataThe following will repeat its input series of 20 numbers 15 times.
series 1 20 | repeat -n 15Strings can be repeated using the echo command. The following will repeat the string hello 100 times.
echo hello | repeat -n 100
series: generate a linear series
series generates a
linear series of numbers between two values. By default, its values change by
units, but this can be modified. The following produces a series of 10
numbers, 1 to 10, one per line.
series 1 10The following produces the same series, but in reverse order; the start of the series can be greater than the end.
series 10 1Non-integral series can be created by supplying an optional increment.
series 0 1 .1produces the series:
0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1except that each value is on its own line. The output from series can be transformed with dm to produce other than linear series. Here is an exponential series:
series 1 10 | dm "exp(x1)"
probdist: generate random numbers
probdist can generate
random numbers for several probability distributions. The following will
generate 100 random numbers from the uniform distribution (between 0 and 1).
probdist random uniform 100This can be transformed using dm to get random numbers with other ranges. The following will produce 100 random integers uniformly distributed between 10 and 29.
probdist random uniform 100 | dm "floor(x1*20+10)"The following generates numbers from a one- trial binomial distribution with probability 0.5.
probdist random uniform 100 | dm "if x1 > .5 then 1 else 0"probdist also has a binomial distribution built in, so the following would be equivalent to the previous example:
probdist rand binomial 1 1/2 100The random number generator can be seeded. The following will seed the random number generator with 143 and generate 100 normally distributed z values.
probdist -s 143 random normal 100The seeding option is useful when a random sequence must be repeated. The random normal numbers have a mean of 0 and a standard deviation of 1, so dm can help create different random normal distributions. The following samples a normal distribution with mean 100 and standard deviation 15.
probdist random normal 100 | dm "x1*15+100"
abut: number lines, recycle files
abut can number input
lines in files using the -n option, or cycle through input files as
many times as is necessary to match the length of longer files. The latter
case is common in creating input files for programs like anova and
contab, which have input data tagged with regular patterns of labels.
File1 File2 Data large easy 12 small easy 23 hard 34 hard 45 56 67 78 89For the above input file configuration, the command
abut -nc File1 File2 Datawould produce the following by recycling the smaller files.
1 large easy 12 2 small easy 23 3 large hard 34 4 small hard 45 5 large easy 56 6 small easy 67 7 large hard 78 8 small hard 89
dm: number lines
dm can number its input lines with its
special variables INLINE, which always contains the input line number,
and INPUT, which always contains the current input line.
dm INLINE INPUT < data
dm: conditional algebraic combinations of columns
dm can
produce algebraic combinations of columns. The following command reads from
data and produces the ratio of columns 2 and 1 with column 3 added on.
dm x2/x1+x3 < dataTransformations can be based on conditions. For example, if x1, the value in column 1, in the above example is 0, then dm will exit after producing an error message like:
dm: division by zero. input line 12 expr[1].To avoid this problem, the following will do the division only if x1 is non-zero.
dm "if x1 then x2/x1+x3 else 0" < data
probdist: probability/statistic conversion
probdist can
convert probabilities to distribution statistics and vice versa as seen
in tables at the end of most statistics textbooks. Many distributions are
supported, including: the normal z, binomial, chi-square, F, and t. The
following will print the two-tailed probability of an obtained t statistic of
2.5 with 20 degrees of freedom.
probdist prob t 20 2.5 0.021234Similarly, the following will print the two-tailed probability of an F ratio of 6.25 with 1 and 20 degrees of freedom.
probdist prob F 1 20 6.25
0.021234These results are the same because of the relationship between the t and F distributions.
The following prints the critical value (also called the quantile) in the chi-square distribution with 5 degrees of freedom to obtain a significance level of .05.
probdist crit chisq 5 .05
11.070498Both probabilities and critical values in the normal z distribution use the lower one tail -oo to +oo distribution, so the z value that produces the .05 level is obtained with the following.
probdist crit z .05
-1.644854The critical value for the 99th percentile is found with the following.
probdist crit z .99
2.326348Binomial distribution critical values are treated differently than the other continuous distributions. For the binomial distribution based on five trials, and a probability of success of one half, The critical value for a one-tailed test at the .05 level is:
probdist crit binomial 5 1/2 .05
5even though the probability of 5 successes is proportionally much less than .05:
probdist prob binomial 5 1/2 5
0.031250This is because the binomial distribution is discrete. Not only are critical values conservative, sometimes there may be no possible value; there is no way to get a less probable event than five out of five successes:
probdist crit binomial 5 1/2 .01
6Here, probdist is returning an impossible value (one with zero probability).
ranksort: convert data to ranks
ranksort can rank order
data from numerical data columns. For the input:
1 95 4.3 2 113 5.2 3 89 4.5 4 100 5.0 5 89 4.5ranksort would produce:
1 3 1 2 5 5 3 1.5 2.5 4 4 4 5 1.5 2.5The ties in the second and third columns get the average rank of the values for which they are tied. Once data are ranksorted, further ranksorting has no effect. With rank orders within columns, rank order statistics (e.g., Spearman rank order correlation, average group rank) can be computed by parametric programs like pair or regress.
maketrix: form a matrix format file
maketrix reads its
data, one whitespace separated string at a time from its free format input, and
produces a multicolumn output.
series 1 20 | maketrix 5The above produces a five column output.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
perm: permute lines
perm, with no options, randomizes its
input lines. It can randomize output from programs like series.
series 1 20 | permA subset of this permutation is a sample without replacement. The following is a sample of size 10 from the file data.
perm < data | dm "if INLINE <= 10 then INPUT else EXIT"perm can be supplied a seed for its random number generator, to replicate a random permutation.
series 1 20 | perm -s 5762 | maketrix 5The above produces (with my system's random number generator):
18 7 10 13 2 14 11 19 15 20 1 3 9 6 16 8 17 12 5 4
perm can also put its lines in alphabetical or numerical order. For example, the output from the previous example could be put into ascending order (according to the first number on each line) with:
series 1 20 | perm -s 5762 | maketrix 5 | perm -nThis produces:
1 3 9 6 16 8 17 12 5 4 14 11 19 15 20 18 7 10 13 2
dsort: sort data lines by multiple keys
The last example of the
perm filter showed how lines can be ordered according to the numerical
value in the first column. dsort can sort lines based on numerical or
alphabetical values in any column. For example, the following command sorts
the previous example matrix in ascending order of the values in the third
column.
series 1 20 | perm -s 5762 | maketrix 5 | dsort -n 3This produces:
1 3 9 6 16 18 7 10 13 2 8 17 12 5 4 14 11 19 15 20If there were ties in a column, dsort could sort by additional key columns.
transpose: transpose matrix format file
transpose flips
rows and columns in its input. For the input:
1 2 3 4 5 6 7 8 9 10 11 12transpose produces:
1 5 9 2 6 10 3 7 11 4 8 12The input to transpose does not have to be regular, nor does it have to be numerical.
one two three four five six seven eight nine ten elevenFor the above input, transpose produces the following.
one four six seven nine two five eight ten three elevenNote that with regular inputs, the transposition of a transposition yields the original. This is not necessarily so with data as in the above input and output. The above output piped through another pass of transpose produces a result different from the original input.
one two three four five eleven six eight seven ten nine
reverse: reverse lines, columns, characters
reverse can
reverse the lines, fields, or characters in its input. It can provide easier
access to the last lines in a file, or the last columns on lines. To get the
last 10 lines in a file, we can reverse the file, get the first 10 lines, and
then reverse those 10 lines.
reverse < data | dm "if INLINE GT 10 then EXIT else INPUT" | reverseTo get the last two columns in a file is easier.
reverse -f < data | dm s2 s1Here, dm is used for column extraction, and rather than call reverse a second time, what were the last two columns before reversal are listed in the opposite order.
colex: reorder columns, reformat columns
colex is a
column extraction program that shares some of the functionality of dm
and reverse. colex is faster and has a simpler syntax than
dm and has data formatting capabilities. Suppose a matrix dataset
with 10 columns is created with the following.
series 1 50 | maketrix 10colex can extract the last five columns followed by the first five with the command:
series 1 50 | maketrix 10 | colex 6-10 1 2 3 4 5Either ranges of columns or single columns can be given. The above command produces:
6 7 8 9 10 1 2 3 4 5 16 17 18 19 20 11 12 13 14 15 26 27 28 29 30 21 22 23 24 25 36 37 38 39 40 31 32 33 34 35 46 47 48 49 50 41 42 43 44 45
Note in the previous example how the numbers line up on the left, rather than the customary format to line up the unit digits. This is because colex puts tabs between columns, and it is not a problem because |STAT programs read data in free-format. colex can print its columns in several numerical formats as well as the default string format. The numerical formatting can round values to some number of decimal places (like zero, for whole numbers). The option: -F 4i would tell colex to format all the columns as integers, each four spaces wide, and the -t option would tell colex to not place a tab between columns. The format of columns can be assigned to individual columns by placing the format before each range of columns. For example, the following variation on the previous command would print columns 6-10 in a money format with two digits after the decimal place, and columns 1-5 as integers four wide.
series 1 50 | maketrix 10 | colex -t 6.2n6-10 4i1-5
6.00 7.00 8.00 9.00 10.00 1 2 3 4 5 16.00 17.00 18.00 19.00 20.00 11 12 13 14 15 26.00 27.00 28.00 29.00 30.00 21 22 23 24 25 36.00 37.00 38.00 39.00 40.00 31 32 33 34 35 46.00 47.00 48.00 49.00 50.00 41 42 43 44 45
dm: reorder columns
dm, like colex, can reorder
columns. However, it does not allow the specification of ranges of columns.
The above example of colex could be done with dm with similar
results.
series 1 50 | maketrix 10 | dm s6 s7 s8 s9 s10 s1 s2 s3 s4 s5
abut: paste corresponding lines from files
abut can join
data in separate files beside one another. In the usual case, abut
takes N files with K lines and produces 1 file with K lines. Suppose the files
height and weight contain the respective heights and weights
of the same people. Each line in each file contains one height or weight.
These could be plotted with the plotting option on the pair program
with the following command.
abut height weight | pair -p
dm: conditional data extraction
dm can extract subsets of
its input, either by columns or by lines. To extract columns of data, each
extracted column is specified with the number of the column preceded by the
letter s. The following extracts columns 8, 2, and 11, in that order.
dm s8 s2 s11dm can extract lines of data by using its built-in line skipping expression SKIP. The following will extract lines 50 to 100.
dm "if INLINE >= 50 & INLINE <= 100 then INPUT else SKIP"It is more awkward than column extraction, but the latter is common.
colex: quick column extraction
colex can extract
individual columns, or ranges of columns. For column extraction, it is easier
to use and faster than dm. The following extracts, in order, columns
8, 2, and 11.
colex 8 2 11
linex: line extraction
linex can extract individual lines
(by number), or ranges of lines. The following extracts, in order, lines 8, 2,
and 11.
linex 8 2 11To extract lines 50 to 100, you could type:
linex 50-100or you could even extract them in reverse order:
linex 100-50
validata: data validation
validata will report for its
input the number of columns, data-types of columns, and for columns with
numerical values, the maxima and minima. validata reports any
inconsistencies in the number of columns in its input. Floating point numbers
can be entered in scientific notation. For the input:
1 2 3 4 5 6 7 2E2 end 5 1e-3validata's output is:
validata: Variable number of columns at line 4 Col N NA alnum alpha int float other type min max 1 4 0 4 0 4 4 0 int 1 7 2 4 0 3 0 2 4 0 float 0.001 200 3 3 0 3 1 2 2 0 alnum 3 6
dm: conditional data validation
dm can find exceptional
cases in its input. A simple case is non-numerical input, which can be checked
with dm's number function.
dm "if !number(s1) then 'bad input on line' else SKIP" INLINEdm can check for specific values, ranges of values, or specific relations of values. The following prints all lines in data with the string bad in them.
dm "if 'bad' C INPUT then INPUT else SKIP"The input line number could be prepended.
dm INLINE "if 'bad' C INPUT then INPUT else SKIP"This is possible because dm will produce no output for skipped lines, regardless of expression order. The following prints all lines where column 3 is greater than column 2.
dm "if x3 > x2 then INPUT else SKIP"dm can print lengths of strings and check for numerical fields:
dm len(s1) number(s1)will print the length of column 1 strings, and report if they are numerical (0 for non-numbers, 1 for integers, 2 for real numbers, 3 for exponential scientific notation numbers).
© 1986 Gary Perlman |