Updated 2013-01-18 21:24:42 by pooryorick

TR - Testing some data for normality (whether they are normally distributed or not) is often needed in statistics. Here, I have implemented one of these normality tests, called the Lilliefors test [1]. It is based on the famous Kolmogorov-Smirnov-Test [2].

I know there are better tests than this, so feel free to add Shapiro-Wilks, Anderson-Darling and the like ...

The procedure looks like this (and modelled after the same function in R [3]):
proc LillieforsFit {values} {
    # calculate the goodness of fit according to Lilliefors
    # (goodness of fit to a normal distribution)
    # values -> list of values to be tested for normality
    # (these values are sampled counts)

    # calculate standard deviation and mean of the sample:
    set n [llength $values]
    if {$n < 5} {error "insufficient data (n<5)"}
    set sd [sd $values]
    set mean [mean $values]
    # sort the sample for further processing:
    set values [lsort -real $values]
    # standardize the sample data (Z-scores):
    foreach x $values {lappend stdData [expr {($x - $mean)/double($sd)}]}

    # compute the value of the distribution function at every sampled point:
    foreach x $stdData {lappend expData [pnorm $x]}

    # compute D+:
    set i 0
    foreach x $expData {
        incr i
        lappend dplus [expr {$i/double($n)-$x}]
    set dplus [lindex [lsort -real $dplus] end]

    # compute D-:
    set i 0
    foreach x $expData {
        incr i
        lappend dminus [expr {$x-($i-1)/double($n)}]
    set dminus [lindex [lsort -real $dminus] end]
    # calculate the test statistic D
    # by finding the maximal vertical difference
    # between the samle and the expectation:
    set D [expr {$dplus > $dminus ? $dplus : $dminus}]
    # we now use the modified statistic Z,
    # because D is only reliable 
    # if the p-value is smaller than 0.1
    ###if {$n < 30} {puts "warning: n=$n<30"}
    return [expr {$D * (sqrt($n) - 0.01 + 0.831/sqrt($n))}]

This function uses a procedure called pnorm which is the cdf (=cumulative distribution function) of the standard normal distribution (see also gaussian Distribution). It looks like this:
proc pnorm {x} {
    # cumulative distribution function (cdf)
    # for the standard normal distribution like in the statistical software 'R'
    # (mean=0 and sd=1)
    # x -> value for which the cdf should be calculated
    set sum [expr {double($x)}]
    set oldSum 0.0
    set i 1
    set denom 1.0
    while {$sum != $oldSum} {
        set oldSum $sum
        incr i 2
        set denom [expr {$denom*$i}]
        #puts "$i - $denom"
        set sum [expr {$oldSum + pow($x,$i)/$denom}]
    return [expr {0.5 + $sum * exp(-0.5 * $x*$x - 0.91893853320467274178)}]

This procedure is accurate until the last digit for number up to 6 or so. There is a quicker alternative if you can afford to loose some accuracy:
proc pnorm_quicker {x} {
    # cumulative distribution function (cdf)
    # for the standard normal distribution like in 'R'
    # (mean=0 and sd=1)
    # quicker than the former but not so accurate
    # x -> value for which the cdf should be calculated
    set n [expr {abs($x)}]
    set n [expr {1.0 + $n*(0.04986735 + $n*(0.02114101 + $n*(0.00327763 \
        + $n*(0.0000380036 + $n*(0.0000488906 + $n*0.000005383)))))}]
    set n [expr {1.0/pow($n,16)}]
    if {$x >= 0} {
        return [expr {1 - $n/2.0}]
    } else {
        return [expr {$n/2.0}]

Now, an example: We have 32 birthweight data (taken from Henry C. Thode: "Testing for normality") here as weights in ounces. We want to know if these are normally distributed:
puts [LillieforsFit {72 112 111 107 119  92 126  80  81  84 115 118 128 128 123 116 125 126 122 \
126 127 86 142 132  87 123 133 106 103 118 114 94}]

We get a value for D which is 0.82827415657 or approx. 0.828. Now we just need to compare this value with the value 0.895. This is the D value for the 0.05 critical level (alpha, type one error). Since our value is less the critical value, we might assume normality (or strictly speaking: under the assumption that our data are normally distributed, there is a possibility of more than 5% (about 9.5% here), that the test detects a difference in the sample by pure chance). (... I always get confused at this point ...)

Other critical levels are:
array set levels {
D       % 
0.741   20
0.775   15
0.819   10
0.895    5
1.035    1

Further reference: http://www.jstatsoft.org/v11/i04/v11i04.pdf (Evaluating the Normal distribution)

AM (28 february 2006) It seems to me worthwhile to incorporate this in the statistics module of Tcllib. TR No problem with me. Just go ahead. Or should I do this myself?

AM We will need some streamlining - let me do it ;).

By the way: the article mentions that the proposed C functions must use extended precision (long doubles) to achieve the claimed accuracy. As Tcl does not support that precision, how large is the effect on the accuracy?

TR Taking the above paper and calculating pnorm-values, I get this (Tcl 8.4):
x     pnorm (Tcl above, reality below)
0.1   0.539827837277
4.5   0.999996602327
5.6   0.999999989282
6.7   0.99999999999
7.8   1.0

-1.1  0.135666060946
-3.3  0.000483424142384
-5.5  0.0000000189895630887 (converted from 1.89895630887e-08)

I can see differences at x=7.8 and x=-5.5 but these are quite small. I also wonder, why the result for x=-3.3 has more significant digits than the other values.

AM I would not worry about the number of digits printed - that has a lot to do with trailing zeros ... If you want to compare things in detail, you should use a definite format, like %20.14g or the like.

TR Oh, I was wrong! You were hinting me in the right direction, thanks. Each of the above numbers has 12 significant digits, in the correct mathematical definition. So all is well here.

AM (29 march 2006) I have (finally) incorporated the code in the statistics package of Tcllib.