Substantively Weighted Least Squares (SWLS)

EKB I have just been introduced to substantively weighted analytical techniques (SWAT) and in particular substantively weighted least squares (SWLS) through the book What Works: A New Approach to Program and Policy Analysis [L1 ]. It's also discussed in an online paper "Best Practices Research: A Methodological Guide for the Perplexed" [L2 ]. It's a very interesting technique for policy analysis, and I created a tcl implementation for it. (There's also an R language implementation available from the author at [L3 ].)

The motivation for SWLS is that in many studies that include a linear regression, the residuals are more interesting than the fitted line. For example, when fitting a variable that gives the performance of a government agency against explanatory variables such as staffing levels, financial resources, and so on, the usual regression parameters tell you about the performance of the average agency for different values of those variables. But what you're usually interested in is the performance of agencies that are above the average, given the levels of resources, so that you can try to identify what they are doing that make them better. SWLS (and, more generally, SWAT) is a systematic way to look for patterns among above or below average data within a data set.

The code is below, and a sample data file, used in the What Works book, is at the end of this page.

Note: The results produced by this code agree with the results in What Works except for a few percent difference in the bounds on the 95% confidence interval. I can't track down the source of the discrepancy... EKB OK, I tracked down the discrepancy. They were using the asymptotic value of the inverse t-distribution as the number of deg of freedom gets very large (1.96), rather than the value for the 42 deg of freedom of the sample data file. The script below finds the appropriate t value for the degrees of freedom in the problem.

Another note: After the original post, I improved on the filltext proc code to be a little more nicely laid out and more informative.

 package require Tcl 8.4
 package require math::linearalgebra 1.0
 package require math::statistics 0.1.1


 namespace eval swat {
    variable epsilon 1.0e-7
    variable swls_threshold 0.7
    variable swls_beta 10
    
    namespace import \
        ::math::linearalgebra::mkMatrix \
        ::math::linearalgebra::mkVector \
        ::math::linearalgebra::mkIdentity \
        ::math::linearalgebra::mkDiagonal \
        ::math::linearalgebra::getrow \
        ::math::linearalgebra::setrow \
        ::math::linearalgebra::getcol \
        ::math::linearalgebra::setcol \
        ::math::linearalgebra::getelem \
        ::math::linearalgebra::setelem \
        ::math::linearalgebra::dotproduct \
        ::math::linearalgebra::matmul \
        ::math::linearalgebra::add \
        ::math::linearalgebra::sub \
        ::math::linearalgebra::solveGauss
    
    # NOTE: The authors of math::linearalgebra forgot to
    #   export ::math::linearalgebra::transpose
        
    ########################################
    #
    # t-stats
    #
    ########################################
    # swat::tstat n
    # Returns inverse of the single-tailed t distribution
    # Defaults to alpha = 0.05
    # Iterates until result is within epsilon of the target
    proc tstat {n {alpha 0.05}} {
        variable epsilon
        variable tvals
        
        if [info exists tvals($n:$alpha)] {
            return $tvals($n:$alpha)
        }
        
        set deltat [expr {100 * $epsilon}]
        # For one-tailed distribution, 
        set ptarg [expr {1.000 - $alpha/2.0}]
        set maxiter 100
        
        # Initial value for t
        set t 2.0
        
        set niter 0
        while {abs([::math::statistics::cdf-students-t $n $t] - $ptarg) > $epsilon} {
            set pstar [::math::statistics::cdf-students-t $n $t]
            set pl [::math::statistics::cdf-students-t $n [expr {$t - $deltat}]]
            set ph [::math::statistics::cdf-students-t $n [expr {$t + $deltat}]]
            
            set t [expr {$t + 2.0 * $deltat * ($ptarg - $pstar)/($ph - $pl)}]
            
            incr niter
            if {$niter == $maxiter} {
                error "swat::tstat: Did not converge after $niter iterations"
                return {}
            }
        }
        
        # Cache the result to shorten the call in future
        set tvals($n:$alpha) $t
        
        return $t
    }
        
    ########################################
    #
    # Weighted Least Squares
    #
    ########################################
    # swat::wls w [list y x's] w [list y x's] ...
    # Returns:
    #   R-squared
    #   coefficients of x's in fit
    #   jacknifed residuals
    #   standard errors of the coefficients
    #   95% confidence bounds for coefficients
    proc wls {args} {
        
        # Fill the matrices of x & y values, and weights
        # For n points, k coefficients
        
        # The number of points is equal to half the arguments (n weights, n points)
        set n [expr {[llength $args]/2}]
        
        set firstloop true
        # Sum up all y values to take an average
        set ysum 0
        # Add up the weights
        set wtsum 0
        # Count over rows (points) as you go
        set point 0
        foreach {wt pt} $args {
            
            # Check inputs
            if {[string is double $wt] == 0} {
                error "swat::wls: Weight \"$wt\" is not a number"
                return {}
            }
            
            ## -- Check dimensions, initialize
            if $firstloop {
                # k = num of vals in pt = 1 + number of x's (because of constant)
                set k [llength $pt]
                if {$n <= [expr {$k + 1}]} {
                    error "swat::wls: Too few degrees of freedom: $k variables but only $n points"
                    return {}
                }
                set X [mkMatrix $n $k]
                set y [mkVector $n]
                set I_x [mkIdentity $k]
                set I_y [mkIdentity $n]
                
                set firstloop false
                
            } else {
                # Have to have same number of x's for all points
                if {$k != [llength $pt]} {
                    error "swat::wls: Point \"$pt\" has wrong number of values (expected $k)"
                    # Clean up
                    return {}
                }
            }
            
            ## -- Extract values from set of points
            # Make a list of y values
            set yval [expr {double([lindex $pt 0])}]
            setelem y $point $yval
            set ysum [expr {$ysum + $wt * $yval}]
            set wtsum [expr {$wtsum + $wt}]
            # Add x-values to the x-matrix
            set xrow [lrange $pt 1 end]
            # Add the constant (value = 1.0)
            lappend xrow 1.0
            setrow X $point $xrow
            # Create list of weights & square root of weights
            lappend w [expr {double($wt)}]
            lappend sqrtw [expr {sqrt(double($wt))}]
            
            incr point
            
        }

        set ymean [expr {double($ysum)/$wtsum}]
        set W [mkDiagonal $w]
        set sqrtW [mkDiagonal $sqrtw]
        
        # Calculate sum os square differences for x's
        for {set r 0} {$r < $k} {incr r} {
            set xsqrsum 0.0
            set xmeansum 0.0
            # Calculate sum of squared differences as: sum(x^2) - (sum x)^2/n
            for {set t 0} {$t < $n} {incr t} {
                set xval [getelem $X $t $r]
                set xmeansum [expr {$xmeansum + double($xval)}]
                set xsqrsum [expr {$xsqrsum + double($xval * $xval)}]
            }
            lappend xsqr [expr {$xsqrsum - $xmeansum * $xmeansum/$n}]
        }
        
        ## -- Set up the X'WX matrix
        set XtW [matmul [::math::linearalgebra::transpose $X] $W]
        set XtWX [matmul $XtW $X]
        
        # Invert
        set M [solveGauss $XtWX $I_x]
        
        set beta [matmul $M [matmul $XtW $y]]
        
        ### -- Residuals & jacknife statistics
        # 1) Generate list of diagonals of the hat matrix
        set H [matmul $X [matmul $M $XtW]]
        for {set i 0} {$i < $n} {incr i} {
            lappend h_ii [getelem $H $i $i]
        }

        set R [matmul $sqrtW [matmul [sub $I_y $H] $y]]
        set yhat [matmul $H $y]
        
        # 2) Generate list of residuals, sum of squared residuals, r-squared
        set sstot 0.0
        set ssreg 0.0
        # Note: Relying on representation of Vector as a list for y, yhat
        foreach yval $y wt $w yhatval $yhat {
            set sstot [expr {$sstot + $wt * ($yval - $ymean) * ($yval - $ymean)}]
            set ssreg [expr {$ssreg + $wt * ($yhatval - $ymean) * ($yhatval - $ymean)}]
        }
        set r2 [expr {double($ssreg)/$sstot}]
        set adjr2 [expr {1.0 - (1.0 - $r2) * ($n - 1)/($n - $k)}]
        set sumsqresid [dotproduct $R $R]
        
        # 3) Calculate the overall variance, the variance per point & jacknifed stderr
        set s2 [expr {double($sumsqresid) / double($n - $k)}]
        for {set r 0} {$r < $n} {incr r} {
            set wt [lindex $w $r]
            set hprime [expr {sqrt($wt) * (1.0 - [lindex $h_ii $r])}]
            set resid [getelem $R $r]
            set newjns2i [expr {(($n - $k) * $s2 - double($resid * $resid)/$hprime)/($n - $k - 1)}]
            lappend jns2i $newjns2i
            lappend jacknife [expr {double($resid)/sqrt($newjns2i * $hprime)}]
        }
        
        ### -- Confidence intervals for coefficients
        set tvalue [tstat [expr {$n - $k}]]
        for {set i 0} {$i < $k} {incr i} {
            set stderr [expr {sqrt($s2 * [getelem $M $i $i])}]
            set mid [lindex $beta $i]
            lappend stderrs $stderr
            lappend confinterval [list [expr {$mid - $tvalue * $stderr}] [expr {$mid + $tvalue * $stderr}]]
        }

        return [list $adjr2 $beta $jacknife $stderrs $confinterval]
    }
    
    ########################################
    #
    # Substantively Weighted Least Squares
    #
    ########################################
    # swat::swls [list y x's] [list y x's] ...
    # Returns:
    #  - # points with wt 1
    #  - R-squared
    #  - coefficients of x's for entire sequence of fits
    #  - 95% confidence intervals for coefficients
    #  - weighted average of x values
    proc swls {args} {
        variable swls_threshold
        variable swls_beta
        
        set numpoints [llength $args]
        
        # Initialize weights to 1.0 for all points
        for {set i 0} {$i < $numpoints} {incr i} {
            set wt($i) 1.0
        }
        
        # Start of big loop over weightings
        for {set i 0} {$i < $swls_beta} {incr i} {
            # Prepare the arg to send to wls
            set wtpts {}
            set nhigh 0
            set sumwts 0
            for {set j 0} {$j < $numpoints} {incr j} {
                set pts [lindex $args $j]
                if {![info exists numxs]} {
                    set numxs [expr {[llength $pts] - 1}]
                }
                lappend wtpts $wt($j) $pts
                set sumwts [expr {$sumwts + $wt($j)}]
                for {set n 1} {$n <= $numxs} {incr n} {
                    if {$j > 0} {
                        set sumx($n) [expr {$sumx($n) + $wt($j) * [lindex $pts $n]}]
                    } else {
                        set sumx($n) [lindex $pts $n]
                    }
                }
                if {$wt($j) == 1.0} {
                    incr nhigh
                }
            }
            set xave {}
            for {set n 1} {$n <= $numxs} {incr n} {
                lappend xave [expr {double($sumx($n))/$sumwts}]
            }
            set retval [eval wls $wtpts]
            # R-squared is first, then coeffs, then jacknifed residuals
            set r2 [lindex $retval 0]
            lappend coeffs [list $nhigh $r2 [lindex $retval 1] [lindex $retval 3] [lindex $retval 4] $xave]
            set jacknife [lindex $retval 2]
            # Figure out the next set of weights
            for {set j 0} {$j < $numpoints} {incr j} {
                if {$i == 0 && [lindex $jacknife $j] < $swls_threshold} {
                    set wt($j) [expr {1.0 - 1.0/double($swls_beta)}]
                }
                if {$i > 0 && $wt($j) < 1.0} {
                    set wt($j) [expr {$wt($j) - 1.0/double($swls_beta)}]
                }
            }
        }
        return $coeffs
    }
 }

 ##########################
 #
 # GUI
 #
 ##########################
 
 set fileinfo(dir) [file dirname $argv0]
 set fileinfo(name) ""
 
 proc fileopen {} {
    global fileinfo
    
    set fname [tk_getOpenFile -initialdir $fileinfo(dir) -initialfile $fileinfo(name) \
            -defaultextension ".dat" -filetypes { {{Data files} {.dat}} {{Text Files} {.txt} } {{All Files} *} }]
    
    if {$fname == ""} {return}
    
    if [catch {open $fname r} fhndl] {
        tk_messageBox -icon error -message "Error opening file [file tail $fname]: $fhndl"
        return
    }
    
    set names {}
    set points {}

    set firstline true
    while {[gets $fhndl line] != -1} {
        # Ignore blank lines & comments
        if {[string trim $line] == ""} {continue}
        if {[string index [string trimleft $line] 0] == "#"} {continue}
        
        if $firstline {
            set names [split $line "\t"]
            set firstline false
            continue
        }
        
        lappend points [split $line "\t"]
    } 
    
    close $fhndl
    
    set fileinfo(dir) [file dirname $fname]
    set fileinfo(name) [file tail $fname]
    
    wm title . "$fileinfo(name) - SWLS"
    
    filltext $names $points
    
 }
 
 proc filltext {names points} {
    .t delete 1.0 end
    # Go ahead and update, since there can be a delay in loading
    update
    
    set coeffs [eval swat::swls $points]
    
    set nstart [lindex [lindex $coeffs 0] 0]
    set nend [lindex [lindex $coeffs end] 0]
    
    .t insert end "Explaining: [lindex $names 0]\n\n"
    .t insert end "Number of data points: $nstart\n"
    .t insert end "Number in set of interest: $nend\n\n"
    .t insert end [format "%-15s\t% 10s\t% 11s\t\%25s\t% 10s\n\n" Variable Coeff "Std Error" "-- 95% Conf Interval --" "Ave Value"]
    
    set i 1
    foreach val $coeffs {
        set r2 [lindex $val 1]
        set xcoeffs [lindex $val 2]
        set stderrs [lindex $val 3]
        set cis [lindex $val 4]
        set xaves [lindex $val 5]
        .t insert end "== Run $i ==\n"
        incr i
        set j 1
        foreach xave $xaves xc $xcoeffs ci $cis se $stderrs {
            if {$j == [llength $xcoeffs]} {
                set lbl "Constant"
                set avestring ""
            } else {
                set lbl "[lindex $names $j]"
                set avestring [format "% 10.3f" $xave]
            }
            .t insert end [format "%-15s\t% 10.3f\t\u00B1% 10.3f\t\[% 10.3f : % 10.3f\]\t%s\n" \
                 $lbl $xc $se [lindex $ci 0] [lindex $ci 1] $avestring]
            incr j
        }
        .t insert end [format "Adj. R-squared = %.3f\n\n" $r2]
    }
 }


 wm title . "SWLS"
 
 . configure -menu .mainmenu
 menu .mainmenu
 .mainmenu add cascade -label "File" -menu .mainmenu.f
 menu .mainmenu.f -tearoff false
 .mainmenu.f add command -label "Open" -command fileopen -underline 0
 
 scrollbar .sby -command {.t yview} -orient vertical
 scrollbar .sbx -command {.t xview} -orient horizontal
 text .t -width 70 -height 20 -wrap none -yscrollcommand {.sby set} \
    -xscrollcommand {.sbx set} -font "Courier 9"
 grid .t .sby -sticky nsew
 grid .sbx -sticky nsew
 grid columnconfigure . 0 -weight 1
 grid rowconfigure . 0 -weight 1
 
 focus -force .

You can try it out using the following data file. This is a file of data on child welfare agencies used by the original SWAT creators for the book What Works, formatted for this tcl app.

    Organizational Learning        ACES        Instability        Ambiguity        Staff Change        Divorce Rate        Slack        Expenditure
    1141.4        1.467351        70.43141        23.57        115.5502        26.2        3.470122        3440.159
    2667        1.754386        6.407348        32.1        39.5922        27.8        4.105816        8554.186
    307.3        0.4215852        63.41155        35.14        106.0642        29.4        5.610657        2544.263
    840.7        0        16.39695        23.46        50.57774        30.8        3.848957        2366.732
    482.5        1.18499        78.90607        31.95        -1.144        20.3        3.700461        4954.157
    550        2.072846        28.10363        17.81        25.76482        22.8        2.74241        2976.427
    514.1        0.607718        22.71782        22.23        18.3592        14.9        5.063129        4905.78
    1352.5        1.470588        6.602017        65.03        -232.2549        19.6        9.178106        5970.52
    1136.5        0.8285004        123.2406        21.44        60.60778        30.8        2.495392        2567.212
    1575.4        1.660879        87.06177        31.26        86.82364        22.2        2.100618        2589.531
    1170.6        0        13.04299        37.68        31.7781        19.1        4.491279        4141.247
    1679.6        0.9624639        7.197373        21.72        79.94753        27        3.074239        3216.067
    882.3        1.212856        181.2476        16.97        45.62724        17.4        1.902308        2469.215
    1347.5        1.782531        78.82232        11.84        33.76736        26.6        1.762038        1742.745
    1353.4        1.431127        11.54932        31.1        57.34841        16.6        3.434238        2659.697
    1363.4        2.40481        13.08474        18.11        72.44474        22.5        2.563138        3296.465
    1087.9        0.538648        26.32775        23.06        166.6439        21.9        2.649425        3330.793
    712.8        0.4703669        27.16063        25.46        35.4368        15.6        4.030189        3416.769
    1865.4        2.42915        9.96554        18.14        90.42884        20.9        3.209765        4109.007
    1088.4        1.234568        38.82103        40.84        24.8976        14.2        3.488844        5275.824
    1308.1        0.6671114        76.34886        26.11        82.58347        12.7        3.641929        4926.775
    3485.4        1.17421        197.8029        25.89        80.5744        17.7        1.753225        5419.784
    1946.4        1.128159        35.4882        26.85        70.3593        14.9        4.76418        5208.218
    1116.8        0.3858025        76.16611        16.39        225.721        21.8        4.611794        2539.369
    2047.5        0.5804953        54.3063        21.42        121.5814        22        2.799651        2754.369
    893.3        2.475247        7.206166        12.36        103.7574        22.9        3.882953        2083.009
    1779.2        0        26.99598        45.8        78.2787        17.6        3.511719        3982.897
    703.4        1.557632        3.533641        49.19        -32.1768        54.9        6.609568        4082.064
    695.1        2.714932        6.912465        43.87        32.34244        19.7        6.407141        2998.161
    1287        0.257732        62.02957        36.51        17.6954        15        4.470681        6310.163
    573        0        13.15338        28.29        48.06679        26        2.264973        2759.142
    955.6        0.3322627        151.1162        32.31        50.3237        15        3.284161        5634.305
    1296.1        1.187472        70.68908        23.75        31.70579        20.6        3.450963        2992.531
    1194.5        1.574803        5.132252        15.61        54.0354        15.4        2.777402        2679.454
    4299.2        3.565225        169.9304        22.62        136.621        20.3        2.841627        3072.191
    896.4        1.259843        66.21898        23.09        35.20083        32.9        3.420866        2381.565
    815.4        4.106776        53.44192        37.42        -176.8579        23.8        6.527769        4303.525
    2262.9        0.7524454        121.2754        51.96        18.3963        14.8        2.932871        4130.657
    1108.8        0.996016        10.88008        38.61        22.00537        16.1        2.421065        3339.35
    1314.2        1.404494        27.21783        25.78        36.989        17.4        1.432419        2533.139
    1268.9        4.207574        2.172794        17.05        23.44076        16.9        2.988671        2366.595
    952.8        0.8075914        94.00198        41.9        33.35452        26.6        1.970768        2031.763
    763.2        0.4034815        121.7529        51.39        53.86072        24.4        2.710884        1620.14
    1228.6        0        11.48006        21.33        73.45351        21.8        4.641821        5069.072
    1023.5        3.527337        3.049591        20.86        84.84604        18.9        4.317273        2983.782
    1646.3        1.272669        58.24525        19.83        105.2593        17.7        2.980112        3097.782
    2483.3        1.195695        80.93693        30.97        160.9784        24.7        6.063038        5996.734
    1045.3        0.5552471        15.81082        15.11        78.10522        22.3        3.786695        2222.728
    3866.9        1.210898        70.8133        24.03        60.605        15.2        3.335582        4972.425
    1450.2        0        7.217247        10.33        163.1716        29        2.623226        1860.076

After running this data file through the script, it should give the following output:

 Explaining: Organizational Learning

 Number of data points: 50
 Number in set of interest: 11

 Variable                    Coeff          Std Error          -- 95% Conf Interval --         Ave Value

 == Run 1 ==
 ACES                      284.795        ±    99.831        [    83.327 :    486.263]             1.265
 Instability                 3.815        ±     2.057        [    -0.336 :      7.967]            52.237
 Ambiguity                   9.004        ±    11.259        [   -13.718 :     31.727]            28.111
 Staff Change                4.617        ±     1.758        [     1.069 :      8.165]            55.651
 Divorce Rate               -1.682        ±    14.412        [   -30.766 :     27.402]            21.712
 Slack                    -125.335        ±    88.330        [  -303.593 :     52.923]             3.643
 Expenditure                 0.294        ±     0.078        [     0.136 :      0.452]          3617.571
 Constant                 -263.506        ±   591.866        [ -1457.939 :    930.927]        
 Adj. R-squared = 0.363

 ... Runs 2-9 omitted ...

 == Run 10 ==
 ACES                      333.752        ±   117.107        [    97.421 :    570.083]             1.133
 Instability                 6.311        ±     2.421        [     1.426 :     11.197]            63.724
 Ambiguity                  -0.572        ±    11.657        [   -24.097 :     22.954]            30.631
 Staff Change                5.445        ±     2.248        [     0.908 :      9.981]            64.046
 Divorce Rate              -16.887        ±    21.001        [   -59.269 :     25.496]            23.142
 Slack                       5.579        ±   117.953        [  -232.460 :    243.618]             3.829
 Expenditure                 0.342        ±     0.107        [     0.126 :      0.558]          3835.946
 Constant                    4.187        ±   877.228        [ -1766.131 :   1774.505]        
 Adj. R-squared = 0.630