Updated 2012-06-18 12:20:20 by RLE

EKB I have just been introduced to substantively weighted analytical techniques (SWAT) and in particular substantively weighted least squares (SWLS) through the book What Works: A New Approach to Program and Policy Analysis [1]. It's also discussed in an online paper "Best Practices Research: A Methodological Guide for the Perplexed" [2]. It's a very interesting technique for policy analysis, and I created a tcl implementation for it. (There's also an R language implementation available from the author at [3].)

The motivation for SWLS is that in many studies that include a linear regression, the residuals are more interesting than the fitted line. For example, when fitting a variable that gives the performance of a government agency against explanatory variables such as staffing levels, financial resources, and so on, the usual regression parameters tell you about the performance of the average agency for different values of those variables. But what you're usually interested in is the performance of agencies that are above the average, given the levels of resources, so that you can try to identify what they are doing that make them better. SWLS (and, more generally, SWAT) is a systematic way to look for patterns among above or below average data within a data set.

The code is below, and a sample data file, used in the What Works book, is at the end of this page.

Note: The results produced by this code agree with the results in What Works except for a few percent difference in the bounds on the 95% confidence interval. I can't track down the source of the discrepancy... EKB OK, I tracked down the discrepancy. They were using the asymptotic value of the inverse t-distribution as the number of deg of freedom gets very large (1.96), rather than the value for the 42 deg of freedom of the sample data file. The script below finds the appropriate t value for the degrees of freedom in the problem.

Another note: After the original post, I improved on the filltext proc code to be a little more nicely laid out and more informative.
``` package require Tcl 8.4
package require math::linearalgebra 1.0
package require math::statistics 0.1.1

namespace eval swat {
variable epsilon 1.0e-7
variable swls_threshold 0.7
variable swls_beta 10

namespace import \
::math::linearalgebra::mkMatrix \
::math::linearalgebra::mkVector \
::math::linearalgebra::mkIdentity \
::math::linearalgebra::mkDiagonal \
::math::linearalgebra::getrow \
::math::linearalgebra::setrow \
::math::linearalgebra::getcol \
::math::linearalgebra::setcol \
::math::linearalgebra::getelem \
::math::linearalgebra::setelem \
::math::linearalgebra::dotproduct \
::math::linearalgebra::matmul \
::math::linearalgebra::sub \
::math::linearalgebra::solveGauss

# NOTE: The authors of math::linearalgebra forgot to
#   export ::math::linearalgebra::transpose

########################################
#
# t-stats
#
########################################
# swat::tstat n
# Returns inverse of the single-tailed t distribution
# Defaults to alpha = 0.05
# Iterates until result is within epsilon of the target
proc tstat {n {alpha 0.05}} {
variable epsilon
variable tvals

if [info exists tvals(\$n:\$alpha)] {
return \$tvals(\$n:\$alpha)
}

set deltat [expr {100 * \$epsilon}]
# For one-tailed distribution,
set ptarg [expr {1.000 - \$alpha/2.0}]
set maxiter 100

# Initial value for t
set t 2.0

set niter 0
while {abs([::math::statistics::cdf-students-t \$n \$t] - \$ptarg) > \$epsilon} {
set pstar [::math::statistics::cdf-students-t \$n \$t]
set pl [::math::statistics::cdf-students-t \$n [expr {\$t - \$deltat}]]
set ph [::math::statistics::cdf-students-t \$n [expr {\$t + \$deltat}]]

set t [expr {\$t + 2.0 * \$deltat * (\$ptarg - \$pstar)/(\$ph - \$pl)}]

incr niter
if {\$niter == \$maxiter} {
error "swat::tstat: Did not converge after \$niter iterations"
return {}
}
}

# Cache the result to shorten the call in future
set tvals(\$n:\$alpha) \$t

return \$t
}

########################################
#
# Weighted Least Squares
#
########################################
# swat::wls w [list y x's] w [list y x's] ...
# Returns:
#   R-squared
#   coefficients of x's in fit
#   jacknifed residuals
#   standard errors of the coefficients
#   95% confidence bounds for coefficients
proc wls {args} {

# Fill the matrices of x & y values, and weights
# For n points, k coefficients

# The number of points is equal to half the arguments (n weights, n points)
set n [expr {[llength \$args]/2}]

set firstloop true
# Sum up all y values to take an average
set ysum 0
set wtsum 0
# Count over rows (points) as you go
set point 0
foreach {wt pt} \$args {

# Check inputs
if {[string is double \$wt] == 0} {
error "swat::wls: Weight \"\$wt\" is not a number"
return {}
}

## -- Check dimensions, initialize
if \$firstloop {
# k = num of vals in pt = 1 + number of x's (because of constant)
set k [llength \$pt]
if {\$n <= [expr {\$k + 1}]} {
error "swat::wls: Too few degrees of freedom: \$k variables but only \$n points"
return {}
}
set X [mkMatrix \$n \$k]
set y [mkVector \$n]
set I_x [mkIdentity \$k]
set I_y [mkIdentity \$n]

set firstloop false

} else {
# Have to have same number of x's for all points
if {\$k != [llength \$pt]} {
error "swat::wls: Point \"\$pt\" has wrong number of values (expected \$k)"
# Clean up
return {}
}
}

## -- Extract values from set of points
# Make a list of y values
set yval [expr {double([lindex \$pt 0])}]
setelem y \$point \$yval
set ysum [expr {\$ysum + \$wt * \$yval}]
set wtsum [expr {\$wtsum + \$wt}]
# Add x-values to the x-matrix
set xrow [lrange \$pt 1 end]
# Add the constant (value = 1.0)
lappend xrow 1.0
setrow X \$point \$xrow
# Create list of weights & square root of weights
lappend w [expr {double(\$wt)}]
lappend sqrtw [expr {sqrt(double(\$wt))}]

incr point

}

set ymean [expr {double(\$ysum)/\$wtsum}]
set W [mkDiagonal \$w]
set sqrtW [mkDiagonal \$sqrtw]

# Calculate sum os square differences for x's
for {set r 0} {\$r < \$k} {incr r} {
set xsqrsum 0.0
set xmeansum 0.0
# Calculate sum of squared differences as: sum(x^2) - (sum x)^2/n
for {set t 0} {\$t < \$n} {incr t} {
set xval [getelem \$X \$t \$r]
set xmeansum [expr {\$xmeansum + double(\$xval)}]
set xsqrsum [expr {\$xsqrsum + double(\$xval * \$xval)}]
}
lappend xsqr [expr {\$xsqrsum - \$xmeansum * \$xmeansum/\$n}]
}

## -- Set up the X'WX matrix
set XtW [matmul [::math::linearalgebra::transpose \$X] \$W]
set XtWX [matmul \$XtW \$X]

# Invert
set M [solveGauss \$XtWX \$I_x]

set beta [matmul \$M [matmul \$XtW \$y]]

### -- Residuals & jacknife statistics
# 1) Generate list of diagonals of the hat matrix
set H [matmul \$X [matmul \$M \$XtW]]
for {set i 0} {\$i < \$n} {incr i} {
lappend h_ii [getelem \$H \$i \$i]
}

set R [matmul \$sqrtW [matmul [sub \$I_y \$H] \$y]]
set yhat [matmul \$H \$y]

# 2) Generate list of residuals, sum of squared residuals, r-squared
set sstot 0.0
set ssreg 0.0
# Note: Relying on representation of Vector as a list for y, yhat
foreach yval \$y wt \$w yhatval \$yhat {
set sstot [expr {\$sstot + \$wt * (\$yval - \$ymean) * (\$yval - \$ymean)}]
set ssreg [expr {\$ssreg + \$wt * (\$yhatval - \$ymean) * (\$yhatval - \$ymean)}]
}
set r2 [expr {double(\$ssreg)/\$sstot}]
set adjr2 [expr {1.0 - (1.0 - \$r2) * (\$n - 1)/(\$n - \$k)}]
set sumsqresid [dotproduct \$R \$R]

# 3) Calculate the overall variance, the variance per point & jacknifed stderr
set s2 [expr {double(\$sumsqresid) / double(\$n - \$k)}]
for {set r 0} {\$r < \$n} {incr r} {
set wt [lindex \$w \$r]
set hprime [expr {sqrt(\$wt) * (1.0 - [lindex \$h_ii \$r])}]
set resid [getelem \$R \$r]
set newjns2i [expr {((\$n - \$k) * \$s2 - double(\$resid * \$resid)/\$hprime)/(\$n - \$k - 1)}]
lappend jns2i \$newjns2i
lappend jacknife [expr {double(\$resid)/sqrt(\$newjns2i * \$hprime)}]
}

### -- Confidence intervals for coefficients
set tvalue [tstat [expr {\$n - \$k}]]
for {set i 0} {\$i < \$k} {incr i} {
set stderr [expr {sqrt(\$s2 * [getelem \$M \$i \$i])}]
set mid [lindex \$beta \$i]
lappend stderrs \$stderr
lappend confinterval [list [expr {\$mid - \$tvalue * \$stderr}] [expr {\$mid + \$tvalue * \$stderr}]]
}

return [list \$adjr2 \$beta \$jacknife \$stderrs \$confinterval]
}

########################################
#
# Substantively Weighted Least Squares
#
########################################
# swat::swls [list y x's] [list y x's] ...
# Returns:
#  - # points with wt 1
#  - R-squared
#  - coefficients of x's for entire sequence of fits
#  - 95% confidence intervals for coefficients
#  - weighted average of x values
proc swls {args} {
variable swls_threshold
variable swls_beta

set numpoints [llength \$args]

# Initialize weights to 1.0 for all points
for {set i 0} {\$i < \$numpoints} {incr i} {
set wt(\$i) 1.0
}

# Start of big loop over weightings
for {set i 0} {\$i < \$swls_beta} {incr i} {
# Prepare the arg to send to wls
set wtpts {}
set nhigh 0
set sumwts 0
for {set j 0} {\$j < \$numpoints} {incr j} {
set pts [lindex \$args \$j]
if {![info exists numxs]} {
set numxs [expr {[llength \$pts] - 1}]
}
lappend wtpts \$wt(\$j) \$pts
set sumwts [expr {\$sumwts + \$wt(\$j)}]
for {set n 1} {\$n <= \$numxs} {incr n} {
if {\$j > 0} {
set sumx(\$n) [expr {\$sumx(\$n) + \$wt(\$j) * [lindex \$pts \$n]}]
} else {
set sumx(\$n) [lindex \$pts \$n]
}
}
if {\$wt(\$j) == 1.0} {
incr nhigh
}
}
set xave {}
for {set n 1} {\$n <= \$numxs} {incr n} {
lappend xave [expr {double(\$sumx(\$n))/\$sumwts}]
}
set retval [eval wls \$wtpts]
# R-squared is first, then coeffs, then jacknifed residuals
set r2 [lindex \$retval 0]
lappend coeffs [list \$nhigh \$r2 [lindex \$retval 1] [lindex \$retval 3] [lindex \$retval 4] \$xave]
set jacknife [lindex \$retval 2]
# Figure out the next set of weights
for {set j 0} {\$j < \$numpoints} {incr j} {
if {\$i == 0 && [lindex \$jacknife \$j] < \$swls_threshold} {
set wt(\$j) [expr {1.0 - 1.0/double(\$swls_beta)}]
}
if {\$i > 0 && \$wt(\$j) < 1.0} {
set wt(\$j) [expr {\$wt(\$j) - 1.0/double(\$swls_beta)}]
}
}
}
return \$coeffs
}
}

##########################
#
# GUI
#
##########################

set fileinfo(dir) [file dirname \$argv0]
set fileinfo(name) ""

proc fileopen {} {
global fileinfo

set fname [tk_getOpenFile -initialdir \$fileinfo(dir) -initialfile \$fileinfo(name) \
-defaultextension ".dat" -filetypes { {{Data files} {.dat}} {{Text Files} {.txt} } {{All Files} *} }]

if {\$fname == ""} {return}

if [catch {open \$fname r} fhndl] {
tk_messageBox -icon error -message "Error opening file [file tail \$fname]: \$fhndl"
return
}

set names {}
set points {}

set firstline true
while {[gets \$fhndl line] != -1} {
# Ignore blank lines & comments
if {[string trim \$line] == ""} {continue}
if {[string index [string trimleft \$line] 0] == "#"} {continue}

if \$firstline {
set names [split \$line "\t"]
set firstline false
continue
}

lappend points [split \$line "\t"]
}

close \$fhndl

set fileinfo(dir) [file dirname \$fname]
set fileinfo(name) [file tail \$fname]

wm title . "\$fileinfo(name) - SWLS"

filltext \$names \$points

}

proc filltext {names points} {
.t delete 1.0 end
update

set coeffs [eval swat::swls \$points]

set nstart [lindex [lindex \$coeffs 0] 0]
set nend [lindex [lindex \$coeffs end] 0]

.t insert end "Explaining: [lindex \$names 0]\n\n"
.t insert end "Number of data points: \$nstart\n"
.t insert end "Number in set of interest: \$nend\n\n"
.t insert end [format "%-15s\t% 10s\t% 11s\t\%25s\t% 10s\n\n" Variable Coeff "Std Error" "-- 95% Conf Interval --" "Ave Value"]

set i 1
foreach val \$coeffs {
set r2 [lindex \$val 1]
set xcoeffs [lindex \$val 2]
set stderrs [lindex \$val 3]
set cis [lindex \$val 4]
set xaves [lindex \$val 5]
.t insert end "== Run \$i ==\n"
incr i
set j 1
foreach xave \$xaves xc \$xcoeffs ci \$cis se \$stderrs {
if {\$j == [llength \$xcoeffs]} {
set lbl "Constant"
set avestring ""
} else {
set lbl "[lindex \$names \$j]"
set avestring [format "% 10.3f" \$xave]
}
.t insert end [format "%-15s\t% 10.3f\t\u00B1% 10.3f\t\[% 10.3f : % 10.3f\]\t%s\n" \
\$lbl \$xc \$se [lindex \$ci 0] [lindex \$ci 1] \$avestring]
incr j
}
.t insert end [format "Adj. R-squared = %.3f\n\n" \$r2]
}
}

wm title . "SWLS"

scrollbar .sby -command {.t yview} -orient vertical
scrollbar .sbx -command {.t xview} -orient horizontal
text .t -width 70 -height 20 -wrap none -yscrollcommand {.sby set} \
-xscrollcommand {.sbx set} -font "Courier 9"
grid .t .sby -sticky nsew
grid .sbx -sticky nsew
grid columnconfigure . 0 -weight 1
grid rowconfigure . 0 -weight 1

focus -force .```

You can try it out using the following data file. This is a file of data on child welfare agencies used by the original SWAT creators for the book What Works, formatted for this tcl app.
```    Organizational Learning        ACES        Instability        Ambiguity        Staff Change        Divorce Rate        Slack        Expenditure
1141.4        1.467351        70.43141        23.57        115.5502        26.2        3.470122        3440.159
2667        1.754386        6.407348        32.1        39.5922        27.8        4.105816        8554.186
307.3        0.4215852        63.41155        35.14        106.0642        29.4        5.610657        2544.263
840.7        0        16.39695        23.46        50.57774        30.8        3.848957        2366.732
482.5        1.18499        78.90607        31.95        -1.144        20.3        3.700461        4954.157
550        2.072846        28.10363        17.81        25.76482        22.8        2.74241        2976.427
514.1        0.607718        22.71782        22.23        18.3592        14.9        5.063129        4905.78
1352.5        1.470588        6.602017        65.03        -232.2549        19.6        9.178106        5970.52
1136.5        0.8285004        123.2406        21.44        60.60778        30.8        2.495392        2567.212
1575.4        1.660879        87.06177        31.26        86.82364        22.2        2.100618        2589.531
1170.6        0        13.04299        37.68        31.7781        19.1        4.491279        4141.247
1679.6        0.9624639        7.197373        21.72        79.94753        27        3.074239        3216.067
882.3        1.212856        181.2476        16.97        45.62724        17.4        1.902308        2469.215
1347.5        1.782531        78.82232        11.84        33.76736        26.6        1.762038        1742.745
1353.4        1.431127        11.54932        31.1        57.34841        16.6        3.434238        2659.697
1363.4        2.40481        13.08474        18.11        72.44474        22.5        2.563138        3296.465
1087.9        0.538648        26.32775        23.06        166.6439        21.9        2.649425        3330.793
712.8        0.4703669        27.16063        25.46        35.4368        15.6        4.030189        3416.769
1865.4        2.42915        9.96554        18.14        90.42884        20.9        3.209765        4109.007
1088.4        1.234568        38.82103        40.84        24.8976        14.2        3.488844        5275.824
1308.1        0.6671114        76.34886        26.11        82.58347        12.7        3.641929        4926.775
3485.4        1.17421        197.8029        25.89        80.5744        17.7        1.753225        5419.784
1946.4        1.128159        35.4882        26.85        70.3593        14.9        4.76418        5208.218
1116.8        0.3858025        76.16611        16.39        225.721        21.8        4.611794        2539.369
2047.5        0.5804953        54.3063        21.42        121.5814        22        2.799651        2754.369
893.3        2.475247        7.206166        12.36        103.7574        22.9        3.882953        2083.009
1779.2        0        26.99598        45.8        78.2787        17.6        3.511719        3982.897
703.4        1.557632        3.533641        49.19        -32.1768        54.9        6.609568        4082.064
695.1        2.714932        6.912465        43.87        32.34244        19.7        6.407141        2998.161
1287        0.257732        62.02957        36.51        17.6954        15        4.470681        6310.163
573        0        13.15338        28.29        48.06679        26        2.264973        2759.142
955.6        0.3322627        151.1162        32.31        50.3237        15        3.284161        5634.305
1296.1        1.187472        70.68908        23.75        31.70579        20.6        3.450963        2992.531
1194.5        1.574803        5.132252        15.61        54.0354        15.4        2.777402        2679.454
4299.2        3.565225        169.9304        22.62        136.621        20.3        2.841627        3072.191
896.4        1.259843        66.21898        23.09        35.20083        32.9        3.420866        2381.565
815.4        4.106776        53.44192        37.42        -176.8579        23.8        6.527769        4303.525
2262.9        0.7524454        121.2754        51.96        18.3963        14.8        2.932871        4130.657
1108.8        0.996016        10.88008        38.61        22.00537        16.1        2.421065        3339.35
1314.2        1.404494        27.21783        25.78        36.989        17.4        1.432419        2533.139
1268.9        4.207574        2.172794        17.05        23.44076        16.9        2.988671        2366.595
952.8        0.8075914        94.00198        41.9        33.35452        26.6        1.970768        2031.763
763.2        0.4034815        121.7529        51.39        53.86072        24.4        2.710884        1620.14
1228.6        0        11.48006        21.33        73.45351        21.8        4.641821        5069.072
1023.5        3.527337        3.049591        20.86        84.84604        18.9        4.317273        2983.782
1646.3        1.272669        58.24525        19.83        105.2593        17.7        2.980112        3097.782
2483.3        1.195695        80.93693        30.97        160.9784        24.7        6.063038        5996.734
1045.3        0.5552471        15.81082        15.11        78.10522        22.3        3.786695        2222.728
3866.9        1.210898        70.8133        24.03        60.605        15.2        3.335582        4972.425
1450.2        0        7.217247        10.33        163.1716        29        2.623226        1860.076```

After running this data file through the script, it should give the following output:
``` Explaining: Organizational Learning

Number of data points: 50
Number in set of interest: 11

Variable                    Coeff          Std Error          -- 95% Conf Interval --         Ave Value

== Run 1 ==
ACES                      284.795        ±    99.831        [    83.327 :    486.263]             1.265
Instability                 3.815        ±     2.057        [    -0.336 :      7.967]            52.237
Ambiguity                   9.004        ±    11.259        [   -13.718 :     31.727]            28.111
Staff Change                4.617        ±     1.758        [     1.069 :      8.165]            55.651
Divorce Rate               -1.682        ±    14.412        [   -30.766 :     27.402]            21.712
Slack                    -125.335        ±    88.330        [  -303.593 :     52.923]             3.643
Expenditure                 0.294        ±     0.078        [     0.136 :      0.452]          3617.571
Constant                 -263.506        ±   591.866        [ -1457.939 :    930.927]