Another weekend fun project by [Richard Suchenwirth], 2001-02-16 - This is a toy or demo thing that is not really fit for real-life use, but still I had some fun with it - it's amazing how little code it takes!. Give it a text widget and possibly a tag (my "hilite" is -bg orange, which stands out pretty clearly; curly red underline seems not to possible in Tk), and it will march through the text contents and highlight all those words that don't match its expectations (i.e. not in dictionary or not resolvable by rules):
======
proc text:spell {w {tag hilite}} {
set lineno 1
$w tag remove $tag 1.0 end
foreach line [split [$w get 1.0 end-1c] \n] {
foreach {from to} [string:wordindexes $line] {
set word [string range $line $from [expr $to-1]]
if {![spell:ok $word]} {
$w tag add $tag $lineno.$from $lineno.$to
update idletasks
}
}
incr lineno
}
}
======
Known bug: embedded images count as one character, but are not seen by the ''$text get'' command, so they shift the highlighting to the right.
The following helper produces a list of starting and ending indices of words (as defined by Tcl) in a string:
======
proc string:wordindexes s {
set i 0
set res {}
foreach c [split $s ""] {
##DKF## Use {$c ne " " && $i eq [string wordstart $s $i]}
##DKF## as test from Tcl 8.4 onwards! It's faster and less buggy
if {$c!=" " && $i==[string wordstart $s $i]} {
lappend res $i [string wordend $s $i]
}
incr i
}
set res
}
======
Here comes the word checker, returning 1 or 0 depending on whether it accepts one word (replace by your own if you have a better one - I will sometime in the future experiment with a graph parser):
======
proc spell:ok s {
global word ;# Faster to create local alias
if {[string length $s]<2} {return 1}
if {![regexp {[A-Za-z]} $s]} {return 1}
set s [string tolower $s]
if {[info exists word($s)]} {return 1}
foreach sfx {s ing ed es d} {
if {
[regexp ^(.+)$sfx$ $s -> stem] &&
[info exists word($stem)] &&
[lsearch $word($stem) $sfx] >= 0
} then {
return 1
}
}
return 0
}
======
The following two are for data preparation, they take a string with possible linebreaks (may be a whole text file), extract the words only, resp. do a frequency count:
======
proc string:words s {
set res {}
foreach line [split $s \n] {
for {set i 0} {$i<[string length $line]} {incr i} {
if {$i==[string wordstart $line $i]} {
set w [string range $line $i [expr {[string wordend $line $i]-1}]]
if {$w!=" "} {lappend res $w}
incr i [expr {[string length $w]-1}];# always loop incr
}
}
}
set res
}
proc words:count s {
foreach i [string tolower [string:words $s]] {
if {[string length $i]>1} {
if {[info exists a($i)]} {
incr a($i)
} else {
set a($i) 1
}
}
}
set t {}
foreach {i n} [array get a] {lappend t [list $i $n]}
##DKF## Efficient in 8.4, not crippling before
return [lsort -integer -decreasing -index 1 $t]
}
======
And here finally comes the "dictionary" (pretty poor yet, fits on less than a page). It does a crude subcategorization based on possible endings (the value of the array entries), so more words are matched:
======
########## load dictionary, distinguish suffix distributions #####
foreach i {
about above after all already also always am an another any and are as at
be been before below between body both but by child children could
data different does doesn during each either empty
found for from fully given got happy happily
has have high his how however if in including into is isn it
just later legal low may maybe more must never next no none not
of on onto only or over perhaps same should since slow so some such
tcl than that the their them then there these they
this those three to too two under unless us using
was we were what whatever when where whether which while who whom whose
why with within would you zero automatic automatically
} {set ::word($i) ""}
foreach i {
add accept allow append approach argument
back book brief buffer button
call check clear click color command consist contain convert count counter
destroy display down end except exist export
fill follow form import intend key
last link list load look mark need number open order overview
pair perform pick point position print reason represent return
screen script second select shift show spell start style support
test treat unit view want word work
} {set ::word($i) "s ing ed"}
foreach i {
bind break do field find mean read see will window
} {set ::word($i) "s ing"}
foreach i {
access focus index match search
} {set ::word($i) "es ing ed"}
foreach i {
actual additional complete current definite direct exact frequent
general immediate normal occasional optional previous proper quick
recent silent symbolical total
} {set ::word($i) "ly"}
foreach i {
action application area bar bottom can case center come context
character computer content control current database effect element
error even event example first font forget format friend
get give global handler height her image information input it item
left let make menu mouse new nothing one operation option other
output package pattern procedure program real red refer region reset
resolution right selection set simple single space special standard
step stop string system table tag take text top up variable white
widget width write your
} {set ::word($i) "s"}
foreach i {
abbreviate associate change code coordinate create date declare
define delete describe determine double execute file force generate
ignore include indicate line like name note outline page remove rule
size state terminate time type use value
} {set ::word($i) "s d"}
======
----
''DKF:'' Modified to run faster. :^)
----
LV: Any of you familar enough with the Wikit code to figure how to
add this code so that after one edits a page, there could be a button
for spell-checking the page, with the possible misspelled words
highlighted in some manner?
----
RS: Before building this into the Wiki, remember I said this is a toy project. The problem is the dictionary, which has to be very much more comprehensive than the one above - otherwise you'll get so many false positives that it doesn't help much. So we need
* data (10,000s of frequent English words)
* an efficient access method (the one above will get slow with much data, because of the many regexps)
Highlighting via http will also be very different, though not difficult: retransmit the received form contents, with dubious words braced in ... (Does HTML allow markup in a form?)
DKF: No, you can't provide HTML markup in a form. Delivering this sort of functionality would require an applet of some form (either Java or Tcl.)
Also, the ''ispell'' english dictionary is 300kB long (after I extract it from its storage format) with over 33 thousand words, omits many common prefixes and suffixes, and I still use a lot of words which it doesn't know about. I tempted to say that instead of writing our own spelling checker, we should just wrap up ispell instead... :^)
FYI, the shell command I used to extract it was:
strings /usr/common/lib/ispell/britishmed+.hash | sed '/[^A-Z]/d' | tr A-Z a-z | sort
----
RS: Could not find ispell in our Solaris or Linux boxes, but the old spell with a flat ASCII wordlist of 25143. Not bad. I'd only prefer a pure-Tcl solution, since my W95 box at home misses so many goodies...
[Arjen Markus] On our Solaris system we have a program "spell" - seems quite similar :-)
----
[AK]: See http://freshmeat.net/appindex/console/text%20utilities.html for several spellcheckers, especially '''pspell''' [http://pspell.sourceforge.net/], the portable spell checker interface library. Contains an ispell module, appears to handle UTF-8.
----
[LV]: the aspell/pspell project has several word lists, as does the fsf.org people. So coming up with a word list isn't the problem. However, perhaps embedding such large word lists into a Wikit would be
counter-productive...
----
NEM: Couldn't we write some Tcl scripts to trawl on-line dictionaries writing data to a Metakit or other Tcl database? Some intelligent language parsing could pick out endings etc, and create a nice database. Could take a while to work tho - all those HTTP requests....
----
Instead of embedding the dictionary in the Wikit, we could create a [metakit] database in a [Tequila] server...
----
More possibilities, these requiring IP connectivity:
use of Google's spell-corrector ([CL] reports that programmability
is as easy as
======
package require SOAP
SOAP::create doSpellingSuggestion \
-proxy http://api.google.com/search/beta2 \
-uri urn:GoogleSearch \
-action urn:GoogleSearchAction \
-params {key string phrase string}
puts [doSpellingSuggestion $key "mangeld word"]
======
); or 'Net connections to several on-line dictionaries
(...)
----
Once concern about using the Google web service is the fact I seem
to recall that one needs to obtain a login / password for google and the use
that login and password as a part of the interaction. Or is this
something difference?
[MG] April 21st, 2004 - I looked into the Google web service recently; you do indeed need to obtain a login/password from them, and there's a limit to how much it can be used in (I believe) any 24-hour period. As a pure-tcl alternative, though, I found the code above (with a few small modifications, mainly for handling words with apostrophes and such) and a large wordlist (the one I have is 1.3 megabytes) works brilliantly; the code is in [Potato MUSH (MUD) Client], but I'll extract the procs I changed later and add them here, just in case anyone wants them. (The word list is available at [http://www.geocities.com/mush_code/en_us.txt], incidently.)
----
Joachim Kock <25/04/2004> :
I don't think it is worth to try to collect words from web services or
compile databases in fancy formats. You can probably get much better
results by using some serious external spell checker like aspell, which is
very easy to control through a pipe, or otherwise just use a binary search
through a well-prepared word list --- these can be found on the internet
(for example on the aspell site [http://aspell.sourceforge.net]
or the Excalibur site [http://www.eg.bucknell.edu/~excalibr/]) and they have been
fine-tuned over many years by clever spell-checking freaks.
The spellchecker aspell (http://aspell.sourceforge.net) is very fast and
has many features. It is easy to call it from a Tcl programme via a pipe.
Alpha(Tk) (http://alphatcl.sourceforge.net) uses aspell as spellchecker and
this is all implemented in Tcl. See the file spellcheck.tcl in the
AlphaTcl library
[http://alphatcl.cvs.sourceforge.net/alphatcl/Tcl/SystemCode/CorePackages/spellcheck.tcl].
There is also a
check-as-you-type spell checker for Alpha(Tk) where misspelled words are
underlined while a list of suggestions appear in a small auxiliary window.
This goes as fast as you can type, and ctrl-k 4 for accepting suggestion
number 4... The Tcl code is here
[http://www.cirget.uqam.ca/~kock/alpha-tcl/spell.tcl].
Alternatively, and in particular if you are not interested in suggestions
for corrections, but only want a boolean, a very convenient data format is
a plain text file with one word per line, alphabetically sorted. There are
very good such wordlists available, and doing a binary search for a word is
faster than you can type it. Here is a code snippet stolen from another
Alpha(Tk) package, 'autoAccents' (this package automatically sets accents
when you type in French (or in other heavily accented languages, depending
on the supplied wordlist). The following is rather minimal:
======
proc checkWord { word } {
set word [string tolower $word]
# Assuming that there is a sorted wordlist here:
set wordList /Data/dics/wordlists/BritishDictionary2.2
# ftp://ftp.eg.bucknell.edu/pub/mac/Excalibur-dictionaries/
set f [open $wordList r]
set lowerlimit 0
seek $f 0 end
set upperlimit [tell $f]
# ------------------
# Rough binary search, to narrow the interval:
while { [expr $upperlimit - $lowerlimit >= 20] } {
set mid [expr ($upperlimit + $lowerlimit) / 2]
seek $f $mid
gets $f linje ; #first chunk is junk
gets $f linje
if { [string compare $word $linje] == 1 } {
set lowerlimit $mid
} else {
set upperlimit $mid
}
}
# ------------------
# Now the goal is within the narrow interval.
# (In very unlucky cases the goal may actually be a litte after the
# interval, but this doesn't matter because we):
# Go back a little further and read forward linearly:
if { $lowerlimit > 20 } {
seek $f [expr $lowerlimit - 20]
gets $f linje ; #first chunk is junk
} else {
seek $f 0
}
gets $f linje
while { [string compare $word [string trim $linje]] == 1 } {
if { [gets $f linje] == -1 } {
break
}
}
# ------------------
# Found the first non-smaller word.
close $f
if { [string equal $word [string trim $linje]] } {
return 1
} else {
return 0
}
}
======
<> Algorithm | Spell Checking