Wiki Spamming

Wiki Spamming Issue

Humans and/or robot scripts are using open-access web sites (i.e., RW), such as wikis, forums, and blog comments in order to publish their site URLs and increase their google pagerank.

See the "link spamming" section of http://en.wikipedia.org/wiki/Spamdexing

About Blog Spamming, see http://en.wikipedia.org/wiki/Blog_spam

About Google PageRank: http://www.google.com/technology/

What's Going on exactly

Study of what's going on (at ruby wiki, tcl wiki, etc.): Note that this is still very very rudimentary. We should collect more data on the type of spam, other wikis, occurrence, patterns, etc. That will help us know how to react I think.

  1. It seems that the process is mostly manual at that time (2004) judging by the times at which edits occur.
  2. Quite a good time is spent on this process (at least in the range of 10 minutes per wiki) but the reward can be huge if the spammed wiki is popular (see pagerank algorithm).
  3. Both sites URLs and origin IP addresses are from pacific Asian source. There is probably a limited number of spammers right now. I have a pointer on a German company, one English, and one from Israel too, see chongqed site below.
  4. links to a) a list of spammers : http://c2.com/cgi/wiki?WikiBlackList and b) a site dedicated to the counter offensive : http://chongqed.org/fightback.html
  5. Example of spam on the Tclers wiki: http://mini.net/tclrevs/17-128-127

Note 2: It seems to me that Wiki Spamming is very very different from mail spamming. Let's not try to apply the same kind of solutions to both.

Solutions

The type of solutions to apply to this probably are twofold: Either technical (e.g., change the engine code), or community based: ask the wiki community to clean-up faster than the spammers, or even ask google to do something about this (why not?). Details follow. See also quite a very complete listing of possible solutions at http://www.usemod.com/cgi-bin/mb.pl?WikiSpam

(ak: Third is of course the combination of technical and community solutions)

  1. Clean all pages within hours, i.e., before the google robot comes. It is probably possible for Tclers, who are quite a big number and passionate :-) but it's more problematic for small-communities-wikis such as the Corcorow DRM talk wiki (http://www.commonhouse.net/wiki/drm/FrontPage ) which is very interesting and useful, but with no community.
  2. Ask google to avoid wiki/blog pages: it is actually simple to achieve through robots.txt or <meta name="robots" content="nofollow"> tags. Note that some people think spammers will not care: I'm not so sure! If the process is really hand-crafted, they will follow the results, especially since it seems that some intermediate people are selling this service to merchant sites. They want to be able to show the results.
  3. Ask google to crawl the wiki pages but not the external URLs: this is possible via a "redirect" technique (see the "Redirect external links" section of http://www.usemod.com/cgi-bin/mb.pl?WikiSpam ). See also http://simon.incutio.com/archive/2003/10/13/linkRedirects
  4. Change the engine to remove entirely the possibility of adding URLs or even comments (some blogs have done that). Pb: some blog owners think that the comments are/can be more valuable than their own postings. For wiki, it is the exact contrary to the wiki purpose, that would be the death of wikis!
  5. Change the engine to enforce a login system. Although I still think, it is anti-wiki (see opinion section below), it can be restricted to editions that contains external URLs, or the URL can be removed. See also next.
  6. When a new text contains external URLs, these are removed and send to a moderator. Note that in both this case and the previous one, it might be possible to add a white list of sites or domains. In case of the login feature, it would only be used to add the sites/domains to the list, then any type of links could freely be published... (but I still think, it's anti-wiki ;-).
  7. Other technical solutions: ban IP, bad URL by domains (e.g., cn), by keywords, etc. Quite a few blacklists exist, that can be found from wikipedia, usemod, chongqed, etc. Also using captcha limits robots but does not prevent humans of spamming (and is harmful for blind people) and my personal feeling is that when the reward is so high, a human spammer will take any time to succeed even if it involves typing a captcha, creating an account, etc.
  8. Fight back using the http://chongqed.org method : When you find a spammer, leave a "well placed" link, with all your pagerank might to -not his site of course- but to a site called http://spammers.chongqed.org/<spammerkeyword > and when you type his name on google, hopefully one of the top links to appear is a web page which title is something like "Wiki Spammer: hakdata.de" (for example) with a carefully crafted list of exactions! Neat no? It seems their method is really upsetting spammers as they received letters from upset spammers: see http://chongqed.org/chongqed.com/ isn't that hilarious. See also from this same guy a very angry page: http ://www.casino-adv.com/chongqed.org/prevent_spam_revenge_of_the_seo.htm

There are many potential types of answers. Maybe a mix of solutions is a good start? I think the very first thing to get is a good wiki backup/revision system (which we have here), then maybe a good community reading Recent Changes every day, and then a blend of technical helpers and fight back techniques... ;-). It's up to us to sort the ones we want to apply to the Tcler's wiki. Please edit this page and add your comments.

Personal opinion

(copied from http://wiki.chongqed.org//Manni )

Ok, so I did all this research just to be able to say that (just kidding :-), but here is what I think concerning wiki spamming: CM 30 Sep 04.

  1. Wiki Spamming cannot be tolerated.
  2. Implementing login systems, Captchas, etc. means losing the fight and giving up.
  3. Wikis are the essence of what the internet should be for me: free, open, accessible by anybody.

Based on this, I propose the following for the Tclers wiki:

  1. Let's try a mix of: a) community solving (we are a big and passionate community, we should be able to spend relatively more time than the dozen of spammers out there in the world), b) technical goodies, and c) reactive feedback through using google blacklisting mechanism (no idea how that works...) and the chongqed people!
  2. Community-based solving: everyday, we have to look at Recent Changes and edit the pages back when something anormal is done. This can be facilitated by two technical goodies: alerts and easy revert links (see next section). Part of the community work will also consist in getting actual data on spammers and their clients in order to report them to google and use their keywords to let customers know spamming was used (see last section). LV I wonder whether Google has a mechanism for negative feedback based on some input - something like ebay banning users and removing bad auctions. Perhaps if a few clients were suddenly dropped significantly in rank - based of course on valid evidence - or there were some other consequences (site removed from google for a week and then started over...) then would that deter the spamming?
  3. Technical-based solving: Currently, we already have an architecture with two domains names: wiki.tcl.tk which is indexed by Google and mini.net which is not, or supposedly so. Actually a search on tcllib reports three pages in mini.net (?) but 409 on wiki.tcl.tk. (weird - why out of thousands of pages are only 409 in the search engine?) Let's first completely disallow history and diffs links to googlebot, otherwise any reverse editing is not useful, as the robot will find all the occurrences of the links in the history (it's in fact even worse, as the spammers come back, and have more links!). Other goodies involve smart alerts based on IP address, DNS domain, number of URL edited (I might be a good candidate with this page), balance of text against URL text: when it's more than 95%, good chances are that we are under spam attack. Etc. Etc. I'm sure that I haven't thought about all the possibilities. The beauty is that alert code can even be tested and ran from outside the wiki machine,

involving only the Tcler that is trying his/her new ideas... :-).

In conclusion. First, we completely master the wiki engine and can incorporate whatever new function is thought necessary in order to reduce the spam. Then, even better, the wiki users are all fans of programming and scripting! So they can for example provide script outside the wiki domain but that can perform analyses, trigger alerts, etc. This is a very valuable asset and it would be a pity not to at least try some technical solutions :-).


Comments.

When dealing with proposed solutions pros and cons, you can directly edit my text and add points and references. Please try to separate facts and exhaustive listing from opinions, that will help us keep an easy to read and useful page (I hope).

30-Sep-04 DaveG: Since the Wiki pages are generated each time, have the Wiki server return the NOINDEX meta tag for any pages that have been created or modified in the past, oh, 7 days. This will allow ample time for cleaning up the Wiki, and eventually the stable and vetted content will be indexed by the Googles of the net.

SS30Sep2004: It's not strange that some wiki spammers are using bots in order to spam, or they will start soon, so why don't implement a security code, (that's an image hard to analyze for an OCR with a number the user need to digit in a text area) in order to be sure that the editor is human. I did it at http://wiki.hping.org (if you want to check how it works, try to edit some page there), and in the last two days I had no spamming problems (before of this it was an every-day problem for me), don't know this result is due to my spammers being robots, or my spammers being human that want not to deal with a security code, or just unable to read English (for the security code instructions).

If this will not work I'll start with blacklisting of IP addresses. IP will be blacklisted every time a spammer operates (but of course not only the single IP address, but all the network or something like this). If there is a good user that want to edit but can't because of the blacklist, there will be an automated procedure to follow in order to be able to unblock a given IP address even if it's blacklisted, but the procedure will take 30 seconds or so: this way wiki spam costs more to spammers:

  • They need to use real people, not bots, and people must be paid and are slow.
  • Spammers can't spam more than one time from a given IP range without to be blacklisted.
  • In order to reactivate an IP they have to wait time: more money to pay because they will be slow.

It's not perfect.. and experimental, but wiki.hping.org reached a spam level that I need to found a solution, because I'm the only to deal against spam there. The Tclers wiki at least have a big community that regularly fix the wiki.

jcw - All good points, ideas, and suggestions IMO. But we should not allow this to take the direction spam discussions sometimes go: people spending ages debating ideas (good ones, I'm sure), while no real action is taken. There is a balance between just cleaning up the cruft and creating mechanisms which do it for us (or avoid it). My vote would be to choose between one of these soon:

  • mandatory "who are you" registration / cookies (das's wikit patch)
  • implementing a simple/effective 1-click last-change rollback

A refinement dkf mentioned on the chat is to enable rollback only for people who are registered (need not be mandatory).

There have been some objections to mandatory registration (and hence insisting on cookies). Are there other options which 1) we could agree on, 2) someone is willing to implement, and 3) don't need much further tweaking once adopted?

CM I'm sure an effective rollback mechanism, plus the fact that revisions pages are ignored by the googlebot (otherwise, it's not useful, see above) would be a very good mechanism to have. I don't think people would have problem to moderate its use (e.g., by the registration/email/cookies) as it does not prevent everybody to write on the wiki. If really we need to enforce something like login for posting, then why not add a test on the fact that the newly published text has got outside URLs in it? This first option (mandatory login) does restrict the wiki's usefulness IMHO..


LES: I really like what I already have in Yahoogroups: a login system and automatic moderation of every member's first posting. The group I keep there uses this system has been free from spam for more than a year, maybe two.

That system gave me an interesting experience. We were spammed for a couple of months, then never again. I mean, I only had to approve legitimate messages since then because there have been no ill attempts. Meanwhile, I see spam grow in other lists I subscribe to. That makes me think that spammers actually keep track of what groups are moderated or not. Put any protection mechanism in place here and they will soon look for prey elsewhere.

I imagined a "trust network" system. A few wikit contributors would be considered "trusted" from the very beginning. New posters would be considered "untrusted" by default. "Untrusted" posts would be signaled somehow in Recent changes . Say, an alert signal next to the entry. Or even retained for moderation, if you want. Any "trusted" contributor would be able to visit an administration page and change someone's status from "untrusted" to "trusted". That would require a login system and cookies, though.


jcw - Could we somehow use a graphic and then respond based on the coordinates of the click? (reminds me of those goofy dialog boxes where the dismiss button moves away as you try to click on it) It can't replace the Save button unfortunately, as that would not send back the edited text, but perhaps an extra page after the save could be used? I have no idea what the graphic should be, just wanted to pass on the basic idea...

LES - Interesting idea indeed. I know for sure that PHP can do that. But I have no idea of how it works inside and could be implemented in wikit. But PHP is open-source, you know. :-) One can look at PHP's source and see how they implement graphic coordinates. Maybe it's not even PHP-dependent. Visit this page [L1 ] and look for "coordinates". Seems very simple.

Lars H: The problem with graphic response thingies is that these make it impossible to edit Wiki pages from within a text editor (as I do right now). For some purposes, browser editing facilities suck.

jcw - Not sure: you edit, you save, as you do now, then comes up a page with a graphic? (Lars H: Comes up where? The HTTP POST action is carried out by the text editor! It certainly gets some HTML back, but it has no ability to display any graphics.) Or we could add whitelists for the regulars.

Just to take this a bit further: edits remain as is, but a changed page is flagged as unverified. A page comes up with a way for people to click on a spot which marks the page as being ok (refinement: a different spot each time). Unverified changes are revoked after a certain time (could be minutes/hours). Can be combined with other ideas. Note: this is still merely an idea: we can shoot it down, ignore it, improve it - time will tell.

DKF: This is too elaborate and likely to annoy regulars. Just do the simple thing by allowing verified users (for which you - in theory at least - have an email address for) easy access to the change tracking an reverting mechanism. If healing the wiki becomes less work than spamming it, the community will be able to hold its own against the spammers for a good while.

DRH: I suggest a whitelist of approved links. Any hyperlink not on the whitelist does not get <a> tags generated but instead appears as ordinary text. Registered users and/or moderators can visit a special page that shows all URLs in the wiki that are not on the whitelist. New URLs can be added to the whitelist with a single click from the moderator, or the wiki updates containing spam URLs can be removed with a single click.

A/AK: The most useful feature for spam fighting would be undo all changes that were made from the given IP address. I've just noticed (and cleared) 5 spammed pages, and the spammer's IP address was the same for them all.

jcw - Ah, good point. The same feature is in the inventor's wiki, at c2.http://c2.com/cgi/wiki/ - hm, yes, that might be doable from the most recent page changes wikit already saves for CVS history.

SS - just a note on coordinates: It's up to the browser to send coordinates of <input name="foo" type="image" ...> as regular POST or GET variables foo_x and foo_y, so from a C CGI, to a Tcl ncgi script, to Tclhttpd, all will be able to use this stuff.


Joe(at)chongqed.org - Going by the number of individual spammers, most probably are human, but the worst of the problem is the automated spammers. Just a few automated spammers can do far more damage than all the others combined. I don't think any wiki spammers are totally automated, more likely they kind of supervise the bots. Often even with spammers that are automated (edit lots of pages in short period) you can see that they try several different linking methods since not all wikis use the same syntax. Once they get it worked out they seem to let the bot do its work.

I don't think its that spammers won't care that you have a robots.txt or meta nofollow, the problem is they will probably not notice. Human spammers likely don't look and robot spammers would have to be programmed to look. But more important, do we know for sure nofollow doesn't still increase the link's PageRank. It says don't follow the link, but it doesn't say that link doesn't exist for Ranking purposes. For that reason I have never suggested this solution before. If your page is still very visible in Google (ie. you have a good PageRank) spammers aren't going to notice you have a nofollow or may be like me and think that may not block the help to their PageRank. For them it doesn't hurt to spam anyway on the chance it does help. I suspect if this did work it would be suggested on one of Google's page as a way to lessen the effect of spam. Another similar idea I have seen thrown out is to use add a noindex tag to all pages, that would hurt the wiki since no one would be able to search it or find it in an engine anymore.

Another technical solution is to limit the number of pages a user can edit in a certain time period. A normal user shouldn't need to edit more than 1 or 2 pages in less than a minute, or edit a single page more than some number in 1 mintue. I have seen this sometimes called edit throttling. I don't know if any wiki has implemented it yet.

Spammers are starting to login already. Its not common yet, but I have seen at least 3 or 4 different spammers do it (2 of those last week). Unless you require a password (which could drastically hurt the wiki community) its not going to provide much protection in the near future since just creating a login is no problem.

I don't think a URL whitelist is anti-wiki. It may be the one of the best methods to save the wiki format. A similar method was used pretty effectivly (though waiting for URLs to be approved is a pain) on POPFile's wiki until a spammer accidently ran into a UseMod bug. See http://wiki.chongqed.org//SpamBlockLoop for a description of the problem. Back to proof that even automated spammers are watching (not counting that guy), after the URL blocking, POPFile was hit by a few spammers. They attempted to get around the block by entering their URLs in different ways and usually gave up within 3-10 trys.

I don't think DaveG's idea of returning a noindex on pages that have been edited within 7 days is a good idea. When Google sees a noindex on a page that is already indexed it removes the page. Thats Google's suggested method of removing a page from the index. It would prevent the spam from being indexed, but could leave major portions of your wiki out of Google. An active wiki will always have some pages that are edited rather frequently. Even if its less frequent that 7 days the timing of Googlbot's visits could still leave pages out of the index.

Thanks for linking to us and giving such a good description of our methods and all the other good ideas.


MC GoogleBot identifies itself in the User-Agent header; instead of adding a NOINDEX meta tag to page edits within the past 7-days, if the user-agent is GoogleBot, send back the last known good revision (if the last edit is within 7-days). No harm really, since we don't expect Google to be indexing the wiki in realtime anyway, yet it still gives a reasonable window for people to clean up after vandalism.

04oct04 jcw - It's encouraging to see how many people are trying to come up with good solutions. Some proposals (such as rejecting edits from sites in CN) are likely to only be moderately effective, and only for a short time. Some tighten edit access, which is at odds with wiki zen. Some introduce an approval mechanism, and require moderators. Some focus on quick revert, making undo's nearly effortless. Ward Cunningham, wiki's inventor, recently said that he has no good answer yet. Let's keep this going, I'm certain that the right approach will float to the top...

4thOct04 NEM - The quick undo options seem like the best solution (although, you'd have to make sure you could undo the undos, just in case). I think the wiki has benefitted immensely from ease of editing by anyone, so it would be bad to start requiring logins or such. One idea that occurred to me: the spamming that I have seen on this wiki has involved a very large number of links added to pages. Perhaps a simple limit on number of external links that can be added in one edit?

04-Oct-2004 DKF: I always saw undo being done by submitting the old version as a new version, and not by revision history pruning. Pruning just makes for abuse potential.

20041005 CMcC: There are more IP addresses for spammers to use than there are websites which employ spammers. Rather than whitelisting links, a regexp filter which blacklisted links would prevent anyone from anywhere creating a link to a known spammer's patron's website. Any edit returning a page containing a blacklisted link would fail.

This removes any economic benefit to wiki spamming. The regexps could even be shared (a special /spammers page?)

The task of creating a link blacklist would be as simple as pointing at a successful spam and blacklisting all hosts in all URLs added by that edit. An additional wrinkle would be to blacklist hosts appearing in any edit attempt which failed through blacklisting (although this could be open to abuse.)

The response is a communal one similar to page reversion, but instead of simply reverting a page, one reverts with prejudice, and the wiki develops an immune response to the toxin. The process of reducing specific hosts to regexps would be an administrative task with low frequency.


07-Oct-2004 CM: I have been studying a bit more wiki spamming either here and elsewhere and found this: Some examples of spammed pages, times, IP addresses, domain names, keywords, spammers "names" and habbits. Hope it'll help.

  • On the Tclers wiki very recently: page http://mini.net/tclrevs/14-30-29 (58 lines of asian URLs were added)
  • Still on the Tclers, I found that the first page (named '0') has been spammed by "ec51", with an introductory sentence from the spammer: ''I am from china, I want to introduce some very good chinese sites to you ,so you can find something about china cluture,people." Nice. If you search this on google, you'll find approx. 489 sites that this guy has probably spammed.
  • For example in the SBML wiki, the same guy (probably) with the same kind of sentence, spammed logged as ec51, see [L2 ]. If there was one example needed that login do more harm than help, that's the good one! I actually tried to remove his links and when ask for my username, I just gave up :-(. At least we know that ec51 has his web site named www.ec51.com, which is in the list. I have been reported his site and assorted (ec51.org, ec51.net, etc.) to http://chongqed.org/ in order to let the world know what him or his customers are doing. He has even been spamming the father of all wikis, see [L3 ]. Let's hope this one will go out of business soon. I wonder if google will blacklist his domain with all these evidence???
  • I found a listing of spam during Sept 25th but didn't note the URL, anyway the spammer spent half an hour editing pages, from 11:14pm to 11:44pm. It would seem strange for me than a robot could spend as much time.
  • Another example in the wiki about Cory Doctorow DRM talk at Microsoft Research, see [L4 ] Sept 19 there have been 8 pages spammed by 210.82.76.16 between 8:08 and 8:12. I can't believe a robot would have taken five minutes to do just 8 pages. This doesn't mean of course that there are not both humans and bots acting, but the wiki syntax is so diverse than it seem more probable to me that humans will do it and spend as much time as needed, just because Google page ranking is so important to them.
  • A last example of the same spammer attacking two or more wikis. I found that 221.136.40.189 or 221.136.43.131 has been posting links at rubygems (e.g., see history of this page: http://rubygems.rubyforge.org/wiki/wiki.pl?action=history&id=How_To_Start_A_Page ) and at the DRM talk wiki (http://www.commonhouse.net/wiki/drm/FrontPage?action=info ) and also the Linuxlinks wiki (see http://www.linuxlinks.com/portal/phpwiki/index.php?pagename=HomePage ) edited by 221.136.43.131 again. In both cases, he/she is publishing links to my.nbip.net/homepage/nblulei/<something>. All these pages eventually points to the "diy.qyun.net" domain which is the one to target. Searching this domain on google yield about a thousand results, most of them wikis! I haven't reported qyun to chongqed.org yet but I'm on my way! :-)

Would it be possible to have the IP address in the "revision history" listing? In order to get more data about potential spammers.

Also, I haven't notice any nofollow tags in the /tclrevs pages. If it is not mentionned in the global robots.txt file, we should modify the scripts so as to insert tags each time a history page will be generated, otherwise google will follow those links and find the spammers' sites. Another nice way would be to redirect googlebot only when it tries to follow the "Revisions" link to the actual page, so that only the current content would ever be seen by Google. Let's not try to redirect spammers bots, it'll be a lost battle, but let's at least prevent spamming from actually working, even if they don't care. This, plus reporting to chongqed.org, will increase the chances that when somebody type their names they'll go to a "This Guy is a Spammer" page instead of to their site!

Otherwise, thanks for all the comments and interest. Especially to Joe for his long comment with some new ideas and some facts to let us know that URL blocking with a white list might be working. Maybe we'll work out some good tools to fight spam... I like for instance the A/AK idea of having a way to suppress all things done by a specific IP address in one click! That would really help the cleaning dramatically!

Thanks to all for working on this issue. And let the spammers know that: their links are removed quickly, spamming doesn't work here, and we will fight back! I do not intent to report previous spamming but new ones, yes, all of them. If they know that it's dangerous for their business, they will probably look somewhere else! :-)


Christophe! Thanks for your latest reports. Altough your intention was to make us aware of ec51, I quickly noticed a couple of spam links to subdomains of freewebpages.org. I reported all of these to the administrators of freewebpages.org. We already have some spammers in our database that used their services and all of these give just a 404 error now. So there is hope and I guess it makes sense to report as much spam as possible. Of course, sometimes you may just be wasting your time and surely I won't write reports to chinanet or other such spammy-as-hell providers. But sometimes, spammers seem to be stupid enough to choose a good provider. -- Manni


20041007 ECS - An easy way to find all or the most recent pages changed from an IP address would be helpful. We could then examine changed pages and revert them as needed.


As people report problems here on the Wiki , be certain not to report the spammer URLs in a format that the wiki will turn into a link. That hopefully will also reduce the benefit of filling up the wiki.


11oct04 jcw - I just found out that my .htaccess blacklisting worked for wiki access, but not for raw edits via the cgi-bin/ URL (doh!). Fixed now, so several vandals should have less success from now on. Also, if the wiki is slow (as it was until a few moments ago): this is usually caused by spiders. In this case, someone was running through all edit URLs (which is not in the cache and causes a CGI script launch). (Insert comment about universes and idiots here some day...)


22oct04 DPE - Fast Fourier Transform page spammed (for all of 13 minutes before I fixed it). 9 URLs for 'www paite dot net' and 6 for 'www wjmgy dot net'. The sites and keywords are both listed on http://chongqed.org

DKF (same day) - I wonder if it would be possible to check on edit submission whether a URL listed is on the chongqed list? I suppose it could be cached locally (TTL a few hours?) if the cost of doing the remote lookup every time is too high. (That spammer also hit the Starkit - How To's page. He came from 221.219.61.102 which is in China, of course.)

RHS Thats a really slick idea, DKF. If its not convenient to get a list of links off conqued, it might be work asking them to implement something to make it easier (rss, soap, etc).

Of course! It's always worth asking us. Just tell us what you need. But let's keep it simple. What we already have in store is this: http://blacklist.chongqed.org -- Manni

27 Oct 2004 CM God, this is really good! Firstly, the idea of CMcC (20041005) was (IMHO) excellent: do not target IP addresses but links themselves instead, as they do not vary so often and are really the goal of the spammer. In fact, when these are links of their customer's web sites, they will be really pissed off to see they are black listed. I believe that this could little by little put an end to wiki spamming as we know it today. Secondly, I think that designing a central site for black list and a flexible API/protocol does seem like a good idea and I support the motion of collaborating with chongqed people on this. I'm not sure we will have much to share as maybe some spammers know some wikis, other target blogs, etc. and they might not necessarily be the same (?). People using moveable-type already have a mechanism like this, with a list of Perl-type regexps (see [L5 ]). It's interested to study and take the URLs into account, however it seems they are quite different from the ones found on wikis.. maybe I'm wrong. On the other hand, I strongly believe that the same spammers, spamming for the same sites, always come back to the same wikis, and so the benefits of silently ignoring their edits is rapidly becoming huge, even when a site is maintaing its own black-list of URLs..

I started experimenting a little with my local wikit and as simple as adding three lines would prevent the last pages of spam that were mentioned here (including the page CMcC which I studied intensively :-). Here is the patch:

 *** modify.tcl~ Thu Jul 10 11:54:20 2003
 --- modify.tcl  Wed Oct 27 12:07:56 2004
 ***************
 *** 108,113 ****
 --- 108,116 ----
       # avoid creating a log entry and committing if nothing changed
     set text [string trimright $text]
     if {!$changed && $text == $page} return
 +   set black {shop263|haishun|7766888|asp169|fm360|genset-sh|sec66|xhhj|cndevi|sinostrategy|paite|wjmgy}
 +   if {[regexp "http://.*.cn" $text] == 1} return
 +   if {[regexp "http://www.($black)." $text] == 1} return

     # make sure it parses before deleting old references
     set newRefs [StreamToRefs [TextToStream $text] InfoProc]

Jean-Claude, could we try this?

Maybe I was a bit extremist with the first regular expression.. but I did a google search and there was no URL of this type as of today... :-)

01nov11 jcw - The above may be wider than you intended, you probably want "http://.*.cn " and "http://www.($black ).". Right now, spamming seems to have gone down due to another simple measure I introduced, so I'm tempted to leave it as is while that lasts. But I agree that with the above and an external blacklist we could tackle the next level of escalation when needed.


27/11/04 - new spam attack - from China. 7 pages like 'help' vandalised. Didn't get a copy of them before they were repaired. -- CMcC

Lars H: You can still get the info (e.g. for reporting the spammer on chongqed.org) from the page revisions.

30/11/04 - more Chinese spam, on wiki gripes page.

19jan2005 jcw - Would it be an idea to add "rel=nofollow" to all wiki-generated links? See [L6 ].

DKF: But surely we want the Wiki to be strengthened by normal external links? (Definitely don't add it to normal intra-Wiki links, of course.)

I think getting google et. al. to blacklist all domains on blacklist.chonqued.org would be the most effective strategy I've heard. -MrElvey _Not to mention blacklisting (i.e. not indexing) surbl.org's list of IPs! Anyone work at google? Update this if you make a connection or send 'em feedback (e.g. how can we improve these results?)


RS 2005-01-27: Another attack, 9 pages, IP: 83.217.6.205

jcw - I wish there were a pattern. I've added a link on several pages to a prominent image describing the fact that this site uses rel=nofollow and that entries on this wiki no longer affect page rank. See The Tcler's Wiki for an example of how it looks and how to insert it elsewhere. Keep in mind that a [...] image link at the top of a wiki page becomes the page title, which is why I inserted a horizontal bar as well.


RLH I have seen on blogs recently that you have to add two numbers together before you can actually post. If that was rolled in it would at least cause a human agent to do the posting. Just a thought...


JSI - jcw, would you like to add the image to the edit screen? This way human spammers will see it - regardless which page they edit and in my opinion the Wiki would suffer from adding the image to any page.

Yes. I just put this in for now as stopgap measure, tweaking the wikit code for edit screen sounds like a good idea. Will look into it in a spare moment. -jcw

JSI - I'd suggest the "footer-solution": Just add an ID to the H2-heading of the edit screen and add the image to the heading as background via CSS. This way the change to the wikit-codebase would be minimal.

Thanks for the idea. It ended up being even simpler: I did a local-mode edit of Wiki CGI settings and added the image. Voilà - trivial, once the proper approach floats to the top! -jcw

DPE The following page has been spammed and needs restoring to a previous version https://wiki.tcl-lang.org/9530


Lars H, 28 May 2005: We seem to suffer from a new wave of Wiki spamming. Characteristics so far:

  • So far only Wiki and Searching and bookmarking URLs on the Tcl'ers Wiki have been affected, suggesting the spammer is going from wiki to wiki, but not very far inside the individual wikis.
  • The spammer is not using keywords (such as those we've seen in the past), but instead tries to hide the spam link as a period. This is probably meant to be inconspicuous, but of course doesn't work with Wikit syntax.
  • All links have been to sites in the domain bigsitecity.com, but the purpose of this seems a bit unclear as the linked-to site doesn't seem to exist.
  • Most annoying, the spam seems to be restored within hours of its removal.

Well... if it's just two pages and a single vandal - perhaps just leave it in after a few cleanup attempts - bits are cheap. -jcw