Difference between revisions of "SoC improve the Spam-X plugin"

From GeeklogWiki
Jump to: navigation, search
(SWOT; Links)
(rewriting this idea page again for GSoC 2011 (work in progress))
Line 3: Line 3:
 
== Introduction ==
 
== Introduction ==
  
Comment spam doesn't need an introduction - pretty much every site gets it. Geeklog ships with its own spam filter, called [http://www.geeklog.net/docs/english/spamx.html Spam-X]. This filter can easily be extended by adding modules so that it can either be updated for the spammer's latest tricks or to add support for new anti-spam services. This project is about the latter - creating modules for new services.
+
Comment spam doesn't need an introduction - pretty much every site gets it. Geeklog ships with its own spam filter, called [http://www.geeklog.net/docs/english/spamx.html Spam-X]. This filter can easily be extended by adding modules so that it can either be updated for the spammer's latest tricks or to add support for new anti-spam services.
  
  
 
== Incentive ==
 
== Incentive ==
  
Geeklog currently ships with a Spam-X module for [http://linksleeve.org/ LinkSleeve] (aka SLV). At the time, this was the only free service available that didn't require creating an account, so that it is usable "out of the box".
+
Spam-X works very well in practice. But of course, there is always room for improvement and that is what this project is about. We are looking for ways to make using the spam filter more efficient and also try to extend the filtering capabilities.
  
Over time, more anti-spam services have appeared. One goal of this project is to evaluate these services. And to be able to do that, we would need Spam-X modules to support these services.
+
From a usage point of view, the handling of long lists of blacklisted phrases and IP addresses could be improved. Also, there are new anti-spam services that we would want to try out. And finally, there's also the idea of setting up a new anti-spam service that is based on a "web of trust".
  
The second half of this project is then about creating a new anti-spam service (see details below) and compare it with the existing services.
 
  
 +
== Part 1: Usability improvements ==
  
== Part 1: New modules for existing services ==
+
''TBD''
  
=== Services ===
 
  
Here's a quick rundown of some existing anti-spam services:
+
== Part 2: A new spam filter module ==
  
==== Akismet ====
+
Geeklog currently ships with a Spam-X module for [http://linksleeve.org/ LinkSleeve] (aka SLV). At the time, this was the only free service available that didn't require creating an account, so that it is usable "out of the box".
 
 
[http://akismet.com/ Akismet] is associated with the WordPress blog platform. The service initially required a wordpress.com account, which made it not suitable for use in Geeklog (asking our users to sign up with a competitor's website would have looked odd). This requirement has since been dropped: You still need to sign up but can do so now from the Akismet homepage. The service is free (commercial options available).
 
 
 
There is already an older version of an [http://gplugs.cvs.sourceforge.net/gplugs/akismet/ Akismet module] for Spam-X. It will probably need a review to check for API changes.
 
 
 
==== Defensio ====
 
 
 
[http://defensio.com/ Defensio] is a service owned by security firm Websense. The service requires signup and is free "for all personal bloggers" (commercial options available).
 
 
 
==== Mollom ====
 
  
[http://mollom.com/ Mollom] is loosely associated with the Drupal CMS. It requires signup and offers both free and for-pay services.
+
Over time, more anti-spam services have appeared (see below for a full list). One of the most interesting - and free - anti-spam services these days is [http://mollom.com/ Mollom], which is associated with the Drupal community (but not limited to Drupal sites).  
  
One specialty of Mollom is that it has an "unsure" categorization for posts where it's not quite sure yet whether the post is spam or not. In this case, it displays a CAPTCHA, so that the poster will have to confirm that they are human.
+
One important difference between Mollom and other services is that it can return an "unsure" ranking for a comment post. This means that the post may or not be spam - Mollom isn't sure. So what do we do? Display a CAPTCHA to the poster.
  
==== TypePad AntiSpam ====
+
This concept, however, can not easily integrated into Geeklog right now. First of all, Spam-X currently expects either a "thumbs up" or "thumbs down" answer, after which the comment post is either allowed or dismissed. "I don't know" simply isn't supported. So to support this third possible reply, some changes will have to be made to Spam-X itself and to the Geeklog code calling it.
  
[http://antispam.typepad.com/ TypePad AntiSpam] is, as the name implies, associated with the TypePad CMS. It is currently (still) in beta. The service requires signup and is free ("and will always be free" -- quote from the website).
+
There's also the problem that Geeklog's CAPTCHA plugin would always display a plugin to the user. So if Mollom were to return an "unsure" response, the poster would have to solve ''two'' CAPTCHAs, which would be very annoying. So we need a solution for this scenario, e.g. some communication with the CAPTCHA plugin.
  
==== Stop Forum Spam ====
+
And finally, since we're going to have to change the Spam-X API anyway, this would be a good opportunity to address a design flaw: Once a comment post is considered spam, it is deleted. There is currently no way to store the post for later review (and possible approval).
  
[http://www.stopforumspam.com/ Stop Forum Spam] is a private (one-man?) project. It doesn not require signup and is free.
 
  
Their API has an option to check for a poster's user name, which at first glance doesn't seem like a reliable criterion to detect spam (the API has other options as well). Michael Hampton (author of Bad Behavior) also [http://www.bad-behavior.ioerror.us/2010/02/20/stop-forum-spam/ expressed some doubts] about the service.
+
== Part 3: SWOT ==
  
=== Discussion ===
+
Most anti-spam services work on the assumption that the same sort of spam is going to hit a lot of sites. The bigger a spam wave, the more likely (and faster) it is going to be recognized by one of these services, as they get reports from sites all over the web.
 
 
Since a site would normally only use one (or maybe two) of these services, we would also need an option in the Spam-X plugin to disable modules. Currently, you can simply drop new modules into the Spam-X plugin's directory and they will be picked up automatically.
 
 
 
We would like to see a short evaluation of these services as one result of this project. For a proper comparison, the modules should probably be installed in parallel but not be used to actually delete spam. So a sort of evaluation mode could be introduced (and also added to the existing SLV module).
 
 
 
 
 
== Part 2: SWOT ==
 
 
 
The services discussed above work on the assumption that the same sort of spam is going to hit a lot of sites. The bigger a spam wave, the more likely (and faster) it is going to be recognized by one of these services, as they get reports from sites all over the web.
 
  
 
Once these services get big (i.e. the more sites they have reporting to them), there is a chance that smaller spam waves may not be recognized. So a spammer that only targets a few sites and with a low volume may get away with it. Of course, the admins of a site hit by this sort of spam will recognize it as spam and remove it. But how could they then alert other site admins?
 
Once these services get big (i.e. the more sites they have reporting to them), there is a chance that smaller spam waves may not be recognized. So a spammer that only targets a few sites and with a low volume may get away with it. Of course, the admins of a site hit by this sort of spam will recognize it as spam and remove it. But how could they then alert other site admins?
Line 73: Line 52:
 
== Level of Difficulty ==
 
== Level of Difficulty ==
  
''easy to medium''
+
''medium''
  
Implementing modules for the existing services should be relatively straightforward. Some more thought and work will be required for the SWOT implementation.
+
The usability changes and implementation of the Mollom module itself should be relatively straightforward (there already is a PHP class for Mollom). Changing the Spam-X API will be more demanding, especially since backward compatibility will be an issue. Implementing SWOT will also require more thought and work.
  
  
Line 82: Line 61:
 
* [[Dealing with Spam]] in Geeklog
 
* [[Dealing with Spam]] in Geeklog
 
* [[Filtering Spam with Spam-X]]
 
* [[Filtering Spam with Spam-X]]
 +
 +
 +
== Addendum: Other Services ==
 +
 +
(for completeness and not relevant for this project)
 +
 +
=== Akismet ===
 +
 +
[http://akismet.com/ Akismet] is associated with the WordPress blog platform. The service initially required a wordpress.com account, which made it not suitable for use in Geeklog (asking our users to sign up with a competitor's website would have looked odd). This requirement has since been dropped: You still need to sign up but can do so now from the Akismet homepage. The service is free (payments suggested, though. Commercial options are also available).
 +
 +
There is already an older version of an [http://gplugs.cvs.sourceforge.net/gplugs/akismet/ Akismet module] for Spam-X. It will probably need a review to check for API changes.
 +
 +
=== Defensio ===
 +
 +
[http://defensio.com/ Defensio] is a service owned by security firm Websense. The service requires signup and is free "for all personal bloggers" (commercial options available).
 +
 +
=== TypePad AntiSpam ===
 +
 +
[http://antispam.typepad.com/ TypePad AntiSpam] is, as the name implies, associated with the TypePad CMS. It is currently (still) in beta. The service requires signup and is free ("and will always be free" -- quote from the website).
 +
 +
=== Stop Forum Spam ===
 +
 +
[http://www.stopforumspam.com/ Stop Forum Spam] is a private (one-man?) project. It doesn not require signup and is free.
 +
 +
Their API has an option to check for a poster's user name, which at first glance doesn't seem like a reliable criterion to detect spam (the API has other options as well). Michael Hampton (author of Bad Behavior) also [http://www.bad-behavior.ioerror.us/2010/02/20/stop-forum-spam/ expressed some doubts] about the service.
  
  
 
[[Category:Summer of Code]] [[Category:Development]]
 
[[Category:Summer of Code]] [[Category:Development]]

Revision as of 13:16, 5 March 2011

(This is an idea page for the Google Summer of Code)

Introduction

Comment spam doesn't need an introduction - pretty much every site gets it. Geeklog ships with its own spam filter, called Spam-X. This filter can easily be extended by adding modules so that it can either be updated for the spammer's latest tricks or to add support for new anti-spam services.


Incentive

Spam-X works very well in practice. But of course, there is always room for improvement and that is what this project is about. We are looking for ways to make using the spam filter more efficient and also try to extend the filtering capabilities.

From a usage point of view, the handling of long lists of blacklisted phrases and IP addresses could be improved. Also, there are new anti-spam services that we would want to try out. And finally, there's also the idea of setting up a new anti-spam service that is based on a "web of trust".


Part 1: Usability improvements

TBD


Part 2: A new spam filter module

Geeklog currently ships with a Spam-X module for LinkSleeve (aka SLV). At the time, this was the only free service available that didn't require creating an account, so that it is usable "out of the box".

Over time, more anti-spam services have appeared (see below for a full list). One of the most interesting - and free - anti-spam services these days is Mollom, which is associated with the Drupal community (but not limited to Drupal sites).

One important difference between Mollom and other services is that it can return an "unsure" ranking for a comment post. This means that the post may or not be spam - Mollom isn't sure. So what do we do? Display a CAPTCHA to the poster.

This concept, however, can not easily integrated into Geeklog right now. First of all, Spam-X currently expects either a "thumbs up" or "thumbs down" answer, after which the comment post is either allowed or dismissed. "I don't know" simply isn't supported. So to support this third possible reply, some changes will have to be made to Spam-X itself and to the Geeklog code calling it.

There's also the problem that Geeklog's CAPTCHA plugin would always display a plugin to the user. So if Mollom were to return an "unsure" response, the poster would have to solve two CAPTCHAs, which would be very annoying. So we need a solution for this scenario, e.g. some communication with the CAPTCHA plugin.

And finally, since we're going to have to change the Spam-X API anyway, this would be a good opportunity to address a design flaw: Once a comment post is considered spam, it is deleted. There is currently no way to store the post for later review (and possible approval).


Part 3: SWOT

Most anti-spam services work on the assumption that the same sort of spam is going to hit a lot of sites. The bigger a spam wave, the more likely (and faster) it is going to be recognized by one of these services, as they get reports from sites all over the web.

Once these services get big (i.e. the more sites they have reporting to them), there is a chance that smaller spam waves may not be recognized. So a spammer that only targets a few sites and with a low volume may get away with it. Of course, the admins of a site hit by this sort of spam will recognize it as spam and remove it. But how could they then alert other site admins?

Another use case: At BarCamp Stuttgart 2008, there was a report by participants about a poster who was very active in some loosely connected blogs. He posted comments that were more or less on topic but always included a link to his (unrelated) services. The participants expressed that they would be willing to trust other bloggers who already identified this sort of "borderline spam".

Spam: Web of Trust (SWOT) by Michael Jervis provides a framework for this sort of trust relationship in spam reports. The idea is that a website provides an RSS feed of the spam that it identified and blocks. Other site admins who trust the owner of this site can then subscribe to this feed and won't need to take care of the same sort of spam. They can then publish their own feed, and so on, building an entire web of trusted feeds that would allow for quick propagation of information about spammers.

We would like to see this concept implemented as a module for Spam-X.

  • A site owner should be able to subscribe to other SWOT feeds.
  • Not all "locally" blocked spam should go into a SWOT feed automatically (e.g. a site may have very strict rules to not allow posts in other languages, but such a feed would not be very useful for other sites).
  • It should be possible to publish more than one SWOT feed, e.g. for different levels of filtering or different criteria.


Level of Difficulty

medium

The usability changes and implementation of the Mollom module itself should be relatively straightforward (there already is a PHP class for Mollom). Changing the Spam-X API will be more demanding, especially since backward compatibility will be an issue. Implementing SWOT will also require more thought and work.


Further Reading


Addendum: Other Services

(for completeness and not relevant for this project)

Akismet

Akismet is associated with the WordPress blog platform. The service initially required a wordpress.com account, which made it not suitable for use in Geeklog (asking our users to sign up with a competitor's website would have looked odd). This requirement has since been dropped: You still need to sign up but can do so now from the Akismet homepage. The service is free (payments suggested, though. Commercial options are also available).

There is already an older version of an Akismet module for Spam-X. It will probably need a review to check for API changes.

Defensio

Defensio is a service owned by security firm Websense. The service requires signup and is free "for all personal bloggers" (commercial options available).

TypePad AntiSpam

TypePad AntiSpam is, as the name implies, associated with the TypePad CMS. It is currently (still) in beta. The service requires signup and is free ("and will always be free" -- quote from the website).

Stop Forum Spam

Stop Forum Spam is a private (one-man?) project. It doesn not require signup and is free.

Their API has an option to check for a poster's user name, which at first glance doesn't seem like a reliable criterion to detect spam (the API has other options as well). Michael Hampton (author of Bad Behavior) also expressed some doubts about the service.