Difference between revisions of "SoC improve the Spam-X plugin"

From GeeklogWiki
Jump to: navigation, search
(SWOT; Links)
(language tweaks and clarifications; added a link to the CAPTCHA plugin announcement)
 
(8 intermediate revisions by the same user not shown)
Line 3: Line 3:
 
== Introduction ==
 
== Introduction ==
  
Comment spam doesn't need an introduction - pretty much every site gets it. Geeklog ships with its own spam filter, called [http://www.geeklog.net/docs/english/spamx.html Spam-X]. This filter can easily be extended by adding modules so that it can either be updated for the spammer's latest tricks or to add support for new anti-spam services. This project is about the latter - creating modules for new services.
+
Comment spam doesn't need an introduction - pretty much every site gets it. Geeklog ships with its own spam filter, called [http://www.geeklog.net/docs/english/spamx.html Spam-X]. This filter can easily be extended by adding modules so that it can either be updated for the spammer's latest tricks or to add support for new anti-spam services.
  
  
 
== Incentive ==
 
== Incentive ==
  
Geeklog currently ships with a Spam-X module for [http://linksleeve.org/ LinkSleeve] (aka SLV). At the time, this was the only free service available that didn't require creating an account, so that it is usable "out of the box".
+
Spam-X works very well in practice. There is always room for improvement of course and that is what this project is about. We are looking for ways to make using the spam filter more efficient (for the site admin) and also try to extend the filtering capabilities.
  
Over time, more anti-spam services have appeared. One goal of this project is to evaluate these services. And to be able to do that, we would need Spam-X modules to support these services.
+
From a usage point of view, the handling of long lists of blacklisted phrases and IP addresses could be improved. Also, there are new anti-spam services that we would want to try out.
  
The second half of this project is then about creating a new anti-spam service (see details below) and compare it with the existing services.
 
  
 +
== Part 1: Usability improvements ==
  
== Part 1: New modules for existing services ==
+
In addition to using external services (see below) that rate posts as spam or not spam, site owners can use their own blacklists to fine-tune filtering of spam specifically for their own site. As a result, however, you'll often end up with long lists of blacklist entries. There are two obivous problems with this:
  
=== Services ===
+
# you don't know whether or not such a rule is still valid, i.e. used to filter spam
 +
# the long lists are hard to use and maintain
  
Here's a quick rundown of some existing anti-spam services:
+
We have (raw and unfinished) patches for both of these issues ([http://project.geeklog.net/tracking/view.php?id=1076 #1076] and [http://project.geeklog.net/tracking/view.php?id=1077 #1077]). So the minimal goal for this part would be to finish and implement these changes.
  
==== Akismet ====
+
However, we would also welcome new ideas on how to better handle these issues. Surely, having a sortable list of several hundred entries is not the only possible solution to this? Here's a chance for a student (i.e. you) to come up with a clever idea that sets your proposal apart from the others.
  
[http://akismet.com/ Akismet] is associated with the WordPress blog platform. The service initially required a wordpress.com account, which made it not suitable for use in Geeklog (asking our users to sign up with a competitor's website would have looked odd). This requirement has since been dropped: You still need to sign up but can do so now from the Akismet homepage. The service is free (commercial options available).
+
Other UI improvements that we're looking for (and that should be easy to implement) in this part of the project would be to ensure consistency with the "look and feel" of the rest of Geeklog. Currently, the Spam-X plugin is sticking out a bit, both from the way the admin panels work as well as from how they look. This would make a good first task in the project.
  
There is already an older version of an [http://gplugs.cvs.sourceforge.net/gplugs/akismet/ Akismet module] for Spam-X. It will probably need a review to check for API changes.
 
  
==== Defensio ====
+
== Part 2: A new spam filter module and API changes ==
  
[http://defensio.com/ Defensio] is a service owned by security firm Websense. The service requires signup and is free "for all personal bloggers" (commercial options available).
+
Geeklog currently ships with a Spam-X module for [http://linksleeve.org/ LinkSleeve] (aka SLV). At the time, this was the only free service available that didn't require creating an account, so that it is usable "out of the box".
  
==== Mollom ====
+
Over time, more anti-spam services have appeared ([[Other anti-spam services|see list]]). One of the most interesting - and free - anti-spam services these days is [http://mollom.com/ Mollom], which is associated with the Drupal community (but not limited to Drupal sites).
  
[http://mollom.com/ Mollom] is loosely associated with the Drupal CMS. It requires signup and offers both free and for-pay services.
+
One important difference between Mollom and other services is that it can return an "unsure" ranking for a comment post. This means that the post may or not be spam - Mollom isn't sure. So what do we do? Display a CAPTCHA to the poster.
  
One specialty of Mollom is that it has an "unsure" categorization for posts where it's not quite sure yet whether the post is spam or not. In this case, it displays a CAPTCHA, so that the poster will have to confirm that they are human.
+
This concept, however, can not easily be integrated into Geeklog right now. First of all, Spam-X currently expects either a "thumbs up" or "thumbs down" answer, after which the comment post is either allowed or dismissed. "I don't know" simply isn't supported. So to support this third possible reply, some changes will have to be made to Spam-X itself and to the Geeklog code calling it.
  
==== TypePad AntiSpam ====
+
There's also the problem that Geeklog's CAPTCHA plugin would always display a CAPTCHA to the user. So if Mollom were to return an "unsure" response, the poster would have to solve ''two'' CAPTCHAs, which would be very annoying. We need a solution for this scenario, e.g. some communication with the CAPTCHA plugin.
  
[http://antispam.typepad.com/ TypePad AntiSpam] is, as the name implies, associated with the TypePad CMS. It is currently (still) in beta. The service requires signup and is free ("and will always be free" -- quote from the website).
+
And finally, since we're going to have to change the Spam-X API anyway, this would be a good opportunity to address a design flaw of the API: Once a comment post is considered spam, it is deleted. There is currently no way to store the post for later review (and possible approval). Fixing this is not as simple as it may seem, though, since currently Spam-X simply doesn't know where the post came from - it could be a comment or a story submission, both of which would have to be treated differently.
  
==== Stop Forum Spam ====
+
Backward compatibility has to be considered. There is third-party code out there that uses the current Spam-X API that we don't want to break. An easy, though not the only, way out would be to introduce new API functions.
  
[http://www.stopforumspam.com/ Stop Forum Spam] is a private (one-man?) project. It doesn not require signup and is free.
 
 
Their API has an option to check for a poster's user name, which at first glance doesn't seem like a reliable criterion to detect spam (the API has other options as well). Michael Hampton (author of Bad Behavior) also [http://www.bad-behavior.ioerror.us/2010/02/20/stop-forum-spam/ expressed some doubts] about the service.
 
 
=== Discussion ===
 
 
Since a site would normally only use one (or maybe two) of these services, we would also need an option in the Spam-X plugin to disable modules. Currently, you can simply drop new modules into the Spam-X plugin's directory and they will be picked up automatically.
 
 
We would like to see a short evaluation of these services as one result of this project. For a proper comparison, the modules should probably be installed in parallel but not be used to actually delete spam. So a sort of evaluation mode could be introduced (and also added to the existing SLV module).
 
 
 
== Part 2: SWOT ==
 
 
The services discussed above work on the assumption that the same sort of spam is going to hit a lot of sites. The bigger a spam wave, the more likely (and faster) it is going to be recognized by one of these services, as they get reports from sites all over the web.
 
 
Once these services get big (i.e. the more sites they have reporting to them), there is a chance that smaller spam waves may not be recognized. So a spammer that only targets a few sites and with a low volume may get away with it. Of course, the admins of a site hit by this sort of spam will recognize it as spam and remove it. But how could they then alert other site admins?
 
 
Another use case: At BarCamp Stuttgart 2008, there was a report by participants about a poster who was very active in some loosely connected blogs. He posted comments that were more or less on topic but always included a link to his (unrelated) services. The participants expressed that they would be willing to trust other bloggers who already identified this sort of "borderline spam".
 
 
[http://swot.fuckingbrit.com/ Spam: Web of Trust] (SWOT) by Michael Jervis provides a framework for this sort of trust relationship in spam reports. The idea is that a website provides an RSS feed of the spam that it identified and blocks. Other site admins who trust the owner of this site can then subscribe to this feed and won't need to take care of the same sort of spam. They can then publish their own feed, and so on, building an entire web of trusted feeds that would allow for quick propagation of information about spammers.
 
 
We would like to see this concept implemented as a module for Spam-X.
 
 
* A site owner should be able to subscribe to other SWOT feeds.
 
* Not all "locally" blocked spam should go into a SWOT feed automatically (e.g. a site may have very strict rules to not allow posts in other languages, but such a feed would not be very useful for other sites).
 
* It should be possible to publish more than one SWOT feed, e.g. for different levels of filtering or different criteria.
 
  
 +
== Level of Difficulty ==
  
== Level of Difficulty ==
+
''low to medium''
  
''easy to medium''
+
The usability changes and implementation of the Mollom module itself should be relatively straightforward (there already is a PHP class for Mollom). Changing the Spam-X API will be more demanding, especially since backward compatibility will be an issue.
  
Implementing modules for the existing services should be relatively straightforward. Some more thought and work will be required for the SWOT implementation.
+
''Possible mentors:'' [http://www.geeklog.net/users.php?mode=profile&uid=11721 Tom Homer], [http://www.geeklog.net/users.php?mode=profile&uid=408 Dirk Haun]
  
  
Line 82: Line 57:
 
* [[Dealing with Spam]] in Geeklog
 
* [[Dealing with Spam]] in Geeklog
 
* [[Filtering Spam with Spam-X]]
 
* [[Filtering Spam with Spam-X]]
 +
* [[Other anti-spam services]]
 +
* [http://www.geeklog.net/article.php/2010051308024941 CAPTCHA plugin for Geeklog]
  
  
 
[[Category:Summer of Code]] [[Category:Development]]
 
[[Category:Summer of Code]] [[Category:Development]]

Latest revision as of 09:05, 29 March 2013

(This is an idea page for the Google Summer of Code)

Introduction

Comment spam doesn't need an introduction - pretty much every site gets it. Geeklog ships with its own spam filter, called Spam-X. This filter can easily be extended by adding modules so that it can either be updated for the spammer's latest tricks or to add support for new anti-spam services.


Incentive

Spam-X works very well in practice. There is always room for improvement of course and that is what this project is about. We are looking for ways to make using the spam filter more efficient (for the site admin) and also try to extend the filtering capabilities.

From a usage point of view, the handling of long lists of blacklisted phrases and IP addresses could be improved. Also, there are new anti-spam services that we would want to try out.


Part 1: Usability improvements

In addition to using external services (see below) that rate posts as spam or not spam, site owners can use their own blacklists to fine-tune filtering of spam specifically for their own site. As a result, however, you'll often end up with long lists of blacklist entries. There are two obivous problems with this:

  1. you don't know whether or not such a rule is still valid, i.e. used to filter spam
  2. the long lists are hard to use and maintain

We have (raw and unfinished) patches for both of these issues (#1076 and #1077). So the minimal goal for this part would be to finish and implement these changes.

However, we would also welcome new ideas on how to better handle these issues. Surely, having a sortable list of several hundred entries is not the only possible solution to this? Here's a chance for a student (i.e. you) to come up with a clever idea that sets your proposal apart from the others.

Other UI improvements that we're looking for (and that should be easy to implement) in this part of the project would be to ensure consistency with the "look and feel" of the rest of Geeklog. Currently, the Spam-X plugin is sticking out a bit, both from the way the admin panels work as well as from how they look. This would make a good first task in the project.


Part 2: A new spam filter module and API changes

Geeklog currently ships with a Spam-X module for LinkSleeve (aka SLV). At the time, this was the only free service available that didn't require creating an account, so that it is usable "out of the box".

Over time, more anti-spam services have appeared (see list). One of the most interesting - and free - anti-spam services these days is Mollom, which is associated with the Drupal community (but not limited to Drupal sites).

One important difference between Mollom and other services is that it can return an "unsure" ranking for a comment post. This means that the post may or not be spam - Mollom isn't sure. So what do we do? Display a CAPTCHA to the poster.

This concept, however, can not easily be integrated into Geeklog right now. First of all, Spam-X currently expects either a "thumbs up" or "thumbs down" answer, after which the comment post is either allowed or dismissed. "I don't know" simply isn't supported. So to support this third possible reply, some changes will have to be made to Spam-X itself and to the Geeklog code calling it.

There's also the problem that Geeklog's CAPTCHA plugin would always display a CAPTCHA to the user. So if Mollom were to return an "unsure" response, the poster would have to solve two CAPTCHAs, which would be very annoying. We need a solution for this scenario, e.g. some communication with the CAPTCHA plugin.

And finally, since we're going to have to change the Spam-X API anyway, this would be a good opportunity to address a design flaw of the API: Once a comment post is considered spam, it is deleted. There is currently no way to store the post for later review (and possible approval). Fixing this is not as simple as it may seem, though, since currently Spam-X simply doesn't know where the post came from - it could be a comment or a story submission, both of which would have to be treated differently.

Backward compatibility has to be considered. There is third-party code out there that uses the current Spam-X API that we don't want to break. An easy, though not the only, way out would be to introduce new API functions.


Level of Difficulty

low to medium

The usability changes and implementation of the Mollom module itself should be relatively straightforward (there already is a PHP class for Mollom). Changing the Spam-X API will be more demanding, especially since backward compatibility will be an issue.

Possible mentors: Tom Homer, Dirk Haun


Further Reading