SoC improve the Spam-X plugin

(This is an idea page for the Google Summer of Code)

Introduction

Comment spam doesn't need an introduction - pretty much every site gets it. Geeklog ships with its own spam filter, called Spam-X. This filter can easily be extended by adding modules so that it can either be updated for the spammer's latest tricks or to add support for new anti-spam services.

Incentive

Spam-X works very well in practice. But of course, there is always room for improvement and that is what this project is about. We are looking for ways to make using the spam filter more efficient (for the site admin) and also try to extend the filtering capabilities.

From a usage point of view, the handling of long lists of blacklisted phrases and IP addresses could be improved. Also, there are new anti-spam services that we would want to try out. And finally, there's also the idea of setting up a new anti-spam service that is based on a "web of trust".

Part 1: Usability improvements

In addition to using external services (see below) that rate posts as spam or not spam, site owners can use their own blacklists to fine-tune filtering of spam specifically for their own site. As a result, however, you'll often end up with long lists of blacklist entries. There are two obivous problems with this:

you don't know whether or not such a rule is still valid, i.e. used to filter spam
the long lists are hard to use and maintain

We have (raw and unfinished) patches for both of these issues (#1076 and #1077). So the minimal goal for this part would be to finish and implement these changes.

However, we would also welcome new ideas on how to better handle these issues. Surely, having a sortable list of several hundred entries is not the only possible solution to this? Here's a chance for a student (i.e. you) to come up with a clever idea that sets your proposal apart from the others.

Other UI improvements that we're looking for (and that should be easy to implement) in this part of the project would be to ensure consistency with the "look and feel" of the rest of Geeklog. Currently, the Spam-X plugin is sticking out a bit, both from the way the admin panels work as well as from how they look. This would probably make a good first task in the project.

Part 2: A new spam filter module and API changes

Geeklog currently ships with a Spam-X module for LinkSleeve (aka SLV). At the time, this was the only free service available that didn't require creating an account, so that it is usable "out of the box".

Over time, more anti-spam services have appeared (see below for a full list). One of the most interesting - and free - anti-spam services these days is Mollom, which is associated with the Drupal community (but not limited to Drupal sites).

One important difference between Mollom and other services is that it can return an "unsure" ranking for a comment post. This means that the post may or not be spam - Mollom isn't sure. So what do we do? Display a CAPTCHA to the poster.

This concept, however, can not easily be integrated into Geeklog right now. First of all, Spam-X currently expects either a "thumbs up" or "thumbs down" answer, after which the comment post is either allowed or dismissed. "I don't know" simply isn't supported. So to support this third possible reply, some changes will have to be made to Spam-X itself and to the Geeklog code calling it.

There's also the problem that Geeklog's CAPTCHA plugin would always display a CAPTCHA to the user. So if Mollom were to return an "unsure" response, the poster would have to solve two CAPTCHAs, which would be very annoying. So we need a solution for this scenario, e.g. some communication with the CAPTCHA plugin.

And finally, since we're going to have to change the Spam-X API anyway, this would be a good opportunity to address a design flaw: Once a comment post is considered spam, it is deleted. There is currently no way to store the post for later review (and possible approval). This is not as simple as it may seem, though, since currently Spam-X simply doesn't know where the post came from - it could be a comment or a story submission, both of which would have to be treated differently.

Backward compatibility has to be considered. There is third-party code out there that uses the current Spam-X API that we don't want to break. An easy, though not the only, way out would be to introduce new API functions.

Part 3: SWOT

Most anti-spam services work on the assumption that the same sort of spam is going to hit a lot of sites. The bigger a spam wave, the more likely (and faster) it is going to be recognized by one of these services, as they get reports from sites all over the web.

Once these services get big (i.e. the more sites they have reporting to them), there is a chance that smaller spam waves may not be recognized. So a spammer that only targets a few sites and with a low volume may get away with it. Of course, the admins of a site hit by this sort of spam will recognize it as spam and remove it. But how could they then alert other site admins?

Another use case: At BarCamp Stuttgart 2008, there was a report by participants about a poster who was very active in some loosely connected blogs. He posted comments that were more or less on topic but always included a link to his (unrelated) services. The participants expressed that they would be willing to trust other bloggers who already identified this sort of "borderline spam".

Spam: Web of Trust (SWOT) by Michael Jervis provides a framework for this sort of trust relationship in spam reports. The idea is that a website provides an RSS feed of the spam that it identified and blocks. Other site admins who trust the owner of this site can then subscribe to this feed and won't need to take care of the same sort of spam. They can then publish their own feed, and so on, building an entire web of trusted feeds that would allow for quick propagation of information about spammers.

We would like to see this concept implemented as a module for Spam-X.

A site owner should be able to subscribe to other SWOT feeds.
Not all "locally" blocked spam should go into a SWOT feed automatically (e.g. a site may have very strict rules to not allow posts in other languages, but such a feed would not be very useful for other sites).
It should be possible to publish more than one SWOT feed, e.g. for different levels of filtering or different criteria.

Level of Difficulty

medium

The usability changes and implementation of the Mollom module itself should be relatively straightforward (there already is a PHP class for Mollom). Changing the Spam-X API will be more demanding, especially since backward compatibility will be an issue. Implementing SWOT will also require more thought and work.

SoC improve the Spam-X plugin

Contents

Introduction

Incentive

Part 1: Usability improvements

Part 2: A new spam filter module and API changes

Part 3: SWOT

Level of Difficulty

Further Reading

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

documentation

other links

Tools