SoC spam-x overhaul

(This is an idea page for the Google Summer of Code)

Introduction

Spam-X is Geeklog's spam filter plugin. The concept is simple but has proved to be very effective:

Every post on a Geeklog site is run through Spam-X which then gives a "thumbs-up" (not spam) or "thumbs-down" (spam detected) result back to the caller. The plugin itself can easily be extended by dropping in new modules, in case the spammers are trying new tricks or new methods of spam detection become available.

Incentive

The Spam-X plugin does have some issues, though:

No second chance for false positives: A post that is flagged as spam is deleted immediately. So if the plugin happens to flag a valid post as spam, the post is lost. This can cause some frustration for users trying to submit a valid post.
Filter list management: When following a strict filter policy, the lists per filter module tend to become long and hard to manage. The plugin also currently does not keep track of whether a filter rule is used or how often it applies to posts.

Details

The main concepts, as outlined above, is solid and should be kept:

extensible - easy to add new filter modules
simple "thumbs-up" / "thumbs-down" decision

A spam filter should be easy to use and maintain. Otherwise it won't be used.

Here are some ideas about what could be made to improve both the main functionality (i.e. catching spam posts) and the maintenance:

Use Counter

It would probably make sense to add a simple use count and a last-used timestamp to every filter rule or filter module (in case the module does not store rules in the database) to keep track of which rules / modules are actually effective.

This should be accompanied by some overview to easily find effective and ineffective rules and modules.

Moderation Option

As explained above, posts that get the "thumbs-down" are deleted immediately. There could be an option to keep flagged posts in a submission queue so that they can be approved by a moderator.

There are some problems with this idea, though:

due to the amount of spam, this could lead to a very long moderation queue
Spam-X does not currently know what type the post is. In other words, Spam-X only sees the content of the post (and meta information like the HTTP headers) but it does not know whether it is a story, a comment, or some sort of post for a plugin (e.g. a forum post). This information would be needed, though, to push the post into the proper moderation queue.

API Change?

The API for the Spam-X plugin, used by Geeklog and many plugins, is the PLG_checkforSpam function. This call currently accepts two parameters: The content of the post and an integer for the action(s) to perform in case the post is considered spam.

As outlined above, it would be helpful if more information is available about a post. At the very least, the type of post (story, comment, forum post, ...) would be needed. This, however, would require an API change and will then require some time to be picked up and used by third-party plugins and add-ons.

To consider: Can the type (reliably) be identified from the available information (e.g. HTTP headers)? If an API change is necessary, the plugin would still need to provide backward compatibility for the legacy API.

Modules

Installation of additional modules should be as simple as possible. Currently, you can simply drop new modules into the plugins/spamx directory or remove unused modules from there.

Problems with the current module concept:

Some modules consist of more than one file (e.g. SLV support) which can cause confusion with incomplete addition / removal.
Localization: Since the modules tend to be self-contained (everything in one file), they tend to use hard-coded texts.

It's worth considering a new module concept that allows for each module to have its own subdirectory. This would make it obvious which files belong together and would allow for addition of separate language files and files containing helper code.

UI idea: Add an option to enable / disable modules from the Spam-X administration screen.

Test Mode

Many of the current Spam-X modules allow filter rules to be written as regular expressions. These are usually hard to understand for less experienced users and can therefore cause false positives - or may simply not work at all.

It would be nice to have a way to test-drive new rules before adding them to the list of filters. This could be a screen where the user can enter a test post (e.g. copied and pasted from a real spam post) and then try out filter rules to match the post.

List Views

This is more of a technical detail: Geeklog provides list views that are sortable and can be searched. These are not currently used by the Spam-X plugin but would help make long lists of filter rules more maintainable.

Level of Difficulty

medium

The main issues in this project are to keep the ease of use and backward compatibility.

Note: In this project, we are not looking for adding new methods of spam detection / prevention. The main goal is to overhaul the infrastructure to allow for future extensions. Of course, we won't stop anyone from implementing new modules as a proof-of-concept for the new infrastructure ...

SoC spam-x overhaul

Contents

Introduction

Incentive

Details

Use Counter

Moderation Option

API Change?

Modules

Test Mode

List Views

Level of Difficulty

Further Reading

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

documentation

other links

Tools