SoC spam-x overhaul
This idea page for the Google Summer of Code has been superseded by a new project idea: New modules for our Spam filter. This project outline is therefore only of historical interest and you can not apply for this project in future instances of the Google Summer of Code. Please see our ideas page for a current list of GSoC project ideas.
Spam-X is Geeklog's spam filter plugin. The concept is simple but has proved to be very effective:
Every post on a Geeklog site is run through Spam-X which then gives a "thumbs-up" (not spam) or "thumbs-down" (spam detected) result back to the caller. The plugin itself can easily be extended by dropping in new modules, in case the spammers are trying new tricks or new methods of spam detection become available.
The Spam-X plugin does have some issues, though:
- No second chance for false positives: A post that is flagged as spam is deleted immediately. So if the plugin happens to flag a valid post as spam, the post is lost. This can cause some frustration for users trying to submit a valid post.
- Filter list management: When following a strict filter policy, the lists per filter module tend to become long and hard to manage. The plugin also currently does not keep track of whether a filter rule is used or how often it applies to posts.
The main concepts, as outlined above, are solid and should be kept:
- extensible - easy to add new filter modules
- simple "thumbs-up" / "thumbs-down" decision
A spam filter should be easy to use and maintain. Otherwise it won't be used.
Here are some ideas about what could be changed to improve the functionality and, specifically, the maintenance:
It would make sense to add a simple use count and a last-used timestamp to every filter rule or filter module (in case the module does not store rules in the database) to keep track of which rules / modules are actually effective.
This should be accompanied by some overview to easily find effective and ineffective rules and modules.
As explained above, posts that get the "thumbs-down" are deleted immediately. There could be an option to keep flagged posts in a submission queue so that they can be approved by a moderator.
There are some problems with this idea, though:
- due to the amount of spam, this could lead to a very long moderation queue
- Spam-X does not currently know what type the post is. In other words, Spam-X only sees the content of the post (and meta information like the HTTP headers) but it does not know whether it is a story, a comment, or some sort of post for a plugin (e.g. a forum post). This information would be needed, though, to push the post into the proper moderation queue.
The API for the Spam-X plugin, used by Geeklog and many plugins, is the
PLG_checkforSpam function. This call currently accepts two parameters: The content of the post and an integer for the action(s) to perform in case the post is considered spam.
As outlined above, it would be helpful if more information is available about a post. At the very least, the type of post (story, comment, forum post, ...) would be needed. This, however, would require an API change and will then take some time before the new API is picked up and used by third-party plugins and add-ons.
To consider: Can the type (reliably) be identified with the available information (e.g. HTTP headers)? If an API change is necessary, the plugin would still need to provide backward compatibility for the legacy API.
Installation of additional modules should be kept as simple as possible. Currently, you can simply drop new modules into the plugins/spamx directory or remove unused modules from there.
Problems with the current module concept:
- Some modules consist of more than one file (e.g. SLV support) which can cause confusion with incomplete addition / removal.
- Localization: Since the modules tend to be self-contained (everything in one file), they tend to use hard-coded texts.
It's worth considering a new module concept that allows for each module to have its own subdirectory. This would make it obvious which files belong together and would allow for addition of separate language files and files containing helper code.
UI idea: Add an option to enable / disable modules from the Spam-X administration screen.
Many of the current Spam-X modules allow filter rules to be written as regular expressions. These are usually hard to understand for less experienced users and can therefore cause false positives - or may simply not work at all.
It would be nice to have a way to test-drive new rules before adding them to the filter list. This could be a screen where the user can enter a test post (e.g. copied and pasted from a real spam post) and then try out filter rules to match the post.
Import / Export
A common scenario for the need of an export option would be when you set up a new site and want to be able to use the spam filter setup that you already have in use on another site. You can currently do that by exporting / importing the gl_spamx table, but having a more comfortable way to do that would be desirable.
One way to do this would be to provide all the information in an (RSS) feed.
There is some overlap here with the SWOT project idea. However, the import / export would be a one-time operation of all the current information, while SWOT feeds (one on each site) could then be used to keep the databases in sync. Specifically, the full export feed should probably not be a SWOT feed (due to overhead and since it may contain entries that may be site-specific and shouldn't really end up in a SWOT feed).
This is more of a technical detail: Geeklog provides list views that are sortable and can be searched. These are not currently used by the Spam-X plugin but would help make long lists of filter rules more maintainable.
Level of Difficulty
The main issues in this project are to keep the ease of use and backward compatibility.
Note: In this project, we are not looking for adding new methods of spam detection / prevention. The main goal is to overhaul the infrastructure to allow for future extensions. Of course, we won't stop anyone from implementing new modules as a proof-of-concept for the new infrastructure ...