Difference between revisions of "SoC spam-x overhaul"

From GeeklogWiki
Jump to: navigation, search
(GSoC idea: Spam-X overhaul (work in progress))
 
Line 3: Line 3:
 
== Introduction ==
 
== Introduction ==
  
[http://www.geeklog.net/docs/spamx.html Spam-X] is Geeklog's spam filter plugin. The concept is simple but has proved itself to be very effective:
+
[http://www.geeklog.net/docs/spamx.html Spam-X] is Geeklog's spam filter plugin. The concept is simple but has proved to be very effective:
  
 
Every post on a Geeklog site is run through Spam-X which then gives a "thumbs-up" (not spam) or "thumbs-down" (spam detected) result back to the caller. The plugin itself can easily be extended by dropping in new modules, in case the spammers are trying new tricks or new methods of spam detection become available.
 
Every post on a Geeklog site is run through Spam-X which then gives a "thumbs-up" (not spam) or "thumbs-down" (spam detected) result back to the caller. The plugin itself can easily be extended by dropping in new modules, in case the spammers are trying new tricks or new methods of spam detection become available.
Line 18: Line 18:
 
== Details ==
 
== Details ==
  
The main concept, as outlined above, is solid and should be kept:
+
The main concepts, as outlined above, is solid and should be kept:
  
 
* extensible - easy to add new filter modules
 
* extensible - easy to add new filter modules
* simple "thumbs-up" / "thumbs-down" decision - but the consequences of a thumbs-down shoule be reconsidered
+
* simple "thumbs-up" / "thumbs-down" decision
  
A spam filter should be easy to use and maintain or it won't be used.
+
A spam filter should be easy to use and maintain. Otherwise it won't be used.
  
 
Here are some ideas about what could be made to improve both the main functionality (i.e. catching spam posts) and the maintenance:
 
Here are some ideas about what could be made to improve both the main functionality (i.e. catching spam posts) and the maintenance:
Line 42: Line 42:
 
* Spam-X does not currently know what type the post is. In other words, Spam-X only sees the content of the post (and meta information like the HTTP headers) but it does not know whether it is a story, a comment, or some sort of post for a plugin (e.g. a forum post). This information would be needed, though, to push the post into the proper moderation queue.
 
* Spam-X does not currently know what type the post is. In other words, Spam-X only sees the content of the post (and meta information like the HTTP headers) but it does not know whether it is a story, a comment, or some sort of post for a plugin (e.g. a forum post). This information would be needed, though, to push the post into the proper moderation queue.
  
 +
=== API Change? ===
 +
 +
The API for the Spam-X plugin, used by Geeklog and many plugins, is the <code>PLG_checkforSpam</code> function. This call currently accepts two parameters: The content of the post and an integer for the action(s) to perform in case the post is considered spam.
 +
 +
As outlined above, it would be helpful if more information is available about a post. At the very least, the type of post (story, comment, forum post, ...) would be needed. This, however, would require an API change and will then require some time to be picked up and used by third-party plugins and add-ons.
 +
 +
To consider: Can the type (reliably) be identified from the available information (e.g. HTTP headers)? If an API change is necessary, the plugin would still need to provide backward compatibility for the legacy API.
  
 
=== List Views ===
 
=== List Views ===
  
 
This is more of a technical detail: Geeklog provides list views that are sortable and can be searched. These are not currently used by the Spam-X plugin but would help make long lists of filter rules more maintainable.
 
This is more of a technical detail: Geeklog provides list views that are sortable and can be searched. These are not currently used by the Spam-X plugin but would help make long lists of filter rules more maintainable.
 +
 +
=== Modules ===
 +
 +
Installation of additional modules should be as simple as possible. Currently, you can simply drop new modules into the <tt>plugins/spamx</tt> directory or remove unused modules from there.
 +
 +
Problems with the current module concept:
 +
 +
* Some modules consist of more than one file (e.g. SLV support) which can cause confusion with incomplete addition / removal.
 +
* Localization: Since the modules tend to be self-contained (everything in one file), they tend to use [http://project.geeklog.net/tracking/view.php?id=656 hard-coded texts].
 +
 +
It's worth considering a new module concept that allows for each module to have its own subdirectory. This would make it obvious which files belong together and would allow for addition of separate language files and files containing helper code.
 +
 +
UI idea: Add an option to enable / disable modules from the Spam-X administration screen.
  
  
Line 51: Line 71:
  
 
''medium to high''
 
''medium to high''
 +
 +
The main issues in this project are to keep the ease of use and backward compatibility.
  
  
 
[[Category:Summer of Code]] [[Category:Development]]
 
[[Category:Summer of Code]] [[Category:Development]]

Revision as of 09:37, 7 February 2009

(This is an idea page for the Google Summer of Code)

Introduction

Spam-X is Geeklog's spam filter plugin. The concept is simple but has proved to be very effective:

Every post on a Geeklog site is run through Spam-X which then gives a "thumbs-up" (not spam) or "thumbs-down" (spam detected) result back to the caller. The plugin itself can easily be extended by dropping in new modules, in case the spammers are trying new tricks or new methods of spam detection become available.


Incentive

The Spam-X plugin does have some issues, though:

  • No second chance for false positives: A post that is flagged as spam is deleted immediately. So if the plugin happens to flag a valid post as spam, the post is lost. This can cause some frustration for users trying to submit a valid post.
  • Filter list management: When following a strict filter policy, the lists per filter module tend to become long and hard to manage. The plugin also currently does not keep track of whether a filter rule is used or how often it applies to posts.


Details

The main concepts, as outlined above, is solid and should be kept:

  • extensible - easy to add new filter modules
  • simple "thumbs-up" / "thumbs-down" decision

A spam filter should be easy to use and maintain. Otherwise it won't be used.

Here are some ideas about what could be made to improve both the main functionality (i.e. catching spam posts) and the maintenance:

Use Counter

It would probably make sense to add a simple use count and a last-used timestamp to every filter rule or filter module (in case the module does not store rules in the database) to keep track of which rules / modules are actually effective.

This should be accompanied by some overview to easily find effective and ineffective rules and modules.

Moderation Option

As explained above, posts that get the "thumbs-down" are deleted immediately. There could be an option to keep flagged posts in a submission queue so that they can be approved by a moderator.

There are some problems with this idea, though:

  • due to the amount of spam, this could lead to a very long moderation queue
  • Spam-X does not currently know what type the post is. In other words, Spam-X only sees the content of the post (and meta information like the HTTP headers) but it does not know whether it is a story, a comment, or some sort of post for a plugin (e.g. a forum post). This information would be needed, though, to push the post into the proper moderation queue.

API Change?

The API for the Spam-X plugin, used by Geeklog and many plugins, is the PLG_checkforSpam function. This call currently accepts two parameters: The content of the post and an integer for the action(s) to perform in case the post is considered spam.

As outlined above, it would be helpful if more information is available about a post. At the very least, the type of post (story, comment, forum post, ...) would be needed. This, however, would require an API change and will then require some time to be picked up and used by third-party plugins and add-ons.

To consider: Can the type (reliably) be identified from the available information (e.g. HTTP headers)? If an API change is necessary, the plugin would still need to provide backward compatibility for the legacy API.

List Views

This is more of a technical detail: Geeklog provides list views that are sortable and can be searched. These are not currently used by the Spam-X plugin but would help make long lists of filter rules more maintainable.

Modules

Installation of additional modules should be as simple as possible. Currently, you can simply drop new modules into the plugins/spamx directory or remove unused modules from there.

Problems with the current module concept:

  • Some modules consist of more than one file (e.g. SLV support) which can cause confusion with incomplete addition / removal.
  • Localization: Since the modules tend to be self-contained (everything in one file), they tend to use hard-coded texts.

It's worth considering a new module concept that allows for each module to have its own subdirectory. This would make it obvious which files belong together and would allow for addition of separate language files and files containing helper code.

UI idea: Add an option to enable / disable modules from the Spam-X administration screen.


Level of Difficulty

medium to high

The main issues in this project are to keep the ease of use and backward compatibility.