Difference between revisions of "SoC spam-x overhaul"

From GeeklogWiki
Jump to: navigation, search
(added a note and Further Reading)
m (marked as outdated / superseded)
 
(3 intermediate revisions by the same user not shown)
Line 1: Line 1:
<center>(This is an idea page for the [[Google Summer of Code]])</center>
+
This idea page for the [[Google Summer of Code]] has been superseded by a new project idea: [[SoC more Spam-X modules|New modules for our Spam filter]]. This project outline is therefore only of historical interest and you can not apply for this project in future instances of the Google Summer of Code. Please see our [[Google Summer of Code|ideas page]] for a current list of GSoC project ideas.
 +
 
  
 
== Introduction ==
 
== Introduction ==
Line 18: Line 19:
 
== Details ==
 
== Details ==
  
The main concepts, as outlined above, is solid and should be kept:
+
The main concepts, as outlined above, are solid and should be kept:
  
 
* extensible - easy to add new filter modules
 
* extensible - easy to add new filter modules
Line 25: Line 26:
 
A spam filter should be easy to use and maintain. Otherwise it won't be used.
 
A spam filter should be easy to use and maintain. Otherwise it won't be used.
  
Here are some ideas about what could be made to improve both the main functionality (i.e. catching spam posts) and the maintenance:
+
Here are some ideas about what could be changed to improve the functionality and, specifically, the maintenance:
  
 
=== Use Counter ===
 
=== Use Counter ===
  
It would probably make sense to add a simple use count and a last-used timestamp to every filter rule or filter module (in case the module does not store rules in the database) to keep track of which rules / modules are actually effective.
+
It would make sense to add a simple use count and a last-used timestamp to every filter rule or filter module (in case the module does not store rules in the database) to keep track of which rules / modules are actually effective.
  
 
This should be accompanied by some overview to easily find effective and ineffective rules and modules.
 
This should be accompanied by some overview to easily find effective and ineffective rules and modules.
Line 46: Line 47:
 
The API for the Spam-X plugin, used by Geeklog and many plugins, is the <code>PLG_checkforSpam</code> function. This call currently accepts two parameters: The content of the post and an integer for the action(s) to perform in case the post is considered spam.
 
The API for the Spam-X plugin, used by Geeklog and many plugins, is the <code>PLG_checkforSpam</code> function. This call currently accepts two parameters: The content of the post and an integer for the action(s) to perform in case the post is considered spam.
  
As outlined [[SoC_spam-x_overhaul#Moderation_Option|above]], it would be helpful if more information is available about a post. At the very least, the type of post (story, comment, forum post, ...) would be needed. This, however, would require an API change and will then require some time to be picked up and used by third-party plugins and add-ons.
+
As outlined [[SoC_spam-x_overhaul#Moderation_Option|above]], it would be helpful if more information is available about a post. At the very least, the type of post (story, comment, forum post, ...) would be needed. This, however, would require an API change and will then take some time before the new API is picked up and used by third-party plugins and add-ons.
 
 
To consider: Can the type (reliably) be identified from the available information (e.g. HTTP headers)? If an API change is necessary, the plugin would still need to provide backward compatibility for the legacy API.
 
 
 
=== List Views ===
 
  
This is more of a technical detail: Geeklog provides list views that are sortable and can be searched. These are not currently used by the Spam-X plugin but would help make long lists of filter rules more maintainable.
+
To consider: Can the type (reliably) be identified with the available information (e.g. HTTP headers)? If an API change is necessary, the plugin would still need to provide backward compatibility for the legacy API.
  
 
=== Modules ===
 
=== Modules ===
  
Installation of additional modules should be as simple as possible. Currently, you can simply drop new modules into the <tt>plugins/spamx</tt> directory or remove unused modules from there.
+
Installation of additional modules should be kept as simple as possible. Currently, you can simply drop new modules into the <tt>plugins/spamx</tt> directory or remove unused modules from there.
  
 
Problems with the current module concept:
 
Problems with the current module concept:
Line 66: Line 63:
  
 
UI idea: Add an option to enable / disable modules from the Spam-X administration screen.
 
UI idea: Add an option to enable / disable modules from the Spam-X administration screen.
 +
 +
=== Test Mode ===
 +
 +
Many of the current Spam-X modules allow filter rules to be written as regular expressions. These are usually hard to understand for less experienced users and can therefore cause false positives - or may simply not work at all.
 +
 +
It would be nice to have a way to test-drive new rules before adding them to the filter list. This could be a screen where the user can enter a test post (e.g. copied and pasted from a real spam post) and then try out filter rules to match the post.
 +
 +
=== Import / Export ===
 +
 +
A common scenario for the need of an export option would be when you set up a new site and want to be able to use the spam filter setup that you already have in use on another site. You can currently do that by exporting / importing the <tt>gl_spamx</tt> table, but having a more comfortable way to do that would be desirable.
 +
 +
One way to do this would be to provide all the information in an (RSS) feed.
 +
 +
There is some overlap here with the [http://swot.fuckingbrit.com/ SWOT] project idea. However, the import / export would be a one-time operation of all the current information, while SWOT feeds (one on each site) could then be used to keep the databases in sync. Specifically, the full export feed should probably ''not'' be a SWOT feed (due to overhead and since it may contain entries that may be site-specific and shouldn't really end up in a SWOT feed).
 +
 +
=== List Views ===
 +
 +
This is more of a technical detail: Geeklog provides list views that are sortable and can be searched. These are not currently used by the Spam-X plugin but would help make long lists of filter rules more maintainable.
  
  
 
== Level of Difficulty ==
 
== Level of Difficulty ==
  
''medium to high''
+
''medium''
  
 
The main issues in this project are to keep the ease of use and backward compatibility.
 
The main issues in this project are to keep the ease of use and backward compatibility.
  
'''Note:''' In this project, we are not looking for adding new methods of spam detection / prevention. The main goal is to overhaul the infrastructure to allow for future extensions. Of course, we won't stop anyone from implementing any new modules as a proof-of-concept for the new infrastructure ...
+
'''Note:''' In this project, we are not looking for adding new methods of spam detection / prevention. The main goal is to overhaul the infrastructure to allow for future extensions. Of course, we won't stop anyone from implementing new modules as a proof-of-concept for the new infrastructure ...
  
  

Latest revision as of 11:05, 6 March 2010

This idea page for the Google Summer of Code has been superseded by a new project idea: New modules for our Spam filter. This project outline is therefore only of historical interest and you can not apply for this project in future instances of the Google Summer of Code. Please see our ideas page for a current list of GSoC project ideas.


Introduction

Spam-X is Geeklog's spam filter plugin. The concept is simple but has proved to be very effective:

Every post on a Geeklog site is run through Spam-X which then gives a "thumbs-up" (not spam) or "thumbs-down" (spam detected) result back to the caller. The plugin itself can easily be extended by dropping in new modules, in case the spammers are trying new tricks or new methods of spam detection become available.


Incentive

The Spam-X plugin does have some issues, though:

  • No second chance for false positives: A post that is flagged as spam is deleted immediately. So if the plugin happens to flag a valid post as spam, the post is lost. This can cause some frustration for users trying to submit a valid post.
  • Filter list management: When following a strict filter policy, the lists per filter module tend to become long and hard to manage. The plugin also currently does not keep track of whether a filter rule is used or how often it applies to posts.


Details

The main concepts, as outlined above, are solid and should be kept:

  • extensible - easy to add new filter modules
  • simple "thumbs-up" / "thumbs-down" decision

A spam filter should be easy to use and maintain. Otherwise it won't be used.

Here are some ideas about what could be changed to improve the functionality and, specifically, the maintenance:

Use Counter

It would make sense to add a simple use count and a last-used timestamp to every filter rule or filter module (in case the module does not store rules in the database) to keep track of which rules / modules are actually effective.

This should be accompanied by some overview to easily find effective and ineffective rules and modules.

Moderation Option

As explained above, posts that get the "thumbs-down" are deleted immediately. There could be an option to keep flagged posts in a submission queue so that they can be approved by a moderator.

There are some problems with this idea, though:

  • due to the amount of spam, this could lead to a very long moderation queue
  • Spam-X does not currently know what type the post is. In other words, Spam-X only sees the content of the post (and meta information like the HTTP headers) but it does not know whether it is a story, a comment, or some sort of post for a plugin (e.g. a forum post). This information would be needed, though, to push the post into the proper moderation queue.

API Change?

The API for the Spam-X plugin, used by Geeklog and many plugins, is the PLG_checkforSpam function. This call currently accepts two parameters: The content of the post and an integer for the action(s) to perform in case the post is considered spam.

As outlined above, it would be helpful if more information is available about a post. At the very least, the type of post (story, comment, forum post, ...) would be needed. This, however, would require an API change and will then take some time before the new API is picked up and used by third-party plugins and add-ons.

To consider: Can the type (reliably) be identified with the available information (e.g. HTTP headers)? If an API change is necessary, the plugin would still need to provide backward compatibility for the legacy API.

Modules

Installation of additional modules should be kept as simple as possible. Currently, you can simply drop new modules into the plugins/spamx directory or remove unused modules from there.

Problems with the current module concept:

  • Some modules consist of more than one file (e.g. SLV support) which can cause confusion with incomplete addition / removal.
  • Localization: Since the modules tend to be self-contained (everything in one file), they tend to use hard-coded texts.

It's worth considering a new module concept that allows for each module to have its own subdirectory. This would make it obvious which files belong together and would allow for addition of separate language files and files containing helper code.

UI idea: Add an option to enable / disable modules from the Spam-X administration screen.

Test Mode

Many of the current Spam-X modules allow filter rules to be written as regular expressions. These are usually hard to understand for less experienced users and can therefore cause false positives - or may simply not work at all.

It would be nice to have a way to test-drive new rules before adding them to the filter list. This could be a screen where the user can enter a test post (e.g. copied and pasted from a real spam post) and then try out filter rules to match the post.

Import / Export

A common scenario for the need of an export option would be when you set up a new site and want to be able to use the spam filter setup that you already have in use on another site. You can currently do that by exporting / importing the gl_spamx table, but having a more comfortable way to do that would be desirable.

One way to do this would be to provide all the information in an (RSS) feed.

There is some overlap here with the SWOT project idea. However, the import / export would be a one-time operation of all the current information, while SWOT feeds (one on each site) could then be used to keep the databases in sync. Specifically, the full export feed should probably not be a SWOT feed (due to overhead and since it may contain entries that may be site-specific and shouldn't really end up in a SWOT feed).

List Views

This is more of a technical detail: Geeklog provides list views that are sortable and can be searched. These are not currently used by the Spam-X plugin but would help make long lists of filter rules more maintainable.


Level of Difficulty

medium

The main issues in this project are to keep the ease of use and backward compatibility.

Note: In this project, we are not looking for adding new methods of spam detection / prevention. The main goal is to overhaul the infrastructure to allow for future extensions. Of course, we won't stop anyone from implementing new modules as a proof-of-concept for the new infrastructure ...


Further Reading