Back to Daves Planet

Stomp Out Spam (SOS)

The SOS project is an amazing inovation in spam filtering technology. It will combine the best of all the spam fighting techniques into one powerful tool. At it's core is a scoring system that takes a dozen different things into account, including massivly distributed peer-to-peer identification of spam. If the sender not in my friends list - that's a couple points, if the sender is listed in a real time black hole, that's points. If the email contains links to sites previously identified as being promoted in spam, that's points toward the overall total. The user decides how many points each category is worth and how many points constitute spam.

Unlike Vipul's Razor which relies on smaller hash codes which can be beaten and must be rotated regularly to foil spammers, SOS hashes the upper case of each word into one of 64 characters that never changes for that word and then stores the entire string of characters that make up the phrases of the email as the spam signature. Then it compares dozens of random samples of each hashed phrases with your incomming mail. If for any entry in the database there is a high percentage of matched phrases to your incomming mail, it increases the spam score of that email.

Additionally when a user marks a piece of email as spam, SOS collects a list of domain names linked to or hosting graphics for that spam and allows the user to add them to the distributed list of known spam links. Email addresses are forged and disposable, domain names being promoted by spam aren't.

Sender isn't in my friends list (adjust this to a high score to only accept email from friends)

Points for each word in the email that is in a database (mortage, enlarge, inches, million, viagra, credit, naked, etc...) AFTER the spam has been de-munged. That is, B1g is known to be either Big or BLG, and V.I.A.G.R.A or V I A G R A are both recognized properly, words with Spanish style accents above them will be replaced with their standard English letter counterpart (a common spammer trick is to replace English letters with accented ones).

Points for linking to domains known to host spam identified by all users using automated the Peer-to-Peer propagation

Points for being a close match to a message another user previously reported as spam (again peer-to-peer)

Points if I am not listed as a recipient (some list servers will do this too, either list them in your friends or make this a low scoring item to make list server email pass)

Points if the sender is listed in a real time black hole

Points for Base64 encoding Text/HTML attachments (a spammer trick)

Optional removal of any email attachments that can contain viruses (exe, com, scr, etc).

The ability to always deliver email from friends (as long as there are no virus capable attachments)

The ability to always deliver Habeas authorized email if the sender is not listed in the Habeas Infringers List (HIL)

Blacklist of senders

Emails and URLs will be de-munged before domain names are discovered and added to the list of banned domains. A common spammer trick is to use escape characters in order to hide from filters.

If an email is identified as spam, it is returned to the sender with an explanation of why it bounced. This vital function allows the user to know that their email to you didn't make it. There will also be incorporated into the bounced email a code listed in a JPG graphic. If the recipient can identify the code and re-send the message then they can get automatically added to your friends list if you have that option configured.

There will be a remotly administered Whitelist of domains that are not spammers but that were caught up in an email from a spammer.

It will be impossible to take down the P2P network with a distributed denial of serice attack because each user acts as a single node in a massivly distributed architecture.

Administrative messages and program updates can be introduced into the net using public/private key pairs to authenticate the signature (hash) of any admin message. The message is hashed and the hash is encrypted with a private key, then when it's distributed to the users they hash the message and decrypt the distributed hash with their public key (everyone has the same public key). If the decrypted hash matches the message hash then it's an authentic message from the administrators.

The number of users who must report email with a similar signature as spam before the filter will assign points to it can be configured.

Abusers can be tracked because IP addresses are appended to a message at each hop, even if the offender pre-pends other IPs to the message before sending it. We can simply query every server in the list and they will report if they truly handled that message or not and the offender will be revealed. Abusers are people trying to abuse the network with false spam reports. Abusers can be blocked by IP ranges and messages generated by abusers can be recalled (undone).

Every user will be identified by their external IP address as reported to them by other users outside their domain/proxy/firewall. Additionally every user will be randomly assigned a 16 character ID that will be hashed with their reports but not known to most other users, this will aid in message authentication by the admins.

The number of spam reports per user per day can be limited to keep abusers from quickly overwhelming the system with false spam reports.

Each node will be hardened against a DOS attack, able to maintain large numbers of connections without going down and able to learn to add to a list of attackers to automatically ignore.

If abuse becomes a problem we can go to a scoring system where your spam reports only start to count after you have proven your ability to correctly identify spam, and still limit the number of reports a single user can generate in a day.

DESCRIPTON OF P2P: Messages to add/remove high-uptime servers in the global list are sent over the P2P network. Each node contains a complete list of servers. If we envision the list of nodes as a square, each node, upon receipt of a new message, communicates that message to the node to his left (N-1), to his right (N+1), to the node above him (N - sqrt(number of nodes)) and the node below him (N + sqrt(number of nodes)), wrapping around the beginning and end of the list as necessary. Additionally, functionality will be in place such that one percent of the nodes, those whose position ends in 00, will take the portion of their position number that begins with a non-zero and add/subtract one to it and re-apply the zeros and communicate with those two nodes. For example, the node at position 99000 will take teh 99 (the portion of the number starting after the zeros) and add 1 to get 100, reapply the zeros to get 100,000, and call that node. It will also call node 98,000. This will quickly move messages to all sections of the list very quickly regardless of how large the list gets.

REQUIRED JAVA COMPONENTS

Back to Daves Planet