| Subject: Re: is this a good time |
Author:
lksjdflsjdf
|
[
Next Thread |
Previous Thread |
Next Message |
Previous Message
]
Date Posted: 06:06:01 01/01/08 Tue
In reply to:
drterter
's message, "Re: is this a good time" on 06:05:10 01/01/08 Tue
>>>Like to build things? Try Hacker News.
>>>August 2002
>>>
>>>(This article describes the spam-filtering techniques
>>>used in the spamproof web-based mail reader we built
>>>to exercise Arc. An improved algorithm is described
>in
>>>Better Bayesian Filtering.)
>>>
>>>I think it's possible to stop spam, and that
>>>content-based filters are the way to do it. The
>>>Achilles heel of the spammers is their message. They
>>>can circumvent any other barrier you set up. They
>have
>>>so far, at least. But they have to deliver their
>>>message, whatever it is. If we can write software
>that
>>>recognizes their messages, there is no way they can
>>>get around that.
>>>
>>>To the recipient, spam is easily recognizable. If you
>>>hired someone to read your mail and discard the spam,
>>>they would have little trouble doing it. How much do
>>>we have to do, short of AI, to automate this process?
>>>
>>>I think we will be able to solve the problem with
>>>fairly simple algorithms. In fact, I've found that
>you
>>>can filter present-day spam acceptably well using
>>>nothing more than a Bayesian combination of the spam
>>>probabilities of individual words. Using a slightly
>>>tweaked (as described below) Bayesian filter, we now
>>>miss less than 5 per 1000 spams, with 0 false
>>>positives.
>>>
>>>The statistical approach is not usually the first one
>>>people try when they write spam filters. Most
>hackers'
>>>first instinct is to try to write software that
>>>recognizes individual properties of spam. You look at
>>>spams and you think, the gall of these guys to try
>>>sending me mail that begins "Dear Friend" or has a
>>>subject line that's all uppercase and ends in eight
>>>exclamation points. I can filter out that stuff with
>>>about one line of code.
>>>
>>>And so you do, and in the beginning it works. A few
>>>simple rules will take a big bite out of your
>incoming
>>>spam. Merely looking for the word "click" will catch
>>>79.7% of the emails in my spam corpus, with only 1.2%
>>>false positives.
>>>
>>>I spent about six months writing software that looked
>>>for individual spam features before I tried the
>>>statistical approach. What I found was that
>>>recognizing that last few percent of spams got very
>>>hard, and that as I made the filters stricter I got
>>>more false positives.
>>>
>>>False positives are innocent emails that get
>>>mistakenly identified as spams. For most users,
>>>missing legitimate email is an order of magnitude
>>>worse than receiving spam, so a filter that yields
>>>false positives is like an acne cure that carries a
>>>risk of death to the patient.
>>>
>>>The more spam a user gets, the less likely he'll be
>to
>>>notice one innocent mail sitting in his spam folder.
>>>And strangely enough, the better your spam filters
>>>get, the more dangerous false positives become,
>>>because when the filters are really good, users will
>>>be more likely to ignore everything they catch.
>>>
>>>I don't know why I avoided trying the statistical
>>>approach for so long. I think it was because I got
>>>addicted to trying to identify spam features myself,
>>>as if I were playing some kind of competitive game
>>>with the spammers. (Nonhackers don't often realize
>>>this, but most hackers are very competitive.) When I
>>>did try statistical analysis, I found immediately
>that
>>>it was much cleverer than I had been. It discovered,
>>>of course, that terms like "virtumundo" and "teens"
>>>were good indicators of spam. But it also discovered
>>>that "per" and "FL" and "ff0000" are good indicators
>>>of spam. In fact, "ff0000" (html for bright red)
>turns
>>>out to be as good an indicator of spam as any
>>>pornographic term.
>>>
>>>
>>>_ _ _
>>>
>>>
>>>Here's a sketch of how I do statistical filtering. I
>>>start with one corpus of spam and one of nonspam
>mail.
>>>At the moment each one has about 4000 messages in it.
>>>I scan the entire text, including headers and
>embedded
>>>html and javascript, of each message in each corpus.
>I
>>>currently consider alphanumeric characters, dashes,
>>>apostrophes, and dollar signs to be part of tokens,
>>>and everything else to be a token separator. (There
>is
>>>probably room for improvement here.) I ignore tokens
>>>that are all digits, and I also ignore html comments,
>>>not even considering them as token separators.
>>>
>>>I count the number of times each token (ignoring
>case,
>>>currently) occurs in each corpus. At this stage I end
>>>up with two large hash tables, one for each corpus,
>>>mapping tokens to number of occurrences.
>>>
>>>Next I create a third hash table, this time mapping
>>>each token to the probability that an email
>containing
>>>it is a spam, which I calculate as follows [1]:
>>>(let ((g (* 2 (or (gethash word good) 0)))
>>> (b (or (gethash word bad) 0)))
>>> (unless (< (+ g b) 5)
>>> (max .01
>>> (min .99 (float (/ (min 1 (/ b nbad))
>>> (+ (min 1 (/ g ngood))
>
>>> (min 1 (/ b
>>>nbad)))))))))
>>>
>>>where word is the token whose probability we're
>>>calculating, good and bad are the hash tables I
>>>created in the first step, and ngood and nbad are the
>>>number of nonspam and spam messages respectively.
>>>
>>>I explained this as code to show a couple of
>important
>>>details. I want to bias the probabilities slightly to
>>>avoid false positives, and by trial and error I've
>>>found that a good way to do it is to double all the
>>>numbers in good. This helps to distinguish between
>>>words that occasionally do occur in legitimate email
>>>and words that almost never do. I only consider words
>>>that occur more than five times in total (actually,
>>>because of the doubling, occurring three times in
>>>nonspam mail would be enough). And then there is the
>>>question of what probability to assign to words that
>>>occur in one corpus but not the other. Again by trial
>>>and error I chose .01 and .99. There may be room for
>>>tuning here, but as the corpus grows such tuning will
>>>happen automatically anyway.
>>>
>>>The especially observant will notice that while I
>>>consider each corpus to be a single long stream of
>>>text for purposes of counting occurrences, I use the
>>>number of emails in each, rather than their combined
>>>length, as the divisor in calculating spam
>>>probabilities. This adds another slight bias to
>>>protect against false positives.
>>>
>>>When new mail arrives, it is scanned into tokens, and
>>>the most interesting fifteen tokens, where
>interesting
>>>is measured by how far their spam probability is from
>>>a neutral .5, are used to calculate the probability
>>>that the mail is spam. If probs is a list of the
>>>fifteen individual probabilities, you calculate the
>>>combined probability thus:
>>>(let ((prod (apply #'* probs)))
>>> (/ prod (+ prod (apply #'* (mapcar #'(lambda (x)
>>> (- 1 x))
>>> probs)))))
>>>
>>>One question that arises in practice is what
>>>probability to assign to a word you've never seen,
>>>i.e. one that doesn't occur in the hash table of word
>>>probabilities. I've found, again by trial and error,
>>>that .4 is a good number to use. If you've never seen
>>>a word before, it is probably fairly innocent; spam
>>>words tend to be all too familiar.
>>>
>>>There are examples of this algorithm being applied to
>>>actual emails in an appendix at the end.
>>>
>>>I treat mail as spam if the algorithm above gives it
>a
>>>probability of more than .9 of being spam. But in
>>>practice it would not matter much where I put this
>>>threshold, because few probabilities end up in the
>>>middle of the range.
>>>
>>>
>>>_ _ _
>>>
>>>
>>>One great advantage of the statistical approach is
>>>that you don't have to read so many spams. Over the
>>>past six months, I've read literally thousands of
>>>spams, and it is really kind of demoralizing. Norbert
>>>Wiener said if you compete with slaves you become a
>>>slave, and there is something similarly degrading
>>>about competing with spammers. To recognize
>individual
>>>spam features you have to try to get into the mind of
>>>the spammer, and frankly I want to spend as little
>>>time inside the minds of spammers as possible.
>>>
>>>But the real advantage of the Bayesian approach, of
>>>course, is that you know what you're measuring.
>>>Feature-recognizing filters like SpamAssassin assign
>a
>>>spam "score" to email. The Bayesian approach assigns
>>>an actual probability. The problem with a "score" is
>>>that no one knows what it means. The user doesn't
>know
>>>what it means, but worse still, neither does the
>>>developer of the filter. How many points should an
>>>email get for having the word "sex" in it? A
>>>probability can of course be mistaken, but there is
>>>little ambiguity about what it means, or how evidence
>>>should be combined to calculate it. Based on my
>>>corpus, "sex" indicates a .97 probability of the
>>>containing email being a spam, whereas "sexy"
>>>indicates .99 probability. And Bayes' Rule, equally
>>>unambiguous, says that an email containing both words
>>>would, in the (unlikely) absence of any other
>>>evidence, have a 99.97% chance of being a spam.
>>>
>>>Because it is measuring probabilities, the Bayesian
>>>approach considers all the evidence in the email,
>both
>>>good and bad. Words that occur disproportionately
>>>rarely in spam (like "though" or "tonight" or
>>>"apparently") contribute as much to decreasing the
>>>probability as bad words like "unsubscribe" and
>>>"opt-in" do to increasing it. So an otherwise
>innocent
>>>email that happens to include the word "sex" is not
>>>going to get tagged as spam.
>>>
>>>Ideally, of course, the probabilities should be
>>>calculated individually for each user. I get a lot of
>>>email containing the word "Lisp", and (so far) no
>spam
>>>that does. So a word like that is effectively a kind
>>>of password for sending mail to me. In my earlier
>>>spam-filtering software, the user could set up a list
>>>of such words and mail containing them would
>>>automatically get past the filters. On my list I put
>>>words like "Lisp" and also my zipcode, so that
>>>(otherwise rather spammy-sounding) receipts from
>>>online orders would get through. I thought I was
>being
>>>very clever, but I found that the Bayesian filter did
>>>the same thing for me, and moreover discovered of a
>>>lot of words I hadn't thought of.
>>>
>>>When I said at the start that our filters let through
>>>less than 5 spams per 1000 with 0 false positives,
>I'm
>>>talking about filtering my mail based on a corpus of
>>>my mail. But these numbers are not misleading,
>because
>>>that is the approach I'm advocating: filter each
>>>user's mail based on the spam and nonspam mail he
>>>receives. Essentially, each user should have two
>>>delete buttons, ordinary delete and delete-as-spam.
>>>Anything deleted as spam goes into the spam corpus,
>>>and everything else goes into the nonspam corpus.
>>>
>>>You could start users with a seed filter, but
>>>ultimately each user should have his own per-word
>>>probabilities based on the actual mail he receives.
>>>This (a) makes the filters more effective, (b) lets
>>>each user decide their own precise definition of
>spam,
>>>and (c) perhaps best of all makes it hard for
>spammers
>>>to tune mails to get through the filters. If a lot of
>>>the brain of the filter is in the individual
>>>databases, then merely tuning spams to get through
>the
>>>seed filters won't guarantee anything about how well
>>>they'll get through individual users' varying and
>much
>>>more trained filters.
>>>
>>>Content-based spam filtering is often combined with a
>>>whitelist, a list of senders whose mail can be
>>>accepted with no filtering. One easy way to build
>such
>>>a whitelist is to keep a list of every address the
>>>user has ever sent mail to. If a mail reader has a
>>>delete-as-spam button then you could also add the
>from
>>>address of every email the user has deleted as
>>>ordinary trash.
>>>
>>>I'm an advocate of whitelists, but more as a way to
>>>save computation than as a way to improve filtering.
>I
>>>used to think that whitelists would make filtering
>>>easier, because you'd only have to filter email from
>>>people you'd never heard from, and someone sending
>you
>>>mail for the first time is constrained by convention
>>>in what they can say to you. Someone you already know
>>>might send you an email talking about sex, but
>someone
>>>sending you mail for the first time would not be
>>>likely to. The problem is, people can have more than
>>>one email address, so a new from-address doesn't
>>>guarantee that the sender is writing to you for the
>>>first time. It is not unusual for an old friend
>>>(especially if he is a hacker) to suddenly send you
>an
>>>email with a new from-address, so you can't risk
>false
>>>positives by filtering mail from unknown addresses
>>>especially stringently.
>>>
>>>In a sense, though, my filters do themselves embody a
>>>kind of whitelist (and blacklist) because they are
>>>based on entire messages, including the headers. So
>to
>>>that extent they "know" the email addresses of
>trusted
>>>senders and even the routes by which mail gets from
>>>them to me. And they know the same about spam,
>>>including the server names, mailer versions, and
>>>protocols.
>>>
>>>
>>>_ _ _
>>>
>>>
>>>If I thought that I could keep up current rates of
>>>spam filtering, I would consider this problem solved.
>>>But it doesn't mean much to be able to filter out
>most
>>>present-day spam, because spam evolves. Indeed, most
>>>antispam techniques so far have been like pesticides
>>>that do nothing more than create a new, resistant
>>>strain of bugs.
>>>
>>>I'm more hopeful about Bayesian filters, because they
>>>evolve with the spam. So as spammers start using
>>>"c0ck" instead of "cock" to evade simple-minded spam
>>>filters based on individual words, Bayesian filters
>>>automatically notice. Indeed, "c0ck" is far more
>>>damning evidence than "cock", and Bayesian filters
>>>know precisely how much more.
>>>
>>>Still, anyone who proposes a plan for spam filtering
>>>has to be able to answer the question: if the
>spammers
>>>knew exactly what you were doing, how well could they
>>>get past you? For example, I think that if
>>>checksum-based spam filtering becomes a serious
>>>obstacle, the spammers will just switch to mad-lib
>>>techniques for generating message bodies.
>>>
>>>To beat Bayesian filters, it would not be enough for
>>>spammers to make their emails unique or to stop using
>>>individual naughty words. They'd have to make their
>>>mails indistinguishable from your ordinary mail. And
>>>this I think would severely constrain them. Spam is
>>>mostly sales pitches, so unless your regular mail is
>>>all sales pitches, spams will inevitably have a
>>>different character. And the spammers would also, of
>>>course, have to change (and keep changing) their
>whole
>>>infrastructure, because otherwise the headers would
>>>look as bad to the Bayesian filters as ever, no
>matter
>>>what they did to the message body. I don't know
>enough
>>>about the infrastructure that spammers use to know
>how
>>>hard it would be to make the headers look innocent,
>>>but my guess is that it would be even harder than
>>>making the message look innocent.
>>>
>>>Assuming they could solve the problem of the headers,
>>>the spam of the future will probably look something
>>>like this:
>>>Hey there. Thought you should check out the
>>following:
>>>
>>>href="http://www.27meg.com/foo">http://www.27meg.com/
>f
>>o
>>>o
>>>
>>>because that is about as much sales pitch as
>>>content-based filtering will leave the spammer room
>to
>>>make. (Indeed, it will be hard even to get this past
>>>filters, because if everything else in the email is
>>>neutral, the spam probability will hinge on the url,
>>>and it will take some effort to make that look
>>>neutral.)
>>>
>>>Spammers range from businesses running so-called
>>>opt-in lists who don't even try to conceal their
>>>identities, to guys who hijack mail servers to send
>>>out spams promoting porn sites. If we use filtering
>to
>>>whittle their options down to mails like the one
>>>above, that should pretty much put the spammers on
>the
>>>"legitimate" end of the spectrum out of business;
>they
>>>feel obliged by various state laws to include
>>>boilerplate about why their spam is not spam, and how
>>>to cancel your "subscription," and that kind of text
>>>is easy to recognize.
>>>
>>>(I used to think it was naive to believe that
>stricter
>>>laws would decrease spam. Now I think that while
>>>stricter laws may not decrease the amount of spam
>that
>>>spammers send, they can certainly help filters to
>>>decrease the amount of spam that recipients actually
>>>see.)
>>>
>>>All along the spectrum, if you restrict the sales
>>>pitches spammers can make, you will inevitably tend
>to
>>>put them out of business. That word business is an
>>>important one to remember. The spammers are
>>>businessmen. They send spam because it works. It
>works
>>>because although the response rate is abominably low
>>>(at best 15 per million, vs 3000 per million for a
>>>catalog mailing), the cost, to them, is practically
>>>nothing. The cost is enormous for the recipients,
>>>about 5 man-weeks for each million recipients who
>>>spend a second to delete the spam, but the spammer
>>>doesn't have to pay that.
>>>
>>>Sending spam does cost the spammer something, though.
>>>[2] So the lower we can get the response rate--
>>>whether by filtering, or by using filters to force
>>>spammers to dilute their pitches-- the fewer
>>>businesses will find it worth their while to send
>>spam.
>>>
>>>The reason the spammers use the kinds of sales
>pitches
>>>that they do is to increase response rates. This is
>>>possibly even more disgusting than getting inside the
>>>mind of a spammer, but let's take a quick look inside
>>>the mind of someone who responds to a spam. This
>>>person is either astonishingly credulous or deeply in
>>>denial about their sexual interests. In either case,
>>>repulsive or idiotic as the spam seems to us, it is
>>>exciting to them. The spammers wouldn't say these
>>>things if they didn't sound exciting. And "thought
>you
>>>should check out the following" is just not going to
>>>have nearly the pull with the spam recipient as the
>>>kinds of things that spammers say now. Result: if it
>>>can't contain exciting sales pitches, spam becomes
>>>less effective as a marketing vehicle, and fewer
>>>businesses want to use it.
>>>
>>>That is the big win in the end. I started writing
>spam
>>>filtering software because I didn't want have to look
>>>at the stuff anymore. But if we get good enough at
>>>filtering out spam, it will stop working, and the
>>>spammers will actually stop sending it.
>>>
>>>
>>>_ _ _
>>>
>>>
>>>Of all the approaches to fighting spam, from software
>>>to laws, I believe Bayesian filtering will be the
>>>single most effective. But I also think that the more
>>>different kinds of antispam efforts we undertake, the
>>>better, because any measure that constrains spammers
>>>will tend to make filtering easier. And even within
>>>the world of content-based filtering, I think it will
>>>be a good thing if there are many different kinds of
>>>software being used simultaneously. The more
>different
>>>filters there are, the harder it will be for spammers
>>>to tune spams to get through them.
>>>
>>>
>>>
>>>Appendix: Examples of Filtering
>>>
>>>Here is an example of a spam that arrived while I was
>>>writing this article. The fifteen most interesting
>>>words in this spam are:
>>>qvp0045
>>>indira
>>>mx-05
>>>intimail
>>>$7500
>>>freeyankeedom
>>>cdo
>>>bluefoxmedia
>>>jpg
>>>unsecured
>>>platinum
>>>3d0
>>>qves
>>>7c5
>>>7c266675
>>>
>>>The words are a mix of stuff from the headers and
>from
>>>the message body, which is typical of spam. Also
>>>typical of spam is that every one of these words has
>a
>>>spam probability, in my database, of .99. In fact
>>>there are more than fifteen words with probabilities
>>>of .99, and these are just the first fifteen seen.
>>>
>>>Unfortunately that makes this email a boring example
>>>of the use of Bayes' Rule. To see an interesting
>>>variety of probabilities we have to look at this
>>>actually quite atypical spam.
>>>
>>>The fifteen most interesting words in this spam, with
>>>their probabilities, are:
>>>madam 0.99
>>>promotion 0.99
>>>republic 0.99
>>>shortest 0.047225013
>>>mandatory 0.047225013
>>>standardization 0.07347802
>>>sorry 0.08221981
>>>supported 0.09019077
>>>people's 0.09019077
>>>enter 0.9075001
>>>quality 0.8921298
>>>organization 0.12454646
>>>investment 0.8568143
>>>very 0.14758544
>>>valuable 0.82347786
>>>
>>>This time the evidence is a mix of good and bad. A
>>>word like "shortest" is almost as much evidence for
>>>innocence as a word like "madam" or "promotion" is
>for
>>>guilt. But still the case for guilt is stronger. If
>>>you combine these numbers according to Bayes' Rule,
>>>the resulting probability is .9027.
>>>
>>>"Madam" is obviously from spams beginning "Dear Sir
>or
>>>Madam." They're not very common, but the word "madam"
>>>never occurs in my legitimate email, and it's all
>>>about the ratio.
>>>
>>>"Republic" scores high because it often shows up in
>>>Nigerian scam emails, and also occurs once or twice
>in
>>>spams referring to Korea and South Africa. You might
>>>say that it's an accident that it thus helps identify
>>>this spam. But I've found when examining spam
>>>probabilities that there are a lot of these
>accidents,
>>>and they have an uncanny tendency to push things in
>>>the right direction rather than the wrong one. In
>this
>>>case, it is not entirely a coincidence that the word
>>>"Republic" occurs in Nigerian scam emails and this
>>>spam. There is a whole class of dubious business
>>>propositions involving less developed countries, and
>>>these in turn are more likely to have names that
>>>specify explicitly (because they aren't) that they
>are
>>>republics.[3]
>>>
>>>On the other hand, "enter" is a genuine miss. It
>>>occurs mostly in unsubscribe instructions, but here
>is
>>>used in a completely innocent way. Fortunately the
>>>statistical approach is fairly robust, and can
>>>tolerate quite a lot of misses before the results
>>>start to be thrown off.
>>>
>>>For comparison, here is an example of that rare bird,
>>>a spam that gets through the filters. Why? Because by
>>>sheer chance it happens to be loaded with words that
>>>occur in my actual email:
>>>perl 0.01
>>>python 0.01
>>>tcl 0.01
>>>scripting 0.01
>>>morris 0.01
>>>graham 0.01491078
>>>guarantee 0.9762507
>>>cgi 0.9734398
>>>paul 0.027040077
>>>quite 0.030676773
>>>pop3 0.042199217
>>>various 0.06080265
>>>prices 0.9359873
>>>managed 0.06451222
>>>difficult 0.071706355
>>>
>>>There are a couple pieces of good news here. First,
>>>this mail probably wouldn't get through the filters
>of
>>>someone who didn't happen to specialize in
>programming
>>>languages and have a good friend called Morris. For
>>>the average user, all the top five words here would
>be
>>>neutral and would not contribute to the spam
>>>probability.
>>>
>>>Second, I think filtering based on word pairs (see
>>>below) might well catch this one: "cost effective",
>>>"setup fee", "money back" -- pretty incriminating
>>>stuff. And of course if they continued to spam me (or
>>>a network I was part of), "Hostex" itself would be
>>>recognized as a spam term.
>>>
>>>Finally, here is an innocent email. Its fifteen most
>>>interesting words are as follows:
>>>continuation 0.01
>>>describe 0.01
>>>continuations 0.01
>>>example 0.033600237
>>>programming 0.05214485
>>>i'm 0.055427782
>>>examples 0.07972858
>>>color 0.9189189
>>>localhost 0.09883721
>>>hi 0.116539136
>>>california 0.84421706
>>>same 0.15981844
>>>spot 0.1654587
>>>us-ascii 0.16804294
>>>what 0.19212411
>>>
>>>Most of the words here indicate the mail is an
>>>innocent one. There are two bad smelling words,
>>>"color" (spammers love colored fonts) and
>"California"
>>>(which occurs in testimonials and also in menus in
>>>forms), but they are not enough to outweigh obviously
>>>innocent words like "continuation" and "example".
>>>
>>>It's interesting that "describe" rates as so
>>>thoroughly innocent. It hasn't occurred in a single
>>>one of my 4000 spams. The data turns out to be full
>of
>>>such surprises. One of the things you learn when you
>>>analyze spam texts is how narrow a subset of the
>>>language spammers operate in. It's that fact,
>together
>>>with the equally characteristic vocabulary of any
>>>individual user's mail, that makes Bayesian filtering
>>>a good bet.
>>>
>>>Appendix: More Ideas
>>>
>>>One idea that I haven't tried yet is to filter based
>>>on word pairs, or even triples, rather than
>individual
>>>words. This should yield a much sharper estimate of
>>>the probability. For example, in my current database,
>>>the word "offers" has a probability of .96. If you
>>>based the probabilities on word pairs, you'd end up
>>>with "special offers" and "valuable offers" having
>>>probabilities of .99 and, say, "approach offers" (as
>>>in "this approach offers") having a probability of .1
>>>or less.
>>>
>>>The reason I haven't done this is that filtering
>based
>>>on individual words already works so well. But it
>does
>>>mean that there is room to tighten the filters if
>spam
>>>gets harder to detect. (Curiously, a filter based on
>>>word pairs would be in effect a Markov-chaining text
>>>generator running in reverse.)
>>>
>>>Specific spam features (e.g. not seeing the
>>>recipient's address in the to: field) do of course
>>>have value in recognizing spam. They can be
>considered
>>>in this algorithm by treating them as virtual words.
>>>I'll probably do this in future versions, at least
>for
>>>a handful of the most egregious spam indicators.
>>>Feature-recognizing spam filters are right in many
>>>details; what they lack is an overall discipline for
>>>combining evidence.
>>>
>>>Recognizing nonspam features may be more important
>>>than recognizing spam features. False positives are
>>>such a worry that they demand extraordinary measures.
>>>I will probably in future versions add a second level
>>>of testing designed specifically to avoid false
>>>positives. If a mail triggers this second level of
>>>filters it will be accepted even if its spam
>>>probability is above the threshold.
>>>
>>>I don't expect this second level of filtering to be
>>>Bayesian. It will inevitably be not only ad hoc, but
>>>based on guesses, because the number of false
>>>positives will not tend to be large enough to notice
>>>patterns. (It is just as well, anyway, if a backup
>>>system doesn't rely on the same technology as the
>>>primary system.)
>>>
>>>Another thing I may try in the future is to focus
>>>extra attention on specific parts of the email. For
>>>example, about 95% of current spam includes the url
>of
>>>a site they want you to visit. (The remaining 5% want
>>>you to call a phone number, reply by email or to a US
>>>mail address, or in a few cases to buy a certain
>>>stock.) The url is in such cases practically enough
>by
>>>itself to determine whether the email is spam.
>>>
>>>Domain names differ from the rest of the text in a
>>>(non-German) email in that they often consist of
>>>several words stuck together. Though computationally
>>>expensive in the general case, it might be worth
>>>trying to decompose them. If a filter has never seen
>>>the token "xxxporn" before it will have an individual
>>>spam probability of .4, whereas "xxx" and "porn"
>>>individually have probabilities (in my corpus) of
>>>.9889 and .99 respectively, and a combined
>probability
>>>of .9998.
>>>
>>>I expect decomposing domain names to become more
>>>important as spammers are gradually forced to stop
>>>using incriminating words in the text of their
>>>messages. (A url with an ip address is of course an
>>>extremely incriminating sign, except in the mail of a
>>>few sysadmins.)
>>>
>>>It might be a good idea to have a cooperatively
>>>maintained list of urls promoted by spammers. We'd
>>>need a trust metric of the type studied by Raph
>Levien
>>>to prevent malicious or incompetent submissions, but
>>>if we had such a thing it would provide a boost to
>any
>>>filtering software. It would also be a convenient
>>>basis for boycotts.
>>>
>>>Another way to test dubious urls would be to send out
>>>a crawler to look at the site before the user looked
>>>at the email mentioning it. You could use a Bayesian
>>>filter to rate the site just as you would an email,
>>>and whatever was found on the site could be included
>>>in calculating the probability of the email being a
>>>spam. A url that led to a redirect would of course be
>>>especially suspicious.
>>>
>>>One cooperative project that I think really would be
>a
>>>good idea would be to accumulate a giant corpus of
>>>spam. A large, clean corpus is the key to making
>>>Bayesian filtering work well. Bayesian filters could
>>>actually use the corpus as input. But such a corpus
>>>would be useful for other kinds of filters too,
>>>because it could be used to test them.
>>>
>>>Creating such a corpus poses some technical problems.
>>>We'd need trust metrics to prevent malicious or
>>>incompetent submissions, of course. We'd also need
>>>ways of erasing personal information (not just
>>>to-addresses and ccs, but also e.g. the arguments to
>>>unsubscribe urls, which often encode the to-address)
>>>from mails in the corpus. If anyone wants to take on
>>>this project, it would be a good thing for the world.
>>>
>>>Appendix: Defining Spam
>>>
>>>I think there is a rough consensus on what spam is,
>>>but it would be useful to have an explicit
>definition.
>>>We'll need to do this if we want to establish a
>>>central corpus of spam, or even to compare spam
>>>filtering rates meaningfully.
>>>
>>>To start with, spam is not unsolicited commercial
>>>email. If someone in my neighborhood heard that I was
>>>looking for an old Raleigh three-speed in good
>>>condition, and sent me an email offering to sell me
>>>one, I'd be delighted, and yet this email would be
>>>both commercial and unsolicited. The defining feature
>>>of spam (in fact, its raison d'etre) is not that it
>is
>>>unsolicited, but that it is automated.
>>>
>>>It is merely incidental, too, that spam is usually
>>>commercial. If someone started sending mass email to
>>>support some political cause, for example, it would
>be
>>>just as much spam as email promoting a porn site.
>>>
>>>I propose we define spam as unsolicited automated
>>>email. This definition thus includes some email that
>>>many legal definitions of spam don't. Legal
>>>definitions of spam, influenced presumably by
>>>lobbyists, tend to exclude mail sent by companies
>that
>>>have an "existing relationship" with the recipient.
>>>But buying something from a company, for example,
>does
>>>not imply that you have solicited ongoing email from
>>>them. If I order something from an online store, and
>>>they then send me a stream of spam, it's still spam.
>>>
>>>Companies sending spam often give you a way to
>>>"unsubscribe," or ask you to go to their site and
>>>change your "account preferences" if you want to stop
>>>getting spam. This is not enough to stop the mail
>from
>>>being spam. Not opting out is not the same as opting
>>>in. Unless the recipient explicitly checked a clearly
>>>labelled box (whose default was no) asking to receive
>>>the email, then it is spam.
>>>
>>>In some business relationships, you do implicitly
>>>solicit certain kinds of mail. When you order online,
>>>I think you implicitly solicit a receipt, and
>>>notification when the order ships. I don't mind when
>>>Verisign sends me mail warning that a domain name is
>>>about to expire (at least, if they are the actual
>>>registrar for it). But when Verisign sends me email
>>>offering a FREE Guide to Building My E-Commerce Web
>>>Site, that's spam.
>>>
>>>Notes:
>>>
>>>[1] The examples in this article are translated into
>>>Common Lisp for, believe it or not, greater
>>>accessibility. The application described here is one
>>>that we wrote in order to test a new Lisp dialect
>>>called Arc that is not yet released.
>>>
>>>[2] Currently the lowest rate seems to be about $200
>>>to send a million spams. That's very cheap, 1/50th of
>>>a cent per spam. But filtering out 95% of spam, for
>>>example, would increase the spammers' cost to reach a
>>>given audience by a factor of 20. Few can have
>margins
>>>big enough to absorb that.
>>>
>>>[3] As a rule of thumb, the more qualifiers there are
>>>before the name of a country, the more corrupt the
>>>rulers. A country called The Socialist People's
>>>Democratic Republic of X is probably the last place
>in
>>>the world you'd want to live.
>>>
>>>Thanks to Sarah Harlin for reading drafts of this;
>>>Daniel Giffin (who is also writing the production Arc
>>>interpreter) for several good ideas about filtering
>>>and for creating our mail infrastructure; Robert
>>>Morris, Trevor Blackwell and Erann Gat for many
>>>discussions about spam; Raph Levien for advice about
>>>trust metrics; and Chip Coldwell and Sam Steingold
>for
>>>advice about statistics.
>>>
>>> You'll find this essay and 14 others in Hackers &
>>>Painters.
[
Next Thread |
Previous Thread |
Next Message |
Previous Message
]
| |