Subject: Re: Hello, where is this |
Author:
ASDASD
|
[
Next Thread |
Previous Thread |
Next Message |
Previous Message
]
Date Posted: 23:47:14 12/31/07 Mon
In reply to:
ASDASD
's message, "Re: Hello, where is this" on 23:46:42 12/31/07 Mon
>>>>Like to build things? Try Hacker News.
>>>>August 2002
>>>>
>>>>(This article describes the spam-filtering
>techniques
>>>>used in the spamproof web-based mail reader we built
>>>>to exercise Arc. An improved algorithm is described
>>in
>>>>Better Bayesian Filtering.)
>>>>
>>>>I think it's possible to stop spam, and that
>>>>content-based filters are the way to do it. The
>>>>Achilles heel of the spammers is their message. They
>>>>can circumvent any other barrier you set up. They
>>have
>>>>so far, at least. But they have to deliver their
>>>>message, whatever it is. If we can write software
>>that
>>>>recognizes their messages, there is no way they can
>>>>get around that.
>>>>
>>>>To the recipient, spam is easily recognizable. If
>you
>>>>hired someone to read your mail and discard the
>spam,
>>>>they would have little trouble doing it. How much do
>>>>we have to do, short of AI, to automate this
>process?
>>>>
>>>>I think we will be able to solve the problem with
>>>>fairly simple algorithms. In fact, I've found that
>>you
>>>>can filter present-day spam acceptably well using
>>>>nothing more than a Bayesian combination of the spam
>>>>probabilities of individual words. Using a slightly
>>>>tweaked (as described below) Bayesian filter, we now
>>>>miss less than 5 per 1000 spams, with 0 false
>>>>positives.
>>>>
>>>>The statistical approach is not usually the first
>one
>>>>people try when they write spam filters. Most
>>hackers'
>>>>first instinct is to try to write software that
>>>>recognizes individual properties of spam. You look
>at
>>>>spams and you think, the gall of these guys to try
>>>>sending me mail that begins "Dear Friend" or has a
>>>>subject line that's all uppercase and ends in eight
>>>>exclamation points. I can filter out that stuff with
>>>>about one line of code.
>>>>
>>>>And so you do, and in the beginning it works. A few
>>>>simple rules will take a big bite out of your
>>incoming
>>>>spam. Merely looking for the word "click" will catch
>>>>79.7% of the emails in my spam corpus, with only
>1.2%
>>>>false positives.
>>>>
>>>>I spent about six months writing software that
>looked
>>>>for individual spam features before I tried the
>>>>statistical approach. What I found was that
>>>>recognizing that last few percent of spams got very
>>>>hard, and that as I made the filters stricter I got
>>>>more false positives.
>>>>
>>>>False positives are innocent emails that get
>>>>mistakenly identified as spams. For most users,
>>>>missing legitimate email is an order of magnitude
>>>>worse than receiving spam, so a filter that yields
>>>>false positives is like an acne cure that carries a
>>>>risk of death to the patient.
>>>>
>>>>The more spam a user gets, the less likely he'll be
>>to
>>>>notice one innocent mail sitting in his spam folder.
>>>>And strangely enough, the better your spam filters
>>>>get, the more dangerous false positives become,
>>>>because when the filters are really good, users will
>>>>be more likely to ignore everything they catch.
>>>>
>>>>I don't know why I avoided trying the statistical
>>>>approach for so long. I think it was because I got
>>>>addicted to trying to identify spam features myself,
>>>>as if I were playing some kind of competitive game
>>>>with the spammers. (Nonhackers don't often realize
>>>>this, but most hackers are very competitive.) When I
>>>>did try statistical analysis, I found immediately
>>that
>>>>it was much cleverer than I had been. It discovered,
>>>>of course, that terms like "virtumundo" and "teens"
>>>>were good indicators of spam. But it also discovered
>>>>that "per" and "FL" and "ff0000" are good indicators
>>>>of spam. In fact, "ff0000" (html for bright red)
>>turns
>>>>out to be as good an indicator of spam as any
>>>>pornographic term.
>>>>
>>>>
>>>>_ _ _
>>>>
>>>>
>>>>Here's a sketch of how I do statistical filtering. I
>>>>start with one corpus of spam and one of nonspam
>>mail.
>>>>At the moment each one has about 4000 messages in
>it.
>>>>I scan the entire text, including headers and
>>embedded
>>>>html and javascript, of each message in each corpus.
>>I
>>>>currently consider alphanumeric characters, dashes,
>>>>apostrophes, and dollar signs to be part of tokens,
>>>>and everything else to be a token separator. (There
>>is
>>>>probably room for improvement here.) I ignore tokens
>>>>that are all digits, and I also ignore html
>comments,
>>>>not even considering them as token separators.
>>>>
>>>>I count the number of times each token (ignoring
>>case,
>>>>currently) occurs in each corpus. At this stage I
>end
>>>>up with two large hash tables, one for each corpus,
>>>>mapping tokens to number of occurrences.
>>>>
>>>>Next I create a third hash table, this time mapping
>>>>each token to the probability that an email
>>containing
>>>>it is a spam, which I calculate as follows [1]:
>>>>(let ((g (* 2 (or (gethash word good) 0)))
>>>> (b (or (gethash word bad) 0)))
>>>> (unless (< (+ g b) 5)
>>>> (max .01
>>>> (min .99 (float (/ (min 1 (/ b nbad))
>>>> (+ (min 1 (/ g ngood))
>>
>>>> (min 1 (/ b
>>>>nbad)))))))))
>>>>
>>>>where word is the token whose probability we're
>>>>calculating, good and bad are the hash tables I
>>>>created in the first step, and ngood and nbad are
>the
>>>>number of nonspam and spam messages respectively.
>>>>
>>>>I explained this as code to show a couple of
>>important
>>>>details. I want to bias the probabilities slightly
>to
>>>>avoid false positives, and by trial and error I've
>>>>found that a good way to do it is to double all the
>>>>numbers in good. This helps to distinguish between
>>>>words that occasionally do occur in legitimate email
>>>>and words that almost never do. I only consider
>words
>>>>that occur more than five times in total (actually,
>>>>because of the doubling, occurring three times in
>>>>nonspam mail would be enough). And then there is the
>>>>question of what probability to assign to words that
>>>>occur in one corpus but not the other. Again by
>trial
>>>>and error I chose .01 and .99. There may be room for
>>>>tuning here, but as the corpus grows such tuning
>will
>>>>happen automatically anyway.
>>>>
>>>>The especially observant will notice that while I
>>>>consider each corpus to be a single long stream of
>>>>text for purposes of counting occurrences, I use the
>>>>number of emails in each, rather than their combined
>>>>length, as the divisor in calculating spam
>>>>probabilities. This adds another slight bias to
>>>>protect against false positives.
>>>>
>>>>When new mail arrives, it is scanned into tokens,
>and
>>>>the most interesting fifteen tokens, where
>>interesting
>>>>is measured by how far their spam probability is
>from
>>>>a neutral .5, are used to calculate the probability
>>>>that the mail is spam. If probs is a list of the
>>>>fifteen individual probabilities, you calculate the
>>>>combined probability thus:
>>>>(let ((prod (apply #'* probs)))
>>>> (/ prod (+ prod (apply #'* (mapcar #'(lambda (x)
>>>> (- 1 x))
>>>> probs)))))
>>>>
>>>>One question that arises in practice is what
>>>>probability to assign to a word you've never seen,
>>>>i.e. one that doesn't occur in the hash table of
>word
>>>>probabilities. I've found, again by trial and error,
>>>>that .4 is a good number to use. If you've never
>seen
>>>>a word before, it is probably fairly innocent; spam
>>>>words tend to be all too familiar.
>>>>
>>>>There are examples of this algorithm being applied
>to
>>>>actual emails in an appendix at the end.
>>>>
>>>>I treat mail as spam if the algorithm above gives it
>>a
>>>>probability of more than .9 of being spam. But in
>>>>practice it would not matter much where I put this
>>>>threshold, because few probabilities end up in the
>>>>middle of the range.
>>>>
>>>>
>>>>_ _ _
>>>>
>>>>
>>>>One great advantage of the statistical approach is
>>>>that you don't have to read so many spams. Over the
>>>>past six months, I've read literally thousands of
>>>>spams, and it is really kind of demoralizing.
>Norbert
>>>>Wiener said if you compete with slaves you become a
>>>>slave, and there is something similarly degrading
>>>>about competing with spammers. To recognize
>>individual
>>>>spam features you have to try to get into the mind
>of
>>>>the spammer, and frankly I want to spend as little
>>>>time inside the minds of spammers as possible.
>>>>
>>>>But the real advantage of the Bayesian approach, of
>>>>course, is that you know what you're measuring.
>>>>Feature-recognizing filters like SpamAssassin assign
>>a
>>>>spam "score" to email. The Bayesian approach assigns
>>>>an actual probability. The problem with a "score" is
>>>>that no one knows what it means. The user doesn't
>>know
>>>>what it means, but worse still, neither does the
>>>>developer of the filter. How many points should an
>>>>email get for having the word "sex" in it? A
>>>>probability can of course be mistaken, but there is
>>>>little ambiguity about what it means, or how
>evidence
>>>>should be combined to calculate it. Based on my
>>>>corpus, "sex" indicates a .97 probability of the
>>>>containing email being a spam, whereas "sexy"
>>>>indicates .99 probability. And Bayes' Rule, equally
>>>>unambiguous, says that an email containing both
>words
>>>>would, in the (unlikely) absence of any other
>>>>evidence, have a 99.97% chance of being a spam.
>>>>
>>>>Because it is measuring probabilities, the Bayesian
>>>>approach considers all the evidence in the email,
>>both
>>>>good and bad. Words that occur disproportionately
>>>>rarely in spam (like "though" or "tonight" or
>>>>"apparently") contribute as much to decreasing the
>>>>probability as bad words like "unsubscribe" and
>>>>"opt-in" do to increasing it. So an otherwise
>>innocent
>>>>email that happens to include the word "sex" is not
>>>>going to get tagged as spam.
>>>>
>>>>Ideally, of course, the probabilities should be
>>>>calculated individually for each user. I get a lot
>of
>>>>email containing the word "Lisp", and (so far) no
>>spam
>>>>that does. So a word like that is effectively a kind
>>>>of password for sending mail to me. In my earlier
>>>>spam-filtering software, the user could set up a
>list
>>>>of such words and mail containing them would
>>>>automatically get past the filters. On my list I put
>>>>words like "Lisp" and also my zipcode, so that
>>>>(otherwise rather spammy-sounding) receipts from
>>>>online orders would get through. I thought I was
>>being
>>>>very clever, but I found that the Bayesian filter
>did
>>>>the same thing for me, and moreover discovered of a
>>>>lot of words I hadn't thought of.
>>>>
>>>>When I said at the start that our filters let
>through
>>>>less than 5 spams per 1000 with 0 false positives,
>>I'm
>>>>talking about filtering my mail based on a corpus of
>>>>my mail. But these numbers are not misleading,
>>because
>>>>that is the approach I'm advocating: filter each
>>>>user's mail based on the spam and nonspam mail he
>>>>receives. Essentially, each user should have two
>>>>delete buttons, ordinary delete and delete-as-spam.
>>>>Anything deleted as spam goes into the spam corpus,
>>>>and everything else goes into the nonspam corpus.
>>>>
>>>>You could start users with a seed filter, but
>>>>ultimately each user should have his own per-word
>>>>probabilities based on the actual mail he receives.
>>>>This (a) makes the filters more effective, (b) lets
>>>>each user decide their own precise definition of
>>spam,
>>>>and (c) perhaps best of all makes it hard for
>>spammers
>>>>to tune mails to get through the filters. If a lot
>of
>>>>the brain of the filter is in the individual
>>>>databases, then merely tuning spams to get through
>>the
>>>>seed filters won't guarantee anything about how well
>>>>they'll get through individual users' varying and
>>much
>>>>more trained filters.
>>>>
>>>>Content-based spam filtering is often combined with
>a
>>>>whitelist, a list of senders whose mail can be
>>>>accepted with no filtering. One easy way to build
>>such
>>>>a whitelist is to keep a list of every address the
>>>>user has ever sent mail to. If a mail reader has a
>>>>delete-as-spam button then you could also add the
>>from
>>>>address of every email the user has deleted as
>>>>ordinary trash.
>>>>
>>>>I'm an advocate of whitelists, but more as a way to
>>>>save computation than as a way to improve filtering.
>>I
>>>>used to think that whitelists would make filtering
>>>>easier, because you'd only have to filter email from
>>>>people you'd never heard from, and someone sending
>>you
>>>>mail for the first time is constrained by convention
>>>>in what they can say to you. Someone you already
>know
>>>>might send you an email talking about sex, but
>>someone
>>>>sending you mail for the first time would not be
>>>>likely to. The problem is, people can have more than
>>>>one email address, so a new from-address doesn't
>>>>guarantee that the sender is writing to you for the
>>>>first time. It is not unusual for an old friend
>>>>(especially if he is a hacker) to suddenly send you
>>an
>>>>email with a new from-address, so you can't risk
>>false
>>>>positives by filtering mail from unknown addresses
>>>>especially stringently.
>>>>
>>>>In a sense, though, my filters do themselves embody
>a
>>>>kind of whitelist (and blacklist) because they are
>>>>based on entire messages, including the headers. So
>>to
>>>>that extent they "know" the email addresses of
>>trusted
>>>>senders and even the routes by which mail gets from
>>>>them to me. And they know the same about spam,
>>>>including the server names, mailer versions, and
>>>>protocols.
>>>>
>>>>
>>>>_ _ _
>>>>
>>>>
>>>>If I thought that I could keep up current rates of
>>>>spam filtering, I would consider this problem
>solved.
>>>>But it doesn't mean much to be able to filter out
>>most
>>>>present-day spam, because spam evolves. Indeed, most
>>>>antispam techniques so far have been like pesticides
>>>>that do nothing more than create a new, resistant
>>>>strain of bugs.
>>>>
>>>>I'm more hopeful about Bayesian filters, because
>they
>>>>evolve with the spam. So as spammers start using
>>>>"c0ck" instead of "cock" to evade simple-minded spam
>>>>filters based on individual words, Bayesian filters
>>>>automatically notice. Indeed, "c0ck" is far more
>>>>damning evidence than "cock", and Bayesian filters
>>>>know precisely how much more.
>>>>
>>>>Still, anyone who proposes a plan for spam filtering
>>>>has to be able to answer the question: if the
>>spammers
>>>>knew exactly what you were doing, how well could
>they
>>>>get past you? For example, I think that if
>>>>checksum-based spam filtering becomes a serious
>>>>obstacle, the spammers will just switch to mad-lib
>>>>techniques for generating message bodies.
>>>>
>>>>To beat Bayesian filters, it would not be enough for
>>>>spammers to make their emails unique or to stop
>using
>>>>individual naughty words. They'd have to make their
>>>>mails indistinguishable from your ordinary mail. And
>>>>this I think would severely constrain them. Spam is
>>>>mostly sales pitches, so unless your regular mail is
>>>>all sales pitches, spams will inevitably have a
>>>>different character. And the spammers would also, of
>>>>course, have to change (and keep changing) their
>>whole
>>>>infrastructure, because otherwise the headers would
>>>>look as bad to the Bayesian filters as ever, no
>>matter
>>>>what they did to the message body. I don't know
>>enough
>>>>about the infrastructure that spammers use to know
>>how
>>>>hard it would be to make the headers look innocent,
>>>>but my guess is that it would be even harder than
>>>>making the message look innocent.
>>>>
>>>>Assuming they could solve the problem of the
>headers,
>>>>the spam of the future will probably look something
>>>>like this:
>>>>Hey there. Thought you should check out the
>>>following:
>>>>
>>>>href="http://www.27meg.com/foo">http://www.27meg.com
>/
>>f
>>>o
>>>>o
>>>>
>>>>because that is about as much sales pitch as
>>>>content-based filtering will leave the spammer room
>>to
>>>>make. (Indeed, it will be hard even to get this past
>>>>filters, because if everything else in the email is
>>>>neutral, the spam probability will hinge on the url,
>>>>and it will take some effort to make that look
>>>>neutral.)
>>>>
>>>>Spammers range from businesses running so-called
>>>>opt-in lists who don't even try to conceal their
>>>>identities, to guys who hijack mail servers to send
>>>>out spams promoting porn sites. If we use filtering
>>to
>>>>whittle their options down to mails like the one
>>>>above, that should pretty much put the spammers on
>>the
>>>>"legitimate" end of the spectrum out of business;
>>they
>>>>feel obliged by various state laws to include
>>>>boilerplate about why their spam is not spam, and
>how
>>>>to cancel your "subscription," and that kind of text
>>>>is easy to recognize.
>>>>
>>>>(I used to think it was naive to believe that
>>stricter
>>>>laws would decrease spam. Now I think that while
>>>>stricter laws may not decrease the amount of spam
>>that
>>>>spammers send, they can certainly help filters to
>>>>decrease the amount of spam that recipients actually
>>>>see.)
>>>>
>>>>All along the spectrum, if you restrict the sales
>>>>pitches spammers can make, you will inevitably tend
>>to
>>>>put them out of business. That word business is an
>>>>important one to remember. The spammers are
>>>>businessmen. They send spam because it works. It
>>works
>>>>because although the response rate is abominably low
>>>>(at best 15 per million, vs 3000 per million for a
>>>>catalog mailing), the cost, to them, is practically
>>>>nothing. The cost is enormous for the recipients,
>>>>about 5 man-weeks for each million recipients who
>>>>spend a second to delete the spam, but the spammer
>>>>doesn't have to pay that.
>>>>
>>>>Sending spam does cost the spammer something,
>though.
>>>>[2] So the lower we can get the response rate--
>>>>whether by filtering, or by using filters to force
>>>>spammers to dilute their pitches-- the fewer
>>>>businesses will find it worth their while to send
>>>spam.
>>>>
>>>>The reason the spammers use the kinds of sales
>>pitches
>>>>that they do is to increase response rates. This is
>>>>possibly even more disgusting than getting inside
>the
>>>>mind of a spammer, but let's take a quick look
>inside
>>>>the mind of someone who responds to a spam. This
>>>>person is either astonishingly credulous or deeply
>in
>>>>denial about their sexual interests. In either case,
>>>>repulsive or idiotic as the spam seems to us, it is
>>>>exciting to them. The spammers wouldn't say these
>>>>things if they didn't sound exciting. And "thought
>>you
>>>>should check out the following" is just not going to
>>>>have nearly the pull with the spam recipient as the
>>>>kinds of things that spammers say now. Result: if it
>>>>can't contain exciting sales pitches, spam becomes
>>>>less effective as a marketing vehicle, and fewer
>>>>businesses want to use it.
>>>>
>>>>That is the big win in the end. I started writing
>>spam
>>>>filtering software because I didn't want have to
>look
>>>>at the stuff anymore. But if we get good enough at
>>>>filtering out spam, it will stop working, and the
>>>>spammers will actually stop sending it.
>>>>
>>>>
>>>>_ _ _
>>>>
>>>>
>>>>Of all the approaches to fighting spam, from
>software
>>>>to laws, I believe Bayesian filtering will be the
>>>>single most effective. But I also think that the
>more
>>>>different kinds of antispam efforts we undertake,
>the
>>>>better, because any measure that constrains spammers
>>>>will tend to make filtering easier. And even within
>>>>the world of content-based filtering, I think it
>will
>>>>be a good thing if there are many different kinds of
>>>>software being used simultaneously. The more
>>different
>>>>filters there are, the harder it will be for
>spammers
>>>>to tune spams to get through them.
>>>>
>>>>
>>>>
>>>>Appendix: Examples of Filtering
>>>>
>>>>Here is an example of a spam that arrived while I
>was
>>>>writing this article. The fifteen most interesting
>>>>words in this spam are:
>>>>qvp0045
>>>>indira
>>>>mx-05
>>>>intimail
>>>>$7500
>>>>freeyankeedom
>>>>cdo
>>>>bluefoxmedia
>>>>jpg
>>>>unsecured
>>>>platinum
>>>>3d0
>>>>qves
>>>>7c5
>>>>7c266675
>>>>
>>>>The words are a mix of stuff from the headers and
>>from
>>>>the message body, which is typical of spam. Also
>>>>typical of spam is that every one of these words has
>>a
>>>>spam probability, in my database, of .99. In fact
>>>>there are more than fifteen words with probabilities
>>>>of .99, and these are just the first fifteen seen.
>>>>
>>>>Unfortunately that makes this email a boring example
>>>>of the use of Bayes' Rule. To see an interesting
>>>>variety of probabilities we have to look at this
>>>>actually quite atypical spam.
>>>>
>>>>The fifteen most interesting words in this spam,
>with
>>>>their probabilities, are:
>>>>madam 0.99
>>>>promotion 0.99
>>>>republic 0.99
>>>>shortest 0.047225013
>>>>mandatory 0.047225013
>>>>standardization 0.07347802
>>>>sorry 0.08221981
>>>>supported 0.09019077
>>>>people's 0.09019077
>>>>enter 0.9075001
>>>>quality 0.8921298
>>>>organization 0.12454646
>>>>investment 0.8568143
>>>>very 0.14758544
>>>>valuable 0.82347786
>>>>
>>>>This time the evidence is a mix of good and bad. A
>>>>word like "shortest" is almost as much evidence for
>>>>innocence as a word like "madam" or "promotion" is
>>for
>>>>guilt. But still the case for guilt is stronger. If
>>>>you combine these numbers according to Bayes' Rule,
>>>>the resulting probability is .9027.
>>>>
>>>>"Madam" is obviously from spams beginning "Dear Sir
>>or
>>>>Madam." They're not very common, but the word
>"madam"
>>>>never occurs in my legitimate email, and it's all
>>>>about the ratio.
>>>>
>>>>"Republic" scores high because it often shows up in
>>>>Nigerian scam emails, and also occurs once or twice
>>in
>>>>spams referring to Korea and South Africa. You might
>>>>say that it's an accident that it thus helps
>identify
>>>>this spam. But I've found when examining spam
>>>>probabilities that there are a lot of these
>>accidents,
>>>>and they have an uncanny tendency to push things in
>>>>the right direction rather than the wrong one. In
>>this
>>>>case, it is not entirely a coincidence that the word
>>>>"Republic" occurs in Nigerian scam emails and this
>>>>spam. There is a whole class of dubious business
>>>>propositions involving less developed countries, and
>>>>these in turn are more likely to have names that
>>>>specify explicitly (because they aren't) that they
>>are
>>>>republics.[3]
>>>>
>>>>On the other hand, "enter" is a genuine miss. It
>>>>occurs mostly in unsubscribe instructions, but here
>>is
>>>>used in a completely innocent way. Fortunately the
>>>>statistical approach is fairly robust, and can
>>>>tolerate quite a lot of misses before the results
>>>>start to be thrown off.
>>>>
>>>>For comparison, here is an example of that rare
>bird,
>>>>a spam that gets through the filters. Why? Because
>by
>>>>sheer chance it happens to be loaded with words that
>>>>occur in my actual email:
>>>>perl 0.01
>>>>python 0.01
>>>>tcl 0.01
>>>>scripting 0.01
>>>>morris 0.01
>>>>graham 0.01491078
>>>>guarantee 0.9762507
>>>>cgi 0.9734398
>>>>paul 0.027040077
>>>>quite 0.030676773
>>>>pop3 0.042199217
>>>>various 0.06080265
>>>>prices 0.9359873
>>>>managed 0.06451222
>>>>difficult 0.071706355
>>>>
>>>>There are a couple pieces of good news here. First,
>>>>this mail probably wouldn't get through the filters
>>of
>>>>someone who didn't happen to specialize in
>>programming
>>>>languages and have a good friend called Morris. For
>>>>the average user, all the top five words here would
>>be
>>>>neutral and would not contribute to the spam
>>>>probability.
>>>>
>>>>Second, I think filtering based on word pairs (see
>>>>below) might well catch this one: "cost effective",
>>>>"setup fee", "money back" -- pretty incriminating
>>>>stuff. And of course if they continued to spam me
>(or
>>>>a network I was part of), "Hostex" itself would be
>>>>recognized as a spam term.
>>>>
>>>>Finally, here is an innocent email. Its fifteen most
>>>>interesting words are as follows:
>>>>continuation 0.01
>>>>describe 0.01
>>>>continuations 0.01
>>>>example 0.033600237
>>>>programming 0.05214485
>>>>i'm 0.055427782
>>>>examples 0.07972858
>>>>color 0.9189189
>>>>localhost 0.09883721
>>>>hi 0.116539136
>>>>california 0.84421706
>>>>same 0.15981844
>>>>spot 0.1654587
>>>>us-ascii 0.16804294
>>>>what 0.19212411
>>>>
>>>>Most of the words here indicate the mail is an
>>>>innocent one. There are two bad smelling words,
>>>>"color" (spammers love colored fonts) and
>>"California"
>>>>(which occurs in testimonials and also in menus in
>>>>forms), but they are not enough to outweigh
>obviously
>>>>innocent words like "continuation" and "example".
>>>>
>>>>It's interesting that "describe" rates as so
>>>>thoroughly innocent. It hasn't occurred in a single
>>>>one of my 4000 spams. The data turns out to be full
>>of
>>>>such surprises. One of the things you learn when you
>>>>analyze spam texts is how narrow a subset of the
>>>>language spammers operate in. It's that fact,
>>together
>>>>with the equally characteristic vocabulary of any
>>>>individual user's mail, that makes Bayesian
>filtering
>>>>a good bet.
>>>>
>>>>Appendix: More Ideas
>>>>
>>>>One idea that I haven't tried yet is to filter based
>>>>on word pairs, or even triples, rather than
>>individual
>>>>words. This should yield a much sharper estimate of
>>>>the probability. For example, in my current
>database,
>>>>the word "offers" has a probability of .96. If you
>>>>based the probabilities on word pairs, you'd end up
>>>>with "special offers" and "valuable offers" having
>>>>probabilities of .99 and, say, "approach offers" (as
>>>>in "this approach offers") having a probability of
>.1
>>>>or less.
>>>>
>>>>The reason I haven't done this is that filtering
>>based
>>>>on individual words already works so well. But it
>>does
>>>>mean that there is room to tighten the filters if
>>spam
>>>>gets harder to detect. (Curiously, a filter based on
>>>>word pairs would be in effect a Markov-chaining text
>>>>generator running in reverse.)
>>>>
>>>>Specific spam features (e.g. not seeing the
>>>>recipient's address in the to: field) do of course
>>>>have value in recognizing spam. They can be
>>considered
>>>>in this algorithm by treating them as virtual words.
>>>>I'll probably do this in future versions, at least
>>for
>>>>a handful of the most egregious spam indicators.
>>>>Feature-recognizing spam filters are right in many
>>>>details; what they lack is an overall discipline for
>>>>combining evidence.
>>>>
>>>>Recognizing nonspam features may be more important
>>>>than recognizing spam features. False positives are
>>>>such a worry that they demand extraordinary
>measures.
>>>>I will probably in future versions add a second
>level
>>>>of testing designed specifically to avoid false
>>>>positives. If a mail triggers this second level of
>>>>filters it will be accepted even if its spam
>>>>probability is above the threshold.
>>>>
>>>>I don't expect this second level of filtering to be
>>>>Bayesian. It will inevitably be not only ad hoc, but
>>>>based on guesses, because the number of false
>>>>positives will not tend to be large enough to notice
>>>>patterns. (It is just as well, anyway, if a backup
>>>>system doesn't rely on the same technology as the
>>>>primary system.)
>>>>
>>>>Another thing I may try in the future is to focus
>>>>extra attention on specific parts of the email. For
>>>>example, about 95% of current spam includes the url
>>of
>>>>a site they want you to visit. (The remaining 5%
>want
>>>>you to call a phone number, reply by email or to a
>US
>>>>mail address, or in a few cases to buy a certain
>>>>stock.) The url is in such cases practically enough
>>by
>>>>itself to determine whether the email is spam.
>>>>
>>>>Domain names differ from the rest of the text in a
>>>>(non-German) email in that they often consist of
>>>>several words stuck together. Though computationally
>>>>expensive in the general case, it might be worth
>>>>trying to decompose them. If a filter has never seen
>>>>the token "xxxporn" before it will have an
>individual
>>>>spam probability of .4, whereas "xxx" and "porn"
>>>>individually have probabilities (in my corpus) of
>>>>.9889 and .99 respectively, and a combined
>>probability
>>>>of .9998.
>>>>
>>>>I expect decomposing domain names to become more
>>>>important as spammers are gradually forced to stop
>>>>using incriminating words in the text of their
>>>>messages. (A url with an ip address is of course an
>>>>extremely incriminating sign, except in the mail of
>a
>>>>few sysadmins.)
>>>>
>>>>It might be a good idea to have a cooperatively
>>>>maintained list of urls promoted by spammers. We'd
>>>>need a trust metric of the type studied by Raph
>>Levien
>>>>to prevent malicious or incompetent submissions, but
>>>>if we had such a thing it would provide a boost to
>>any
>>>>filtering software. It would also be a convenient
>>>>basis for boycotts.
>>>>
>>>>Another way to test dubious urls would be to send
>out
>>>>a crawler to look at the site before the user looked
>>>>at the email mentioning it. You could use a Bayesian
>>>>filter to rate the site just as you would an email,
>>>>and whatever was found on the site could be included
>>>>in calculating the probability of the email being a
>>>>spam. A url that led to a redirect would of course
>be
>>>>especially suspicious.
>>>>
>>>>One cooperative project that I think really would be
>>a
>>>>good idea would be to accumulate a giant corpus of
>>>>spam. A large, clean corpus is the key to making
>>>>Bayesian filtering work well. Bayesian filters could
>>>>actually use the corpus as input. But such a corpus
>>>>would be useful for other kinds of filters too,
>>>>because it could be used to test them.
>>>>
>>>>Creating such a corpus poses some technical
>problems.
>>>>We'd need trust metrics to prevent malicious or
>>>>incompetent submissions, of course. We'd also need
>>>>ways of erasing personal information (not just
>>>>to-addresses and ccs, but also e.g. the arguments to
>>>>unsubscribe urls, which often encode the to-address)
>>>>from mails in the corpus. If anyone wants to take on
>>>>this project, it would be a good thing for the
>world.
>>>>
>>>>Appendix: Defining Spam
>>>>
>>>>I think there is a rough consensus on what spam is,
>>>>but it would be useful to have an explicit
>>definition.
>>>>We'll need to do this if we want to establish a
>>>>central corpus of spam, or even to compare spam
>>>>filtering rates meaningfully.
>>>>
>>>>To start with, spam is not unsolicited commercial
>>>>email. If someone in my neighborhood heard that I
>was
>>>>looking for an old Raleigh three-speed in good
>>>>condition, and sent me an email offering to sell me
>>>>one, I'd be delighted, and yet this email would be
>>>>both commercial and unsolicited. The defining
>feature
>>>>of spam (in fact, its raison d'etre) is not that it
>>is
>>>>unsolicited, but that it is automated.
>>>>
>>>>It is merely incidental, too, that spam is usually
>>>>commercial. If someone started sending mass email to
>>>>support some political cause, for example, it would
>>be
>>>>just as much spam as email promoting a porn site.
>>>>
>>>>I propose we define spam as unsolicited automated
>>>>email. This definition thus includes some email that
>>>>many legal definitions of spam don't. Legal
>>>>definitions of spam, influenced presumably by
>>>>lobbyists, tend to exclude mail sent by companies
>>that
>>>>have an "existing relationship" with the recipient.
>>>>But buying something from a company, for example,
>>does
>>>>not imply that you have solicited ongoing email from
>>>>them. If I order something from an online store, and
>>>>they then send me a stream of spam, it's still spam.
>>>>
>>>>Companies sending spam often give you a way to
>>>>"unsubscribe," or ask you to go to their site and
>>>>change your "account preferences" if you want to
>stop
>>>>getting spam. This is not enough to stop the mail
>>from
>>>>being spam. Not opting out is not the same as opting
>>>>in. Unless the recipient explicitly checked a
>clearly
>>>>labelled box (whose default was no) asking to
>receive
>>>>the email, then it is spam.
>>>>
>>>>In some business relationships, you do implicitly
>>>>solicit certain kinds of mail. When you order
>online,
>>>>I think you implicitly solicit a receipt, and
>>>>notification when the order ships. I don't mind when
>>>>Verisign sends me mail warning that a domain name is
>>>>about to expire (at least, if they are the actual
>>>>registrar for it). But when Verisign sends me email
>>>>offering a FREE Guide to Building My E-Commerce Web
>>>>Site, that's spam.
>>>>
>>>>Notes:
>>>>
>>>>[1] The examples in this article are translated into
>>>>Common Lisp for, believe it or not, greater
>>>>accessibility. The application described here is one
>>>>that we wrote in order to test a new Lisp dialect
>>>>called Arc that is not yet released.
>>>>
>>>>[2] Currently the lowest rate seems to be about $200
>>>>to send a million spams. That's very cheap, 1/50th
>of
>>>>a cent per spam. But filtering out 95% of spam, for
>>>>example, would increase the spammers' cost to reach
>a
>>>>given audience by a factor of 20. Few can have
>>margins
>>>>big enough to absorb that.
>>>>
>>>>[3] As a rule of thumb, the more qualifiers there
>are
>>>>before the name of a country, the more corrupt the
>>>>rulers. A country called The Socialist People's
>>>>Democratic Republic of X is probably the last place
>>in
>>>>the world you'd want to live.
>>>>
>>>>Thanks to Sarah Harlin for reading drafts of this;
>>>>Daniel Giffin (who is also writing the production
>Arc
>>>>interpreter) for several good ideas about filtering
>>>>and for creating our mail infrastructure; Robert
>>>>Morris, Trevor Blackwell and Erann Gat for many
>>>>discussions about spam; Raph Levien for advice about
>>>>trust metrics; and Chip Coldwell and Sam Steingold
>>for
>>>>advice about statistics.
>>>>
>>>> You'll find this essay and 14 others in Hackers &
>>>>Painters.
[
Next Thread |
Previous Thread |
Next Message |
Previous Message
]
| |