Re: Now, what is this about? -- Catholics United for the Faith

[ Show ]

Support VoyForums

[ Shrink ]

VoyForums Announcement: Programming and providing support for this service has been a labor of love since 1997. We are one of the few services online who values our users' privacy, and have never sold your information. We have even fought hard to defend your privacy in legal cases; however, we've done it with almost no financial support -- paying out of pocket to continue providing the service. Due to the issues imposed on us by advertisers, we also stopped hosting most ads on the forums many years ago. We hope you appreciate our efforts.

Show your support by donating any amount. (Note: We are still technically a for-profit company, so your contribution is not tax-deductible.) PayPal Acct: Feedback:

Donate to VoyForums (PayPal):

[ Login ] [ Contact Forum Admin ] [ Main index ] [ Post a new message ] [ Search | Check update time | Archives: 1, 2, 3, [4] ]

Subject: Re: Now, what is this about?

Author:
ghjghj

[ Next Thread | Previous Thread | Next Message | Previous Message ]

Date Posted: 05:28:57 01/01/08 Tue
In reply to: fgjfjjkj 's message, "Re: Now, what is this about?" on 05:28:15 01/01/08 Tue

>>>>Like to build things? Try Hacker News.
>>>>August 2002
>>>>
>>>>(This article describes the spam-filtering
>techniques
>>>>used in the spamproof web-based mail reader we built
>>>>to exercise Arc. An improved algorithm is described
>>in
>>>>Better Bayesian Filtering.)
>>>>
>>>>I think it's possible to stop spam, and that
>>>>content-based filters are the way to do it. The
>>>>Achilles heel of the spammers is their message. They
>>>>can circumvent any other barrier you set up. They
>>have
>>>>so far, at least. But they have to deliver their
>>>>message, whatever it is. If we can write software
>>that
>>>>recognizes their messages, there is no way they can
>>>>get around that.
>>>>
>>>>To the recipient, spam is easily recognizable. If
>you
>>>>hired someone to read your mail and discard the
>spam,
>>>>they would have little trouble doing it. How much do
>>>>we have to do, short of AI, to automate this
>process?
>>>>
>>>>I think we will be able to solve the problem with
>>>>fairly simple algorithms. In fact, I've found that
>>you
>>>>can filter present-day spam acceptably well using
>>>>nothing more than a Bayesian combination of the spam
>>>>probabilities of individual words. Using a slightly
>>>>tweaked (as described below) Bayesian filter, we now
>>>>miss less than 5 per 1000 spams, with 0 false
>>>>positives.
>>>>
>>>>The statistical approach is not usually the first
>one
>>>>people try when they write spam filters. Most
>>hackers'
>>>>first instinct is to try to write software that
>>>>recognizes individual properties of spam. You look
>at
>>>>spams and you think, the gall of these guys to try
>>>>sending me mail that begins "Dear Friend" or has a
>>>>subject line that's all uppercase and ends in eight
>>>>exclamation points. I can filter out that stuff with
>>>>about one line of code.
>>>>
>>>>And so you do, and in the beginning it works. A few
>>>>simple rules will take a big bite out of your
>>incoming
>>>>spam. Merely looking for the word "click" will catch
>>>>79.7% of the emails in my spam corpus, with only
>1.2%
>>>>false positives.
>>>>
>>>>I spent about six months writing software that
>looked
>>>>for individual spam features before I tried the
>>>>statistical approach. What I found was that
>>>>recognizing that last few percent of spams got very
>>>>hard, and that as I made the filters stricter I got
>>>>more false positives.
>>>>
>>>>False positives are innocent emails that get
>>>>mistakenly identified as spams. For most users,
>>>>missing legitimate email is an order of magnitude
>>>>worse than receiving spam, so a filter that yields
>>>>false positives is like an acne cure that carries a
>>>>risk of death to the patient.
>>>>
>>>>The more spam a user gets, the less likely he'll be
>>to
>>>>notice one innocent mail sitting in his spam folder.
>>>>And strangely enough, the better your spam filters
>>>>get, the more dangerous false positives become,
>>>>because when the filters are really good, users will
>>>>be more likely to ignore everything they catch.
>>>>
>>>>I don't know why I avoided trying the statistical
>>>>approach for so long. I think it was because I got
>>>>addicted to trying to identify spam features myself,
>>>>as if I were playing some kind of competitive game
>>>>with the spammers. (Nonhackers don't often realize
>>>>this, but most hackers are very competitive.) When I
>>>>did try statistical analysis, I found immediately
>>that
>>>>it was much cleverer than I had been. It discovered,
>>>>of course, that terms like "virtumundo" and "teens"
>>>>were good indicators of spam. But it also discovered
>>>>that "per" and "FL" and "ff0000" are good indicators
>>>>of spam. In fact, "ff0000" (html for bright red)
>>turns
>>>>out to be as good an indicator of spam as any
>>>>pornographic term.
>>>>
>>>>
>>>>_ _ _
>>>>
>>>>
>>>>Here's a sketch of how I do statistical filtering. I
>>>>start with one corpus of spam and one of nonspam
>>mail.
>>>>At the moment each one has about 4000 messages in
>it.
>>>>I scan the entire text, including headers and
>>embedded
>>>>html and javascript, of each message in each corpus.
>>I
>>>>currently consider alphanumeric characters, dashes,
>>>>apostrophes, and dollar signs to be part of tokens,
>>>>and everything else to be a token separator. (There
>>is
>>>>probably room for improvement here.) I ignore tokens
>>>>that are all digits, and I also ignore html
>comments,
>>>>not even considering them as token separators.
>>>>
>>>>I count the number of times each token (ignoring
>>case,
>>>>currently) occurs in each corpus. At this stage I
>end
>>>>up with two large hash tables, one for each corpus,
>>>>mapping tokens to number of occurrences.
>>>>
>>>>Next I create a third hash table, this time mapping
>>>>each token to the probability that an email
>>containing
>>>>it is a spam, which I calculate as follows [1]:
>>>>(let ((g (* 2 (or (gethash word good) 0)))
>>>> (b (or (gethash word bad) 0)))
>>>> (unless (< (+ g b) 5)
>>>> (max .01
>>>> (min .99 (float (/ (min 1 (/ b nbad))
>>>> (+ (min 1 (/ g ngood))
>>
>>>> (min 1 (/ b
>>>>nbad)))))))))
>>>>
>>>>where word is the token whose probability we're
>>>>calculating, good and bad are the hash tables I
>>>>created in the first step, and ngood and nbad are
>the
>>>>number of nonspam and spam messages respectively.
>>>>
>>>>I explained this as code to show a couple of
>>important
>>>>details. I want to bias the probabilities slightly
>to
>>>>avoid false positives, and by trial and error I've
>>>>found that a good way to do it is to double all the
>>>>numbers in good. This helps to distinguish between
>>>>words that occasionally do occur in legitimate email
>>>>and words that almost never do. I only consider
>words
>>>>that occur more than five times in total (actually,
>>>>because of the doubling, occurring three times in
>>>>nonspam mail would be enough). And then there is the
>>>>question of what probability to assign to words that
>>>>occur in one corpus but not the other. Again by
>trial
>>>>and error I chose .01 and .99. There may be room for
>>>>tuning here, but as the corpus grows such tuning
>will
>>>>happen automatically anyway.
>>>>
>>>>The especially observant will notice that while I
>>>>consider each corpus to be a single long stream of
>>>>text for purposes of counting occurrences, I use the
>>>>number of emails in each, rather than their combined
>>>>length, as the divisor in calculating spam
>>>>probabilities. This adds another slight bias to
>>>>protect against false positives.
>>>>
>>>>When new mail arrives, it is scanned into tokens,
>and
>>>>the most interesting fifteen tokens, where
>>interesting
>>>>is measured by how far their spam probability is
>from
>>>>a neutral .5, are used to calculate the probability
>>>>that the mail is spam. If probs is a list of the
>>>>fifteen individual probabilities, you calculate the
>>>>combined probability thus:
>>>>(let ((prod (apply #'* probs)))
>>>> (/ prod (+ prod (apply #'* (mapcar #'(lambda (x)
>>>> (- 1 x))
>>>> probs)))))
>>>>
>>>>One question that arises in practice is what
>>>>probability to assign to a word you've never seen,
>>>>i.e. one that doesn't occur in the hash table of
>word
>>>>probabilities. I've found, again by trial and error,
>>>>that .4 is a good number to use. If you've never
>seen
>>>>a word before, it is probably fairly innocent; spam
>>>>words tend to be all too familiar.
>>>>
>>>>There are examples of this algorithm being applied
>to
>>>>actual emails in an appendix at the end.
>>>>
>>>>I treat mail as spam if the algorithm above gives it
>>a
>>>>probability of more than .9 of being spam. But in
>>>>practice it would not matter much where I put this
>>>>threshold, because few probabilities end up in the
>>>>middle of the range.
>>>>
>>>>
>>>>_ _ _
>>>>
>>>>
>>>>One great advantage of the statistical approach is
>>>>that you don't have to read so many spams. Over the
>>>>past six months, I've read literally thousands of
>>>>spams, and it is really kind of demoralizing.
>Norbert
>>>>Wiener said if you compete with slaves you become a
>>>>slave, and there is something similarly degrading
>>>>about competing with spammers. To recognize
>>individual
>>>>spam features you have to try to get into the mind
>of
>>>>the spammer, and frankly I want to spend as little
>>>>time inside the minds of spammers as possible.
>>>>
>>>>But the real advantage of the Bayesian approach, of
>>>>course, is that you know what you're measuring.
>>>>Feature-recognizing filters like SpamAssassin assign
>>a
>>>>spam "score" to email. The Bayesian approach assigns
>>>>an actual probability. The problem with a "score" is
>>>>that no one knows what it means. The user doesn't
>>know
>>>>what it means, but worse still, neither does the
>>>>developer of the filter. How many points should an
>>>>email get for having the word "sex" in it? A
>>>>probability can of course be mistaken, but there is
>>>>little ambiguity about what it means, or how
>evidence
>>>>should be combined to calculate it. Based on my
>>>>corpus, "sex" indicates a .97 probability of the
>>>>containing email being a spam, whereas "sexy"
>>>>indicates .99 probability. And Bayes' Rule, equally
>>>>unambiguous, says that an email containing both
>words
>>>>would, in the (unlikely) absence of any other
>>>>evidence, have a 99.97% chance of being a spam.
>>>>
>>>>Because it is measuring probabilities, the Bayesian
>>>>approach considers all the evidence in the email,
>>both
>>>>good and bad. Words that occur disproportionately
>>>>rarely in spam (like "though" or "tonight" or
>>>>"apparently") contribute as much to decreasing the
>>>>probability as bad words like "unsubscribe" and
>>>>"opt-in" do to increasing it. So an otherwise
>>innocent
>>>>email that happens to include the word "sex" is not
>>>>going to get tagged as spam.
>>>>
>>>>Ideally, of course, the probabilities should be
>>>>calculated individually for each user. I get a lot
>of
>>>>email containing the word "Lisp", and (so far) no
>>spam
>>>>that does. So a word like that is effectively a kind
>>>>of password for sending mail to me. In my earlier
>>>>spam-filtering software, the user could set up a
>list
>>>>of such words and mail containing them would
>>>>automatically get past the filters. On my list I put
>>>>words like "Lisp" and also my zipcode, so that
>>>>(otherwise rather spammy-sounding) receipts from
>>>>online orders would get through. I thought I was
>>being
>>>>very clever, but I found that the Bayesian filter
>did
>>>>the same thing for me, and moreover discovered of a
>>>>lot of words I hadn't thought of.
>>>>
>>>>When I said at the start that our filters let
>through
>>>>less than 5 spams per 1000 with 0 false positives,
>>I'm
>>>>talking about filtering my mail based on a corpus of
>>>>my mail. But these numbers are not misleading,
>>because
>>>>that is the approach I'm advocating: filter each
>>>>user's mail based on the spam and nonspam mail he
>>>>receives. Essentially, each user should have two
>>>>delete buttons, ordinary delete and delete-as-spam.
>>>>Anything deleted as spam goes into the spam corpus,
>>>>and everything else goes into the nonspam corpus.
>>>>
>>>>You could start users with a seed filter, but
>>>>ultimately each user should have his own per-word
>>>>probabilities based on the actual mail he receives.
>>>>This (a) makes the filters more effective, (b) lets
>>>>each user decide their own precise definition of
>>spam,
>>>>and (c) perhaps best of all makes it hard for
>>spammers
>>>>to tune mails to get through the filters. If a lot
>of
>>>>the brain of the filter is in the individual
>>>>databases, then merely tuning spams to get through
>>the
>>>>seed filters won't guarantee anything about how well
>>>>they'll get through individual users' varying and
>>much
>>>>more trained filters.
>>>>
>>>>Content-based spam filtering is often combined with
>a
>>>>whitelist, a list of senders whose mail can be
>>>>accepted with no filtering. One easy way to build
>>such
>>>>a whitelist is to keep a list of every address the
>>>>user has ever sent mail to. If a mail reader has a
>>>>delete-as-spam button then you could also add the
>>from
>>>>address of every email the user has deleted as
>>>>ordinary trash.
>>>>
>>>>I'm an advocate of whitelists, but more as a way to
>>>>save computation than as a way to improve filtering.
>>I
>>>>used to think that whitelists would make filtering
>>>>easier, because you'd only have to filter email from
>>>>people you'd never heard from, and someone sending
>>you
>>>>mail for the first time is constrained by convention
>>>>in what they can say to you. Someone you already
>know
>>>>might send you an email talking about sex, but
>>someone
>>>>sending you mail for the first time would not be
>>>>likely to. The problem is, people can have more than
>>>>one email address, so a new from-address doesn't
>>>>guarantee that the sender is writing to you for the
>>>>first time. It is not unusual for an old friend
>>>>(especially if he is a hacker) to suddenly send you
>>an
>>>>email with a new from-address, so you can't risk
>>false
>>>>positives by filtering mail from unknown addresses
>>>>especially stringently.
>>>>
>>>>In a sense, though, my filters do themselves embody
>a
>>>>kind of whitelist (and blacklist) because they are
>>>>based on entire messages, including the headers. So
>>to
>>>>that extent they "know" the email addresses of
>>trusted
>>>>senders and even the routes by which mail gets from
>>>>them to me. And they know the same about spam,
>>>>including the server names, mailer versions, and
>>>>protocols.
>>>>
>>>>
>>>>_ _ _
>>>>
>>>>
>>>>If I thought that I could keep up current rates of
>>>>spam filtering, I would consider this problem
>solved.
>>>>But it doesn't mean much to be able to filter out
>>most
>>>>present-day spam, because spam evolves. Indeed, most
>>>>antispam techniques so far have been like pesticides
>>>>that do nothing more than create a new, resistant
>>>>strain of bugs.
>>>>
>>>>I'm more hopeful about Bayesian filters, because
>they
>>>>evolve with the spam. So as spammers start using
>>>>"c0ck" instead of "cock" to evade simple-minded spam
>>>>filters based on individual words, Bayesian filters
>>>>automatically notice. Indeed, "c0ck" is far more
>>>>damning evidence than "cock", and Bayesian filters
>>>>know precisely how much more.
>>>>
>>>>Still, anyone who proposes a plan for spam filtering
>>>>has to be able to answer the question: if the
>>spammers
>>>>knew exactly what you were doing, how well could
>they
>>>>get past you? For example, I think that if
>>>>checksum-based spam filtering becomes a serious
>>>>obstacle, the spammers will just switch to mad-lib
>>>>techniques for generating message bodies.
>>>>
>>>>To beat Bayesian filters, it would not be enough for
>>>>spammers to make their emails unique or to stop
>using
>>>>individual naughty words. They'd have to make their
>>>>mails indistinguishable from your ordinary mail. And
>>>>this I think would severely constrain them. Spam is
>>>>mostly sales pitches, so unless your regular mail is
>>>>all sales pitches, spams will inevitably have a
>>>>different character. And the spammers would also, of
>>>>course, have to change (and keep changing) their
>>whole
>>>>infrastructure, because otherwise the headers would
>>>>look as bad to the Bayesian filters as ever, no
>>matter
>>>>what they did to the message body. I don't know
>>enough
>>>>about the infrastructure that spammers use to know
>>how
>>>>hard it would be to make the headers look innocent,
>>>>but my guess is that it would be even harder than
>>>>making the message look innocent.
>>>>
>>>>Assuming they could solve the problem of the
>headers,
>>>>the spam of the future will probably look something
>>>>like this:
>>>>Hey there. Thought you should check out the
>>>following:
>>>> >>>>href="http://www.27meg.com/foo">http://www.27meg.com
>/
>>f
>>>o
>>>>o
>>>>
>>>>because that is about as much sales pitch as
>>>>content-based filtering will leave the spammer room
>>to
>>>>make. (Indeed, it will be hard even to get this past
>>>>filters, because if everything else in the email is
>>>>neutral, the spam probability will hinge on the url,
>>>>and it will take some effort to make that look
>>>>neutral.)
>>>>
>>>>Spammers range from businesses running so-called
>>>>opt-in lists who don't even try to conceal their
>>>>identities, to guys who hijack mail servers to send
>>>>out spams promoting porn sites. If we use filtering
>>to
>>>>whittle their options down to mails like the one
>>>>above, that should pretty much put the spammers on
>>the
>>>>"legitimate" end of the spectrum out of business;
>>they
>>>>feel obliged by various state laws to include
>>>>boilerplate about why their spam is not spam, and
>how
>>>>to cancel your "subscription," and that kind of text
>>>>is easy to recognize.
>>>>
>>>>(I used to think it was naive to believe that
>>stricter
>>>>laws would decrease spam. Now I think that while
>>>>stricter laws may not decrease the amount of spam
>>that
>>>>spammers send, they can certainly help filters to
>>>>decrease the amount of spam that recipients actually
>>>>see.)
>>>>
>>>>All along the spectrum, if you restrict the sales
>>>>pitches spammers can make, you will inevitably tend
>>to
>>>>put them out of business. That word business is an
>>>>important one to remember. The spammers are
>>>>businessmen. They send spam because it works. It
>>works
>>>>because although the response rate is abominably low
>>>>(at best 15 per million, vs 3000 per million for a
>>>>catalog mailing), the cost, to them, is practically
>>>>nothing. The cost is enormous for the recipients,
>>>>about 5 man-weeks for each million recipients who
>>>>spend a second to delete the spam, but the spammer
>>>>doesn't have to pay that.
>>>>
>>>>Sending spam does cost the spammer something,
>though.
>>>>[2] So the lower we can get the response rate--
>>>>whether by filtering, or by using filters to force
>>>>spammers to dilute their pitches-- the fewer
>>>>businesses will find it worth their while to send
>>>spam.
>>>>
>>>>The reason the spammers use the kinds of sales
>>pitches
>>>>that they do is to increase response rates. This is
>>>>possibly even more disgusting than getting inside
>the
>>>>mind of a spammer, but let's take a quick look
>inside
>>>>the mind of someone who responds to a spam. This
>>>>person is either astonishingly credulous or deeply
>in
>>>>denial about their sexual interests. In either case,
>>>>repulsive or idiotic as the spam seems to us, it is
>>>>exciting to them. The spammers wouldn't say these
>>>>things if they didn't sound exciting. And "thought
>>you
>>>>should check out the following" is just not going to
>>>>have nearly the pull with the spam recipient as the
>>>>kinds of things that spammers say now. Result: if it
>>>>can't contain exciting sales pitches, spam becomes
>>>>less effective as a marketing vehicle, and fewer
>>>>businesses want to use it.
>>>>
>>>>That is the big win in the end. I started writing
>>spam
>>>>filtering software because I didn't want have to
>look
>>>>at the stuff anymore. But if we get good enough at
>>>>filtering out spam, it will stop working, and the
>>>>spammers will actually stop sending it.
>>>>
>>>>
>>>>_ _ _
>>>>
>>>>
>>>>Of all the approaches to fighting spam, from
>software
>>>>to laws, I believe Bayesian filtering will be the
>>>>single most effective. But I also think that the
>more
>>>>different kinds of antispam efforts we undertake,
>the
>>>>better, because any measure that constrains spammers
>>>>will tend to make filtering easier. And even within
>>>>the world of content-based filtering, I think it
>will
>>>>be a good thing if there are many different kinds of
>>>>software being used simultaneously. The more
>>different
>>>>filters there are, the harder it will be for
>spammers
>>>>to tune spams to get through them.
>>>>
>>>>
>>>>
>>>>Appendix: Examples of Filtering
>>>>
>>>>Here is an example of a spam that arrived while I
>was
>>>>writing this article. The fifteen most interesting
>>>>words in this spam are:
>>>>qvp0045
>>>>indira
>>>>mx-05
>>>>intimail
>>>>$7500
>>>>freeyankeedom
>>>>cdo
>>>>bluefoxmedia
>>>>jpg
>>>>unsecured
>>>>platinum
>>>>3d0
>>>>qves
>>>>7c5
>>>>7c266675
>>>>
>>>>The words are a mix of stuff from the headers and
>>from
>>>>the message body, which is typical of spam. Also
>>>>typical of spam is that every one of these words has
>>a
>>>>spam probability, in my database, of .99. In fact
>>>>there are more than fifteen words with probabilities
>>>>of .99, and these are just the first fifteen seen.
>>>>
>>>>Unfortunately that makes this email a boring example
>>>>of the use of Bayes' Rule. To see an interesting
>>>>variety of probabilities we have to look at this
>>>>actually quite atypical spam.
>>>>
>>>>The fifteen most interesting words in this spam,
>with
>>>>their probabilities, are:
>>>>madam 0.99
>>>>promotion 0.99
>>>>republic 0.99
>>>>shortest 0.047225013
>>>>mandatory 0.047225013
>>>>standardization 0.07347802
>>>>sorry 0.08221981
>>>>supported 0.09019077
>>>>people's 0.09019077
>>>>enter 0.9075001
>>>>quality 0.8921298
>>>>organization 0.12454646
>>>>investment 0.8568143
>>>>very 0.14758544
>>>>valuable 0.82347786
>>>>
>>>>This time the evidence is a mix of good and bad. A
>>>>word like "shortest" is almost as much evidence for
>>>>innocence as a word like "madam" or "promotion" is
>>for
>>>>guilt. But still the case for guilt is stronger. If
>>>>you combine these numbers according to Bayes' Rule,
>>>>the resulting probability is .9027.
>>>>
>>>>"Madam" is obviously from spams beginning "Dear Sir
>>or
>>>>Madam." They're not very common, but the word
>"madam"
>>>>never occurs in my legitimate email, and it's all
>>>>about the ratio.
>>>>
>>>>"Republic" scores high because it often shows up in
>>>>Nigerian scam emails, and also occurs once or twice
>>in
>>>>spams referring to Korea and South Africa. You might
>>>>say that it's an accident that it thus helps
>identify
>>>>this spam. But I've found when examining spam
>>>>probabilities that there are a lot of these
>>accidents,
>>>>and they have an uncanny tendency to push things in
>>>>the right direction rather than the wrong one. In
>>this
>>>>case, it is not entirely a coincidence that the word
>>>>"Republic" occurs in Nigerian scam emails and this
>>>>spam. There is a whole class of dubious business
>>>>propositions involving less developed countries, and
>>>>these in turn are more likely to have names that
>>>>specify explicitly (because they aren't) that they
>>are
>>>>republics.[3]
>>>>
>>>>On the other hand, "enter" is a genuine miss. It
>>>>occurs mostly in unsubscribe instructions, but here
>>is
>>>>used in a completely innocent way. Fortunately the
>>>>statistical approach is fairly robust, and can
>>>>tolerate quite a lot of misses before the results
>>>>start to be thrown off.
>>>>
>>>>For comparison, here is an example of that rare
>bird,
>>>>a spam that gets through the filters. Why? Because
>by
>>>>sheer chance it happens to be loaded with words that
>>>>occur in my actual email:
>>>>perl 0.01
>>>>python 0.01
>>>>tcl 0.01
>>>>scripting 0.01
>>>>morris 0.01
>>>>graham 0.01491078
>>>>guarantee 0.9762507
>>>>cgi 0.9734398
>>>>paul 0.027040077
>>>>quite 0.030676773
>>>>pop3 0.042199217
>>>>various 0.06080265
>>>>prices 0.9359873
>>>>managed 0.06451222
>>>>difficult 0.071706355
>>>>
>>>>There are a couple pieces of good news here. First,
>>>>this mail probably wouldn't get through the filters
>>of
>>>>someone who didn't happen to specialize in
>>programming
>>>>languages and have a good friend called Morris. For
>>>>the average user, all the top five words here would
>>be
>>>>neutral and would not contribute to the spam
>>>>probability.
>>>>
>>>>Second, I think filtering based on word pairs (see
>>>>below) might well catch this one: "cost effective",
>>>>"setup fee", "money back" -- pretty incriminating
>>>>stuff. And of course if they continued to spam me
>(or
>>>>a network I was part of), "Hostex" itself would be
>>>>recognized as a spam term.
>>>>
>>>>Finally, here is an innocent email. Its fifteen most
>>>>interesting words are as follows:
>>>>continuation 0.01
>>>>describe 0.01
>>>>continuations 0.01
>>>>example 0.033600237
>>>>programming 0.05214485
>>>>i'm 0.055427782
>>>>examples 0.07972858
>>>>color 0.9189189
>>>>localhost 0.09883721
>>>>hi 0.116539136
>>>>california 0.84421706
>>>>same 0.15981844
>>>>spot 0.1654587
>>>>us-ascii 0.16804294
>>>>what 0.19212411
>>>>
>>>>Most of the words here indicate the mail is an
>>>>innocent one. There are two bad smelling words,
>>>>"color" (spammers love colored fonts) and
>>"California"
>>>>(which occurs in testimonials and also in menus in
>>>>forms), but they are not enough to outweigh
>obviously
>>>>innocent words like "continuation" and "example".
>>>>
>>>>It's interesting that "describe" rates as so
>>>>thoroughly innocent. It hasn't occurred in a single
>>>>one of my 4000 spams. The data turns out to be full
>>of
>>>>such surprises. One of the things you learn when you
>>>>analyze spam texts is how narrow a subset of the
>>>>language spammers operate in. It's that fact,
>>together
>>>>with the equally characteristic vocabulary of any
>>>>individual user's mail, that makes Bayesian
>filtering
>>>>a good bet.
>>>>
>>>>Appendix: More Ideas
>>>>
>>>>One idea that I haven't tried yet is to filter based
>>>>on word pairs, or even triples, rather than
>>individual
>>>>words. This should yield a much sharper estimate of
>>>>the probability. For example, in my current
>database,
>>>>the word "offers" has a probability of .96. If you
>>>>based the probabilities on word pairs, you'd end up
>>>>with "special offers" and "valuable offers" having
>>>>probabilities of .99 and, say, "approach offers" (as
>>>>in "this approach offers") having a probability of
>.1
>>>>or less.
>>>>
>>>>The reason I haven't done this is that filtering
>>based
>>>>on individual words already works so well. But it
>>does
>>>>mean that there is room to tighten the filters if
>>spam
>>>>gets harder to detect. (Curiously, a filter based on
>>>>word pairs would be in effect a Markov-chaining text
>>>>generator running in reverse.)
>>>>
>>>>Specific spam features (e.g. not seeing the
>>>>recipient's address in the to: field) do of course
>>>>have value in recognizing spam. They can be
>>considered
>>>>in this algorithm by treating them as virtual words.
>>>>I'll probably do this in future versions, at least
>>for
>>>>a handful of the most egregious spam indicators.
>>>>Feature-recognizing spam filters are right in many
>>>>details; what they lack is an overall discipline for
>>>>combining evidence.
>>>>
>>>>Recognizing nonspam features may be more important
>>>>than recognizing spam features. False positives are
>>>>such a worry that they demand extraordinary
>measures.
>>>>I will probably in future versions add a second
>level
>>>>of testing designed specifically to avoid false
>>>>positives. If a mail triggers this second level of
>>>>filters it will be accepted even if its spam
>>>>probability is above the threshold.
>>>>
>>>>I don't expect this second level of filtering to be
>>>>Bayesian. It will inevitably be not only ad hoc, but
>>>>based on guesses, because the number of false
>>>>positives will not tend to be large enough to notice
>>>>patterns. (It is just as well, anyway, if a backup
>>>>system doesn't rely on the same technology as the
>>>>primary system.)
>>>>
>>>>Another thing I may try in the future is to focus
>>>>extra attention on specific parts of the email. For
>>>>example, about 95% of current spam includes the url
>>of
>>>>a site they want you to visit. (The remaining 5%
>want
>>>>you to call a phone number, reply by email or to a
>US
>>>>mail address, or in a few cases to buy a certain
>>>>stock.) The url is in such cases practically enough
>>by
>>>>itself to determine whether the email is spam.
>>>>
>>>>Domain names differ from the rest of the text in a
>>>>(non-German) email in that they often consist of
>>>>several words stuck together. Though computationally
>>>>expensive in the general case, it might be worth
>>>>trying to decompose them. If a filter has never seen
>>>>the token "xxxporn" before it will have an
>individual
>>>>spam probability of .4, whereas "xxx" and "porn"
>>>>individually have probabilities (in my corpus) of
>>>>.9889 and .99 respectively, and a combined
>>probability
>>>>of .9998.
>>>>
>>>>I expect decomposing domain names to become more
>>>>important as spammers are gradually forced to stop
>>>>using incriminating words in the text of their
>>>>messages. (A url with an ip address is of course an
>>>>extremely incriminating sign, except in the mail of
>a
>>>>few sysadmins.)
>>>>
>>>>It might be a good idea to have a cooperatively
>>>>maintained list of urls promoted by spammers. We'd
>>>>need a trust metric of the type studied by Raph
>>Levien
>>>>to prevent malicious or incompetent submissions, but
>>>>if we had such a thing it would provide a boost to
>>any
>>>>filtering software. It would also be a convenient
>>>>basis for boycotts.
>>>>
>>>>Another way to test dubious urls would be to send
>out
>>>>a crawler to look at the site before the user looked
>>>>at the email mentioning it. You could use a Bayesian
>>>>filter to rate the site just as you would an email,
>>>>and whatever was found on the site could be included
>>>>in calculating the probability of the email being a
>>>>spam. A url that led to a redirect would of course
>be
>>>>especially suspicious.
>>>>
>>>>One cooperative project that I think really would be
>>a
>>>>good idea would be to accumulate a giant corpus of
>>>>spam. A large, clean corpus is the key to making
>>>>Bayesian filtering work well. Bayesian filters could
>>>>actually use the corpus as input. But such a corpus
>>>>would be useful for other kinds of filters too,
>>>>because it could be used to test them.
>>>>
>>>>Creating such a corpus poses some technical
>problems.
>>>>We'd need trust metrics to prevent malicious or
>>>>incompetent submissions, of course. We'd also need
>>>>ways of erasing personal information (not just
>>>>to-addresses and ccs, but also e.g. the arguments to
>>>>unsubscribe urls, which often encode the to-address)
>>>>from mails in the corpus. If anyone wants to take on
>>>>this project, it would be a good thing for the
>world.
>>>>
>>>>Appendix: Defining Spam
>>>>
>>>>I think there is a rough consensus on what spam is,
>>>>but it would be useful to have an explicit
>>definition.
>>>>We'll need to do this if we want to establish a
>>>>central corpus of spam, or even to compare spam
>>>>filtering rates meaningfully.
>>>>
>>>>To start with, spam is not unsolicited commercial
>>>>email. If someone in my neighborhood heard that I
>was
>>>>looking for an old Raleigh three-speed in good
>>>>condition, and sent me an email offering to sell me
>>>>one, I'd be delighted, and yet this email would be
>>>>both commercial and unsolicited. The defining
>feature
>>>>of spam (in fact, its raison d'etre) is not that it
>>is
>>>>unsolicited, but that it is automated.
>>>>
>>>>It is merely incidental, too, that spam is usually
>>>>commercial. If someone started sending mass email to
>>>>support some political cause, for example, it would
>>be
>>>>just as much spam as email promoting a porn site.
>>>>
>>>>I propose we define spam as unsolicited automated
>>>>email. This definition thus includes some email that
>>>>many legal definitions of spam don't. Legal
>>>>definitions of spam, influenced presumably by
>>>>lobbyists, tend to exclude mail sent by companies
>>that
>>>>have an "existing relationship" with the recipient.
>>>>But buying something from a company, for example,
>>does
>>>>not imply that you have solicited ongoing email from
>>>>them. If I order something from an online store, and
>>>>they then send me a stream of spam, it's still spam.
>>>>
>>>>Companies sending spam often give you a way to
>>>>"unsubscribe," or ask you to go to their site and
>>>>change your "account preferences" if you want to
>stop
>>>>getting spam. This is not enough to stop the mail
>>from
>>>>being spam. Not opting out is not the same as opting
>>>>in. Unless the recipient explicitly checked a
>clearly
>>>>labelled box (whose default was no) asking to
>receive
>>>>the email, then it is spam.
>>>>
>>>>In some business relationships, you do implicitly
>>>>solicit certain kinds of mail. When you order
>online,
>>>>I think you implicitly solicit a receipt, and
>>>>notification when the order ships. I don't mind when
>>>>Verisign sends me mail warning that a domain name is
>>>>about to expire (at least, if they are the actual
>>>>registrar for it). But when Verisign sends me email
>>>>offering a FREE Guide to Building My E-Commerce Web
>>>>Site, that's spam.
>>>>
>>>>Notes:
>>>>
>>>>[1] The examples in this article are translated into
>>>>Common Lisp for, believe it or not, greater
>>>>accessibility. The application described here is one
>>>>that we wrote in order to test a new Lisp dialect
>>>>called Arc that is not yet released.
>>>>
>>>>[2] Currently the lowest rate seems to be about $200
>>>>to send a million spams. That's very cheap, 1/50th
>of
>>>>a cent per spam. But filtering out 95% of spam, for
>>>>example, would increase the spammers' cost to reach
>a
>>>>given audience by a factor of 20. Few can have
>>margins
>>>>big enough to absorb that.
>>>>
>>>>[3] As a rule of thumb, the more qualifiers there
>are
>>>>before the name of a country, the more corrupt the
>>>>rulers. A country called The Socialist People's
>>>>Democratic Republic of X is probably the last place
>>in
>>>>the world you'd want to live.
>>>>
>>>>Thanks to Sarah Harlin for reading drafts of this;
>>>>Daniel Giffin (who is also writing the production
>Arc
>>>>interpreter) for several good ideas about filtering
>>>>and for creating our mail infrastructure; Robert
>>>>Morris, Trevor Blackwell and Erann Gat for many
>>>>discussions about spam; Raph Levien for advice about
>>>>trust metrics; and Chip Coldwell and Sam Steingold
>>for
>>>>advice about statistics.
>>>>
>>>> You'll find this essay and 14 others in Hackers &
>>>>Painters.

[ Next Thread | Previous Thread | Next Message | Previous Message ]

Replies:

Subject	Author	Date
Re: Now, what is this about?	fghgfh	05:30:28 01/01/08 Tue
Re: Now, what is this about? -- fghfghgf, 05:31:27 01/01/08 Tue Re: Now, what is this about? -- hjlhjlgkghjg, 05:32:01 01/01/08 Tue