Subject: Re: Now, what is this about? |
Author:
asdasda
|
[
Next Thread |
Previous Thread |
Next Message |
Previous Message
]
Date Posted: 05:27:11 01/01/08 Tue
In reply to:
l;aksdjalsjkda
's message, "Now, what is this about?" on 05:25:51 01/01/08 Tue
>Like to build things? Try Hacker News.
>August 2002
>
>(This article describes the spam-filtering techniques
>used in the spamproof web-based mail reader we built
>to exercise Arc. An improved algorithm is described in
>Better Bayesian Filtering.)
>
>I think it's possible to stop spam, and that
>content-based filters are the way to do it. The
>Achilles heel of the spammers is their message. They
>can circumvent any other barrier you set up. They have
>so far, at least. But they have to deliver their
>message, whatever it is. If we can write software that
>recognizes their messages, there is no way they can
>get around that.
>
>To the recipient, spam is easily recognizable. If you
>hired someone to read your mail and discard the spam,
>they would have little trouble doing it. How much do
>we have to do, short of AI, to automate this process?
>
>I think we will be able to solve the problem with
>fairly simple algorithms. In fact, I've found that you
>can filter present-day spam acceptably well using
>nothing more than a Bayesian combination of the spam
>probabilities of individual words. Using a slightly
>tweaked (as described below) Bayesian filter, we now
>miss less than 5 per 1000 spams, with 0 false
>positives.
>
>The statistical approach is not usually the first one
>people try when they write spam filters. Most hackers'
>first instinct is to try to write software that
>recognizes individual properties of spam. You look at
>spams and you think, the gall of these guys to try
>sending me mail that begins "Dear Friend" or has a
>subject line that's all uppercase and ends in eight
>exclamation points. I can filter out that stuff with
>about one line of code.
>
>And so you do, and in the beginning it works. A few
>simple rules will take a big bite out of your incoming
>spam. Merely looking for the word "click" will catch
>79.7% of the emails in my spam corpus, with only 1.2%
>false positives.
>
>I spent about six months writing software that looked
>for individual spam features before I tried the
>statistical approach. What I found was that
>recognizing that last few percent of spams got very
>hard, and that as I made the filters stricter I got
>more false positives.
>
>False positives are innocent emails that get
>mistakenly identified as spams. For most users,
>missing legitimate email is an order of magnitude
>worse than receiving spam, so a filter that yields
>false positives is like an acne cure that carries a
>risk of death to the patient.
>
>The more spam a user gets, the less likely he'll be to
>notice one innocent mail sitting in his spam folder.
>And strangely enough, the better your spam filters
>get, the more dangerous false positives become,
>because when the filters are really good, users will
>be more likely to ignore everything they catch.
>
>I don't know why I avoided trying the statistical
>approach for so long. I think it was because I got
>addicted to trying to identify spam features myself,
>as if I were playing some kind of competitive game
>with the spammers. (Nonhackers don't often realize
>this, but most hackers are very competitive.) When I
>did try statistical analysis, I found immediately that
>it was much cleverer than I had been. It discovered,
>of course, that terms like "virtumundo" and "teens"
>were good indicators of spam. But it also discovered
>that "per" and "FL" and "ff0000" are good indicators
>of spam. In fact, "ff0000" (html for bright red) turns
>out to be as good an indicator of spam as any
>pornographic term.
>
>
>_ _ _
>
>
>Here's a sketch of how I do statistical filtering. I
>start with one corpus of spam and one of nonspam mail.
>At the moment each one has about 4000 messages in it.
>I scan the entire text, including headers and embedded
>html and javascript, of each message in each corpus. I
>currently consider alphanumeric characters, dashes,
>apostrophes, and dollar signs to be part of tokens,
>and everything else to be a token separator. (There is
>probably room for improvement here.) I ignore tokens
>that are all digits, and I also ignore html comments,
>not even considering them as token separators.
>
>I count the number of times each token (ignoring case,
>currently) occurs in each corpus. At this stage I end
>up with two large hash tables, one for each corpus,
>mapping tokens to number of occurrences.
>
>Next I create a third hash table, this time mapping
>each token to the probability that an email containing
>it is a spam, which I calculate as follows [1]:
>(let ((g (* 2 (or (gethash word good) 0)))
> (b (or (gethash word bad) 0)))
> (unless (< (+ g b) 5)
> (max .01
> (min .99 (float (/ (min 1 (/ b nbad))
> (+ (min 1 (/ g ngood))
> (min 1 (/ b
>nbad)))))))))
>
>where word is the token whose probability we're
>calculating, good and bad are the hash tables I
>created in the first step, and ngood and nbad are the
>number of nonspam and spam messages respectively.
>
>I explained this as code to show a couple of important
>details. I want to bias the probabilities slightly to
>avoid false positives, and by trial and error I've
>found that a good way to do it is to double all the
>numbers in good. This helps to distinguish between
>words that occasionally do occur in legitimate email
>and words that almost never do. I only consider words
>that occur more than five times in total (actually,
>because of the doubling, occurring three times in
>nonspam mail would be enough). And then there is the
>question of what probability to assign to words that
>occur in one corpus but not the other. Again by trial
>and error I chose .01 and .99. There may be room for
>tuning here, but as the corpus grows such tuning will
>happen automatically anyway.
>
>The especially observant will notice that while I
>consider each corpus to be a single long stream of
>text for purposes of counting occurrences, I use the
>number of emails in each, rather than their combined
>length, as the divisor in calculating spam
>probabilities. This adds another slight bias to
>protect against false positives.
>
>When new mail arrives, it is scanned into tokens, and
>the most interesting fifteen tokens, where interesting
>is measured by how far their spam probability is from
>a neutral .5, are used to calculate the probability
>that the mail is spam. If probs is a list of the
>fifteen individual probabilities, you calculate the
>combined probability thus:
>(let ((prod (apply #'* probs)))
> (/ prod (+ prod (apply #'* (mapcar #'(lambda (x)
> (- 1 x))
> probs)))))
>
>One question that arises in practice is what
>probability to assign to a word you've never seen,
>i.e. one that doesn't occur in the hash table of word
>probabilities. I've found, again by trial and error,
>that .4 is a good number to use. If you've never seen
>a word before, it is probably fairly innocent; spam
>words tend to be all too familiar.
>
>There are examples of this algorithm being applied to
>actual emails in an appendix at the end.
>
>I treat mail as spam if the algorithm above gives it a
>probability of more than .9 of being spam. But in
>practice it would not matter much where I put this
>threshold, because few probabilities end up in the
>middle of the range.
>
>
>_ _ _
>
>
>One great advantage of the statistical approach is
>that you don't have to read so many spams. Over the
>past six months, I've read literally thousands of
>spams, and it is really kind of demoralizing. Norbert
>Wiener said if you compete with slaves you become a
>slave, and there is something similarly degrading
>about competing with spammers. To recognize individual
>spam features you have to try to get into the mind of
>the spammer, and frankly I want to spend as little
>time inside the minds of spammers as possible.
>
>But the real advantage of the Bayesian approach, of
>course, is that you know what you're measuring.
>Feature-recognizing filters like SpamAssassin assign a
>spam "score" to email. The Bayesian approach assigns
>an actual probability. The problem with a "score" is
>that no one knows what it means. The user doesn't know
>what it means, but worse still, neither does the
>developer of the filter. How many points should an
>email get for having the word "sex" in it? A
>probability can of course be mistaken, but there is
>little ambiguity about what it means, or how evidence
>should be combined to calculate it. Based on my
>corpus, "sex" indicates a .97 probability of the
>containing email being a spam, whereas "sexy"
>indicates .99 probability. And Bayes' Rule, equally
>unambiguous, says that an email containing both words
>would, in the (unlikely) absence of any other
>evidence, have a 99.97% chance of being a spam.
>
>Because it is measuring probabilities, the Bayesian
>approach considers all the evidence in the email, both
>good and bad. Words that occur disproportionately
>rarely in spam (like "though" or "tonight" or
>"apparently") contribute as much to decreasing the
>probability as bad words like "unsubscribe" and
>"opt-in" do to increasing it. So an otherwise innocent
>email that happens to include the word "sex" is not
>going to get tagged as spam.
>
>Ideally, of course, the probabilities should be
>calculated individually for each user. I get a lot of
>email containing the word "Lisp", and (so far) no spam
>that does. So a word like that is effectively a kind
>of password for sending mail to me. In my earlier
>spam-filtering software, the user could set up a list
>of such words and mail containing them would
>automatically get past the filters. On my list I put
>words like "Lisp" and also my zipcode, so that
>(otherwise rather spammy-sounding) receipts from
>online orders would get through. I thought I was being
>very clever, but I found that the Bayesian filter did
>the same thing for me, and moreover discovered of a
>lot of words I hadn't thought of.
>
>When I said at the start that our filters let through
>less than 5 spams per 1000 with 0 false positives, I'm
>talking about filtering my mail based on a corpus of
>my mail. But these numbers are not misleading, because
>that is the approach I'm advocating: filter each
>user's mail based on the spam and nonspam mail he
>receives. Essentially, each user should have two
>delete buttons, ordinary delete and delete-as-spam.
>Anything deleted as spam goes into the spam corpus,
>and everything else goes into the nonspam corpus.
>
>You could start users with a seed filter, but
>ultimately each user should have his own per-word
>probabilities based on the actual mail he receives.
>This (a) makes the filters more effective, (b) lets
>each user decide their own precise definition of spam,
>and (c) perhaps best of all makes it hard for spammers
>to tune mails to get through the filters. If a lot of
>the brain of the filter is in the individual
>databases, then merely tuning spams to get through the
>seed filters won't guarantee anything about how well
>they'll get through individual users' varying and much
>more trained filters.
>
>Content-based spam filtering is often combined with a
>whitelist, a list of senders whose mail can be
>accepted with no filtering. One easy way to build such
>a whitelist is to keep a list of every address the
>user has ever sent mail to. If a mail reader has a
>delete-as-spam button then you could also add the from
>address of every email the user has deleted as
>ordinary trash.
>
>I'm an advocate of whitelists, but more as a way to
>save computation than as a way to improve filtering. I
>used to think that whitelists would make filtering
>easier, because you'd only have to filter email from
>people you'd never heard from, and someone sending you
>mail for the first time is constrained by convention
>in what they can say to you. Someone you already know
>might send you an email talking about sex, but someone
>sending you mail for the first time would not be
>likely to. The problem is, people can have more than
>one email address, so a new from-address doesn't
>guarantee that the sender is writing to you for the
>first time. It is not unusual for an old friend
>(especially if he is a hacker) to suddenly send you an
>email with a new from-address, so you can't risk false
>positives by filtering mail from unknown addresses
>especially stringently.
>
>In a sense, though, my filters do themselves embody a
>kind of whitelist (and blacklist) because they are
>based on entire messages, including the headers. So to
>that extent they "know" the email addresses of trusted
>senders and even the routes by which mail gets from
>them to me. And they know the same about spam,
>including the server names, mailer versions, and
>protocols.
>
>
>_ _ _
>
>
>If I thought that I could keep up current rates of
>spam filtering, I would consider this problem solved.
>But it doesn't mean much to be able to filter out most
>present-day spam, because spam evolves. Indeed, most
>antispam techniques so far have been like pesticides
>that do nothing more than create a new, resistant
>strain of bugs.
>
>I'm more hopeful about Bayesian filters, because they
>evolve with the spam. So as spammers start using
>"c0ck" instead of "cock" to evade simple-minded spam
>filters based on individual words, Bayesian filters
>automatically notice. Indeed, "c0ck" is far more
>damning evidence than "cock", and Bayesian filters
>know precisely how much more.
>
>Still, anyone who proposes a plan for spam filtering
>has to be able to answer the question: if the spammers
>knew exactly what you were doing, how well could they
>get past you? For example, I think that if
>checksum-based spam filtering becomes a serious
>obstacle, the spammers will just switch to mad-lib
>techniques for generating message bodies.
>
>To beat Bayesian filters, it would not be enough for
>spammers to make their emails unique or to stop using
>individual naughty words. They'd have to make their
>mails indistinguishable from your ordinary mail. And
>this I think would severely constrain them. Spam is
>mostly sales pitches, so unless your regular mail is
>all sales pitches, spams will inevitably have a
>different character. And the spammers would also, of
>course, have to change (and keep changing) their whole
>infrastructure, because otherwise the headers would
>look as bad to the Bayesian filters as ever, no matter
>what they did to the message body. I don't know enough
>about the infrastructure that spammers use to know how
>hard it would be to make the headers look innocent,
>but my guess is that it would be even harder than
>making the message look innocent.
>
>Assuming they could solve the problem of the headers,
>the spam of the future will probably look something
>like this:
>Hey there. Thought you should check out the following:
>
>href="http://www.27meg.com/foo">http://www.27meg.com/fo
>o
>
>because that is about as much sales pitch as
>content-based filtering will leave the spammer room to
>make. (Indeed, it will be hard even to get this past
>filters, because if everything else in the email is
>neutral, the spam probability will hinge on the url,
>and it will take some effort to make that look
>neutral.)
>
>Spammers range from businesses running so-called
>opt-in lists who don't even try to conceal their
>identities, to guys who hijack mail servers to send
>out spams promoting porn sites. If we use filtering to
>whittle their options down to mails like the one
>above, that should pretty much put the spammers on the
>"legitimate" end of the spectrum out of business; they
>feel obliged by various state laws to include
>boilerplate about why their spam is not spam, and how
>to cancel your "subscription," and that kind of text
>is easy to recognize.
>
>(I used to think it was naive to believe that stricter
>laws would decrease spam. Now I think that while
>stricter laws may not decrease the amount of spam that
>spammers send, they can certainly help filters to
>decrease the amount of spam that recipients actually
>see.)
>
>All along the spectrum, if you restrict the sales
>pitches spammers can make, you will inevitably tend to
>put them out of business. That word business is an
>important one to remember. The spammers are
>businessmen. They send spam because it works. It works
>because although the response rate is abominably low
>(at best 15 per million, vs 3000 per million for a
>catalog mailing), the cost, to them, is practically
>nothing. The cost is enormous for the recipients,
>about 5 man-weeks for each million recipients who
>spend a second to delete the spam, but the spammer
>doesn't have to pay that.
>
>Sending spam does cost the spammer something, though.
>[2] So the lower we can get the response rate--
>whether by filtering, or by using filters to force
>spammers to dilute their pitches-- the fewer
>businesses will find it worth their while to send spam.
>
>The reason the spammers use the kinds of sales pitches
>that they do is to increase response rates. This is
>possibly even more disgusting than getting inside the
>mind of a spammer, but let's take a quick look inside
>the mind of someone who responds to a spam. This
>person is either astonishingly credulous or deeply in
>denial about their sexual interests. In either case,
>repulsive or idiotic as the spam seems to us, it is
>exciting to them. The spammers wouldn't say these
>things if they didn't sound exciting. And "thought you
>should check out the following" is just not going to
>have nearly the pull with the spam recipient as the
>kinds of things that spammers say now. Result: if it
>can't contain exciting sales pitches, spam becomes
>less effective as a marketing vehicle, and fewer
>businesses want to use it.
>
>That is the big win in the end. I started writing spam
>filtering software because I didn't want have to look
>at the stuff anymore. But if we get good enough at
>filtering out spam, it will stop working, and the
>spammers will actually stop sending it.
>
>
>_ _ _
>
>
>Of all the approaches to fighting spam, from software
>to laws, I believe Bayesian filtering will be the
>single most effective. But I also think that the more
>different kinds of antispam efforts we undertake, the
>better, because any measure that constrains spammers
>will tend to make filtering easier. And even within
>the world of content-based filtering, I think it will
>be a good thing if there are many different kinds of
>software being used simultaneously. The more different
>filters there are, the harder it will be for spammers
>to tune spams to get through them.
>
>
>
>Appendix: Examples of Filtering
>
>Here is an example of a spam that arrived while I was
>writing this article. The fifteen most interesting
>words in this spam are:
>qvp0045
>indira
>mx-05
>intimail
>$7500
>freeyankeedom
>cdo
>bluefoxmedia
>jpg
>unsecured
>platinum
>3d0
>qves
>7c5
>7c266675
>
>The words are a mix of stuff from the headers and from
>the message body, which is typical of spam. Also
>typical of spam is that every one of these words has a
>spam probability, in my database, of .99. In fact
>there are more than fifteen words with probabilities
>of .99, and these are just the first fifteen seen.
>
>Unfortunately that makes this email a boring example
>of the use of Bayes' Rule. To see an interesting
>variety of probabilities we have to look at this
>actually quite atypical spam.
>
>The fifteen most interesting words in this spam, with
>their probabilities, are:
>madam 0.99
>promotion 0.99
>republic 0.99
>shortest 0.047225013
>mandatory 0.047225013
>standardization 0.07347802
>sorry 0.08221981
>supported 0.09019077
>people's 0.09019077
>enter 0.9075001
>quality 0.8921298
>organization 0.12454646
>investment 0.8568143
>very 0.14758544
>valuable 0.82347786
>
>This time the evidence is a mix of good and bad. A
>word like "shortest" is almost as much evidence for
>innocence as a word like "madam" or "promotion" is for
>guilt. But still the case for guilt is stronger. If
>you combine these numbers according to Bayes' Rule,
>the resulting probability is .9027.
>
>"Madam" is obviously from spams beginning "Dear Sir or
>Madam." They're not very common, but the word "madam"
>never occurs in my legitimate email, and it's all
>about the ratio.
>
>"Republic" scores high because it often shows up in
>Nigerian scam emails, and also occurs once or twice in
>spams referring to Korea and South Africa. You might
>say that it's an accident that it thus helps identify
>this spam. But I've found when examining spam
>probabilities that there are a lot of these accidents,
>and they have an uncanny tendency to push things in
>the right direction rather than the wrong one. In this
>case, it is not entirely a coincidence that the word
>"Republic" occurs in Nigerian scam emails and this
>spam. There is a whole class of dubious business
>propositions involving less developed countries, and
>these in turn are more likely to have names that
>specify explicitly (because they aren't) that they are
>republics.[3]
>
>On the other hand, "enter" is a genuine miss. It
>occurs mostly in unsubscribe instructions, but here is
>used in a completely innocent way. Fortunately the
>statistical approach is fairly robust, and can
>tolerate quite a lot of misses before the results
>start to be thrown off.
>
>For comparison, here is an example of that rare bird,
>a spam that gets through the filters. Why? Because by
>sheer chance it happens to be loaded with words that
>occur in my actual email:
>perl 0.01
>python 0.01
>tcl 0.01
>scripting 0.01
>morris 0.01
>graham 0.01491078
>guarantee 0.9762507
>cgi 0.9734398
>paul 0.027040077
>quite 0.030676773
>pop3 0.042199217
>various 0.06080265
>prices 0.9359873
>managed 0.06451222
>difficult 0.071706355
>
>There are a couple pieces of good news here. First,
>this mail probably wouldn't get through the filters of
>someone who didn't happen to specialize in programming
>languages and have a good friend called Morris. For
>the average user, all the top five words here would be
>neutral and would not contribute to the spam
>probability.
>
>Second, I think filtering based on word pairs (see
>below) might well catch this one: "cost effective",
>"setup fee", "money back" -- pretty incriminating
>stuff. And of course if they continued to spam me (or
>a network I was part of), "Hostex" itself would be
>recognized as a spam term.
>
>Finally, here is an innocent email. Its fifteen most
>interesting words are as follows:
>continuation 0.01
>describe 0.01
>continuations 0.01
>example 0.033600237
>programming 0.05214485
>i'm 0.055427782
>examples 0.07972858
>color 0.9189189
>localhost 0.09883721
>hi 0.116539136
>california 0.84421706
>same 0.15981844
>spot 0.1654587
>us-ascii 0.16804294
>what 0.19212411
>
>Most of the words here indicate the mail is an
>innocent one. There are two bad smelling words,
>"color" (spammers love colored fonts) and "California"
>(which occurs in testimonials and also in menus in
>forms), but they are not enough to outweigh obviously
>innocent words like "continuation" and "example".
>
>It's interesting that "describe" rates as so
>thoroughly innocent. It hasn't occurred in a single
>one of my 4000 spams. The data turns out to be full of
>such surprises. One of the things you learn when you
>analyze spam texts is how narrow a subset of the
>language spammers operate in. It's that fact, together
>with the equally characteristic vocabulary of any
>individual user's mail, that makes Bayesian filtering
>a good bet.
>
>Appendix: More Ideas
>
>One idea that I haven't tried yet is to filter based
>on word pairs, or even triples, rather than individual
>words. This should yield a much sharper estimate of
>the probability. For example, in my current database,
>the word "offers" has a probability of .96. If you
>based the probabilities on word pairs, you'd end up
>with "special offers" and "valuable offers" having
>probabilities of .99 and, say, "approach offers" (as
>in "this approach offers") having a probability of .1
>or less.
>
>The reason I haven't done this is that filtering based
>on individual words already works so well. But it does
>mean that there is room to tighten the filters if spam
>gets harder to detect. (Curiously, a filter based on
>word pairs would be in effect a Markov-chaining text
>generator running in reverse.)
>
>Specific spam features (e.g. not seeing the
>recipient's address in the to: field) do of course
>have value in recognizing spam. They can be considered
>in this algorithm by treating them as virtual words.
>I'll probably do this in future versions, at least for
>a handful of the most egregious spam indicators.
>Feature-recognizing spam filters are right in many
>details; what they lack is an overall discipline for
>combining evidence.
>
>Recognizing nonspam features may be more important
>than recognizing spam features. False positives are
>such a worry that they demand extraordinary measures.
>I will probably in future versions add a second level
>of testing designed specifically to avoid false
>positives. If a mail triggers this second level of
>filters it will be accepted even if its spam
>probability is above the threshold.
>
>I don't expect this second level of filtering to be
>Bayesian. It will inevitably be not only ad hoc, but
>based on guesses, because the number of false
>positives will not tend to be large enough to notice
>patterns. (It is just as well, anyway, if a backup
>system doesn't rely on the same technology as the
>primary system.)
>
>Another thing I may try in the future is to focus
>extra attention on specific parts of the email. For
>example, about 95% of current spam includes the url of
>a site they want you to visit. (The remaining 5% want
>you to call a phone number, reply by email or to a US
>mail address, or in a few cases to buy a certain
>stock.) The url is in such cases practically enough by
>itself to determine whether the email is spam.
>
>Domain names differ from the rest of the text in a
>(non-German) email in that they often consist of
>several words stuck together. Though computationally
>expensive in the general case, it might be worth
>trying to decompose them. If a filter has never seen
>the token "xxxporn" before it will have an individual
>spam probability of .4, whereas "xxx" and "porn"
>individually have probabilities (in my corpus) of
>.9889 and .99 respectively, and a combined probability
>of .9998.
>
>I expect decomposing domain names to become more
>important as spammers are gradually forced to stop
>using incriminating words in the text of their
>messages. (A url with an ip address is of course an
>extremely incriminating sign, except in the mail of a
>few sysadmins.)
>
>It might be a good idea to have a cooperatively
>maintained list of urls promoted by spammers. We'd
>need a trust metric of the type studied by Raph Levien
>to prevent malicious or incompetent submissions, but
>if we had such a thing it would provide a boost to any
>filtering software. It would also be a convenient
>basis for boycotts.
>
>Another way to test dubious urls would be to send out
>a crawler to look at the site before the user looked
>at the email mentioning it. You could use a Bayesian
>filter to rate the site just as you would an email,
>and whatever was found on the site could be included
>in calculating the probability of the email being a
>spam. A url that led to a redirect would of course be
>especially suspicious.
>
>One cooperative project that I think really would be a
>good idea would be to accumulate a giant corpus of
>spam. A large, clean corpus is the key to making
>Bayesian filtering work well. Bayesian filters could
>actually use the corpus as input. But such a corpus
>would be useful for other kinds of filters too,
>because it could be used to test them.
>
>Creating such a corpus poses some technical problems.
>We'd need trust metrics to prevent malicious or
>incompetent submissions, of course. We'd also need
>ways of erasing personal information (not just
>to-addresses and ccs, but also e.g. the arguments to
>unsubscribe urls, which often encode the to-address)
>from mails in the corpus. If anyone wants to take on
>this project, it would be a good thing for the world.
>
>Appendix: Defining Spam
>
>I think there is a rough consensus on what spam is,
>but it would be useful to have an explicit definition.
>We'll need to do this if we want to establish a
>central corpus of spam, or even to compare spam
>filtering rates meaningfully.
>
>To start with, spam is not unsolicited commercial
>email. If someone in my neighborhood heard that I was
>looking for an old Raleigh three-speed in good
>condition, and sent me an email offering to sell me
>one, I'd be delighted, and yet this email would be
>both commercial and unsolicited. The defining feature
>of spam (in fact, its raison d'etre) is not that it is
>unsolicited, but that it is automated.
>
>It is merely incidental, too, that spam is usually
>commercial. If someone started sending mass email to
>support some political cause, for example, it would be
>just as much spam as email promoting a porn site.
>
>I propose we define spam as unsolicited automated
>email. This definition thus includes some email that
>many legal definitions of spam don't. Legal
>definitions of spam, influenced presumably by
>lobbyists, tend to exclude mail sent by companies that
>have an "existing relationship" with the recipient.
>But buying something from a company, for example, does
>not imply that you have solicited ongoing email from
>them. If I order something from an online store, and
>they then send me a stream of spam, it's still spam.
>
>Companies sending spam often give you a way to
>"unsubscribe," or ask you to go to their site and
>change your "account preferences" if you want to stop
>getting spam. This is not enough to stop the mail from
>being spam. Not opting out is not the same as opting
>in. Unless the recipient explicitly checked a clearly
>labelled box (whose default was no) asking to receive
>the email, then it is spam.
>
>In some business relationships, you do implicitly
>solicit certain kinds of mail. When you order online,
>I think you implicitly solicit a receipt, and
>notification when the order ships. I don't mind when
>Verisign sends me mail warning that a domain name is
>about to expire (at least, if they are the actual
>registrar for it). But when Verisign sends me email
>offering a FREE Guide to Building My E-Commerce Web
>Site, that's spam.
>
>Notes:
>
>[1] The examples in this article are translated into
>Common Lisp for, believe it or not, greater
>accessibility. The application described here is one
>that we wrote in order to test a new Lisp dialect
>called Arc that is not yet released.
>
>[2] Currently the lowest rate seems to be about $200
>to send a million spams. That's very cheap, 1/50th of
>a cent per spam. But filtering out 95% of spam, for
>example, would increase the spammers' cost to reach a
>given audience by a factor of 20. Few can have margins
>big enough to absorb that.
>
>[3] As a rule of thumb, the more qualifiers there are
>before the name of a country, the more corrupt the
>rulers. A country called The Socialist People's
>Democratic Republic of X is probably the last place in
>the world you'd want to live.
>
>Thanks to Sarah Harlin for reading drafts of this;
>Daniel Giffin (who is also writing the production Arc
>interpreter) for several good ideas about filtering
>and for creating our mail infrastructure; Robert
>Morris, Trevor Blackwell and Erann Gat for many
>discussions about spam; Raph Levien for advice about
>trust metrics; and Chip Coldwell and Sam Steingold for
>advice about statistics.
>
> You'll find this essay and 14 others in Hackers &
>Painters.
[
Next Thread |
Previous Thread |
Next Message |
Previous Message
]
| |