Linux -Blue forest free software | Return to home page | Site Map | Search WWW | Contact Us |
Your current position : Homepage > Free Software > Technological exchanges >Network Communication


    

Blue Forest http://www.lslnet.com at 12:08 on July 28, 2006


CCERT Chinese spam filtering rule sets Chinese_rules.cf

Source http://www.ccert.edu.cn/spam/sa/Chinese_rules.htm

Chinese spam filtering rule sets [b]CCERT
Chinese_rules.cf[/b]


1. What is Chinese_rules.cf

2. Chinese_rules.cf theoretical background

3. Chinese_rules.cf framework for the formation and use of

4. Matching speed Chinese_rules.cf

5. Chinese_rules.cf accuracy

6. The use Chinese_rules.cf


-------------------------------------------------------------

[b]1. What is Chinese_rules.cf[/b]

Chinese_rules.cf is widely used for the free trade system SpamAssassin spam filtering spam filtering rules in the collection. As previously no Chinese filtering rules set SpamAssassin mail filtering accuracy is not high on the Chinese. Anti-Spam Research Group CCERT CCERT use of the latest available data, and rich samples, the launch of the first Chinese SpamAssassin spam filtering based on rules set Chinese_rules.cf. The rule set updated weekly, the limitation was very good.
Chinese_rules.cf in SpamAssassin issued by the official website of a Chinese spam filtering rule sets, but also with Google, Yahoo, Baidu, MSN Search, "Chinese spam filtering," which returned to the first results.



[b]2. Chinese_rules.cf theoretical background [/b]

Chinese_rules.cf mail content filtering rules are set. Currently mail content filtering technologies can be divided into two types : the rule-based and statistical methods. Rule-based methods in finding specific items and patterns, such as the theme of the word "free." Based on statistics is the use of statistical methods to solve the dual-mail classification, separation machines According mail spam and normal samples trained. In spam filtering technology is the most commonly used Bayesian statistical methods.
Rules will be able to enjoy the advantages of rule-based methods, the promotion of it highly. One person to write the rules can provide more personal, a number of servers. However, it has a drawback, which is updating slow. Because the rules are generally prepared artificial generation, the new rules have barely kept pace with the new spam on the speed, in other words, its effectiveness is poor.
The advantages of the method is based on statistical classification by the automatic machine training, so long as samples of the training set to update classification machine can be updated on the pace to catch up with spam on the speed that it is limited in nature. However, the method has a drawback, which is the classification of the plane is not shared by some users to use their training to mail samples to the classification of aircraft, other users may be ineffective, it is the method of promoting sexual stereotypes.
Chinese_rules.cf statistics on the use of the new rules, the rules that it is automatically generated by statistical methods. The lessons of the rule-based and statistical advantage : Because it is based on a rule-based approach, a strong promotion, the rule because it is automatically generated by the statistical method, the time it is also very strong. Chinese_rules.cf compared with traditional methods, as shown in table 1.


Table 1, as compared with traditional methods Chinese_rules.cf

-----------------------------------------------------------------
Timeliness | | | | a promotion
-----------------------------------------------------------------
Better rules-based | | | | poor
Based on statistical difference | | | | well
|Chinese_rules.cf Good | | | good
------------------------------------------------------------------


CCERT anti-spam group has been set up since 1998, a lot of spam complaints every day, the latest and the most abundant sample data. Chinese_rules.cf latest and most abundant in this sample database based on the statistical methods automatic.



[b]3. Chinese_rules.cf framework for the formation and use of [/b]

Chinese_rules.cf framework for the generation and use as shown in figure 1. First, the use of CCERT spam processing services and information to maintain an updated user feedback, the most complete rubbish / normal mail sample database, and then using statistical methods, garbage / Automatic Generation normal mail sample rule sets Chinese_rules.cf. Because the sample database is the latest, on the timeliness Chinse_rules.cf very strong. Set the rules in regard CCERT CCERT page publication, as provided by the CCERT an external services. Users all over the (server) to download the home page CCERT Chinese_rules.cf thus Chinese_rules.cf promotional in nature.

[img]http://www.ccert.edu.cn/spam/sa/framework.jpg[/img]
Figure 1, and the generation and use frameworks Chinese_rules.cf



[b]4. Matching speed [/b] Chinese_rules.cf

Chinese_rules.cf Rule Set general rules will be controlled at around 500. That figure might make the matching rate Chinese_rules.cf little doubt. Careful analysis and test results show Chinese_rules.cf matching the performance is still relatively high, the reasons are : First, the rules are simple Chinese_rules.cf is a relatively short string, in the middle of a wildcard, did you bring any such matching complicated than the rules more quickly; Second, 90% of mail Chinese_rules.cf theme of the rules is that only 10% is in the rules. Because of the theme often, as a result of the match Chinese_rules.cf pace faster.
These are the theoretical analysis of the performance. We use a PC (P4 2.8G CPU), with Chinese_rules.cf (21 Dec 2004 version) 178482 mail match, the result is the average size of the 5.0K matching letter mail only 0.04 seconds. The result was extremely good, because if the average size of a mail server mail 5.0K (not annex), so if a PC can handle 2.16 million daily mail. Students in general mail server daily transceiver around 300,000 letters. In other words, as long as the existing mail server such as a PC with enough processing capability to deal with spam hardware to meet the demand.




[b]5. Chinese_rules.cf accuracy [/b]

Chinese_rules.cf each carry a version of the accuracy of the test results, such as the 2004 version of the test results is as follows : Dec 21

Chinese_rules.cf,2005 Jan 2
-------------------------------------------------------------------
| | threshold rate of verified spam mail misjudgment normal rate | |
| | (a total of 16,729) | (total 93655) |
-------------------------------------------------------------------
| 0.5 | 95.0% | 5.1% |
| 1.0 | 92.9% | 1.6% |
| 1.5 | 90.4% | 0.4% |
| 2.0 | 87.9% | 0.1% |
| 2.5 | 84.5% | 0.0% |
| 3.0 | 81.1% | 0.0% |
| 3.5 | 76.6% | 0.0% |
| 4.0 | 72.4% | 0.0% |
| 4.5 | 67.0% | 0.0% |
--------------------------------------------------------------------

Mail scanning byte size needs to 5007.55 0.04 seconds (P4-2.8G CPU)

Table 2 is the result of the testing order, in addition to any other use other than Chinese_rules.cf rules. In reality, usually with SpamAssassin Chinese_rules.cf also use the default rule. SpamAssassin rules because the default mail a description of the rules is part of the Chinese spam detection effect, the actual performance is better than the experimental results.
Attention, a day for more than 400,000 items of mail servers, can tolerate the performance is less than 5% of normal mail misjudgment, spam detection rate of more than 90%.



[b]6. Chinese_rules.cf use of [/b]

Chinese_rules.cf downloaded, a list of the rules on SpamAssassin rules stored (usually in /usr/share/spamassassin). Wget downloaded through the order as follows :

# Wget N P /usr/share/spamassassin www.ccert.edu.cn/spam/sa/Chinese_rules.cf

Chinese_rules.cf are updated each loading SpamAssassin rules need to restart the process. If you normally restarting spamd is :

# Ps -ax | grep spamd
See spamd process PID, then
PID # kill -HUP
If you have to use up the mimedefang mimedefang. /etc/init.d/init-script Up the script for a hypothetical mimedefang, ordered as follows :

# /etc/init.d/init-script Restart

CCERT updated weekly rule set and the corresponding scores, updated CCERT use of anti-spam services in six months to deal with the spam samples. Chinese_rules.cf filter will be updated to better effect. As long as the above download mimedefang up the orders and the orders placed crontab, and regular operation can be completed automatically update function. If you want to update a month, then add the root of his crontab should :

0 0 1 * N * P /usr/share/spamassassin www.ccert.edu.cn/spam/sa/Chinese_rules.cf; /etc/init.d/init-script restart wget


More information, see :
Http://www.ccert.edu.cn/spam/sa/Chinese_rules.htm

-------------------------------------------------------------



 Privacy Policy  Copyright © 1999-2000 LSLNET.COM. All rights reserved. Blue Forest website owners. E-mail : Webmaster@lslnet.com