Pages

Monday, 3 December 2012

Poison Attacks Against Machine Learning(svm投毒)


With AI systems becoming more common, we have to start worrying about security. A network intrusion may be all the more serious if it is a neural net that is affected. New results indicate that it may be easier than we thought to provide data to a learning program that causes it to learn the wrong things.
If you like ScFi you will have seen or read scenarios where the robot or computer, always evil, is defeated by being asked a logical program that has no solution or is distracted by being asked to compute Pi to a billion billion digits. The key idea is that, given machine intelligence, the trick to defeating it is to feed it the wrong data.

Security experts call the idea of breaking a system by feeding it the wrong data a poison attack and it is a good description. Can poison attacks be applied to AI systems in reality?
Support Vector Machines (SVMs) are fairly simple learning devices. They use examples to make classifications or decisions. Although still regarded as an experimental technique, SVMs are used in security settings to detect abnormal behavior such as fraud, credit card use anomalies and even to weed out spam.
SVMs learn by being shown examples of the sorts of things they are supposed to detect. Normally this training occurs once and before they are used for real. However, there are lots of situations in which the nature of the data changes over time. For example, spam changes its nature as spammers think up new ideas and change what they do in response to the detection mechanisms. As a result it is not unusual for an SVM to continue to learn while its doing the job for real and this is where the opportunity for a poison attack arises.
Three researchers, Battista Biggio (Italy) Blaine Nelson and Pavel Laskov (Germany), have found a way to feed an SVM with data specially designed to increase the error rate of the machine as much as possible with a few data points.
The approach assumes that the attacker knows the learning algorithm being employed and has access to the same data. Less realistically it assumes that the attacker has access to the original training data. This is unlikely, but the original training data could be approximated by a sample from the population.
With all of the data the attacker can manipulate the optimal SVM solution by inserting crafted attack points.  As the researchers say:
the proposed method breaks new ground in optimizing the impact of data-driven attacks against kernel-based learning algorithms and emphasizes the need to consider resistance against adversarial training data as an important factor in the design of learning algorithms.
What they discovered is that their method was capable of having a surprisingly large impact on the performance of the SVMs tested. They also point out that it could be possible to direct the induced errors so as to product particular types of error. For example, a spammer could send some poisoned data so as to evade detection in the future. The biggest practical difficult in using such methods is that, in most cases, the attacker doesn't control the labeling of the data points - i.e. spam or not spam - used in the training. A custom solution would have to be designed to compromise the labeling algorithm.
It seems that hacking might be about to get even more interesting.
from http://www.i-programmer.info/news/105-artificial-intelligence/4526-poison-attacks-against-machine-learning.html
--------------------------------------------------------------------
突然觉得几年前的事情现在应该又走到了一个拐点。
6年前我们会用squid都很容易绕过限制了,然后是socks,加密socks,dns还原,然后是ssh,openvpn。
现在都开始搞机器学习来封锁了。
那么这就是一个策略问题了:
  1. 封锁的研究是针对公开协议的,而且是黑盒。
  2. 任何破解技术都会被封锁黑盒反馈。敌人在暗处,大众在明处,这显然不是一个有利位置。
所以,任何公开的破解策略,都会最终被封锁技术打败。
我觉得任何分享hosts什么技术的都是傻逼行为。
以后的破解,一定是地下组织,单向联系,各自之间无交集,经常变换自定义二进制协议隧道。
这样才有的搞。
btw 另外我觉得大家应该学习下SVM投毒。这个是未来的必要生存手段.
-----------------------------------------------------------------------
北邮校长被誉为防火长城之父,虽然他自称已退出日常管理,但对防火墙的技术发展无疑具有重要影响。从Google学术搜索发现,方校长在即将过去的一年中 笔耕不辍,是十多篇论文的署名作者,其中一些论文值得关注,例如《正则表达式分组的1/(1-1/k)-近似算法》,《基于贝叶斯网络建模的非常规危机事 件网络舆情预警研究(PDF)》,《基于密度估计的社会网络特征簇挖掘方法》,《新媒体事件新闻话题数预测建模》,《网络流量分类研究进展与展望 (PDF)》。最后的一篇论文介绍了多种正处于实验阶段的流量分类方法,包括基于主机行为的流量分类(适用于骨干网)和基于机器学习的流量分类——其中有 准确度达到95%的向量机(support vector machine,SVM)和神经网络分析,机器学习能够对加密流量进行分析。论文还提到了Skype的加密流量变化和Tor的混淆流量,表示高速网络环境 中的流量实时分类很有挑战性。