监视我的手机:数据都去哪儿了?
“现在的人几乎是全部透明的。我心里就想,马化腾肯定天天在看我们的微信,因为他都可以看的,随便看,这些问题非常大。” —— 李书福
日常使用的手机可能比想象的更加活跃,当微信聊天、淘宝购物、抖音看视频甚至是喵的手机待机啥也不干,某些 App 都会悄悄地与服务器交换着数据。这些数据包括微信聊天记录、地理位置、通讯录、通话记录、QQ消息,甚至短信 内容...
我一直想知道我的数据都去了哪里?哪些 App 在源源不断上传数据?数据被哪些公司搜集了?
前一段时间,浏览过一篇国外的博客《Tracking my phone's silent connections》,文中作者 Kushal 使用 WireGuard 代理的方式,监控自己的手机 1 个星期,截获手机与服务器之间的所有请求,最后统计了手机到底悄悄地在和哪些公司的服务器进行连接。
受到 Kushal 的启发,我决定使用部署 ss 的方式截获我个人的手机数据。
监控方案
实验设备
- 日常使用的安卓手机
x1
- 国内某云服务器
x1
代理方案
手机的数据都是与不同的服务器进行着连接,如何获取所有的连接?首先我想到的是手机要通过 Wi-Fi 路由器上网,那么如果在路由器端截取数据包,会比较容易。但是无法获取手机的移动基站流量。
于是在 1 台云服务器上搭建了个代理服务,手机客户端设置为全局代理连接 VPN 服务器,就可以在服务器端获取所有的数据请求。
部署服务
为了保证上网访问速度,提升网络体验,推荐选择国内的服务器,代理服务器首先安装 Docker
$ sudo apt-get -y install docker.io
启动 ss Docker 容器
通过阅读 ss 的文档,可知在启动 ss 时只需要加上 -v
参数(Verbose mode)即可输出详细 Log。同时使用 tmux
让服务在后台运行,将输出以追加的方式(>>
)重定向到 logs.txt
文件。
$ tmux
$ sudo docker run -t --name ss -p 9000:9000 mritd/shadowsocks -s "-s 0.0.0.0 -p 9000 -m aes-256-cfb -k yourpassword --fast-open -v" >> logs.txt
手机客户端
在手机端安装 ss 或者酸酸乳客户端,配置代理服务器地址、端口、密码与加密方式,代理模式设置为全局代理。
然后在服务器端,使用 tail
命令从指定点开始将从文件写到标准输出,显示实时 Log,服务搭建成功
$ tail -f logs.txt
当手机使用微信时,记录的 Log 日志如下
数据处理
DNS 域名解析
DNS(Domain Name System),翻译过来就是域名系统,是互联网上作为域名和 IP 地址相互映射的一个分布式数据库。获取到的记录大多数是域名,需要先解析成 IP 地址
import socket
def domain_to_ip(domain):
return socket.gethostbyname(domain)
例如,解析 www.baidu.com
的 IP 地址
domain_to_ip('www.baidu.com')
'14.215.177.38'
IP 地理数据库
推荐使用 ip2region,一个开源的 IP 到地区的映射库,具有 99.9% 准确率,提供 Binary,B 树和纯内存三种查询快速搜索算法。
>> result = ipgeo.find('www.baidu.com')
>> print(result)
{'ip': '14.215.177.38', 'city_id': 2140, 'country': '中国', 'province': '广东省', 'city': '广州市', 'operator': '电信'}
保存数据
df.to_csv(out_csv, index=False)
print('saved to {}'.format(out_csv))
数据可视化
经过十多天的记录,俺一共记录了 280059
条记录
接下来使用 Pyecharts 对数据进行可视化。Echarts 是百度开源的一个数据可视化 JS 库,而 Pyecharts 是一个用于生成 Echarts 图表 Python 库。
主要的互联网公司
从上图可以看出,俺的安卓手机(安装了谷歌服务),在国内的网络环境,请求次数最多还是 Google。
然后就是日常使用的微信和 QQ 了。由于平时会看 B 站视频,所以 Bilibili 排名第三 orz...
我手机安装的是 QQ 输入法,但是去往 sougou.com
的请求居然有 1952
条,查看了用户协议才发现 “QQ输入法”是经腾讯公司认可,由搜狗公司发布的客户端软件。
还有像美团、高德地图这样的软件,平时并不怎么频繁使用,网络请求却异常地活跃,不知道偷偷摸摸干着啥。
夜间活动排行
过滤出凌晨 00:00 ~ 06:00 时间段的活动,可以发现去往 *.qq.com
的连接始终是最多的。
全球分布
国内各省份分布
可以看到俺的流量大多去往了广东、上海和北京这样的地方,台湾这么高的原因是谷歌的服务器在那边,DNS 解析谷歌的域名都指向了台湾。
电信运营商
服务器端口统计
其他
在一加手机的网络请求中,发现了一些发往 oppo 服务器的请求,看来不光硬件由 oppo 代工,连软件也是。
[('epoch.cdo.oppomobile.com', 208),
('gslb.cdo.oppomobile.com', 38),
('istore.oppomobile.com', 38),
('opsapi.store.oppomobile.com', 34),
('api.cdo.oppomobile.com', 22),
('message.pull.oppomobile.com', 21),
('st.pull.oppomobile.com', 13),
('cdopic0.oppomobile.com', 9),
('newds01.myoppo.com', 9),
('httpdns.push.oppomobile.com', 4),
('conn1.oppomobile.com', 1),
('iopen.cdo.oppomobile.com', 1)
最后
吉利控股集团创始人、董事长李书福曾说 “现在的人几乎是全部透明的。我心里就想,马化腾肯定天天在看我们的微信,因为他都可以看的,随便看,这些问题非常大。”
完整代码
https://github.com/wangshub/tracking-my-phone
- 如果需要更为详细的数据,可以考虑使用 mitmproxy 代理,能够抓取 HTTPS 数据,并提供 Python API。
参考链接
- Tracking my phone's silent connections
- ip2region: Ip2region is a offline IP location library
- Python Data Analysis Library
- Pyecharts: A Python Echarts Plotting Library.
from https://github.com/wangshub/tracking-my-phone
------
Tracking my phone's silent connections
My phone has more friends than me. It talks to more peers (computers) than the number of human beings I talk on an average. In this age of smartphones and mobile apps for A-Z things, we are dependent on these technologies. However, at the same time, we don’t know much of what is going on in the computers equipped with powerful cameras, GPS device, microphone we are carrying all the time. All these apps are talking to their respective servers (or can we call them masters?), but, there is no easy way to track them.
These questions bothered me for a long time: I wanted to see the servers my phone is connecting to, and I want to block those connections as I wish. However, I never managed to work on this. A few weeks ago, I finally sat down to start working to build up a system by reusing already available open source projects and tools to create the system, which will allow me to track what my phone is doing. Maybe not in full details, but, at least shed some light on the network traffic from the phone.
Initial trial
I tried to create a wifi hotspot at home using a Raspberry Pi and then started capturing all the packets from the device using standard tools (dumpcap
) and later reading through the logs using Wireshark. This procedure meant that I could only capture when I am connected to the network at home. What about when I am not at home?
Next round
This time I took a bit different approach. I chose algo to create a VPN server. Using WireGuard, it became straightforward to connect my iPhone to the VPN. This process also allows capturing all the traffic from the phone very easily on the VPN server. A few days in the experiment, Kashmir started posting her experiment named Life Without the Tech Giants, where she started blocking all the services from 5 big technology companies. With her help, I contacted Dhruv Mehrotra, who is a technologist behind the story. After talking to him, I felt that I am going in the right direction. He already posted details on how they did the blocking, and you can try that at home :)
Looking at the data after 1 week
After capturing the data for the first week, I moved the captured pcap files into my computer. Wrote some Python code to put the data into a SQLite database, enabling me to query the data much faster.
Domain Name System (DNS) data
The Domain Name System (DNS) is a decentralized system which helps to translate the human memory safe domain names (like kushaldas.in) into Internet Protocol (IP) addresses (like 192.168.1.1 ). Computers talk to each other using these IP addresses, we, don’t have to worry to remember so many names. When the developers develop their applications for the phone, they generally use those domain names to specify where the app should connect.
If I plot all the different domains (including any subdomain) which got queried at least 10 times in a week, we see the following graph.
The first thing to notice is how the phone is trying to find servers from Apple, which makes sense as this is an iPhone. I use the mobile Twitter app a lot, so we also see many queries related to Twitter. Lookout is a special mention there, it was suggested to me by my friends who understand these technologies and security better than me. The 3rd position is taken by Google, though sometimes I watch Youtube videos, but, the phone queried for many other Google domains.
There are also many queries to Akamai CDN service, and I could not find any easy way to identify those hosts, the same with Amazon AWS related hosts. If you know any better way, please drop me a note.
You can see a lot of data analytics related companies were also queried. dev.appboy.com
is a major one, and thankfully algo already blocked that domain in the DNS level. I don’t know which app is trying to connect to which all servers, I found about a few of the apps in my phone by searching about the client list of the above-mentioned analytics companies. Next, in coming months, I will start blocking those hosts/domains one by one and see which all apps stop working.
Looking at data flow
The number of DNS queries is an easy start, but, next I wanted to learn more about the actual servers my phone is talking to. The paranoid part inside of me was pushing for discovering these servers.
If we put all of the major companies the phone is talking to, we get the following graph.
Apple is leading the chart by taking 44% of all the connections, and the number is 495225 times. Twitter is in the second place, and Edgecastcdn is in the third. My phone talked to Google servers 67344 number of times, which is like 7 times less than the number of times Apple itself.
In the next graph, I removed the big players (including Google and Amazon). Then, I can see that analytics companies like nflxso.net
and mparticle.com
have 31% of the connections, which is a lot. Most probably I will start with blocking these two first. The 3 other CDN companies, Akamai, Cloudfront, and Cloudflare has 8%, 7%, and 6% respectively. Do I know what all things are these companies tracking? Nope, and that is scary enough that one of my friend commented “It makes me think about throwing my phone in the garbage.”
What about encrypted vs unencrypted traffic? What all protocols are being used? I tried to find the answer for the first question, and the answer looks like the following graph. Maybe the number will come down if I try to refine the query and add other parameters, that is a future task.
What next?
As I said earlier, I am working on creating a set of tools, which then can be deployed on the VPN server, that will provide a user-friendly way to monitor, and block/unblock traffic from their phone. The major part of the work is to make sure that the whole thing is easy to deploy, and can be used by someone with less technical knowledge.
How can you help?
The biggest thing we need is the knowledge of “How to analyze the data we are capturing?”. It is one thing to make reports for personal user, but, trying to help others is an entirely different game altogether. We will, of course, need all sorts of contributions to the project. Before anything else, we will have to join the random code we have, into a proper project structure. Keep following this blog for more updates and details about the project.
Note to self
Do not try to read data after midnight, or else I will again think a local address as some random dynamic address in Bangkok and freak out (thank you reverse-dns).
from https://kushaldas.in/posts/tracking-my-phone-s-silent-connections.html
No comments:
Post a Comment