Mobile apps are notorious for collecting a wealth of private information from users. Despite significant effort from the research community in developing privacy leak detection tools based on data flow tracking inside the app or through network traffic analysis, it is still unclear whether apps and ad libraries can hide the fact that they are leaking private information. In fact, all existing analysis tools have limitations: data flow tracking suffers from imprecisions that cause false positives, as well as false negatives when the data flow from a source of private information to a network sink is interrupted; on the other hand, network traffic analysis cannot handle encryption or custom encoding.
We propose a new approach to privacy leak detection that is not affected by such limitations, and it is also resilient to obfuscation techniques, such as encoding, formatting, encryption, or any other kind of transformation performed on private information before it is leaked. Our work is based on black-box differential analysis, and it works in two steps: first, it establishes a baseline of the network behavior of an app; then, it modifies sources of private information, such as the device ID and location, and detects leaks by observing deviations in the resulting network traffic.
The basic concept of black-box differential analysis is not novel, but, unfortunately, it is not practical enough to precisely analyze modern mobile apps. In fact, their network traffic contains many sources of non-determinism, such as random identifiers, timestamps, and server-assigned session identifiers, which, when not handled properly, cause too much noise to correlate output changes with input changes.
The main contribution of this work is to make black-box differential analysis practical when applied to modern Android apps. In particular, we show that the network-based non-determinism can often be explained and eliminated, and it is thus possible to reliably use variations in the network traffic as a strong signal to detect privacy leaks. We implemented this approach in a tool, called Agrigento, and we evaluated it on more than one thousand Android apps. Our evaluation shows that our approach works well in practice and outperforms current state-of-the-art techniques. We conclude our study by discussing several case studies that show how popular apps and ad libraries currently exfiltrate data by using complex combinations of encoding and encryption mechanisms that other approaches fail to detect. Our results show that these apps and libraries seem to deliberately hide their data leaks from current approaches and clearly demonstrate the need for an obfuscation-resilient approach such as ours.