Guidelines for Protecting User Privacy in WIDE Traffic Traces October 14, 1999 The WIDE Project, MAWI Working-Group The WIDE Project has dealt with requests for collecting traffic data at a WIDE network. However, we currently handle raw data and lack a systematic way to protect user privacy. In this guideline, we define procedures to remove privacy information from traffic traces in order to make easy-to-handle data sets. The WIDE Project intend to make a series of traffic data sets publicly-available via anonymous FTP. In short, current traffic traces are poisonous, and thus, must be handled with care and have restrictions on their utilization. We are trying to make traffic traces easier to handle by removing poisonous factors in data sets. 1. Removing Privacy Information Motivation There are always risk of accidents when we handle raw traffic data including privacy information. Also, standardizing procedures for providing traffic traces considerably reduces required work for both providers and users. The goal of these procedures is to provide generic data sets useful for many research areas without jeopardizing user privacy. (requests for specific detailed information are handled depending on the case, as have been handled in the past.) Traffic data conforming to this guideline can be used for research without permission from the WIDE Project. Protecting User Privacy There are 2 issues regarding user privacy. 1. Removing user data: Leave only protocol headers and remove protocol payload which could contain user data. 2. Providing anonymity: Scramble addresses which could be used to identify a user. See Appendices for details. 2. Making WIDE traffic traces open to the public Motivation It is essential to the Internet research community that latest traffic data sets are easily-accessible by researchers. Research based on open data sets can be confirmed or analyzed further by someone else, which leads to deeper studies. WIDE is very unique in that WIDE network is a large testbed still carrying real user traffic. Other networks carry commercial traffic these days, and thus, will not be able to make their traces open to the public. WIDE networks also carry new or experimental technologies such as IPv6 and DiffServ. Use of trace data Traffic data can be used only for research purposes. It is prohibited to use traffic data for other than research. Care should be taken not to trespass upon users' privacy. Target Traffic Protocol: IP version 4 and IP version 6. Sampling points: several points within the WIDE backbone. (we will not provide information regarding their specific places.) Appendix A Rules for Removing Payload As a general rule, remove payload of TCP and UDP which contains users' private information. If another protocol header exists on top of a TCP or UDP header and the inner header does not contain user private information, the inner header can be maintained. If it is difficult to judge whether a header contains user private information or not, the header should be removed as a precaution. Appendix B Rules for address scrambling Provide anonymity to individuals and organizations by scrambling source and destination addresses in IP headers. There are 2 levels in address scrambling. Chose an appropriate method according to purposes. (1) Full scrambling Scramble IP addresses by mapping an IP address to another IP address using a hash function. (2) Scrambling with address prefix preserved When 2 IP addresses have a common address prefix, they are mapped to addresses with a common address prefix of the same length. Note that, although it preserves routing information, method (2) has a risk of being reverse-engineered (e.g., using well-known server addresses as a clue). Chose an appropriate method according to the importance of anonymity in the trace and the purpose of the data set. Exceptional addresses Addresses not containing user identifiers may be left without scrambling. Those addresses include broadcast addresses, multicast addresses, and private addresses. In the case of IPv6, link-local addresses and site-local addresses could contain user MAC addresses. Solicited-node multicast addresses contain lower bits of global addresses. Therefore, these addresses should be scrambled as well. IP addresses contained in upper-layer headers IP addresses could be contained in upper-protocol headers (e.g., ICMP, DNS). These addresses must be scrambled in the same manner, or removed. MAC addresses Many link-layer headers (e.g., Ethernet headers) contain MAC addresses. A MAC address contains vendor and model information which could be part of user privacy or lead to a security hole. However, a trace contains only MAC addresses of machines directly connected to the segment, and backbone networks usually have no ordinary users on the local segment. As long as there is no ordinary user on the local segment, MAC addresses may be left without scrambling because this guideline is only for user privacy. (privacy or security of network nodes is out of scope of this guideline.) Unit of address scrambling There are several choices regarding address consistency between two or more data sets. (1) a single TCP session in a single data set is to be mapped to the same address. (2) all occurrences of an address are to be mapped to a single address within a data set. (3) all occurrences of an address are to be mapped to a single address even in different data sets. Longer consistency is convenient for users but it also makes reverse engineering easier. Method (2) is recommended. IP/TCP options An IP options can contain IP addresses. Addresses in IP options should be scrambled in the same manner. Otherwise, IP options should be replaced by NOP options, or removed. On the other hand, TCP options do not contain privacy information. TCP options carry useful information to analyze TCP behaviors so that it is recommended to leave TCP options. Timestamp As a general rule, time information should not be modified. File Size If the volume of a data set is too large, it is recommended to divide them into smaller files (about 100MB each). Exceptional cases Sometimes, it is necessary to access raw addresses in order to analyze routing protocols or DNS. Such requests will be processed depending on the case. Still, care should be taken not to risk user privacy. For example, by extracting only packets using the target protocol. Standard Tool wide-tcpdpriv is our standard tool to remove privacy information from a tcpdump output file. wide-tcpdpriv is derived from tcpdpriv written by Greg Minshall, and the default settings are changed to meet the above requirements. wide-tcpdpriv is available from: ftp://ftp.csl.sony.co.jp/pub/kjc/wide-tcpdpriv.tar.gz Usage: % tcpdpriv [-w outputfile] [-r inputfile] or % tcpdpriv < inputfile > outputfile Appendix C Traffic Data Description Format Traffic Data Description Format Description: (A general description of the traffic trace (e.g., "one hour of all TCP traffic between the University of Southern California and the rest of the world"; "HTTP server logs for a departmental server").) Data Format: [ ] tcpdump binary [ ] tcpdump ascii [ ] other ( ) Measurement Information: Start Date and Time: Duration: hours minutes Contact information: e-mail: Other measurement details: (if any) Protocol: [ ] IPv4 [ ] IPv6 [ ] other ( ) Privacy: [ ] wide-tcpdprive default setting [ ] other payload deletion: [ ] TCP/UDP payload deleted List of protocols whose headers are not deleted [ ] address scrambling method: [ ] no scrambling [ ] full scrambling [ ] prefix preserved [ ] other ( ) address mapping consistency: [ ] session only [ ] (subdivided) file [ ] entire data set [ ] other ( ) Restrictions: (Whether the trace may be redistributed without permission, who to contact for permission. All traces in the archive are unrestricted as to what use may be made of them (for example, there is no requirement that simulations made using the traces be published in the open literature). ) [ ] redistributable [ ] other (described below) Distribution: File Information compressed size: Bytes uncompressed size: Bytes compression method: [ ] gzip [ ] other ( ) if data set is divided into multiple files: number of files: average file size: URL: Acknowledgments: (Who captured the trace, how the trace should be acknowledged in publications, who to contact with questions regarding the trace.) Publications: (Publications available that have already studied this trace, if any.) Related: (Available related software and traces, if any.)