Guidelines for Protecting User Privacy in WIDE Traffic Traces

				October 14, 1999
				The WIDE Project, MAWI Working-Group

	The WIDE Project has dealt with requests for collecting
	traffic data at a WIDE network.  However, we currently handle
	raw data and lack a systematic way to protect user privacy.
	In this guideline, we define procedures to remove privacy
	information from traffic traces in order to make
	easy-to-handle data sets.

	The WIDE Project intend to make a series of traffic data sets
	publicly-available via anonymous FTP.
	
		In short, current traffic traces are poisonous, and
		thus, must be handled with care and have restrictions
		on their utilization. 
		We are trying to make traffic traces easier to handle
		by removing poisonous factors in data sets.

1. Removing Privacy Information

Motivation

	There are always risk of accidents when we handle raw traffic
	data including privacy information.
	Also, standardizing procedures for providing traffic traces
	considerably reduces required work for both providers and
	users.

	The goal of these procedures is to provide generic data sets
	useful for many research areas without jeopardizing user
	privacy.  
	(requests for specific detailed information are handled
	depending on the case, as have been handled in the past.)

	Traffic data conforming to this guideline can be used for
	research without permission from the WIDE Project.

Protecting User Privacy

	There are 2 issues regarding user privacy.

	1. Removing user data:
		Leave only protocol headers and remove protocol
		payload which could contain user data.
	2. Providing anonymity:
		Scramble addresses which could be used to identify a
		user.

	See Appendices for details.

2. Making WIDE traffic traces open to the public

Motivation

	It is essential to the Internet research community that latest
	traffic data sets are easily-accessible by researchers.

	Research based on open data sets can be confirmed or analyzed
	further by someone else, which leads to deeper studies.

	WIDE is very unique in that WIDE network is a large testbed still
	carrying real user traffic.  Other networks carry commercial 
	traffic these days, and thus, will not be able to make their
	traces open to the public. 
	WIDE networks also carry new or experimental technologies such 
	as IPv6 and DiffServ.

Use of trace data

	Traffic data can be used only for research purposes.  
	It is prohibited to use traffic data for other than research.
	Care should be taken not to trespass upon users' privacy.

Target Traffic

	Protocol: IP version 4 and IP version 6.

	Sampling points: several points within the WIDE backbone.
		(we will not provide information regarding their
		specific places.)

Appendix A  Rules for Removing Payload

	As a general rule, remove payload of TCP and UDP which
	contains users' private information.

	If another protocol header exists on top of a TCP or UDP
	header and the inner header does not contain user private
	information, the inner header can be maintained.
	If it is difficult to judge whether a header contains user
	private information or not, the header should be removed as a
	precaution. 

Appendix B  Rules for address scrambling

	Provide anonymity to individuals and organizations by
	scrambling source and destination addresses in IP headers.

	There are 2 levels in address scrambling.  Chose an
	appropriate method according to purposes.

		(1) Full scrambling
		    Scramble IP addresses by mapping an IP address to
		    another IP address using a hash function.

		(2) Scrambling with address prefix preserved
		    When 2 IP addresses have a common address prefix,
		    they are mapped to addresses with a common address
		    prefix of the same length.

	Note that, although it preserves routing information, method
	(2) has a risk of being reverse-engineered (e.g., using
	well-known server addresses as a clue). 
	Chose an appropriate method according to the importance of
	anonymity in the trace and the purpose of the data set.

  Exceptional addresses
	Addresses not containing user identifiers may be left without
	scrambling.  Those addresses include broadcast addresses,
	multicast addresses, and private addresses.

	In the case of IPv6, link-local addresses and site-local
	addresses could contain user MAC addresses.  Solicited-node
	multicast addresses contain lower bits of global addresses.
	Therefore, these addresses should be scrambled as well.

  IP addresses contained in upper-layer headers
	IP addresses could be contained in upper-protocol headers
	(e.g., ICMP, DNS).  These addresses must be scrambled in the
	same manner, or removed.

  MAC addresses
	Many link-layer headers (e.g., Ethernet headers) contain
	MAC addresses.
	A MAC address contains vendor and model information which
	could be part of user privacy or lead to a security hole.
	However, a trace contains only MAC addresses of machines
	directly connected to the segment, and backbone networks
	usually have no ordinary users on the local segment.

	As long as there is no ordinary user on the local segment,
	MAC addresses may be left without scrambling because this
	guideline is only for user privacy.
	(privacy or security of network nodes is out of scope of this
	guideline.)
	
Unit of address scrambling
	There are several choices regarding address consistency
	between two or more data sets.
		(1) a single TCP session in a single data set is to be
		    mapped to the same address.
		(2) all occurrences of an address are to be mapped to
		    a single address within a data set.
		(3) all occurrences of an address are to be mapped to
		    a single address even in different data sets.
	Longer consistency is convenient for users but it also makes
	reverse engineering easier.
	Method (2) is recommended.

IP/TCP options
	An IP options can contain IP addresses.  Addresses in IP
	options should be scrambled in the same manner.  Otherwise, IP
	options should be replaced by NOP options, or removed.

	On the other hand, TCP options do not contain privacy
	information.  TCP options carry useful information to analyze
	TCP behaviors so that it is recommended to leave TCP options.

Timestamp
	As a general rule, time information should not be modified.

File Size
	If the volume of a data set is too large, it is recommended to
	divide them into smaller files (about 100MB each).

Exceptional cases
	Sometimes, it is necessary to access raw addresses in order to
	analyze routing protocols or DNS.
	Such requests will be processed depending on the case.
	Still, care should be taken not to risk user privacy.  For
	example, by extracting only packets using the target protocol.

Standard Tool
	wide-tcpdpriv is our standard tool to remove privacy
	information from a tcpdump output file.
	wide-tcpdpriv is derived from tcpdpriv written by Greg
	Minshall, and the default settings are changed to meet the
	above requirements.
	wide-tcpdpriv is available from:
	ftp://ftp.csl.sony.co.jp/pub/kjc/wide-tcpdpriv.tar.gz

	Usage:
		% tcpdpriv [-w outputfile] [-r inputfile]
		or
		% tcpdpriv < inputfile > outputfile

Appendix C  Traffic Data Description Format


			Traffic Data Description Format

	Description:
	    (A general description of the traffic trace (e.g., "one
	    hour of all TCP traffic between the University of Southern
	    California and the rest of the world"; "HTTP server logs
	    for a departmental server").)


	Data Format:
		[ ] tcpdump binary  [ ] tcpdump ascii  [ ] other (	)

	Measurement Information:
		Start Date and Time:
		Duration:	hours		minutes
		Contact information:			e-mail:
		Other measurement details: (if any)

	Protocol:
		[ ] IPv4  [ ] IPv6 [ ] other (		)

	Privacy:
		[ ] wide-tcpdprive default setting
		[ ] other
		    payload deletion:
			[ ] TCP/UDP payload deleted
			List of protocols whose headers are not deleted
			[						]

		    address scrambling method:
			[ ] no scrambling
			[ ] full scrambling  [ ] prefix preserved
			[ ] other (			)
		    address mapping consistency:
			[ ] session only  [ ] (subdivided) file  
			[ ] entire data set  [ ] other (		)

	Restrictions:
	    (Whether the trace may be redistributed without
	    permission, who to contact for permission.
	    All traces in the archive are unrestricted as to what use
	    may be made of them (for example, there is no requirement
	    that simulations made using the traces be published in the
	    open literature). )

		[ ] redistributable  [ ] other (described below)


	Distribution:
		File Information
			compressed size:		Bytes  
			uncompressed size:		Bytes
			compression method:
				[ ] gzip  [ ] other (		)
			if data set is divided into multiple files:
				number of files:
				average file size:
		URL:

	Acknowledgments:
	    (Who captured the trace, how the trace should be
	    acknowledged in publications, who to contact with
	    questions regarding the trace.)


	Publications: 
	    (Publications available that have already studied this
	    trace, if any.)

	Related: (Available related software and traces, if any.)