• expressive/anonymous,
• valid/fantasy,
• active/inactive,
• positive/offensive,
• popular/isolated,
• sociable/unsociable, and
• traceable/untraceable.
Our approach to labeling the training data was empirically driven,
experimenting with formulae to best match the collected data. The
features are determined using a mixture of ad-hoc automated techniques, ranging from checking the validity of the address to comparing the terms and language used against a list of known terms.
For each feature pair, a profile is awarded a normalized score
between 0 and 1. Using mixtures of these features, we were then able
to classify each profile along three scales. (See Figure 1.)
Expressive
Anonymous
Valid
Honest
Dishonest
Fantasy
Traceable
?
Untraceable
Real
Fake
?
Active
Inactive
Popular
Accountable Un Accountable
Isolated
Positive
Offensive
Sociable
Unsociable
friends and view number), and positive use of language (which we
later see is not a strong indicator of validity).
3. Real/fake. We define a fake profile as one intended to make people
believe that the profile belongs to some real person who has no actual connection to the profile. On the other hand, a real profile is one
controlled by the identity presented.
Social Network Analysis
Because social networks are examples of small world networks [ 13],
the community can be modeled as a network N =I, F, where I
represents an individual or node, and F represents a friend’s link or edge.
Social network analysis can be used to describe the properties of
this network structure as well as characteristics about a specific individual in that network. These include a profile’s connectivity and the
amounts and types of interactions with other members of the community which can reveal information about the validity of an identity.
To capture this information in a form that can be used by our classifier, we analyzed measurable characteristics including the out-degree,
in-degree, overlapping (mutual friends), centrality, and isolation of
nodes which were tagged as accountability attributes, to identify a relationship within these properties and the type of identity. We also measured the similarity criteria for both self-described data and extracted
personality factors between individuals and their network of friends.
These data were generated by incorporating the identity features of
the top 40 friends within the system. Our data set initially contained
more than 4. 8 million profiles, which reduced to 2. 2 million nodes
with 2. 4 billion edges between them after removing mutual friends.
This suggests the probability that a friend of a friend will become a
friend is much higher than a stranger becoming a new friend.
Our analysis revealed that the sample network employs many high-degree connections, which strongly clustered with an average of 1,010
friends for public profiles and 5,792 for band profiles. Real-popular
nodes showed a high out-degree distribution, although the out-degree
analysis alone was unable to verify a fake person from real-local. From
this analysis, we defined the popularity factor in our classifier as the
centrality measurement, or the accumulation from out-degree and in-degree distribution.
The distribution of training node (known profiles) positions within
the network structure shown in Figure 2 illustrates several key observations. The real-popular nodes are more closely linked to each other,
Figure 1: Identity model based on personality factors.
1. Honest/dishonest. It can be argued that there is a trade-off between
privacy and honesty in online presentation dependent upon context
[ 10, 16]. We define an honest person as one whose information is valid
(exists and is reasonably acceptable); traceable (includes a Web address, school name, and photograph), and expressive (a count of information revealed).
2. Accountable/unaccountable. Accountable/unaccountable can be
defined based on a user’s level of activity (blog and group membership), sociability (number of comments), popularity (number of
Figure 2: The network structure of training set.