DATA MINING ONLINE IDENTITIES
By Roya Feizy, Ian Wakeman, and Dan Chalmers
How sure are you that your friends are who they say they are? In real life, unless you are the target of some form of espionage, you can usually be fairly certain that you know whom your friends are because you have a history of shared interests
and experiences. Likewise, most people can tell, just by using common sense, if someone is trying to sell
them on a product, idea, or candidate. When we interact with people face-to-face, we reevaluate con-
tinuously whether something just seems off based on body language and other social and cultural cues.
These identity validation questions have a long history in computer
science and translate directly to the pervasive computing context, where
there is a widespread view that access control mechanisms will use some
form of computational trust [ 8]. One example of this paradigm is the set
of social networks embodied in Web sites such as MySpace and Facebook. If a person can show proof that he or she is responsible for an online identity through standard public key cryptography, then his or her
information and relationships can be used to calculate a level of trust.
The millions of social network users and billions of connections
between them make it non-trivial to formalize an automated approach
to differentiate fact from fiction in self-described identities online. An
identity may be part of a role-playing game [ 1] or it may be an impersonation, either for play or more nefarious purposes, such as fraud.
However, each of these identities still has associated profile data and is
embedded within a social network.
How can we be sure with whom we are interacting and whether these
individuals and groups are being truthful in the online identities they
present to the rest of the community? What tools and techniques can be
used to gather, organize, and explore the available data for informing the
level of trust that should be granted an individual? Can we verify the validity of the identity automatically, based on the displayed information?
To tackle these questions, we use a machine learning approach to
look at traces of people’s identities left behind on online social networking sites to evaluate the validity of those identities. We train clas-sifier-based models on profiles with known identities (real or fake).
We also use data mining techniques and social network analysis to
extract significant patterns in the data and network structure and
improve the classifier during the cycle of development.
We evaluate our algorithm on 2. 2 million user profiles’ features collected from MySpace. Our results indicate that by utilizing people’s
online, self-reported information, network of friends, and interactions,
we are able to provide evidence for deciding the level of trust with
which to imbue individuals in making access control decisions in a
manner that is both accurate and scalable.
Social Network Data Collection
To obtain our sample data set, we customized a robust crawler to
accumulate personal and relational information from MySpace profiles within three main categories: 1) public (personal pages), 2) private
(training & test set)
Table 1: Number of collected profiles by each category.
(pages with limited biographical data) and 3) bands (data related to
musical artists). (See Table 1.)
Each seed identity was chosen by selection of a random FriendID
(MySpace members’ unique number). We then crawled pages up to a
depth of two degrees (link to the friends and the friends of friends),
targeting the top 40 friends of each individual.
We applied a qualitative study to manually identify the true identity
of three types of users for our classifier training data:
• Real (popular): official profiles representing famous people. These
are obviously well-connected profiles, which might affect the experimental results; therefore we collected a number of local users.
• Real (local): current students at University of Sussex who responded
to a survey (118 responses from 2,019 emails) and verified that their
profiles belong to them, and rated their level of honesty.
• Fake (impersonator): users who fabricated real persona with almost the same information such as name and pictures. We determined fakes manually, for instance by knowing of another real profile for the same person (457 participants).
The known data (real or fake) was used as the training and testing set,
while the remaining unknown dataset was used to investigate appropriate pre-processing algorithms for the classifier (Table 1).
To aid the machine learning classifier, we developed a series of attributes to describe each individual profile. A preprocessing algorithm to
derive a set of personality features from the raw profile data was
devised. The set of personality features includes: