Friendship also relies on some degree of similarity, meaning the
characteristics of individuals should be, on some level, similar to those
people who connect with them. This similarity measurement reveals
information about the context of links between users and the correlation between their identity types.
The following formula was used to compare I, which represents the
value of personality features for individuals, with F, which indicates
the average friend’s value of personality attributes. The average similarity was calculated by dividing the minimum value by the maximum
value between individuals and their friends. (See Equation 1.)
We learned that the choice of friends has an influence on the individuals’ determined type of identity. We took the square root of friends’
attributes because they are less significant compared to the individual’s
attributes. (See Equation 2.)
To identify patterns within our data and to improve our classifier
model in parallel, we used a supervised learning approach to train and
test our classifier model. We evaluated four classifier models to classify data more efficiently and determine which would operate most
effectively given our problem definition: decision tree, rule learner,
Naive Bayes, and nearest neighbor .
Given the large size of our sample population and the number of
features used to describe each of the data points, we first used principal component analysis [ 22] to reduce the amount of dimensions
required to cluster the data prior to attempting learning. This redundancy technique examines the correlation between features within the
training data set and generates the main components with minimal
loss of information. The result of this analysis indicates which factors
or components are most significant when examining personal information to predict the truth about identities.
Using the attributes that were identified as the principle components, two-thirds of our known data, consisting of the both raw data and
extracted personality attributes, was used as the training set (to build a
model), and one-third is used as the test set (to measure the model).
Several validation schemes exist
that can be used to estimate the
performance of a learner, such as
simple validation, regression performance, and T-test.
In our experiments, we applied the cross-validation operator , which evaluates the
learning method from the training set and applies the average absolute
and squared errors to the test set to predict the unknown labels.
Determining performance accuracy of the learner produces a confusion matrix , which is an evaluation technique to factor a matrix
of true-positive ( TP), true-negative ( TN), false-positive (FP) and false-negative (FN) values, where:
• TP is correct classification of correct data: real correctly tagged as real
• TN is correct classification of incorrect data: fake correctly tagged
• FP is incorrect classification of incorrect data: fake incorrectly
tagged as real
• FN is incorrect classification of correct data: real incorrectly tagged
❝Despite the important role of honesty
in social interaction, the honesty value
(quality in content) has less effect than
accountability value (quantity in interaction)
for identity prediction in MySpace.❞
Similarity ( I, F ) = min( I, F ) 100 max(I, F )
i(a) F = f
i= 1( ) na= 1 f(a)
This formula allowed us to uncover which identity elements are
more important in choosing a friend. Our analysis of the pre-classified
attributes showed that the traceability, validity, and being positive are
not as important as being active, sociable, and popular. This similarity
measurement reveals that people confer more value in accountability
than honesty due to less honesty being a less visible characteristic.
Using an individual’s personality
attributes and their relation within
the larger social network, we were
able to compute our reality algorithm for determining the likelihood that an identity is valid. In
our identity model, R(x) refers to a
real person, F (x) to a fake persona,
and a represents each attribute extracted from their profiles. Our algorithm calculated values for each, indicating the likelihood that an identity is either real or fake based on the level of accountability and honesty.
To calculate R(x), the summation of honesty H (a) and accountability
A(a) values are added to the squared average of top friends’ attributes.
F (x) is calculated based on the dishonesty D(a) and the unaccountability U(a) values. Top friends’ personality values are counted in the
weighting schema as from our experiment on the similarity attribute.
( 2) ;
H ( a) = avg(expressive + valid + traceable)
A( a) = avg( active + popular + sociable + positive)
(H( a) + A( a)))
H ( a) + A( a) +
D(a) + U(a) +
D ( a) = avg( anonymous + fantasy + untraceable)
U ( a) = avg( inactive + isolated + unsociable + offensive)
(D( a) + U( a)))
The metrics for performance evaluation can be calculated as:
Precision=TP/TP+FN and Recall= TP/TP+FP.