Table 2 demonstrates the performance of pre-classified data using the
nearest neighbor method. The highlighted cells represent the TP and
TN prediction with the average accuracy performance at 84.6 percent.
Accuracy: 84.59%
real-popular
real-popular
real-local
fake
class recall
Predicted
real-local
377
15
25
90.41%
Actual Identity
fake
19
62
37
26
31
400
52.54%
87.53%
class precision
89.34%
57.41%
86.58%
Table 2: Confusion matrix of Nearest Neighbor learner over
training dataset.
We validated different learners for both original data (profile’s content such as age, gender, location) and pre-classified data (extracted
personality factors such as valid, popular, traceable) in order to achieve
more accurate precision. Evaluating both of these sets allows us to
compare the average performance improvement across all three
inputs for the machine learning models.
Our results showed that the overall performance over pre-classified data is higher than using the original data, while incorporating
social network data improves performance yet further. (See Table 3).
Although the diversity of information in pre-classified data is less, it’s
much faster and the prediction performance is more effective than
using the original data by 83.65 to 65.98 percent.
Decision
Tree
Original data 69.25%
Pre-classified data 86.10%
Learner accuracy 77.68%
Rule Nearest Naïve Overall
Learner Neighbor Bayes Accuracy
66.63% 67.05% 60.99% 65.98%
85.89% 84.59% 78.03% 83.65%
76.26% 75.82% 69.51%
Table 3: Average learner performance comparison when using
original data and pre-classified data.
Real vs. Fake
Our results reveal three important implications. First, they allow us to
clarify our assumption that the levels of honesty and accountability have
a strong correlation when determining real versus fake personas. As
shown in Figure 3, the real nodes are more often associated with both
higher accountability and honesty, while fake users have lower values
for both attributes.
In our identity model (Figure 1), we identified the four possible
types of identity representation: honest and accountable (HA), dishonest and unaccountable (DU), honest and unaccountable (HU), dishonest and accountable (DA). The fraction and frequency of each type
of identity representation are shown in Figure 4.
The results prove our assumptions that:
1. HA highly correlated with real-popular users;
2. DU highly correlated with fake users;
3. HU mainly correlated with real-local users; and
4. DA mainly correlated with real-local and fake users.
Figure 3: Relationship between honesty and accountability
to determine the type of identity.
Figure 4: Four-dimensional identity representation.
Finally, our results show that, despite the important role of honesty in
social interaction, the honesty value (quality in content) has less effect
than accountability value (quantity in interaction) for identity prediction in MySpace.
Friend or Foe?
Our work draws upon research areas in computer sciences, statistics,
sociology, and psychology. Previous work similar to ours has looked at
the validity in self described online identities [ 6, 18]. Some of these
studies showed promise in predicting personality traits with high
accuracy [ 7], but were not able to clarify if the predicted traits are real
or fabricated.
The social network analysis we performed was inspired by previous studies on understanding properties of social network structures
[ 3], especially connectivity [ 11], interactions [ 4, 14, 15], and behavior