APRIL 2016 | VOL. 59 | NO. 4 | COMMUNICATIONS OF THE ACM 89
used as inputs to a transaction as being controlled by the
same user, has already been used and explored in previous
work, and exploits an inherent property of the Bitcoin protocol. The second is new and based on the so-called change
addresses; in contrast to the first, it exploits a current idiom
of use in the Bitcoin network rather than an inherent property. As such, it is less robust in the face of changing patterns
within the network, but—as we especially see in Section 5—
it can provide insight into the Bitcoin network that the first
heuristic does not.
4. 1. Our heuristics
Heuristic 1. The first heuristic, in which we link together
addresses used as input to the same transaction, has
already been used many times in previous work.
1, 9, 10, 12
For completeness, we nevertheless present it here as
Heuristic 1: if two (or more) addresses are used as inputs
to the same transaction, then they are controlled by the
same user.
Using this heuristic, we partitioned the network into
5. 5 million clusters of users. By naming these clusters—
using the data collection described in Section 3—we
observed that some of them corresponded to the same
user; for example, there were 20 clusters that we tagged
as being controlled by Mt. Gox. (This is not surprising, as
many big services appear to spread their funds across a
number of distinct addresses to minimize the risk in case
anyone gets compromised.) Factoring in “sink” addresses
that have to date never sent any bitcoins (and thus did not
get clustered using this heuristic) yields at most 6,595,564
distinct users, although we consider this number a quite
large upper bound.
Heuristic 2. Although Heuristic 1 already yields a useful clustering of users, restricting ourselves to only this
heuristic does not tell the whole story. To further collapse users, our second heuristic focuses on the role of
change addresses within the Bitcoin system. A similar
heuristic was explored by Androulaki et al.
1 (who called
them “shadow” addresses), although there are a number
of important differences. In particular, their definition
of shadow addresses relied upon assumptions that may
have held at the time of their work, but no longer hold at
present. For example, they assumed that users rarely issue
transactions to two different users, which is a frequent
occurrence today (e.g., payouts from mining pools, or bets
on gambling sites).
One of the defining features of the Bitcoin protocol is
the way that bitcoins must be spent. When the bitcoins
redeemed as the output of a transaction are spent, they
must be spent all at once: the only way to divide them is
through the use of a change address, in which the excess
from the input address is sent back to the sender. In
one idiom of use, the change address is created internally by the Bitcoin client and never re-used; as such, a
user is unlikely to give out this change address to other
users (e.g., for accepting payments), and in fact might
not even know the address unless he inspects the block
chain. If we can identify change addresses, we can therefore potentially cluster not only the input addresses for a
Non-bank exchanges. In contrast, most of the fixed-rate
exchanges did not function as banks, and are instead
intended for one-time conversions. We therefore were able
to participate in fewer transactions with these exchanges,
although we again tried to transact with most of the major
ones at least once (eight in total).
Vendors. We purchased goods, both physical and digital,
from a wide variety of vendors. Many of the vendors we interacted with did not use an independent method for accepting
bitcoins, but relied instead on the BitPay payment gateway
(and one used WalletBit as a payment gateway). We also kept a
wallet with Silk Road, which allowed us to tag their addresses
without making any purchases.
Gambling. We kept accounts with five poker sites, and
transacted with eight sites offering mini-games and/or
lotteries.
Miscellaneous. Four of the additional services we interacted with were mix or laundry services: when provided with
an output address, they promised to send to that address
coins that had no association with the ones sent to them; the
more sophisticated ones offered to spread the coins out over
various transactions and over time. One of these, BitMix,
simply stole our money, while Bitcoin Laundry twice sent us
our own coins back, indicating we were possibly their only
customer at that time. We also interacted with Bit Visitor, a
site that paid users to visit certain sites; Bitcoin Advertisers,
which provided online advertizing; CoinAd, which gave out
free bitcoins; Coinapult, which forwarded bitcoins to an email
address, where they could then be redeemed; and finally,
Wikileaks, with whom we donated to both their public donation address and two one-time addresses generated for us
via their IRC channel.
3. 2. From other sources
In addition to our own transactions, many users publicly
claim their own addresses; for example, charities providing donation addresses, or LulzSec claiming their address
on Twitter. While we did not attempt to collect all such
instances, many of these tags are conveniently collected
at blockchain.info/tags, including both addresses provided in users’ signatures for Bitcoin forums, as well as
self-submitted tags. We collected all of these tags—over
5000 in total—keeping in mind that the ones that were
not self-submitted (and even the ones that were) could
be regarded as less reliable than the ones we collected
ourselves.
Finally, we searched through the Bitcoin forums (in
particular, bitcointalk.org) looking for addresses associated with major thefts, or now-defunct services such as
Tradehill and GLBSE. Again, these sources are less reliable,
so we consequently labeled users only for addresses for
which we could gain some confidence through manual due
diligence.
4. ADDRESS CLUSTERING
In this section, we present two heuristics for linking addresses
controlled by the same user, with the goal of collapsing the
many addresses seen in the block chain into larger entities.
The first heuristic, in which we treat different addresses