Information Extraction: Distilling
Structured Data from Unstructured Text
Accountability in Algorithmic
Learning from THE WEB
1. Agrawal, S., Goyal, N. Analysis of Thompson sampling
for the multi-armed bandit problem. J. Machine
Learning Research: Workshop and Conference
Proceedings 23 (2012); http://jmlr.org/proceedings/
2. Fink, D. A compendium of conjugate priors. Montana
State University, 1997; http://www.johndcook.com/
3. Goodman, N.D. The principles and practice of
probabilistic programming. In Proceedings of the
40th Annual ACM SIGPLAN-SIGAC T Symposium on
Principles of Programming Languages, 2013; http://
4. Norvig, P. Machine learning for programming.
InfoQueue, 2015; https://www.infoq.com/
5. Parr, T., Vinju, J. Towards a universal code formatter
through machine learning. In Proceedings of the ACM
SIGPLAN International Conference on Software
Language Engineering, 2016; http://dl.acm.org/
6. Paulos, J.A. The mathematics of changing your mind.
New York Times Sunday Book Review; http://www.
7. Raychev, V., Bielik, P., Vechev, M. Probabilistic model
for code with decision trees. In Proceedings of
the ACM SIGPLAN International Conference on
Object-oriented Programming, Systems, Languages,
and Applications, 2016; http://dl.acm.org/citation.
8. Schwarz, K. Darts, dice, and coins: Sampling from a
discrete distribution, 2011; http://www.keithschwarz.
9. Scibior, A., Ghahramani, A., Gordon, A.D. Practical
probabilistic programming with monads. In
Proceedings of the 2015 ACM SIGPLAN Symposium
on Haskell; http://dl.acm.org/citation.cfm?id=2804317.
10. Silver, D., et al. Mastering the game of Go with deep
neural networks and tree search, 2016; https://blog.
11. Stucchio, C. Bayesian bandits—Modifying click-throughs with statistics, 2013; https://www.
12. Wikipedia. Conjugate prior: Discrete distributions;
Erik Meijer has been working on “Democratizing the
Cloud” for the past 15 years. He is known for his work
on the Haskell, C#, Visual Basic, and Dart programming
languages, among others, as well as for his contributions
to LINQ and the Reactive Framework (Rx).
Copyright held by owner/author.
Publication rights licensed to ACM. $15.00.
Determine the optimal title for this
article that would maximize click-
through on the CACM website. That
is, should we use “Probabilistic Pro-
gramming for Dummies” instead of
the current title “Making Money Us-
In this case, we create the model
shown in Figure 7, the set of all us-
ers as a conditional distribution of
a user clicking on the article given
User ∈ ℙ(Click|Title)
Note we do not want to make any a priori assumptions about the underlying
distributions other than the frequentist stream of clicks received, given
the frequentist stream of titles served
to the users.
The agent in this case wants to
find out over time which possible title for a story will generate the most
clicks from the users, and hence we
will model the agent by the higher-order function that takes the users
and from that creates a distribution
Mathematicians call the implementation of user a Bayesian bandit, 11 and
they leverage the fact that Bernoulli
and beta distributions are conjugate
priors. 12 They call the variant of the
run function we will be using
Thompson sampling. 1
When viewed from a computer scientist’s point of view, the operational
solution is relatively straightforward.
We convert the user behavior that
returns a distribution of clicks user
∈ Title→ℙ(Click) into a function Title→ℙ(Title&Clickℝ)
that returns a distribution of pairs
of titles and clicks using run as explained earlier (this corresponds
to the beta distribution part of the
algorithm. We do not track the “
uncertainty” about ℙ(Click), but we
can easily compute that together
with the click probability if that is
useful). A small tweak is needed in
that we are interested only in clicks
that are true, and not in those that
are false (this is the Bernoulli part
of the algorithm).
This allows us to observe how the
probability that the user will click
on each title evolves over time as
we see more clicks from the users.
Whenever we need to produce a new
title, we use the Title for which
the most recent Title&Clickℝ
has the highest probability (this
is the Thompson sampling part of
the algorithm). In other words, the
Bayesian bandit is essentially a merge
sort over the reified underlying probability distributions of the clicks from
The computational model underneath modern applications such as
self-driving cars, speech and image
recognition, personalized recommendations, and so on, is changing
from classical deterministic computations to probabilistic machine-learned models. Currently, building such applications requires deep
knowledge and expertise of the underlying details of the ML algorithms
using custom tools.
Probabilistic programming aims to
democratize the application of machine learning by allowing regular
programmers to incorporate ML
in general-purpose programming
languages without friction. As illustrated in this article, from a
semantics point of view, a probabilistic language simpl y adds the
probability monad to the set of
ambient side effects and leverages
Bayes’ rule to compose and condition probability distributions. Efficiently implementing probabilistic
languages and providing the proper
software engineering tooling, however, will keep compiler and pro-gramming-language experts busy for
a long time.
Figure 7. Set of all users as a conditional