developer community, the full potential of the technology has not yet been
reached. Thanks to the foundational
nature of LINQ, there is still enormous
potential for its mapping scenarios
outside object-relational (O/R), especially in the area of big data.
The advent of big data makes it
more important than ever for programmers to have a single abstraction
that allows them to process, transform, compose, query, analyze, and
compute across at least three different dimensions: volume, big or small,
ranging from billions of items to a
handful of results; variety in models, structured or unstructured, flat
figure 1. example pie chart.
top 5 words
that
is
figure 2. Relational algebra operators.
“hello”
π(translate,
σ(fourletters,
“hello”
“world”
×
∅
{“hello”}
∪
“world”
“hello”
“world”
)
“hello”
“world”
)
“salve”
“mundi”
“hello” @(prefixes, )
or nested; and velocity, streaming or
persisted, push or pull. As a result, we
see a mind-blowing number of new
data models, query languages, and
execution fabrics. LINQ can virtualize all these aspects behind a single
abstraction.
Take, for example, Apache’s Hadoop
ecosystem. It comes with at least eight
external DSLs (domain-specific languages) or APIs: a set of low-level Java interfaces for MapReduce computations;
Cascading, a “data-processing definition language, implemented as a simple
Java API;” Flume, a “simple and flexible
architecture based on streaming data
flows;” Pig a “high-level language for
a
of
the
“hello”
“world”
“salve”
“mundi”
(“hello”, “salve”)
“hello”
“hell”
“hel” “he”
“h” ““
expressing data analysis programs;”
HiveQL, an “SQL-like language for easy
data summarization, ad hoc queries,
and the analysis of large data sets;” CQL,
a “proposed language for data management in Cassandra;” Oozie, an XML-based “coordinator engine specialized
in running workflows based on time
and data triggers;” and Avro, a schema
language for data serialization.
To create an end-to-end application, programmers need to use several of these external DSLs in addition
to a general-purpose programming
language such as Java to glue everything together. If data comes from an
external RDBMS (relational database
management system) or push-based
source, then even more DSLs such as
SQL or StreamBase are required. Using LINQ and C# or Visual Basic on
the other hand, programmers can use
internal DSLs to program against any
shape or form of data inside a general-purpose OO (object-oriented) language that comes with tooling (Visual
Studio or cross-platform solutions
from Xamarin such as MonoDevelop,
Mono Touch for iPhone, or Mono for
Android) and an extensive collection of
standard libraries (.NET Framework).
standard Query operators and LinQ
Assume that given a file of text—say,
words.txt—you need to count the
number of distinct words in that file,
find the five most common ones, and
visualize the result in a pie chart. If you
think about this for a minute, it becomes clear that this is really an exercise in transforming collections. This is
exactly the kind of task for which LINQ
was designed. To keep things simple,
we have implemented this example using LINQ to Objects to process the data
in memory; however, with minimal
modification the same code runs on
LINQ to HPC (high-performance computing) over terabytes of data stored in
commodity clusters.
The standard
File.Read All Text
method provides the content of the file
as a single giant string. You first need
to chop up this string into individual
words by breaking it at delimiter characters such as space, comma, period,
etc. Once you have a list of words, you
need to clean it up, removing all empty words. Finally, normalize all words
to lowercase.