Stevens Institute of Technology Scaling Big Data Mining Infrastructure Paper Read the paper by Lin and Ryaboy: Scaling Big Data Mining Infrastructure: The Twitter Experience (available on CANVAS), and write and submit a short (~1/2 page) report Scaling Big Data Mining Infrastructure:
The Twitter Experience
Jimmy Lin and Dmitriy Ryaboy
The analytics platform at Twitter has experienced tremendous growth over the past few years in terms of size, complexity, number of users, and variety of use cases. In this
paper, we discuss the evolution of our infrastructure and the
development of capabilities for data mining on “big data”.
One important lesson is that successful big data mining in
practice is about much more than what most academics
would consider data mining: life “in the trenches” is occupied
by much preparatory work that precedes the application of
data mining algorithms and followed by substantial effort to
turn preliminary models into robust solutions. In this context, we discuss two topics: First, schemas play an important role in helping data scientists understand petabyte-scale
data stores, but they’re insufficient to provide an overall “big
picture” of the data available to generate insights. Second,
we observe that a major challenge in building data analytics
platforms stems from the heterogeneity of the various components that must be integrated together into production
workflows—we refer to this as “plumbing”. This paper has
two goals: For practitioners, we hope to share our experiences to flatten bumps in the road for those who come after
us. For academic researchers, we hope to provide a broader
context for data mining in production environments, pointing out opportunities for future work.
The analytics platform at Twitter has experienced tremendous growth over the past few years in terms of size, complexity, number of users, and variety of use cases. In 2010,
there were approximately 100 employees in the entire company and the analytics team consisted of four people—the
only people to use our 30-node Hadoop cluster on a daily basis. Today, the company has over one thousand employees.
There are thousands of Hadoop nodes across multiple datacenters. Each day, around one hundred terabytes of raw
data are ingested into our main Hadoop data warehouse;
engineers and data scientists from dozens of teams run tens
of thousands of Hadoop jobs collectively. These jobs accomplish everything from data cleaning to simple aggregations and report generation to building data-powered products to training machine-learned models for promoted products, spam detection, follower recommendation, and much,
much more. We’ve come a long way, and in this paper, we
share experiences in scaling Twitter’s analytics infrastructure over the past few years. Our hope is to contribute to a
set of emerging “best practices” for building big data analytics platforms for data mining from a case study perspective.
A little about our backgrounds: The first author is an Associate Professor at the University of Maryland who spent an
extended sabbatical from 2010 to 2012 at Twitter, primarily working on relevance algorithms and analytics infrastructure. The second author joined Twitter in early 2010 and
was first a tech lead, then the engineering manager of the
analytics infrastructure team. Together, we hope to provide
a blend of the academic and industrial perspectives—a bit
of ivory tower musings mixed with “in the trenches” practical advice. Although this paper describes the path we have
taken at Twitter and is only one case study, we believe our
recommendations align with industry consensus on how to
approach a particular set of big data challenges.
The biggest lesson we wish to share with the community
is that successful big data mining is about much more than
what most academics would consider data mining. A significant amount of tooling and infrastructure is required to
operationalize vague strategic directives into concrete, solvable problems with clearly-defined metrics of success. A data
scientist spends a significant amount of effort performing
exploratory data analysis to even figure out “what’s there”;
this includes data cleaning and data munging not directly
related to the problem at hand. The data infrastructure engineers work to make sure that productionized workflows operate smoothly, efficiently, and robustly, reporting errors and
alerting responsible parties as necessary. The “core” of what
academic researchers think of as data mining—translating
domain insight into features and training models for various
tasks—is a comparatively small, albeit critical, part of the
overall insight-generation lifecycle.
In this context, we discuss two topics: First, with a certain
amount of bemused ennui, we explain that schemas play an
important role in helping data scientists understand petabyte-scale data stores, but that schemas alone are insufficient to provide an overall “big picture” of the data available and how they can be mined for insights. We’ve frequently observed that data scientists get stuck before they
even begin—it’s surprisingly difficult in a large production
environment to understand what data exist, how they are
structured, and how they relate to each other. Our discussion is couched in the context of user behavior logs, which
comprise the bulk of our data. We share a number of examples, based on our experience, of what doesn’t work and
how to fix it.
Second, we observe that a major challenge in building data
analytics platforms comes from the heterogeneity of the various components that must be integrated together into production workflows. Much complexity arises from impedance
Volume 14, Issue 2
mismatches at the interface between different components.
A production system must run like clockwork, splitting out
aggregated reports every hour, updating data products every
three hours, generating new classifier models daily, etc. Getting a bunch of heterogeneous components to operate as a
synchronized and coordinated workflow is challenging. This
is what we fondly call “plumbing”—the not-so-sexy pieces of
software (and duct tape and chewing gum) that ensure everything runs together smoothly is part of the “black magic”
of converting data to insights.
This paper has two goals: For practitioners, we hope to
share our experiences to flatten bumps in the road for those
who come after us. Scaling big data infrastructure is a complex endeavor, and we point out potential pitfalls along the
way, with possible solutions. For academic researchers, we
hope to provide a broader context for data mining in production environments—to help academics understand how their
research is adapted and applied to solve real-world problems
at scale. In addition, we identify opportunities for future
work that could contribute to streamline big data mining
HOW WE GOT HERE
We begin by situating the Twitter analytics platform in
the broader context of “big data” for commercial enterprises.
The “fourth paradigm” of how big data is reshaping the physical and natural sciences  (e.g., high-energy physics and
bioinformatics) is beyond the scope of this paper.
The simple idea that an organization should retain data
that result from carrying out its mission and exploit those
data to generate insights that benefit the organization is of
course not new. Commonly known as business intelligence,
among other monikers, its origins date back several decades.
In this sense, the “big data” hype is simply a rebranding of
what many organizations have been doing all along.
Examined more closely, however, there are three major
trends that distinguish insight-generation activities today
from, say, the 1990s. First, we have seen a tremendous explosion in the sheer amount of data—orders of magnitude
increase. In the past, enterprises have typically focused on
gathering data that are obviously valuable, such as business
objects representing customers, items in catalogs, purchases,
contracts, etc. Today, in addition to such data, organizations also gather behavioral data from users. In the online
setting, these include web pages that users visit, links that
they click on, etc. The advent of social media and usergenerated content, and the resulting interest in encouraging
such interactions, further contributes to the amount of data
that is being accumulated.
Second, and more recently, we have seen increasing sophistication in the types of analyses that organizations perform
on their vast data stores. Traditionally, most of the information needs fall under what is known as online analytical
processing (OLAP). Common tasks include ETL (extract,
transform, load) from multiple data sources, creating joined
views, followed by filtering, aggregation, or cube materialization. Statisticians might use the phrase descriptive statistics
to describe this type of analysis. These outputs might feed
report generators, front-end dashboards, and other visualization tools to support common “roll up” and “drill down”
operations on multi-dimensional data. Today, however, a
new breed of “data scientists” want to do far more: they
are interested in predictive analytics. These include, for ex-
ample, using machine learning techniques to train predictive models of user behavior—whether a piece of content is
spam, whether two users should become “friends”, the likelihood that a user will complete a purchase or be interested
in a related product, etc. Other desired capabilities include
mining large (often unstructured) data for statistical regularities, using a wide range of techniques from simple (e.g.,
k-means clustering) to complex (e.g., latent Dirichlet allocation or other Bayesian approaches). These techniques might
surface “latent” facts about the users—such as their interest
and expertise—that they do not explicitly express.
To be fair, some types of predictive analytics have a long
history—for example, credit card fraud detection and market basket analysis. However, we believe there are several
qualitative differences. The application of data mining on
behavioral data changes the scale at which algorithms need
to operate, and the generally weaker signals present in such
data require more sophisticated algorithms to produce insights. Furthermore, expectations have grown—what were
once cutting-edge techniques practiced only by a few innovative organizations are now routine, and perhaps even necessary for survival in today’s competitive environment. Thus,
capabilities that may have previously been considered luxuries are now essential.
Finally, open-source software is playing an increasingly
important role in today’s ecosystem. A decade ago, no
credible open-source, enterprise-grade, distributed data analytics platform capable of handling large data volumes existed. Today, the Hadoop open-source implementation of
MapReduce  lies at the center of a de facto platform
for large-scale data analytics, surrounded by complementary
systems such as HBase, ZooKeeper, Pig, Hive, and many
others. The importance of Hadoop is validated not only
by adoption in countless startups, but also the endorsement
of industry heavyweights such as IBM, Microsoft, Oracle,
and EMC. Of course, Hadoop is not a panacea and is not
an adequate solution for many problems, but a strong case
can be made for Hadoop supplementing (and in some cases,
replacing) existing data management systems.
Analytics at Twitter lies at the intersection of these three
developments. The social media aspect is of course obvious. Like Facebook , LinkedIn, and many other companies, Twitter has eschewed, to the extent practical, costly
proprietary systems in favor of building around the Hadoop
open-source platform. Finally, like other organizations, analytics at Twitter range widely in sophistication, from simple
aggregations to training machine-learned models.
THE BIG DATA MINING CYCLE
In production environments, effective big data mining at
scale doesn’t begin or end with what academics would consider data mining. Most of the research literature (e.g., KDD
papers) focus on better algorithms, statistical models, or
machine learning techniques—usually starting with a (relatively) well-defined problem, clear metrics for success, and
existing data. The criteria for publication typically involve
improvements in some figure of merit (hopefully statistically
significant): the new proposed method is more accurate, runs
faster, requires less memory, is more robust to noise, etc.
In contrast, the problems we grapple with on a daily basis are far more “messy”. Let us illustrate with a realistic
but hypothetical scenario. We typically begin with a poorly
formulated problem, often driven from outside engineering
Volume 14, Issue 2
and aligned with strategic objectives of the organization,
e.g., “we need to accelerate user growth”. Data scientists
are tasked with executing against the goal—and to operationalize the vague directive into a concrete, solvable problem requires exploratory data analysis. Consider the following sample questions:
When do users typically log in and out?
What features of the product do they use?
Do different groups of users behave differently?
Do these activities correlate with engagement?
What network features correlate with activity?
How do activity profiles of users change over time?
Before beginning exploratory data analysis, the data scientist needs to know what data are available and how they are
organized. This fact may seem obvious, but is surprisingly
difficult in practice. To understand why, we must take a
slight detour to discuss service architectures.
Service Architectures and Logging
Web- and internet-based products today are typically designed as composition of services: rendering a web page might
require the coordination of dozens or even hundreds of component services that provide different content. For example,
Twitter is powered by many loosely-coordinated services,
for adding and removing followers, for storing and fetching
tweets, for providing user recommendations, for accessing
search functionalities, etc. Most services are not aware of the
implementation details of other services and all services communicate through well-defined and stable interfaces. Typically, in this design, each service performs its own logging—
these are records of requests, along with related information
such as who (i.e., the user or another service) made the request, when the request occurred, what the result was, how
long it took, any warnings or exceptions that were thrown,
etc. The log data are independent of each other by design,
the side effect of which is the creation of a large number of
isolated data stores which need to be composed to reconstruct a complete picture of what happened during processing of even a single user request.
Since a single user action may involve many services, a
data scientist wishing to analyze user behavior must first
identify all the disparate data sources involved, understand
their contents, and then join potentially incompatible data
to reconstruct what users were doing. This is often easier
said than done! In our Hadoop data warehouse, logs are
all deposited in the /logs/ directory, with a sub-directory
for each log “category”. There are dozens of log categories,
many of which are named after projects whose function is
well-known but whose internal project names are not intuitive. Services are normally developed and operated by
different teams, which may adopt different conventions for
storing and organizing log data. Frequently, engineers who
build the services have little interaction with the data scientists who analyze the data—so there is no guarantee that
fields needed for data analysis are actually present (see Section 4.5). Furthermore, services change over time in a number of ways: functionalities evolve; two services merge into
one; a single service is replaced by two; a service becomes
obsolete and is replaced by another; a service is used in
ways for which it was not originally intended. These are
all reflected in idiosyncrasies present in individual service
logs—these comprise the “institutional” knowledge that data
scientists acquire over time, but create steep learning curves
for new data scientists.
The net effect is that data scientists expend a large amount
of effort to understand the data available to them, before
they even begin any meaningful analysis. In the next section, we discuss some solutions, but these challenges are far
from solved. There is an increasing trend to structure teams
that are better integrated, e.g., involving data scientists in
the design of a service to make sure the right data are captured, or even having data scientists embedded within individual product teams, but such organizations are the exception, not the rule. A fundamental tradeoff in organizational
solutions is development speed vs. ease of analysis: incorporating analytics considerations when designing and building
services slows down development, but failure to do so results
in unsustainable downstream complexity. A good balance
between these competing concerns is difficult to strike.
Exploratory Data Analysis
Exploratory data analysis always reveals data quality issues. In our recollection, we have never encountered a large,
real-world dataset that was directly usable without data
cleaning. Sometimes there are outright bugs, e.g., inconsistently formatted messages or values that shouldn’t exist. There are inevitably corrupt records—e.g., a partially
written message caused by a premature closing of a file handle. Cleaning data often involves sanity checking: a common
technique for a service that logs aggregate counts as well as
component counts is to make sure the sum of component
counts matches the aggregate counts. Another is to compute various frequencies from the raw data to make sure the
numbers seem “reasonable”. This is surprisingly difficult—
identifying values that seem suspiciously high or suspiciously
low requires experience, since the aggregate behavior of millions of users is frequently counter-intuitive. We have encountered many instances in which we thought that there
must have been data collection errors, and only after careful
verification of the data generation and import pipeline were
we confident that, indeed, users did really behave in some
Sanity checking frequently reveals abrupt shifts in the
characteristics of the data. For example, in a single day, the
prevalence of a particular type of message might decrease or
increase by orders of magnitude. This is frequently an artifact of some system change—for example, a new feature just
having been rolled out to the public or a service endpoint
that has been deprecated in favor of a new system. Twitter
has gotten sufficiently complex that it is no longer possible for any single individual to keep track of every product
rollout, API change, bug fix release, etc., and thus abrupt
changes in datasets appear mysterious until the underlying
causes are understood, which is often a time-consuming activity requiring cross-team cooperation.
Even when the logs are “correct”, there are usually a host
of outliers caused by “non-typical” use cases, most often attributable to non-human actors in a human domain. For
example, in a query log, there will be robots responsible
for tens of thousands of queries a day, robots who issue
queries thousands of terms long, etc. Without discarding
these outliers, any subsequent analysis will produce skewed
results. Although over time, a data scientist gains experi-
Volume 14, Issue 2
ence in data cleaning and the process becomes more routine,
there are frequently surprises and new situations. We have
not yet reached the point where data cleaning can be performed automatically.
Purchase answer to see full
Delivering a high-quality product at a reasonable price is not enough anymore.
That’s why we have developed 5 beneficial guarantees that will make your experience with our service enjoyable, easy, and safe.
You have to be 100% sure of the quality of your product to give a money-back guarantee. This describes us perfectly. Make sure that this guarantee is totally transparent.
Each paper is composed from scratch, according to your instructions. It is then checked by our plagiarism-detection software. There is no gap where plagiarism could squeeze in.
Thanks to our free revisions, there is no way for you to be unsatisfied. We will work on your paper until you are completely happy with the result.
Your email is safe, as we store it according to international data protection rules. Your bank details are secure, as we use only reliable payment systems.