CONTACTS
|
New Website Announcement!
New Website Announcement!
The Data Science at Scale Team has published a new website and is no longer updating this website. The new website is at:
In the future you will be automatically redirected to the new website, but for now you can stay at the old website. If you would like to proceed to the new website, you can click the above link.
Data Science at Scale 2013 Summer School Paper List
Date |
Discussion Leader |
Paper |
29 May |
Ayan |
The Fourth Paradigm: Data-Intensive Scientific Discovery Edited by Tony Hey, Stewart Tansley, and Kristin Tolle, Chapter 1 |
Sidharth |
Snap-together visualization: a user interface for coordinating visualizations via relational schemata Chris North, Ben Shneiderman Proceedings of the working conference on Advanced visual interfaces 2000/5/1, 128-135 |
Wathsala |
Toward measuring visualization insight Chris North Computer Graphics and Applications 26 (3), IEEE 2006/5, 6-9 |
5 June |
Yu Su |
The google file system Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Proceeding SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles, 29-43 |
Boonth |
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat Communications of the ACM - 50th anniversary issue: 1958 - 2008, Volume 51 Issue 1, January 2008, 107-113 |
11 June |
Kien |
Provenance for computational tasks: A survey J Freire, D Koop, E Santos,
CT Silva Computing in Science & Engineering 10 (3), 11-21 |
Kien |
The open provenance model core specification (v1. 1) L Moreau, B Clifford, J Freire, J Futrelle, Y Gil, P Groth, N Kwasnikowska ..
. Future Generation Computer Systems 27 (6), 743-756 |
26 June |
Peter |
Data-intensive spatial filtering in large numerical simulation datasets Kalin Kanov, Randal Burns, Greg Eyink, Charles Meneveau, Alexander Szalay Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, 60 |
Will |
Extreme Data-Intensive Scientific Computing Alexander Szalay Computing in Science & Engineering (Volume:13 , Issue: 6 ) |
3 July |
Ayun |
Design Study Methodology: Reflections from the Trenches and the Stacks Michael Sedlmair, Miriah Meyer, Tamara Munzner IEEE Transactions on Visualization and Computer Graphics 2012/12 (Volume: 18, Issue: 12) |
Chris |
The challenge of information visualization evaluation Catherine Plaisant AVI '04 Proceedings of the working conference on Advanced visual interfaces Pages 109-116 |
10 July |
Wathsala |
Interactive Exploration and Analysis of Large-Scale Simulations Using Topology-Based Data Segmentation Bremer, P-T., Gunther Weber, Julien Tierny, Valerio Pascucci, Marc Day, and John Bell. "Interactive exploration and analysis of large-scale simulations using topology-based data segmentation." Visualization and Computer Graphics, IEEE Transactions on 17, no. 9 (2011): 1307-1324. Visualization and Computer Graphics, IEEE Transactions on 17, no. 9 (2011): 1307-1324 |
Sidharth |
Efficient Data Restructuring and Aggregation for I/O Acceleration in PIDX Kumar, Sidharth, Venkatram Vishwanath, Philip Carns, Joshua A. Levine, Robert Latham, Giorgio Scorzelli, Hemanth Kolla et al., In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, p. 50. IEEE Computer Society Press, 2012. |
Other papers:
Large-scale Incremental Processing Using Distributed Transactions and Notifications
by Daniel Peng and Frank Dabek
Updating an index of the web as documents are crawled requires
continuously transforming a large repository of existing documents as
new documents arrive. This task is one example of a class of data
processing tasks that transform a large repository of data via small,
independent mutations. These tasks lie in a gap between the
capabilities of existing infrastructure. Databases do not meet the
storage or throughput requirements of these tasks: Google's indexing
system stores tens of petabytes of data and processes billions of
updates per day on thousands of machines. MapReduce and other
batch-processing systems cannot process small updates individually as
they rely on creating large batches for efficiency.
We have built Percolator, a system for incrementally processing updates
to a large data set, and deployed it to create the Google web search
index. By replacing a batch-based indexing system with an indexing
system based on incremental processing using Percolator, we process the
same number of documents per day, while reducing the average age of
documents in Google search results by 50%.
Bigtable:
A Distributed Storage System for Structured Data
by Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber
Bigtable is a distributed storage system for
managing structured data
that is designed to scale to a very large size:
petabytes of data across thousands of commodity servers. Many
projects at Google store data in Bigtable, including web indexing,
Google Earth, and Google Finance. These applications place very
different demands on Bigtable, both in terms of data size (from URLs
to web pages to satellite imagery) and latency requirements (from
backend bulk processing to real-time data serving). Despite these
varied demands, Bigtable has successfully provided a flexible,
high-performance solution for all of these Google products. In this
paper we describe the simple data model provided by Bigtable, which
gives clients dynamic control over data layout and format, and we
describe the design and implementation of Bigtable.
Spanner:
Google's Globally-Distributed Database
by James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, JJ Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Yasushi Saito, Michal Szymaniak, Christopher Taylor, Ruth Wang, and Dale Woodford
Spanner is Google's scalable,
multi-version, globally-distributed, and
synchronously-replicated database. It is the first system to
distribute data at global scale and support externally-consistent
distributed transactions. This paper describes how Spanner is
structured, its feature set, the rationale underlying various design
decisions, and a novel time API that exposes clock uncertainty. This
API and its implementation are critical to supporting external
consistency and a variety of powerful features: non-blocking reads in
the past, lock-free read-only transactions, and atomic schema
changes, across all of Spanner.
Dynamo: Amazon's Highly Available Key-value Store
by Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan
Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan
Sivasubramanian, Peter Vosshall and Werner Vogels
Reliability at massive scale is one of the biggest challenges we face
at Amazon.com, one of the largest e-commerce operations in the world;
even the slightest outage has significant financial consequences and
impacts customer trust. The Amazon.com platform, which provides
services for many web sites worldwide, is implemented on top of an
infrastructure of tens of thousands of servers and network components
located in many datacenters around the world. At this scale, small and
large components fail continuously and the way persistent state is
managed in the face of these failures drives the reliability and
scalability of the software systems.
This paper presents the design and implementation of Dynamo, a highly
available key-value storage system that some of Amazon's core services
use to provide an "always-on" experience. To achieve this level of
availability, Dynamo sacrifices consistency under certain failure
scenarios. It makes extensive use of object versioning and
application-assisted conflict resolution in a manner that provides a
novel interface for developers to use.
The Fourth Paradigm: Data-Intensive Scientific Discovery
edited by Toney Hey, Stewart Tansley, and Kristen Tolle
A collection of essays that expands on the vision of pioneering computer
scientist Jim Gray for a new, fourth paradigm of discovery based on
data-intensive science.
GrayWulf: Scalable Clustered Architecture for Data Intensive Computing
by Szalay, A.S.; Bell, G.; VandenBerg, J.; Wonders, A.; Burns, R.; Dan Fay;
Heasley, J.; Hey, T.; Nieto-Santisteban, M.; Thakar, A.; van Ingen, C.;
Wilton, R.,
Data intensive computing presents a significant challenge for
traditional supercomputing architectures that maximize FLOPS since CPU
speed has surpassed IO capabilities of HPC systems and BeoWulf
clusters. We present the architecture for a three tier commodity
component cluster designed for a range of data intensive computations
operating on petascale data sets named GrayWulf. The design goal is a
balanced system in terms of IO performance and memory size, according
to Amdahl's laws. The hardware currently installed at JHU exceeds one
petabyte of storage and has 0.5 bytes/sec of I/O and 1 byte of memory
for each CPU cycle. The GrayWulf provides almost an order of magnitude
better balance than existing systems. The paper covers its architecture
and reference applications. The software design is presented in a
companion paper.
Designing and mining multi-terabyte astronomy archives: the Sloan Digital Sky Survey
by Alexander S. Szalay, Peter Z. Kunszt, Anit Thakar, Jim Gray, Don Slutz, Robert J. Brunner
The next-generation astronomy digital
archives will cover most of the sky at fine resolution in many
wavelengths, from X-rays, through ultraviolet, optical, and infrared.
The archives will be stored at diverse geographical locations. One of
the first of these projects, the Sloan Digital Sky Survey (SDSS) is
creating a 5-wavelength catalog over 10,000 square degrees of the sky
(see http://www.sdss.org/). The 200 million objects in the
multi-terabyte database will have mostly numerical attributes in a 100+
dimensional space. Points in this space have highly correlated
distributions.
The archive will enable astronomers to explore the data
interactively. Data access will be aided by multidimensional spatial and
attribute indices. The data will be partitioned in many ways. Small tag
objects consisting of the most popular attributes will accelerate
frequent searches. Splitting the data among multiple servers will allow
parallel, scalable I/O and parallel data analysis. Hashing techniques
will allow efficient clustering, and pair-wise comparison algorithms
that should parallelize nicely. Randomly sampled subsets will allow
de-bugging otherwise large queries at the desktop. Central servers will
operate a data pump to support sweep searches touching most of the data.
The anticipated queries will require special operators related to
angular distances and complex similarity tests of object properties,
like shapes, colors, velocity vectors, or temporal behaviors. These
issues pose interesting data management challenges.
SkyQuery: A Web Service Approach to Federate Databases
by Tanu Malik, Alex S. Szalay, Tamas Budavari, Ani R. Thakur
Traditional science searched for new objects and phenomena that led to
discoveries. Tomorrow's science will combine together the large pool of
information in scientific archives and make discoveries. Scientists are
currently keen to federate together the existing scientific databases.
The major challenge in building a federation of these autonomous and
heterogeneous databases is system integration. Ineffective integration
will result in defunct federations and under utilized scientific data.
Astronomy, in particular, has many autonomous archives spread over the
Internet. It is now seeking to federate these, with minimal effort,
into a Virtual Observatory that will solve complex distributed
computing tasks such as answering federated spatial join queries.
In this paper, we present SkyQuery, a successful prototype of an
evolving federation of astronomy archives. It interoperates using the
emerging Web services standard. We describe the SkyQuery architecture
and show how it efficiently evaluates a probabilistic federated spatial
join query.
PISTON: A Portable Cross-Platform Framework for Data-Parallel Visualization Operators
by Li-ta Lo, Christopher Sewell, and James Ahrens
Due to the wide variety of current and next-generation supercomputing
architectures, the development of high-performance parallel
visualization and analysis operators frequently requires re-writing the
underlying algorithms for many different platforms. In order to
facilitate portability, we have devised a framework for creating such
operators that employs the data-parallel programming model. By writing
the operators using only data-parallel primitives (such as scans,
transforms, stream compactions, etc.), the same code may be compiled to
multiple targets using architecture-specific backend implementations of
these primitives. Specifically, we make use of and extend NVIDIA's
Thrust library, which provides CUDA and OpenMP backends. Using this
framework, we have implemented isosurface, cut surface, and threshold
operators, and have achieved good parallel performance on two different
architectures (multi-core CPUs and NVIDIA GPUs) using the exact same
operator code. We have applied these operators to several large, real
scientific data sets, and have open-source released a beta version of
our code base.
Indexing and Parallel Query Processing Support for Visualizing Climate Datasets
by Y. Su, G. Agrawal, and J. Woodring
With increasing emphasis on analysis of large-scale scientific data,
and with growing dataset sizes, a number of new challenges are arising.
Particularly, novel data management solutions are needed, which can
work together with the existing tools. This paper examines indexing
support for supporting high-level queries (primarily those for sub
setting) on array-based scientific datasets. This work is motivated by
the limitations arising in visualizing climate datasets (stored in Net
CDF), using tools like Para View. We have developed a new indexing
strategy, which can help support a variety of sub setting queries over
these datasets, including those requiring sub setting over
dimensions/coordinates and those involving variable values. Our
approach is based on bitmaps, but involves use of two-level indices and
careful partitioning, based on query profiles. We also show how our
indexing support can be used for sub setting operations executed in
parallel. We compare our solutions against a number of other solutions,
and demonstrate that our method is more effective.
In-situ Sampling of a Large-Scale Particle Simulation for Interactive Visualization and Analysis
by Jonathan L. Woodring, James P. Ahrens, Jeannette A. Figg, Joanne R. Wendelberger, and Katrin Heitmann
We describe a simulation-time random sampling of a large-scale particle
simulation, the RoadRunner Universe MC3 cosmological simulation, for
interactive post-analysis and visualization. Simulation data generation
rates will continue to be far greater than storage bandwidth rates by
many orders of magnitude. This implies that only a very small fraction
of data generated by a simulation can ever be stored and subsequently
post-analyzed. The limiting factors in this situation are similar to
the problem in many population surveys: there aren't enough human
resources to query a large population. To cope with the lack of
resources, statistical sampling techniques are used to create a
representative data set of a large population. Following this analogy,
we propose to store a simulation-time random sampling of the particle
data for post-analysis, with level-of-detail organization, to cope with
the bottlenecks. A sample is stored directly from the simulation in a
level-of-detail format for post-visualization and analysis, which
amortizes the cost of post-processing and reduces workflow time.
Additionally by sampling during the simulation, we are able to analyze
the entire particle population to record full population statistics and
quantify sample error.
Revisiting Wavelet Compression for Large-Scale Climate Data using JPEG 2000 and Ensuring Data Precision
by J. Woodring, S. Mniszewski, C. Brislawn, D. DeMarle and J. Ahrens
We revisit wavelet compression by using a standards-based method to
reduce large-scale data sizes for production scientific computing. Many
of the bottlenecks in visualization and analysis come from limited
bandwidth in data movement, from storage to networks. The majority of
the processing time for visualization and analysis is spent reading or
writing large-scale data or moving data from a remote site in a
distance scenario. Using wavelet compression in JPEG 2000, we provide a
mechanism to vary data transfer time versus data quality, so that a
domain expert can improve data transfer time while quantifying
compression effects on their data. By using a standards-based method,
we are able to provide scientists with the state-of-the-art wavelet
compression from the signal processing and data compression community,
suitable for use in a production computing environment. To quantify
compression effects, we focus on measuring bit rate versus maximum
error as a quality metric to provide precision guarantees for
scientific analysis on remotely compressed POP (Parallel Ocean Program)
data.
VisIO: Enabling Interactive Visualization of Ultra-Scale, Time Series Data via High-Bandwidth Distributed I/O Systems
by Christopher Mitchell, James Ahrens, and Jun Wang
Petascale simulations compute at resolutions ranging into billions of
cells and write terabytes of data for visualization and analysis.
Interactive visualization of this time series is a desired step before
starting a new run. The I/O subsystem and associated network often are a
significant impediment to interactive visualization of time-varying
data, as they are not configured or provisioned to provide necessary I/O
read rates. In this paper, we propose a new I/O library for
visualization applications: VisIO. Visualization applications commonly
use N-to-N reads within their parallel enabled readers which provides an
incentive for a shared-nothing approach to I/O, similar to other
data-intensive approaches such as Hadoop. However, unlike other
data-intensive applications, visualization requires: (1) interactive
performance for large data volumes, (2) compatibility with MPI and POSIX
file system semantics for compatibility with existing infrastructure,
and (3) use of existing file formats and their stipulated data
partitioning rules. VisIO, provides a mechanism for using a non-POSIX
distributed file system to provide linear scaling of I/O bandwidth. In
addition, we introduce a novel scheduling algorithm that helps to
co-locate visualization processes on nodes with the requested data.
Testing using VisIO integrated into Para View was conducted using the
Hadoop Distributed File System (HDFS) on TACC's Longhorn cluster. A
representative dataset, VPIC, across 128 nodes showed a 64.4% read
performance improvement compared to the provided Lustre installation.
Also tested, was a dataset representing a global ocean salinity
simulation that showed a 51.4% improvement in read performance over
Lustre when using our VisIO system. VisIO, provides powerful
high-performance I/O services to visualization applications, allowing
for interactive performance with ultra-scale, time-series data.
Jitter-Free Co-Processing on a Prototype Exascale Storage Stack
by J. Bent, S. Faibish, J. Ahrens, G. Grider, J. Patchett, and J. Woodring
In the petascale era, the storage stack used by the extreme scale high
performance computing community is fairly homogeneous across sites. On
the compute edge of the stack, file system clients or IO forwarding
services direct IO over an interconnect network to a relatively small
set of IO nodes. These nodes forward the requests over a secondary
storage network to a spindle-based parallel file system. Unfortunately,
this architecture will become unviable in the exascale era. As the
density growth of disks continues to outpace increases in their
rotational speeds, disks are becoming increasingly cost-effective for
capacity but decreasingly so for bandwidth. Fortunately, new storage
media such as solid state devices are filling this gap; although not
cost-effective for capacity, they are so for performance. This suggests
that the storage stack at exascale will incorporate solid state storage
between the compute nodes and the parallel file systems. There are
three natural places into which to position this new storage layer:
within the compute nodes, the IO nodes, or the parallel file system. In
this paper, we argue that the IO nodes are the appropriate location for
HPC workloads and show results from a prototype system that we have
built accordingly. Running a pipeline of computational simulation and
visualization, we show that our prototype system reduces total time to
completion by up to 30%.
On the Role of Burst Buffers in Leadership-class Storage Systems
by Liu, N; Cope, J; Carns, P; Carothers, C; Ross, R; Grider, G; Crume, A; and Maltzahn, C
The largest-scale high-performance (HPC) systems are stretching
parallel file systems to their limits in terms of aggregate bandwidth
and numbers of clients. To further sustain the scalability of these
file systems, researchers and HPC storage architects are exploring
various storage system designs. One proposed storage system design
integrates a tier of solid-state burst buffers into the storage system
to absorb application I/O requests. In this paper, we simulate and
explore this storage system design for use by large-scale HPC systems.
First, we examine application I/O patterns on an existing large-scale
HPC system to identify common burst patterns. Next, we describe
enhancements to the CODES storage system simulator to enable our burst
buffer simulations. These enhancements include the integration of a
burst buffer model into the I/O forwarding layer of the simulator, the
development of an I/O kernel description language and interpreter, the
development of a suite of I/O kernels that are derived from observed
I/O patterns, and fidelity improvements to the CODES models. We
evaluate the I/O performance for a set of multiapplication I/O
workloads and burst buffer configurations. We show that burst buffers
can accelerate the application perceived throughput to the external
storage system and can reduce the amount of external storage bandwidth
required to meet a desired application perceived throughput goal.
Low-Power Amdahl-Balanced Blades for Data Intensive Computing
by Alexander S. Szalay, Gordon C. Bell, H. Howie Huang, Andreas Terzis, Alainna White
Enterprise and scientific data sets double every year, forcing similar
growths in storage size and power consumption. As a consequence,
current system architectures used to build data warehouses are about to
hit a power consumption wall. In this paper we propose an alternative
architecture comprising large number of so-called Amdahl blades that
combine energy-efficient CPUs with solid state disks to increase
sequential read I/O throughput by an order of magnitude while keeping
power consumption constant. We also show that while keeping the total
cost of ownership constant, Amdahl blades offer five times the
throughput of a state-of-theart computing cluster for data-intensive
applications. Finally, using the scaling laws originally postulated by
Amdahl, we show that systems for data-intensive computing must maintain
a balance between low power consumption and per-server throughput to
optimize performance perWatt.
HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads
by Azza Abouzeid, Kamil Bajda-pawlikowski, Daniel Abadi, Avi Silberschatz, Er Rasin
The production environment for analytical data management applications
is rapidly changing. Many enterprises are shifting away from deploying
their analytical databases on high-end proprietary machines, and moving
towards cheaper, lower-end, commodity hardware, typically arranged in a
shared-nothing MPP architecture, often in a virtualized environment
inside public or private âcloudsâ. At the same time, the amount of
data that needs to be analyzed is exploding, requiring hundreds to thousands
of machines to work in parallel to perform the analysis. There tend to
be two schools of thought regarding what technology to use for data
analysis in such an environment. Proponents of parallel databases argue
that the strong emphasis on performance and efficiency of parallel
databases makes them wellsuited to perform such analysis. On the other
hand, others argue that MapReduce-based systems are better suited due to
their superior scalability, fault tolerance, and flexibility to handle
unstructured data. In this paper, we explore the feasibility of building
a hybrid system that takes the best features from both technologies;
the prototype we built approaches parallel databases in performance and
efficiency, yet still yields the scalability, fault tolerance, and
flexibility of MapReduce-based systems.
Hive: A Petabyte Scale Data Warehouse Using Hadoop
by Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad
Chakka, Ning Zhang, Suresh Antony, Hao Liu and Raghotham Murthy
The size of data sets being collected and analyzed in the industry for
business intelligence is growing rapidly, making traditional
warehousing solutions prohibitively expensive. Hadoop is a
popular open - source map - reduc e implementation which is being used
in companies like Yahoo, Facebook etc. to store and process
extremely large data sets on commodity hardware. However, the map -
reduce programming model is very low level and requires developers to
write custom programs w hich are hard to maintain and reuse. In this
paper, we present Hive , an open - source data warehousing solution
built on top of Hadoop. Hive supports queries expressed in a SQL - like
declarative language - HiveQL , which are compiled into map - reduce
jobs that are executed using Hadoop. In addition, HiveQL enables users
to plug in custom map - reduce scripts into queries. The language
includes a type system with support for tables containing primitive
types, collections like arrays and maps, and nested composition s of
the same. The underlying IO libraries can be extended to query data in
custom formats. Hive also includes a system catalog - Metastore â that
contain s schemas and statistics, which are useful in data exploratio
n, query optimization and query comp i lati on . In Facebook, the Hive
warehouse contains tens of thousands of tables and stores over 700TB of
data and is being used extensively for both reporting and ad - hoc
analyses by more than 200 users per month.
A Comparison of Approaches to Large-Scale Data Analysis
by Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. DeWitt, Samuel Madden, and Michael Stonebraker
There is currently considerable enthusiasm around the MapReduce (MR)
paradigm for large-scale data analysis [17]. Although the basic control
flow of this framework has existed in parallel SQL database management
systems (DBMS) for over 20 years, some have called MR a dramatically new
computing model [8, 17]. In this paper, we describe and compare both
paradigms. Furthermore, we evaluate both kinds of systems in terms of
performance and development complexity. To this end, we define a
benchmark consisting of a collection of tasks that we have run on an
open source version of MR as well as on two parallel DBMSs. For each
task, we measure each system's performance for various degrees of
parallelism on a cluster of 100 nodes. Our results reveal some
interesting trade-offs. Although the process to load data into and tune
the execution of parallel DBMSs took much longer than the MR system, the
observed performance of these DBMSs was strikingly better. We speculate
about the causes of the dramatic performance difference and consider
implementation concepts that future systems should take from both kinds
of architectures.
Data Management Challenges of Data-Intensive Scientific Workflows
by Ewa Deelman and Ann Chervenak
Scientific workflows play an important role in today's science. Many
disciplines rely on workflow technologies to orchestrate the execution
of thousands of computational tasks. Much research to-date focuses on
efficient, scalable, and robust workflow execution, especially in
distributed environments. However, many challenges remain in the area of
data management related to workflow creation, execution, and result
management. In this paper we examine some of these issues in the context
of the entire workflow lifecycle.
DISC: A System for Distributed Data Intensive Scientific Computing
by George Kola, Tevfik Kosar, Jaime Frey, Miron Livny, Robert Brunner, Michael Remijan
The increasing computation and data requirements of scientific
applications have necessitated the use of distributed resources owned
by collaborating parties. While existing distributed systems work well
for computation that requires limited data movement, they fail in
unexpected ways when the computation accesses, creates, and moves large
amounts of data over wide-area networks. In this work, we analyzed the
problems with existing systems and used the result of this analysis to
design our own system. Realizing that it takes a long while for a new
system to stabilize, we tried our best to reuse existing components. We
added new components only when we could not get by with adding features
to existing ones. We used our system to successfully process three
terabytes of DPOSS image data in under a week by using idle CPUs in
desktops and commodity clusters in the UW-Madison Computer Science
Department and Starlight.
Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center
by Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy Katz, Scott Shenker, and Ion Stoica
We present Mesos, a platform for sharing commodity clusters between
multiple diverse cluster computing frameworks, such as Hadoop and MPI.
Sharing improves cluster utilization and avoids per-framework data
replication. Mesos shares resources in a fine-grained manner, allowing
frameworks to achieve data locality by taking turns reading data stored
on each machine. To support the sophisticated schedulers of today's
frameworks, Mesos introduces a distributed two-level scheduling
mechanism called resource offers. Mesos decides how many resources to
offer each framework, while frameworks decide which resources to accept
and which computations to run on them. Our results show that Mesos can
achieve near-optimal data locality when sharing the cluster among
diverse frameworks, can scale to 50,000 (emulated) nodes, and is
resilient to failures.
Parallel Visualization on Large Clusters using MapReduce
by Vo, H.T.; Bronson, J.; Summa, B.; Comba, J.L.D.; Freire, J.; Howe, B.; Bremer, P.-T.; Silva, C.T.
Large-scale visualization systems are typically designed to efficiently
âpushâ datasets through the graphics hardware. However, exploratory
visualization systems are increasingly expected to support scalable data
manipulation, restructuring, and querying capabilities in addition to core
visualization algorithms. We posit that new emerging abstractions for parallel
data processing, in particular computing clouds, can be leveraged to support
large-scale data exploration through visualization. In this paper, we take a
first step in evaluating the suitability of the MapReduce framework to
implement large-scale visualization techniques. MapReduce is a lightweight,
scalable, general-purpose parallel data processing framework increasingly
popular in the context of cloud computing. Specifically, we implement and
evaluate a representative suite of visualization tasks (mesh rendering,
isosurface extraction, and mesh simplification) as MapReduce programs, and
report quantitative performance results applying these algorithms to realistic
datasets. For example, we perform isosurface extraction of up to l6 isovalues
for volumes composed of 27 billion voxels, simplification of meshes with 30GBs
of data and subsequent rendering with image resolutions up to 800002 pixels.
Our results indicate that the parallel scalability, ease of use, ease of access
to computing resources, and fault-tolerance of MapReduce offer a promising
foundation for a combined data manipulation and data visualization system
deployed in a public cloud or a local commodity cluster.
Provenance for computational tasks: A survey
by J Freire, D Koop, E Santos, CT Silva
The problem of systematically capturing and managing provenance for computational tasks has recently received significant attention because of its relevance to a wide range of domains and applications. The authors give an overview of important concepts related to provenance management, so that potential users can make informed decisions when selecting or designing a provenance solution.
The open provenance model core specification (v1. 1)
by L Moreau, B Clifford, J Freire, J Futrelle, Y Gil, P Groth, N Kwasnikowska ...
The Open Provenance Model is a model of provenance that is designed to meet the following requirements: (1) Allow provenance information to be exchanged between systems, by means of a compatibility layer based on a shared provenance model. (2) Allow developers to build and share tools that operate on such a provenance model. (3) Define provenance in a precise, technology-agnostic manner. (4) Support a digital representation of provenance for any “thing”, whether produced by computer systems or not. (5) Allow multiple levels of description to coexist. (6) Define a core set of rules that identify the valid inferences that can be made on provenance representation. This document contains the specification of the Open Provenance Model (v1.1) resulting from a community effort to achieve inter-operability in the Provenance Challenge series.
Using NoSQL Databases for Streaming Network Analysis
by Brian Wylie, Daniel Dunlavy, Warren Davis IV, Jeff Baumes
The high-volume, low-latency world of network traffic presents significant
obstacles for complex analysis techniques. The unique challenge of adapting
powerful but high-latency models to realtime network streams is the basis of
our cyber security project. In this paper we discuss our use of NoSQL databases
in a framework that enables the application of computationally expensive models
against a real-time network data stream. We describe how this approach
transforms the highly constrained (and sometimes arcane) world of real-time
network analysis into a more developer friendly model that relaxes many of the
traditional constraints associated with streaming data. Our primary use of the
system is for conducting streaming text analysis and classification activities
on a network link receiving ~200,000 emails per day.
Interpreting the Data: Parallel Analysis with Sawzall
by Rob Pike, Sean Dorward, Robert Griesemer, Sean Quinlan
Very large data sets often have a flat but regular structure and span multiple
disks and machines. Examples include telephone call records, network logs, and
web document repositories. These large data sets are not amenable to study
using traditional database techniques, if only because they can be too large to
fit in a single relational database. On the other hand, many of the analyses
done on them can be expressed using simple, easily distributed computations:
filtering, aggregation, extraction of statistics, and so on.
We present a system for automating such analyses. A filtering phase, in which a
query is expressed using a new programming language, emits data to an
aggregation phase. Both phases are distributed over hundreds or even thousands
of computers. The results are then collated and saved to a file. The design --
including the separation into two phases, the form of the programming language,
and the properties of the aggregators -- exploits the parallelism inherent in
having data and computation distributed across many machines.
Dremel: Interactive Analysis of Web-Scale Datasets
by Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, and Theo Vassilakis
Dremel is a scalable, interactive ad-hoc query system for analysis of read-only
nested data. By combining multi-level execution trees and columnar data layout,
it is capable of running aggregation queries over trillion-row tables in
seconds. The system scales to thousands of CPUs and petabytes of data, and has
thousands of users at Google. In this paper, we describe the architecture and
implementation of Dremel, and explain how it complements MapReduce-based
computing. We present a novel columnar storage representation for nested
records and discuss experiments on few-thousand node instances of the system.
|
|