Data Science at Scale 2013 Summer School

Summer School Home
Webmaster

CONTACTS

Summer School Lead
James Ahrens
2013 Mentors
Christine Ahrens
James Ahrens
Frank Alexander
Reese Baird
Curtis Canada
Susan Kurien
Christopher Mitchell
Reid Porter
James Powell
Reid Priedhorsky
Sara Del Valle
Jonathan Woodring
2013 Students
Garrett Aldrich (UC Davis)
Ayan Biswas (OSU)
Chris Bryan (UC Davis)
Connor Dolan (UNM)
Mike Jacobi (LANL)
Kalin Konav (JHU)
Sidharth Kumar (Utah)
Peter Ortegel (UNM)
Kien Pham (NYU-Poly)
Jesus Pulido (UC Davis)
Yu Su (OSU)
Will Vining (UNM)
Wathsala Widanagamaachchi (Utah)

New Website Announcement!

The Data Science at Scale Team has published a new website and is no longer updating this website. The new website is at:

http://datascience.lanl.gov/

In the future you will be automatically redirected to the new website, but for now you can stay at the old website. If you would like to proceed to the new website, you can click the above link.

Data Science at Scale 2013 Summer School Paper List

Date	Discussion Leader	Paper
29 May	Ayan	The Fourth Paradigm: Data-Intensive Scientific Discovery Edited by Tony Hey, Stewart Tansley, and Kristin Tolle, Chapter 1
	Sidharth	Snap-together visualization: a user interface for coordinating visualizations via relational schemata Chris North, Ben Shneiderman Proceedings of the working conference on Advanced visual interfaces 2000/5/1, 128-135
	Wathsala	Toward measuring visualization insight Chris North Computer Graphics and Applications 26 (3), IEEE 2006/5, 6-9
5 June	Yu Su	The google file system Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Proceeding SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles, 29-43
5 June	Boonth	MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat Communications of the ACM - 50th anniversary issue: 1958 - 2008, Volume 51 Issue 1, January 2008, 107-113
11 June	Kien	Provenance for computational tasks: A survey J Freire, D Koop, E Santos, CT Silva Computing in Science & Engineering 10 (3), 11-21
11 June	Kien	The open provenance model core specification (v1. 1) L Moreau, B Clifford, J Freire, J Futrelle, Y Gil, P Groth, N Kwasnikowska .. . Future Generation Computer Systems 27 (6), 743-756
26 June	Peter	Data-intensive spatial filtering in large numerical simulation datasets Kalin Kanov, Randal Burns, Greg Eyink, Charles Meneveau, Alexander Szalay Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, 60
26 June	Will	Extreme Data-Intensive Scientific Computing Alexander Szalay Computing in Science & Engineering (Volume:13 , Issue: 6 )
3 July	Ayun	Design Study Methodology: Reflections from the Trenches and the Stacks Michael Sedlmair, Miriah Meyer, Tamara Munzner IEEE Transactions on Visualization and Computer Graphics 2012/12 (Volume: 18, Issue: 12)
3 July	Chris	The challenge of information visualization evaluation Catherine Plaisant AVI '04 Proceedings of the working conference on Advanced visual interfaces Pages 109-116
10 July	Wathsala	Interactive Exploration and Analysis of Large-Scale Simulations Using Topology-Based Data Segmentation Bremer, P-T., Gunther Weber, Julien Tierny, Valerio Pascucci, Marc Day, and John Bell. "Interactive exploration and analysis of large-scale simulations using topology-based data segmentation." Visualization and Computer Graphics, IEEE Transactions on 17, no. 9 (2011): 1307-1324. Visualization and Computer Graphics, IEEE Transactions on 17, no. 9 (2011): 1307-1324
10 July	Sidharth	Efficient Data Restructuring and Aggregation for I/O Acceleration in PIDX Kumar, Sidharth, Venkatram Vishwanath, Philip Carns, Joshua A. Levine, Robert Latham, Giorgio Scorzelli, Hemanth Kolla et al., In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, p. 50. IEEE Computer Society Press, 2012.

Other papers:

Large-scale Incremental Processing Using Distributed Transactions and Notifications
by Daniel Peng and Frank Dabek

Updating an index of the web as documents are crawled requires continuously transforming a large repository of existing documents as new documents arrive. This task is one example of a class of data processing tasks that transform a large repository of data via small, independent mutations. These tasks lie in a gap between the capabilities of existing infrastructure. Databases do not meet the storage or throughput requirements of these tasks: Google's indexing system stores tens of petabytes of data and processes billions of updates per day on thousands of machines. MapReduce and other batch-processing systems cannot process small updates individually as they rely on creating large batches for efficiency.

We have built Percolator, a system for incrementally processing updates to a large data set, and deployed it to create the Google web search index. By replacing a batch-based indexing system with an indexing system based on incremental processing using Percolator, we process the same number of documents per day, while reducing the average age of documents in Google search results by 50%.

Bigtable: A Distributed Storage System for Structured Data
by Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber

Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers. Many projects at Google store data in Bigtable, including web indexing, Google Earth, and Google Finance. These applications place very different demands on Bigtable, both in terms of data size (from URLs to web pages to satellite imagery) and latency requirements (from backend bulk processing to real-time data serving). Despite these varied demands, Bigtable has successfully provided a flexible, high-performance solution for all of these Google products. In this paper we describe the simple data model provided by Bigtable, which gives clients dynamic control over data layout and format, and we describe the design and implementation of Bigtable.

Spanner: Google's Globally-Distributed Database
by James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, JJ Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Yasushi Saito, Michal Szymaniak, Christopher Taylor, Ruth Wang, and Dale Woodford

Spanner is Google's scalable, multi-version, globally-distributed, and synchronously-replicated database. It is the first system to distribute data at global scale and support externally-consistent distributed transactions. This paper describes how Spanner is structured, its feature set, the rationale underlying various design decisions, and a novel time API that exposes clock uncertainty. This API and its implementation are critical to supporting external consistency and a variety of powerful features: non-blocking reads in the past, lock-free read-only transactions, and atomic schema changes, across all of Spanner.

Dynamo: Amazon's Highly Available Key-value Store
by Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall and Werner Vogels

Reliability at massive scale is one of the biggest challenges we face at Amazon.com, one of the largest e-commerce operations in the world; even the slightest outage has significant financial consequences and impacts customer trust. The Amazon.com platform, which provides services for many web sites worldwide, is implemented on top of an infrastructure of tens of thousands of servers and network components located in many datacenters around the world. At this scale, small and large components fail continuously and the way persistent state is managed in the face of these failures drives the reliability and scalability of the software systems.

This paper presents the design and implementation of Dynamo, a highly available key-value storage system that some of Amazon's core services use to provide an "always-on" experience. To achieve this level of availability, Dynamo sacrifices consistency under certain failure scenarios. It makes extensive use of object versioning and application-assisted conflict resolution in a manner that provides a novel interface for developers to use.

The Fourth Paradigm: Data-Intensive Scientific Discovery
edited by Toney Hey, Stewart Tansley, and Kristen Tolle

A collection of essays that expands on the vision of pioneering computer scientist Jim Gray for a new, fourth paradigm of discovery based on data-intensive science.

GrayWulf: Scalable Clustered Architecture for Data Intensive Computing
by Szalay, A.S.; Bell, G.; VandenBerg, J.; Wonders, A.; Burns, R.; Dan Fay; Heasley, J.; Hey, T.; Nieto-Santisteban, M.; Thakar, A.; van Ingen, C.; Wilton, R.,

Data intensive computing presents a significant challenge for traditional supercomputing architectures that maximize FLOPS since CPU speed has surpassed IO capabilities of HPC systems and BeoWulf clusters. We present the architecture for a three tier commodity component cluster designed for a range of data intensive computations operating on petascale data sets named GrayWulf. The design goal is a balanced system in terms of IO performance and memory size, according to Amdahl's laws. The hardware currently installed at JHU exceeds one petabyte of storage and has 0.5 bytes/sec of I/O and 1 byte of memory for each CPU cycle. The GrayWulf provides almost an order of magnitude better balance than existing systems. The paper covers its architecture and reference applications. The software design is presented in a companion paper.

Designing and mining multi-terabyte astronomy archives: the Sloan Digital Sky Survey
by Alexander S. Szalay, Peter Z. Kunszt, Anit Thakar, Jim Gray, Don Slutz, Robert J. Brunner

The next-generation astronomy digital archives will cover most of the sky at fine resolution in many wavelengths, from X-rays, through ultraviolet, optical, and infrared. The archives will be stored at diverse geographical locations. One of the first of these projects, the Sloan Digital Sky Survey (SDSS) is creating a 5-wavelength catalog over 10,000 square degrees of the sky (see http://www.sdss.org/). The 200 million objects in the multi-terabyte database will have mostly numerical attributes in a 100+ dimensional space. Points in this space have highly correlated distributions.

The archive will enable astronomers to explore the data interactively. Data access will be aided by multidimensional spatial and attribute indices. The data will be partitioned in many ways. Small tag objects consisting of the most popular attributes will accelerate frequent searches. Splitting the data among multiple servers will allow parallel, scalable I/O and parallel data analysis. Hashing techniques will allow efficient clustering, and pair-wise comparison algorithms that should parallelize nicely. Randomly sampled subsets will allow de-bugging otherwise large queries at the desktop. Central servers will operate a data pump to support sweep searches touching most of the data. The anticipated queries will require special operators related to angular distances and complex similarity tests of object properties, like shapes, colors, velocity vectors, or temporal behaviors. These issues pose interesting data management challenges.

SkyQuery: A Web Service Approach to Federate Databases
by Tanu Malik, Alex S. Szalay, Tamas Budavari, Ani R. Thakur

Traditional science searched for new objects and phenomena that led to discoveries. Tomorrow's science will combine together the large pool of information in scientific archives and make discoveries. Scientists are currently keen to federate together the existing scientific databases. The major challenge in building a federation of these autonomous and heterogeneous databases is system integration. Ineffective integration will result in defunct federations and under utilized scientific data.

Astronomy, in particular, has many autonomous archives spread over the Internet. It is now seeking to federate these, with minimal effort, into a Virtual Observatory that will solve complex distributed computing tasks such as answering federated spatial join queries.

In this paper, we present SkyQuery, a successful prototype of an evolving federation of astronomy archives. It interoperates using the emerging Web services standard. We describe the SkyQuery architecture and show how it efficiently evaluates a probabilistic federated spatial join query.

PISTON: A Portable Cross-Platform Framework for Data-Parallel Visualization Operators
by Li-ta Lo, Christopher Sewell, and James Ahrens

Due to the wide variety of current and next-generation supercomputing architectures, the development of high-performance parallel visualization and analysis operators frequently requires re-writing the underlying algorithms for many different platforms. In order to facilitate portability, we have devised a framework for creating such operators that employs the data-parallel programming model. By writing the operators using only data-parallel primitives (such as scans, transforms, stream compactions, etc.), the same code may be compiled to multiple targets using architecture-specific backend implementations of these primitives. Specifically, we make use of and extend NVIDIA's Thrust library, which provides CUDA and OpenMP backends. Using this framework, we have implemented isosurface, cut surface, and threshold operators, and have achieved good parallel performance on two different architectures (multi-core CPUs and NVIDIA GPUs) using the exact same operator code. We have applied these operators to several large, real scientific data sets, and have open-source released a beta version of our code base.

Indexing and Parallel Query Processing Support for Visualizing Climate Datasets
by Y. Su, G. Agrawal, and J. Woodring

With increasing emphasis on analysis of large-scale scientific data, and with growing dataset sizes, a number of new challenges are arising. Particularly, novel data management solutions are needed, which can work together with the existing tools. This paper examines indexing support for supporting high-level queries (primarily those for sub setting) on array-based scientific datasets. This work is motivated by the limitations arising in visualizing climate datasets (stored in Net CDF), using tools like Para View. We have developed a new indexing strategy, which can help support a variety of sub setting queries over these datasets, including those requiring sub setting over dimensions/coordinates and those involving variable values. Our approach is based on bitmaps, but involves use of two-level indices and careful partitioning, based on query profiles. We also show how our indexing support can be used for sub setting operations executed in parallel. We compare our solutions against a number of other solutions, and demonstrate that our method is more effective.

In-situ Sampling of a Large-Scale Particle Simulation for Interactive Visualization and Analysis
by Jonathan L. Woodring, James P. Ahrens, Jeannette A. Figg, Joanne R. Wendelberger, and Katrin Heitmann

We describe a simulation-time random sampling of a large-scale particle simulation, the RoadRunner Universe MC3 cosmological simulation, for interactive post-analysis and visualization. Simulation data generation rates will continue to be far greater than storage bandwidth rates by many orders of magnitude. This implies that only a very small fraction of data generated by a simulation can ever be stored and subsequently post-analyzed. The limiting factors in this situation are similar to the problem in many population surveys: there aren't enough human resources to query a large population. To cope with the lack of resources, statistical sampling techniques are used to create a representative data set of a large population. Following this analogy, we propose to store a simulation-time random sampling of the particle data for post-analysis, with level-of-detail organization, to cope with the bottlenecks. A sample is stored directly from the simulation in a level-of-detail format for post-visualization and analysis, which amortizes the cost of post-processing and reduces workflow time. Additionally by sampling during the simulation, we are able to analyze the entire particle population to record full population statistics and quantify sample error.

Revisiting Wavelet Compression for Large-Scale Climate Data using JPEG 2000 and Ensuring Data Precision
by J. Woodring, S. Mniszewski, C. Brislawn, D. DeMarle and J. Ahrens

We revisit wavelet compression by using a standards-based method to reduce large-scale data sizes for production scientific computing. Many of the bottlenecks in visualization and analysis come from limited bandwidth in data movement, from storage to networks. The majority of the processing time for visualization and analysis is spent reading or writing large-scale data or moving data from a remote site in a distance scenario. Using wavelet compression in JPEG 2000, we provide a mechanism to vary data transfer time versus data quality, so that a domain expert can improve data transfer time while quantifying compression effects on their data. By using a standards-based method, we are able to provide scientists with the state-of-the-art wavelet compression from the signal processing and data compression community, suitable for use in a production computing environment. To quantify compression effects, we focus on measuring bit rate versus maximum error as a quality metric to provide precision guarantees for scientific analysis on remotely compressed POP (Parallel Ocean Program) data.

VisIO: Enabling Interactive Visualization of Ultra-Scale, Time Series Data via High-Bandwidth Distributed I/O Systems
by Christopher Mitchell, James Ahrens, and Jun Wang

Petascale simulations compute at resolutions ranging into billions of cells and write terabytes of data for visualization and analysis. Interactive visualization of this time series is a desired step before starting a new run. The I/O subsystem and associated network often are a significant impediment to interactive visualization of time-varying data, as they are not configured or provisioned to provide necessary I/O read rates. In this paper, we propose a new I/O library for visualization applications: VisIO. Visualization applications commonly use N-to-N reads within their parallel enabled readers which provides an incentive for a shared-nothing approach to I/O, similar to other data-intensive approaches such as Hadoop. However, unlike other data-intensive applications, visualization requires: (1) interactive performance for large data volumes, (2) compatibility with MPI and POSIX file system semantics for compatibility with existing infrastructure, and (3) use of existing file formats and their stipulated data partitioning rules. VisIO, provides a mechanism for using a non-POSIX distributed file system to provide linear scaling of I/O bandwidth. In addition, we introduce a novel scheduling algorithm that helps to co-locate visualization processes on nodes with the requested data. Testing using VisIO integrated into Para View was conducted using the Hadoop Distributed File System (HDFS) on TACC's Longhorn cluster. A representative dataset, VPIC, across 128 nodes showed a 64.4% read performance improvement compared to the provided Lustre installation. Also tested, was a dataset representing a global ocean salinity simulation that showed a 51.4% improvement in read performance over Lustre when using our VisIO system. VisIO, provides powerful high-performance I/O services to visualization applications, allowing for interactive performance with ultra-scale, time-series data.

Jitter-Free Co-Processing on a Prototype Exascale Storage Stack
by J. Bent, S. Faibish, J. Ahrens, G. Grider, J. Patchett, and J. Woodring

In the petascale era, the storage stack used by the extreme scale high performance computing community is fairly homogeneous across sites. On the compute edge of the stack, file system clients or IO forwarding services direct IO over an interconnect network to a relatively small set of IO nodes. These nodes forward the requests over a secondary storage network to a spindle-based parallel file system. Unfortunately, this architecture will become unviable in the exascale era. As the density growth of disks continues to outpace increases in their rotational speeds, disks are becoming increasingly cost-effective for capacity but decreasingly so for bandwidth. Fortunately, new storage media such as solid state devices are filling this gap; although not cost-effective for capacity, they are so for performance. This suggests that the storage stack at exascale will incorporate solid state storage between the compute nodes and the parallel file systems. There are three natural places into which to position this new storage layer: within the compute nodes, the IO nodes, or the parallel file system. In this paper, we argue that the IO nodes are the appropriate location for HPC workloads and show results from a prototype system that we have built accordingly. Running a pipeline of computational simulation and visualization, we show that our prototype system reduces total time to completion by up to 30%.

On the Role of Burst Buffers in Leadership-class Storage Systems
by Liu, N; Cope, J; Carns, P; Carothers, C; Ross, R; Grider, G; Crume, A; and Maltzahn, C

The largest-scale high-performance (HPC) systems are stretching parallel file systems to their limits in terms of aggregate bandwidth and numbers of clients. To further sustain the scalability of these file systems, researchers and HPC storage architects are exploring various storage system designs. One proposed storage system design integrates a tier of solid-state burst buffers into the storage system to absorb application I/O requests. In this paper, we simulate and explore this storage system design for use by large-scale HPC systems. First, we examine application I/O patterns on an existing large-scale HPC system to identify common burst patterns. Next, we describe enhancements to the CODES storage system simulator to enable our burst buffer simulations. These enhancements include the integration of a burst buffer model into the I/O forwarding layer of the simulator, the development of an I/O kernel description language and interpreter, the development of a suite of I/O kernels that are derived from observed I/O patterns, and fidelity improvements to the CODES models. We evaluate the I/O performance for a set of multiapplication I/O workloads and burst buffer configurations. We show that burst buffers can accelerate the application perceived throughput to the external storage system and can reduce the amount of external storage bandwidth required to meet a desired application perceived throughput goal.

Low-Power Amdahl-Balanced Blades for Data Intensive Computing
by Alexander S. Szalay, Gordon C. Bell, H. Howie Huang, Andreas Terzis, Alainna White

Enterprise and scientific data sets double every year, forcing similar growths in storage size and power consumption. As a consequence, current system architectures used to build data warehouses are about to hit a power consumption wall. In this paper we propose an alternative architecture comprising large number of so-called Amdahl blades that combine energy-efficient CPUs with solid state disks to increase sequential read I/O throughput by an order of magnitude while keeping power consumption constant. We also show that while keeping the total cost of ownership constant, Amdahl blades offer five times the throughput of a state-of-theart computing cluster for data-intensive applications. Finally, using the scaling laws originally postulated by Amdahl, we show that systems for data-intensive computing must maintain a balance between low power consumption and per-server throughput to optimize performance perWatt.

HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads
by Azza Abouzeid, Kamil Bajda-pawlikowski, Daniel Abadi, Avi Silberschatz, Er Rasin

The production environment for analytical data management applications is rapidly changing. Many enterprises are shifting away from deploying their analytical databases on high-end proprietary machines, and moving towards cheaper, lower-end, commodity hardware, typically arranged in a shared-nothing MPP architecture, often in a virtualized environment inside public or private âcloudsâ. At the same time, the amount of data that needs to be analyzed is exploding, requiring hundreds to thousands of machines to work in parallel to perform the analysis. There tend to be two schools of thought regarding what technology to use for data analysis in such an environment. Proponents of parallel databases argue that the strong emphasis on performance and efficiency of parallel databases makes them wellsuited to perform such analysis. On the other hand, others argue that MapReduce-based systems are better suited due to their superior scalability, fault tolerance, and flexibility to handle unstructured data. In this paper, we explore the feasibility of building a hybrid system that takes the best features from both technologies; the prototype we built approaches parallel databases in performance and efficiency, yet still yields the scalability, fault tolerance, and flexibility of MapReduce-based systems.

Hive: A Petabyte Scale Data Warehouse Using Hadoop
by Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Ning Zhang, Suresh Antony, Hao Liu and Raghotham Murthy

The size of data sets being collected and analyzed in the industry for business intelligence is growing rapidly, making traditional warehousing solutions prohibitively expensive. Hadoop is a popular open - source map - reduc e implementation which is being used in companies like Yahoo, Facebook etc. to store and process extremely large data sets on commodity hardware. However, the map - reduce programming model is very low level and requires developers to write custom programs w hich are hard to maintain and reuse. In this paper, we present Hive , an open - source data warehousing solution built on top of Hadoop. Hive supports queries expressed in a SQL - like declarative language - HiveQL , which are compiled into map - reduce jobs that are executed using Hadoop. In addition, HiveQL enables users to plug in custom map - reduce scripts into queries. The language includes a type system with support for tables containing primitive types, collections like arrays and maps, and nested composition s of the same. The underlying IO libraries can be extended to query data in custom formats. Hive also includes a system catalog - Metastore â that contain s schemas and statistics, which are useful in data exploratio n, query optimization and query comp i lati on . In Facebook, the Hive warehouse contains tens of thousands of tables and stores over 700TB of data and is being used extensively for both reporting and ad - hoc analyses by more than 200 users per month.

A Comparison of Approaches to Large-Scale Data Analysis
by Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. DeWitt, Samuel Madden, and Michael Stonebraker

There is currently considerable enthusiasm around the MapReduce (MR) paradigm for large-scale data analysis [17]. Although the basic control flow of this framework has existed in parallel SQL database management systems (DBMS) for over 20 years, some have called MR a dramatically new computing model [8, 17]. In this paper, we describe and compare both paradigms. Furthermore, we evaluate both kinds of systems in terms of performance and development complexity. To this end, we define a benchmark consisting of a collection of tasks that we have run on an open source version of MR as well as on two parallel DBMSs. For each task, we measure each system's performance for various degrees of parallelism on a cluster of 100 nodes. Our results reveal some interesting trade-offs. Although the process to load data into and tune the execution of parallel DBMSs took much longer than the MR system, the observed performance of these DBMSs was strikingly better. We speculate about the causes of the dramatic performance difference and consider implementation concepts that future systems should take from both kinds of architectures.

Data Management Challenges of Data-Intensive Scientific Workflows
by Ewa Deelman and Ann Chervenak

Scientific workflows play an important role in today's science. Many disciplines rely on workflow technologies to orchestrate the execution of thousands of computational tasks. Much research to-date focuses on efficient, scalable, and robust workflow execution, especially in distributed environments. However, many challenges remain in the area of data management related to workflow creation, execution, and result management. In this paper we examine some of these issues in the context of the entire workflow lifecycle.

DISC: A System for Distributed Data Intensive Scientific Computing
by George Kola, Tevfik Kosar, Jaime Frey, Miron Livny, Robert Brunner, Michael Remijan

The increasing computation and data requirements of scientific applications have necessitated the use of distributed resources owned by collaborating parties. While existing distributed systems work well for computation that requires limited data movement, they fail in unexpected ways when the computation accesses, creates, and moves large amounts of data over wide-area networks. In this work, we analyzed the problems with existing systems and used the result of this analysis to design our own system. Realizing that it takes a long while for a new system to stabilize, we tried our best to reuse existing components. We added new components only when we could not get by with adding features to existing ones. We used our system to successfully process three terabytes of DPOSS image data in under a week by using idle CPUs in desktops and commodity clusters in the UW-Madison Computer Science Department and Starlight.

Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center
by Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy Katz, Scott Shenker, and Ion Stoica

We present Mesos, a platform for sharing commodity clusters between multiple diverse cluster computing frameworks, such as Hadoop and MPI. Sharing improves cluster utilization and avoids per-framework data replication. Mesos shares resources in a fine-grained manner, allowing frameworks to achieve data locality by taking turns reading data stored on each machine. To support the sophisticated schedulers of today's frameworks, Mesos introduces a distributed two-level scheduling mechanism called resource offers. Mesos decides how many resources to offer each framework, while frameworks decide which resources to accept and which computations to run on them. Our results show that Mesos can achieve near-optimal data locality when sharing the cluster among diverse frameworks, can scale to 50,000 (emulated) nodes, and is resilient to failures.

Parallel Visualization on Large Clusters using MapReduce
by Vo, H.T.; Bronson, J.; Summa, B.; Comba, J.L.D.; Freire, J.; Howe, B.; Bremer, P.-T.; Silva, C.T.

Large-scale visualization systems are typically designed to efficiently âpushâ datasets through the graphics hardware. However, exploratory visualization systems are increasingly expected to support scalable data manipulation, restructuring, and querying capabilities in addition to core visualization algorithms. We posit that new emerging abstractions for parallel data processing, in particular computing clouds, can be leveraged to support large-scale data exploration through visualization. In this paper, we take a first step in evaluating the suitability of the MapReduce framework to implement large-scale visualization techniques. MapReduce is a lightweight, scalable, general-purpose parallel data processing framework increasingly popular in the context of cloud computing. Specifically, we implement and evaluate a representative suite of visualization tasks (mesh rendering, isosurface extraction, and mesh simplification) as MapReduce programs, and report quantitative performance results applying these algorithms to realistic datasets. For example, we perform isosurface extraction of up to l6 isovalues for volumes composed of 27 billion voxels, simplification of meshes with 30GBs of data and subsequent rendering with image resolutions up to 800002 pixels. Our results indicate that the parallel scalability, ease of use, ease of access to computing resources, and fault-tolerance of MapReduce offer a promising foundation for a combined data manipulation and data visualization system deployed in a public cloud or a local commodity cluster.

Provenance for computational tasks: A survey
by J Freire, D Koop, E Santos, CT Silva

The problem of systematically capturing and managing provenance for computational tasks has recently received significant attention because of its relevance to a wide range of domains and applications. The authors give an overview of important concepts related to provenance management, so that potential users can make informed decisions when selecting or designing a provenance solution.

The open provenance model core specification (v1. 1)
by L Moreau, B Clifford, J Freire, J Futrelle, Y Gil, P Groth, N Kwasnikowska ...

The Open Provenance Model is a model of provenance that is designed to meet the following requirements: (1) Allow provenance information to be exchanged between systems, by means of a compatibility layer based on a shared provenance model. (2) Allow developers to build and share tools that operate on such a provenance model. (3) Define provenance in a precise, technology-agnostic manner. (4) Support a digital representation of provenance for any “thing”, whether produced by computer systems or not. (5) Allow multiple levels of description to coexist. (6) Define a core set of rules that identify the valid inferences that can be made on provenance representation. This document contains the specification of the Open Provenance Model (v1.1) resulting from a community effort to achieve inter-operability in the Provenance Challenge series.

Using NoSQL Databases for Streaming Network Analysis
by Brian Wylie, Daniel Dunlavy, Warren Davis IV, Jeff Baumes

The high-volume, low-latency world of network traffic presents significant obstacles for complex analysis techniques. The unique challenge of adapting powerful but high-latency models to realtime network streams is the basis of our cyber security project. In this paper we discuss our use of NoSQL databases in a framework that enables the application of computationally expensive models against a real-time network data stream. We describe how this approach transforms the highly constrained (and sometimes arcane) world of real-time network analysis into a more developer friendly model that relaxes many of the traditional constraints associated with streaming data. Our primary use of the system is for conducting streaming text analysis and classification activities on a network link receiving ~200,000 emails per day.

Interpreting the Data: Parallel Analysis with Sawzall
by Rob Pike, Sean Dorward, Robert Griesemer, Sean Quinlan

Very large data sets often have a flat but regular structure and span multiple disks and machines. Examples include telephone call records, network logs, and web document repositories. These large data sets are not amenable to study using traditional database techniques, if only because they can be too large to fit in a single relational database. On the other hand, many of the analyses done on them can be expressed using simple, easily distributed computations: filtering, aggregation, extraction of statistics, and so on.

We present a system for automating such analyses. A filtering phase, in which a query is expressed using a new programming language, emits data to an aggregation phase. Both phases are distributed over hundreds or even thousands of computers. The results are then collated and saved to a file. The design -- including the separation into two phases, the form of the programming language, and the properties of the aggregators -- exploits the parallelism inherent in having data and computation distributed across many machines.

Dremel: Interactive Analysis of Web-Scale Datasets
by Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, and Theo Vassilakis

Dremel is a scalable, interactive ad-hoc query system for analysis of read-only nested data. By combining multi-level execution trees and columnar data layout, it is capable of running aggregation queries over trillion-row tables in seconds. The system scales to thousands of CPUs and petabytes of data, and has thousands of users at Google. In this paper, we describe the architecture and implementation of Dremel, and explain how it complements MapReduce-based computing. We present a novel columnar storage representation for nested records and discuss experiments on few-thousand node instances of the system.

DS@S Team » DS@S 2013 Summer School » DS@S 2013 Summer School Papers

CONTACTS

New Website Announcement!

http://datascience.lanl.gov/

Data Science at Scale 2013 Summer School Paper List

Other papers: