Summer School

Storage, IO and data analytics

JLESC Summer School

July 2-3, 2015

THURSDAY, JULY 2

9:00-13:00

An Introduction to HPC Storage and I/O

Rob Ross (ANL)

The goal of this tutorial is to serve as an introduction to the HPC storage stack. We will present a high-level overview of the organization of HPC storage systems and how they fit into HPC architectures. Then we will discuss the HPC I/O software stack, the major components, and how they work together towards high performance and high productivity I/O. We will highlight some popular interfaces for I/O from HPC applications and their strengths and weaknesses, and we will discuss common performance pitfalls and how to avoid them.

14:00–18:00

Big data technology

Gabriel Antoniu (INRIA)
Rosa M Badia (BSC)

The goal of this tutorial is to give an initial introduction to Big Data and Data Science (challenges, applications, etc.) and then explore technologies used to handle big data such as MapReduce (and what’s after), Hadoop, Spark, and Stratosphere/Flink. The tutorial will also include material on the PyCOMPSs programming model and its integration with dataClay and other new storage technologies. This theoretical tutorial will be completed with hands-on session on Hadoop on Friday afternoon.

FRIDAY, JULY 3

9:00 - 13:00

Data analytics

Morris Riedel, Markus Goetz, Christian Bodenstein (JUELICH/UoIceland)

The goal of this tutorial is to introduce participants to two concrete and widely used data analytics techniques for analyzing ‘big data’ for scientific and engineering applications. After a brief introduction to the general approach of using machine learning, data mining, and statistics in data analytics, we start with the ‘clustering’ technique that partitions datasets into subgroups (i.e. clusters) previously unknown. From the broad class of available methods we focus on the Density-based spatial clustering of applications with noise (DBSCAN) algorithm that also enables the identification of outliers or interesting anomalies. A parallel and scalable DBSCAN implementation, based on MPI/OpenMP and the hierarchical data format (HDF), will be used during the hands-on session with various interesting datasets. The second technique we cover is ‘classification’ in which groups of datasets already exist and new data is checked in order to understand to which existing group it belongs. As one of the best out-of-the-box methods for classification we focus on the Support Vector Machine (SVM) algorithm including selected kernel methods. A parallel and scalable SVM implementation, based on MPI, will be used during the hands-on session with a couple of challenging datasets. Both used HPC algorithms will be compared with solutions based on high throughput computing (i.e. map-reduce, Hadoop, Spark/MLlib, etc.) and serial approaches (R, Octave, Matlab, etc).

14:00-18:00

Big Data processing using Hadoop: An introductory Tutorial

The goal of this tutorial is to serve as a first step towards exploring the Hadoop platform and also to provide a short introduction into working with big data in Hadoop. We will first present the MapReduce programming model as an important programming model for Big Data processing in the Cloud. Hadoop ecosystem and some of major Hadoop features will then be discussed. Finally, we will discuss several approaches and methods used to optimize the performance of Hadoop in the Cloud. Several hands-on will be provided to study the operation of Hadoop platform along with the implementation of MapReduce applications.
In general, the tutorial will include short presentations and practical sessions. All the practical sessions will be using a VM (I will provide it during the training).