Keynotes | Big Data Mining

Jiawei Han, Abel Bliss Professor of Computer Science, University of Illinois at Urbana-Champaign

Title: Challenging Problems for Scalable Mining of Heterogeneous Social and Information Networks

Jiawei Han, Abel Bliss Professor of Computer Science, University of Illinois at Urbana-Champaign. He has been researching into data mining, information network analysis, database systems, and data warehousing, with over 600 journal and conference publications. He has chaired or served on many program committees of international conferences, including PC co-chair for KDD, SDM, and ICDM conferences, and Americas Coordinator for VLDB conferences. He also served as the founding Editor-In-Chief of ACM Transactions on Knowledge Discovery from Data and is serving as the Director of Information Network Academic Research Center supported by U.S. Army Research Lab. He is a Fellow of ACM and IEEE, and received 2004 ACM SIGKDD Innovations Award, 2005 IEEE Computer Society Technical Achievement Award, 2009 IEEE Computer Society Wallace McDowell Award, and 2011 Daniel C. Drucker Eminent Faculty Award at UIUC. His book “Data Mining: Concepts and Techniques” has been used popularly as a textbook worldwide.

Challenging Problems for Scalable Mining of Heterogeneous Social and Information Networks by Jiawei Han at BigMine-13

Abstract: In today’s interconnected real world, social and informational entities are interconnected, forming gigantic, interconnected, integrated social and information networks. By structuring these data objects into multiple types, such networks become semi-structured heterogeneous social and information networks. Most real world applications that handle big data, including interconnected social media and social networks, medical information systems, online e-commerce systems, or database systems, can be structured into typed, heterogeneous social and information networks. For example, in a medical care network, objects of multiple types, such as patients, doctors, diseases, medication, and links such as visits, diagnosis, and treatments are intertwined together, providing rich information and forming heterogeneous information networks. Effective analysis of large-scale heterogeneous social and information networks poses an interesting but critical challenge.

In this talk, we present a set of data mining scenarios in heterogeneous social and information networks and show that mining typed, heterogeneous networks is a new and promising research frontier in data mining research. However, such mining may raise some serious challenging problems on scalability computation. We identify a set of problems on scalable computation and calls for serious studies on such problems. This includes how to efficiently computation for (1) meta path-based similarity search, (2) rank-based clustering, (3) rank-based classification, (4) meta path-based link/relationship prediction, and (5) topical hierarchies from heterogeneous information networks. We introduce some recent efforts, discuss the trade-offs between query-independent pre-computation vs. query-dependent online computation, and point out some promising research directions.

Christos Faloutsos, Professor at Carnegie Mellon University

Title: Large Graph Mining – Patterns, tools and cascade analysis

Christos Faloutsos is a Professor at Carnegie Mellon University. He has received the Presidential Young Investigator Award by the National Science Foundation (1989), the Research Contributions Award in ICDM 2006, the SIGKDD Innovations Award (2010), nineteen “best paper” awards (including two “test of time” awards), and four teaching awards. He is an ACM Fellow, he has served as a member of the executive committee of SIGKDD; he has published over 200 refereed articles, 11 book chapters and one monograph. He holds six patents and he has given over 35 tutorials and over 15 invited distinguished lectures. His research interests include data mining for graphs and streams, fractals, database performance, and indexing for multimedia and bio-informatics data.

Large Graph Mining – Patterns, tools and cascade analysis by Christos Faloutsos at BigMine-13

Abstract: What do graphs look like? How do they evolve over time? How does influence/news/viruses propagate, over time? We present a long list of static and temporal laws, and some recent observations on real graphs. For tools, we present an overview of the PEGASUS system which is designed for handling Billion-node graphs, running on top of the “hadoop” system. Finally, for cascades and propagation, we show how to measure the connectivity of a graph, and how to achieve near-optimal immunization, to slow down virus propagation.

Hong Cheng, Assistant Professor at the Chinese University of Hong Kong

Title: Processing Reachability Queries with Realistic Constraints on Massive Networks

Hong Cheng is an Assistant Professor in the Department of Systems Engineering and Engineering Management at the Chinese University of Hong Kong. She received her Ph.D. degree from University of Illinois at Urbana-Champaign in 2008. Her research interests include data mining, database systems, and machine learning. She received research paper awards at ICDE’07, SIGKDD’06 and SIGKDD’05, and the certificate of recognition for the 2009 SIGKDD Doctoral Dissertation Award. She is a recipient of the 2010 Vice-Chancellor’s Exemplary Teaching Award at the Chinese University of Hong Kong.

Processing Reachability Queries with Realistic Constraints on Massive Networks by Hong Cheng by BigMine-13

Abstract: Massive graphs are ubiquitous in various application domains, such as social networks, road networks, communication networks, biological networks, RDF graphs, and so on. Such graphs are massive (for example, with hundreds of millions of nodes and edges or even more) and contain rich information (for example, node/edge weights, labels and textual contents). In such massive graphs, an important class of problems is to process various graph structure related queries. Graph reachability, as an example, asks whether a node can reach another in a graph. However, the large graph scale presents new challenges for efficient query processing.

In this talk, I will introduce two new yet important types of graph reachability queries: weight constraint reachability that imposes edge weight constraint on the answer path, and k-hop reachability that imposes a length constraint on the answer path. With such realistic constraints, we can find more meaningful and practically feasible answers. These two reachablity queries have wide applications in many real-world problems, such as QoS routing and trip planning.

Xavier Amatriain, Director of Personalization Science and Engineering, Netflix

Title: Big & Personal: the data and the models behind Netflix recommendations

Xavier Amatriain is Director of Personalization Science and Engineering at Netflix, where he leads the work on the next generation of recommendation algorithms at Netflix. He is working on the cross-roads of machine learning research, large-scale software engineering, and product innovation. Previous to this, he was a Research Scientist and Professor focused on Recommender Systems and neighboring areas such as Data Mining, and Machine Learning. He has authored more than 50 papers in books, journals and international conferences.

Big & Personal: the data and the models behind Netflix recommendations by Xavier Amatriain at BigMine-13

Abstract: Since the Netflix $1 million Prize, announced in 2006, our company has been known for having personalization at the core of our product. Even at that point in time, the dataset that we released was considered “large”, and we stirred innovation in the (Big) Data Mining research field. Our current product offering is now focused around instant video streaming, and our data is now many orders of magnitude larger. Not only do we have many more users in many more countries, but we also receive many more streams of data. Besides the ratings, we now also use information such as what our members play, browse, or search.

In this talk I will discuss the different approaches we follow to deal with these large streams of data in order to extract information for personalizing our service. I will describe some of the machine learning models used, as well as the architectures that allow us to combine complex offline batch processes with real-time data streams.