Data Mining: The Textbook

Data Mining: The Textbook, Springer, May 2015

Charu C. Aggarwal.

Comprehensive textbook on data mining: Table of Contents

PDF Download Link (Free for computers connected to subscribing institutions only)

Buy hard-cover or PDF (PDF has embedded links for navigation on e-readers)

Buy low-cost paperback edition (Instructions for computers connected to subscribing institutions only)

The emergence of data science as a discipline requires the development of a book that goes beyond the traditional focus of books on fundamental data mining problems. More emphasis needs to be placed on the advanced data types such as text, time series, discrete sequences, spatial data, graph data, and social networks. This comprehensive data mining book explores the different aspects of data mining, starting from the fundamentals, and subsequently explores the complex data types and their applications. Therefore, this book may be used for both introductory and advanced data mining courses. The chapters of this book fall into one of three categories:

The fundamental chapters: Data mining has four main problems, which correspond to clustering, classification, association pattern mining, and outlier analysis. These chapters comprehensively discuss a wide variety of methods for these problems.

Domain chapters: These chapters discuss the specific methods used for different domains of data such as text data, time-series data, sequence data, graph data, and spatial data.

Application chapters: These chapters study important applications such as stream mining, Web mining, ranking, recommendations, social networks, and privacy preservation. The domain chapters also have an applied flavor.

The book carefully balances mathematical details and intuition. It contains the necessary mathematical details for professors and researchers, but it is presented in a simple and intuitive style to improve accessibility for students and industrial practitioners. Numerous illustrations, examples, and exercises are included with an emphasis on semantically interpretable examples.

Cost-effective methods for obtaining electronic and hardcopy versions

The book is available in both hardcopy (hardcover) and electronic versions. The hardcover is available at all the usual channels (e.g, Amazon, Barnes and Noble etc.), in Kindle format, and also directly from Springer in hardcopy and pdf format. The good thing about Springer is that electronic versions are often widely accessible at no cost to subscribing institutions, which is particularly convenient for students. My understanding is that a very large fraction of universities in North America, Europe, Australia, and New Zealand are subscribers, and a rapidly increasing number of universities in Asia are also subscribing. The electronic version is available at the following Springerlink pointer . For subscribing institutions click from a computer directly connected to your institution network to download the book for free. Springer uses the domain name of your computer to regulate access. To be eligible, your institution must subscribe to "e-book package english (Computer Science)" or "e-book package english (full collection)". If your institution is eligible, you will see a (free) `Download Book' button. Otherwise you will see a (paid) `Get Access' button. Sometimes you may be able to download it from your library e-collection, even when it is not Web-accessible from your institution. For those who prefer desk copies rather than electronic books, there are some very cost-effective methods to obtain a paperback MyCopy edition for $25 or less (subscribing institutions only). If you have ever published an article (even journal) with Springer, you are also entitled to an additional 40% author discount for any Springer book (including the $25 paperback edition) using the approach described here .

In general, for electronic versions, I highly recommend buying the PDF directly from springer over Amazon's Kindle edition. The PDF has embedded links that allows navigation over an e-reader, and will take about 18 MB on your device. Aside from this, one PDF allows you use over any device or computer. Since the PDFs are fully produced by Springer (rather than Amazon kindle, where Amazon plays a role in conversion), the look and feel is fully controlled by author and publisher. This makes the PDF versions of better quality than an Amazon Kindle.

About the Author

Charu Aggarwal is a Distinguished Research Staff Member (DRSM) at the IBM T. J. Watson Research Center in Yorktown Heights, New York. He completed his B.S. from IIT Kanpur in 1993 and his Ph.D. from Massachusetts Institute of Technology in 1996. He has worked extensively in the field of data mining, with particular interests in data streams, privacy, uncertain data and social network analysis. He has published 14 (3 authored and 11 edited) books, over 250 papers in refereed venues, and has applied for or been granted over 80 patents. His h-index is 70. Because of the commercial value of the above-mentioned patents, he has received several invention achievement awards and has thrice been designated a Master Inventor at IBM. He is a recipient of an IBM Corporate Award (2003) for his work on bio-terrorist threat detection in data streams, a recipient of the IBM Outstanding Innovation Award (2008) for his scientific contributions to privacy technology, and a recipient of an IBM Research Division Award (2008) for his scientific contributions to data stream research. He has received two best paper awards and an EDBT Test-of-Time Award (2014). He has served as the general or program co-chair of the IEEE Big Data Conference (2014), the ICDM Conference (2015), the ACM CIKM Conference (2015), and the KDD Conference (2016). He also co-chaired the data mining track at the WWW Conference 2009. He served as an associate editor of the IEEE Transactions on Knowledge and Data Engineering from 2004 to 2008. He is an associate editor of the ACM Transactions on Knowledge Discovery and Data Mining Journal , an action editor of the Data Mining and Knowledge Discovery Journal , an associate editor of the IEEE Transactions on Big Data, and an associate editor of the Knowledge and Information Systems Journal. He is editor-in-chief of the ACM SIGKDD Explorations. He is a fellow of the SIAM (2015), ACM (2013) and the IEEE (2010) for "contributions to knowledge discovery and data mining techniques."

Solution Manual for Book

The solution manual for the book is available here from Springer. There is a link for the solution manual on this page. If you are an instructor, then you can obtain a copy. Please do not ask me directly for a copy of the solution manual. It can only be distributed by Springer.

Resources for book

The resources for this book will grow over time. Currently, I have not found time to prepare slides for teaching and will add them over time. I will also make the powerpoint figures of the book available soon. Meanwhile, I have added links to various sites on the internet where software is available for related material. In case you use the book and prepare slides, please try to share them on the internet. I can add a link from my site like the links below (with your name acknowledged of course).

Chapter 1: An Introduction to Data Mining

Data Sets from UCI Machine Learning Repository

Open Source Data Mining Software (WEKA Workbench)

Scikit-learn Tools (Python)

Apache Mahout Machine Learning Library

Spider Machine Learning Library (MATLAB)

KD-Nuggets: Resources in Data Mining

IBM SPSS Software Suite

Chapter 2: Data Preparation

IBM SPSS Data Preparation

Data Preparator from Bobbie Stewart

Scikit-learn Data Preparation (Python)

Scikit-learn Dimensionality Reduction (Python)

PCA, SVD, and eigen-decomposition implementation by redSVD

Various forms of matrix decomposition implementations including SVD by ALGLIB

Haar Wavelet Implementation by Tom Gibara

Multidimensional Scaling (MDS) implementation from University of Konstanz

ISOMAP from Stanford University

Weka Matrix class for Matrix Operations such as SVD

Chapter 3: Similarity and Distances

ISOMAP from Stanford University

Dynamic Time Warping by D. Ellis

Edit distance at Rosetta Code

Longest Common Subsequence at Rosetta Code

Chapters 4 and 5: Association Pattern Mining

IBM Quest Data Generator

SPMF Frequent Pattern Mining Implementations

Open Source Data Mining Software (WEKA Workbench)

FIMI Implementations on Frequent Pattern Mining

Implementations by Christian Borgelt

Chapters 6 and 7: Data Clustering

Scikit-learn Data Clustering (Python)

Open Source Data Mining Software (WEKA Workbench)

Apache Mahout Machine Learning Library (Clustering)

Spider Machine Learning Library (MATLAB)

Open source clustering software

Spectral clustering in MATLAB

Nonnegative matrix factorization in Python

OpenSubspace for high-dimensional clustering

IBM SPSS Software Suite

R-archive network

mlpack in C++

Chapters 8 and 9: Outlier Analysis

Scikit Outlier Detection (Python)

Open Source Data Mining Software (WEKA Workbench)

IBM SPSS Software Suite

R-archive network

Chapters 10 and 11: Data Classification

Scikit-learn Data Classification and Regression (Python)

Open Source Data Mining Software (WEKA Workbench)

Apache Mahout Machine Learning Library (Classification)

Spider Machine Learning Library (MATLAB)

IBM SPSS Software Suite

LibSVM for Support Vector Machines

mlpack in C++

ENTOOL for Ensemble Learning and Classification

R-archive network with lots of classification and ensemble methods

Chapter 12: Data Streams

IBM Infosphere Streams Platform

Reservoir Sampling Implementation

Sketch and Lossy counting implementations

MADlib sketch and Flajolet-Martin implementation

CluStream Implementation

MOA Toolkit for Massive Online Analytics

Chapter 13: Mining Text Data

Bow Toolkit and Data Sets

CRAN text mining packages

Co-clustering implementation

Stanford Topic Modeling

Topic Modeling by David Blei

MALLET topic modeling

GENISM topic modeling in Python

SVMperf Software for scalable text classification

Multinomial Bayes model for classification

Chapter 14: Mining Time-Series Data

Time-series forecasting in R from CRAN

Zaitun open source software

Gait-CAD MATLAB toolbox for clustering, classification, and regression

Cronos open source time-series package


UCR time-series clustering and classification page

Chapter 15: Mining Discrete Sequences

GSP Sequential Pattern Mining from Weka

SPMF Sequential Pattern Mining Implementations

Open Source Data Mining Software (WEKA Workbench)

HMM Implementation in R

Sequence mining for computational biology

Chapter 16: Mining Spatial Data

Spatial data mining software

Spatial econometrics software

Trajectory mining software

Bunch of spatial and trajectory software

Chapter 17: Mining Graph Data

Ullman algorithm for subgraph isomorphism

Frequent subgraph mining toolbox

Cheminformatics toolbox for kernels

Python Graph Implementations

Chapter 18: Mining Web Data

SNAP: Stanford Network Analysis Project

PageRank implementation for very large graphs

HITS implementation

Python Graph Implementations

Apache Mahout Recommender Systems

Large scale collaborative filtering

Scikit recommender systems in Python

Chapter 19: Social Network Analysis

SNAP: Stanford Network Analysis Project

Kernighan-Lin Implementation

Girvan-Newman Algorithm

METIS implementation

Spectral clustering in MATLAB

Weka Package for Semisupervised Learning and Collective Classification

NetKit-SRL for Collective Classification

Link Prediction Method (Lpmade)

Chapter 20: Privacy-Preserving Data Mining

Open source implementation of several anonymization algorithms