Machine Learning with IBM PowerAI: Getting Started with Image Classification (Part 1)

IBM Power Systems


Image classification has become one of the key pilot use-cases for demonstrating machine learning. In this short article, I attempt to write about how to implement such a solution using IBM PowerAI, and compare GPU and CPU performances while running this on IBM Power Systems.

Artificial Intelligence

Artificial Intelligence is currently seen as a branch of computer science that deals with making computers perform tasks like visual recognition, speech identification, cognitive decision-making, language translation etc, which are traditionally attributed to human intelligence.

Machine Learning

Machine Learning, commonly viewed as an application of Artificial Intelligence, deals with giving the systems an ability to learn and improve with experience, without explicitly coding all tasks.

Deep Learning

Deep Learning is a subset of Machine Learning where the systems can learn with labelled training data (supervised) or unlabeled training data (unsupervised). Deep Learning typically uses a hierarchical level of artificial neural networks to carry out a task.

Artificial Neural Networks

Artificial Neural Networks are systems inspired by biological neural networks and can perform certain tasks like image classification with amazing accuracy. For example, for image classification, a set of images of an animal are provided with labeling. This is the training data. The Artificial Neural Network, over a series of steps (or layers), helps the system learn the ability to classify unlabeled images (An image of an Orangutan in the example shown in this article) as belonging to a certain group while coming up with accuracy scores.

There are several applications of deep learning for your business, ranging from cellphone personal assistants to self-driving cars where rapidly changing patterns are used to classify objects in real-time.

What is IBM PowerAI?

IBM PowerAI software lets you easily run all the popular machine learning frameworks with minimal effort on your IBM POWER9 servers which contain a GPU. CPUs were designed and built for serial processing and contain a small number of cores, whereas GPUs can contain thousands of smaller cores and rely on parallel processing of tasks. Tasks meant for machine learning are key applications of GPUs. Check out the IBM Power System AC922 servers, touted as one of the best servers in the market for running enterprise AI tasks. IBM PowerAI currently includes the following frameworks;


Current setup

For this demo, I used a container on a VM running Ubuntu on Power (ppc64le), hosted on Nimbix Cloud.

A Container is a running instance of an image. An image is a template which contains the OS, Software and application code, all bundled in one file. Images are defined using a Dockerfile, which is a list of steps to configure the image. The Dockerfile is built to create an image, and the image is run to get a running container. To run the image, you need to have Docker Engine installed and configured on the VM.

Here is the Dockerfile I used, written by Indrajit Poddar. This is taken from this Github page.

This builds an image with Jupyter Notebook, iTorch Kernel (we’ll discuss this in the second part) and some base TensorFlow examples.

TensorFlow is an open source, scalable library for Machine Learning applications, and is based on the concept of a data flow graph which can be built and executed. A graph can contain two components, nodes and edges (or tensors). It comes with a Python API, and is easy to assemble a net, assign parameters and run your training models.

The steps below were demonstrated by Indrajit Poddar. He has built a test image on Nimbix Cloud which will run the aforementioned services when deployed, in a few minutes.

The following command is used to verify if the GPU is attached to the container.

root@JARVICENAE-0A0A1841:/usr/lib/nvidia-384# nvidia-smi
Thu Feb 1 23:45:11 2018
+ — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — -+
| NVIDIA-SMI 384.111 Driver Version: 384.111 |
| — — — — — — — — — — — — — — — -+ — — — — — — — — — — — + — — — — — — — — — — — +
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| 0 Tesla P100-SXM2… Off | 00000003:01:00.0 Off | 0 |
| N/A 40C P0 42W / 300W | 299MiB / 16276MiB | 0% Default |
+ — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — -+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
+ — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — -+

I see an Nvidia Tesla P100 GPU attached. The following command shows the installed Jupyter Notebook instances and the associated tokens that will be used for authentication later.

root@JARVICENAE-0A0A1841:/usr/lib/nvidia-384# jupyter notebook list
Currently running servers: :: /opt/DL/torch :: /opt/DL/caffe-ibm :: /opt/DL/tensorflow

Starting Image Classification

What is Caffe?

Caffe (Convolutional Architecture for Fast Feature Embedding) was developed at the Berkeley Vision and Learning Center. It is an open source framework for performing tasks like image classification. It supports CUDA, Convolutional Neural Networks, has pre-trained models, and is therefore a good choice for this demo.

We’ll use Python to perform all the tasks. The steps below were done via Jupyter Notebook. First, let’s set up Python, Numpy, and Matplotlib.

import numpy as np
import matplotlib.pyplot as plt
# display plots in this notebook
%matplotlib inline
# set display defaults
plt.rcParams[‘figure.figsize’] = (10, 10) # large images
plt.rcParams[‘image.interpolation’] = ‘nearest’ # don’t interpolate: show square pixels
plt.rcParams[‘image.cmap’] = ‘gray’ # use grayscale output rather than a (potentially misleading) color heatmap
# Then, we load Caffe. The caffe module needs to be on the Python path;
# we’ll add it here explicitly.
import sys
caffe_root = ‘../’ # this file should be run from {caffe_root}/examples (otherwise change this line)
sys.path.insert(0, caffe_root + ‘python’)
import caffe

What is Caffenet?

Caffenet is a convolutional neural network written to interface with CUDA, with the primary aim of classifying images. Caffenet is a variant of Alexnet. A presentation from 2015 by the creators of Alexnet is here. In the code below, we download a pre-trained model.

import os
if os.path.isfile(caffe_root + ‘models/bvlc_reference_caffenet/bvlc_reference_caffenet.caffemodel’):
print ‘CaffeNet found.’
print ‘Downloading pre-trained CaffeNet model…’
!../scripts/ ../models/bvlc_reference_caffenet

Here is the output.

CaffeNet found.
Downloading pre-trained CaffeNet model...
…100%, 232 MB, 42746 KB/s, 5 seconds passed

Then, we load Caffe in CPU mode and work with input preprocessing.

model_def = caffe_root + ‘models/bvlc_reference_caffenet/deploy.prototxt’
model_weights = caffe_root + ‘models/bvlc_reference_caffenet/bvlc_reference_caffenet.caffemodel’
net = caffe.Net(model_def, # defines the structure of the model
model_weights, # contains the trained weights
caffe.TEST) # use test mode (e.g., don’t perform dropout)

Caffenet’s ‘’ is used. This is the default transformer used in all examples. It creates a transformed mean value for an image based on the input provided. Caffenet is setup to get input images in the BGR format with values in the range 0 to 255. Transformation to load images with values in the range of 0 to 1 in RGB format, as input needed for Matplotlib, are performed.

# load the mean ImageNet image (as distributed with Caffe) for subtraction
mu = np.load(caffe_root + ‘python/caffe/imagenet/ilsvrc_2012_mean.npy’)
mu = mu.mean(1).mean(1) # average over pixels to obtain the mean (BGR) pixel values
print ‘mean-subtracted values:’, zip(‘BGR’, mu)
# create transformer for the input called ‘data’
transformer ={‘data’: net.blobs[‘data’].data.shape})
transformer.set_transpose(‘data’, (2,0,1)) # move image channels to outermost dimension
transformer.set_mean(‘data’, mu) # subtract the dataset-mean value in each channel
transformer.set_raw_scale(‘data’, 255) # rescale from [0, 1] to [0, 255]
transformer.set_channel_swap(‘data’, (2,1,0)) # swap channels from RGB to BGR

In other words, computers can now learn to classify an image by first converting the image to an array of RGB values. Then, these values are scanned to look for patterns of values that already match another image in a pre-trained model. While comparing, confidence metrics are generated which show how accurate the classification was.

Here is the output.

mean-subtracted values: [(‘B’, 104.0069879317889), (‘G’, 116.66876761696767), (‘R’, 122.6789143406786)]


Here, we set the default size of the images. This can be changed later depending on your input.

50, # batch size
3, # 3-channel (BGR) images
720, 720) # image size is 720x720

Next, we load the image of an Orangutan from the Wiki Commons library.

# download the image
my_image_url = “" # paste your URL here
!wget -O image.jpg $my_image_url
# transform it and copy it into the net
image =‘image.jpg’)
transformed_image = transformer.preprocess(‘data’, image)

Here is the output.

--2018-02-02 00:27:52--
Resolving (, 2620:0:863:ed1a::2:b
Connecting to (||:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1443340 (1.4M) [image/jpeg]
Saving to: 'image.jpg'
image.jpg           100%[===================>]   1.38M  5.25MB/s    in 0.3s
2018-02-02 00:27:54 (5.25 MB/s) - 'image.jpg' saved [1443340/1443340]

Now, let’s classify the image.

# copy the image data into the memory allocated for the net
net.blobs[‘data’].data[…] = transformed_image
# perform classification
output = net.forward()
​output_prob = output[‘prob’][0] # the output probability vector for the first image in the batch
​print ‘predicted class is:’, output_prob.argmax()

The output was ‘predicted class is: 281’.

The output above classifies the image into class 281. Let’s load the ImageNet labels and view the output.

# load ImageNet labels
labels_file = caffe_root + ‘data/ilsvrc12/synset_words.txt’
if not os.path.exists(labels_file):
labels = np.loadtxt(labels_file, str, delimiter=’\t’)
print ‘output label:’, labels[output_prob.argmax()]

Here’s the output. The class was correct!

output label: n02480495 orangutan, orang, orangutang, Pongo pygmaeus

The following code helps you come up with other top classes.

# sort top five predictions from softmax output
top_inds = output_prob.argsort()[::-1][:5] # reverse sort and take five largest items
print ‘probabilities and labels:’
zip(output_prob[top_inds], labels[top_inds])

Here is the output.

probabilities and labels:
[(0.96807814, 'n02480495 orangutan, orang, orangutang, Pongo pygmaeus'),
(0.030588957, 'n02492660 howler monkey, howler'),
(0.00085891742, 'n02493509 titi, titi monkey'),
(0.00015429058, 'n02493793 spider monkey, Ateles geoffroyi'),
(7.259626e-05, 'n02488291 langur')]

Analyzing GPU Performance

Here is the time taken to perform the classification on the CPU only mode.

%timeit net.forward()

Here is the output.

OUTPUT: 1 loop, best of 3: 3.06 s per loop

Three seconds per loop is quite long. Let’s switch to GPU mode and perform the same.

caffe.set_device(0) # if we have multiple GPUs, pick the first one
net.forward() # run once before timing to set up memory
%timeit net.forward()

Here is the output.

OUTPUT: 1 loop, best of 3: 11.4 ms per loop

That is an improvement of 3048.6 milliseconds! This concludes the first part of this blog. I apologize for grammatical errors, if any.

In the next part, we will take a look at how to train your own model using NVIDIA Digits and how to use Torch.

If you’ve enjoyed this piece, go ahead, give it a clap 👏🏻 (you can clap more than once)! You can also share it somewhere online so others can read it too.

Author: Upendra Rajan

Machine Learning with IBM PowerAI: Getting Started with Image Classification (Part 1) was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

Statistical Significance Explained

What does it mean to prove something with data?

As the dean at a major university, you receive a concerning report showing your students get an average of 6.80 hours of sleep per night compared to the national college average of 7.02 hours. The student body president is worried about the health of students and points to this study as proof that homework must be reduced. The university president on the other hand dismisses the study as nonsense: “Back in my day we got four hours of sleep a night and considered ourselves lucky.” You have to decide if this is a serious issue. Fortunately, you’re well-versed in statistics and finally see a chance to put your education to use!

How can we decide if this is meaningful?

Statistical significance is one of those terms we often hear without really understanding. When someone claims data proves their point, we nod and accept it, assuming statisticians have done complex operations that yielded a result which cannot be questioned. In fact, statistical significance is not a complicated phenomenon requiring years of study to master, but a straightforward idea that everyone can — and should — understand. Like with most technical concepts, statistical significance is built on a few simple ideas: hypothesis testing, the normal distribution, and p values. In this article, we will briefly touch on all of these concepts (with further resources provided) as we work up to solving the conundrum presented above.

The first idea we have to discuss is hypothesis testing, a technique for evaluating a theory using data. The “hypothesis” refers to the researcher’s initial belief about the situation before the study. This initial theory is known as the alternative hypothesis and the opposite is known as the null hypothesis. In our example these are:

  • Alternative Hypothesis: The average amount of sleep by students at our university is below the national average for college student.
  • Null Hypothesis: The average amount of sleep by students at our university is not below the national average for college students.

Notice how careful we have to be about the wording: we are looking for a very specific effect, which needs to be formalized in the hypotheses so after the fact we cannot claim to have been testing something else! (This is an example of a one-sided hypothesis test because we are concerned with a change in only one direction.) Hypothesis tests are one of the foundations of statistics and are used to assess the results of most studies. These studies can be anything from a medical trial to assess drug effectiveness to an observational study evaluating an exercise plan. What all studies have in common is that they are concerned with making comparisons, either between two groups or between one group and the entire population. In the medical example, we might compare the average time to recover between groups taking two different drugs, or, in our problem as dean, we want to compare sleep between our students and all the students in the country.

The testing part of hypothesis tests allows us to determine which theory, the null or alternative, is better supported by the evidence. However, before we can get to testing our data, we need to talk about two more crucial ideas.

The second building block of statistical significance is the normal distribution, also called the Gaussian or bell curve. The normal distribution is used to represent how data from a process is distributed and is defined by the mean, given the Greek letter μ (mu), and the standard deviation, given the letter σ (sigma). The mean shows the location of the center of the data and the standard deviation is the spread in the data.

Normal Distribution with mean μ and standard deviation σ

The application of the normal distribution comes from assessing data points in terms of the standard deviation. We can determine how anomalous a data point is based on how many standard deviations it is from the mean. The normal distribution has the following helpful properties:

  • 68% of data is within ± 1 standard deviations from the mean
  • 95% of data is within ± 2 standard deviations from the mean
  • 99.7% of data is within ± 3 standard deviations from the mean

If we have a normal distribution for a statistic, we can characterize any point in terms of standard deviations from the mean. For example, average female height in the US is 65 inches (5′ 5″) with a standard deviation of 4 inches. If we meet a new acquaintance who is 73 inches tall, we can say she is two standard deviations above the mean and is in the tallest 2.5% of females. (2.5% of females will be shorter than μ — 2σ (57 in) and 2.5% will be taller than μ+2σ).

In statistics, instead of saying our data is two standard deviations from the mean, we assess it in terms of a z-score, which just represents the number of standard deviations a point is from the mean. Conversion to a z-score is done by subtracting the mean of the distribution from the data point and dividing by the standard deviation. In the height example, you can check that our friend would have a z-score of 2. If we do this to all the data points the new distribution is called the standard normal with a mean of 0 and a standard deviation of 1 as shown below.

Transformation from normal (right) to standard normal distribution (left). (Source)

Every time we do a hypothesis test, we assume the data being measured come from a form of the normal distribution. This will always be an estimate because real-world data never perfectly follows a normal distribution. Generally, as the number of data points increases, the distribution gets closer and closer to normal. Nonetheless, we need to remember it is still an approximation! Assuming a normal distribution means we can determine how likely or unlikely the result we observe in a study is. The higher or lower the z-score, the more unlikely the result is to happen by chance and the more likely the result is meaningful. To quantify just how meaningful the results are, we use one more concept.

The final core idea is that of p-values. A p-value is the probability that an observed result in a study occurred at random. Say we are measuring differences in IQ between people in the US states of Florida and Washington. If we measure higher average IQs in Washington and our p-value is 0.346 this indicates there is a 34.6% chance these results occurred at random. The lower the p-value, the more meaningful the result because it is less likely to be caused by noise.

Whether or not the result can be called statistically significant depends on the p-value (known as alpha) we establish for significance before we begin the experiment . If the observed p-value is less than alpha, then the results are statistically significant. We need to choose alpha before the experiment because if we waited until after, we could just select a number that proves our results are significant no matter what the data shows!

Choosing a p-value after the study in one good way to “Lie with Statistics”

The choice of alpha depends on the situation and the field of study, but the most commonly used value is 0.05, corresponding to a 5% chance the results occurred at random. In my lab, I see values from 0.1 to 0.001 commonly in use. As an extreme example, the physicists who discovered the Higgs Boson particle used a p-value of 0.0000003, or a 1 in 3.5 million chance the discovery occurred because of noise. (Statisticians are loathe to admit that a p-value of 0.05 is arbitrary. R.A. Fischer, the father of modern statistics, choose a p-value of 0.05 for indeterminate reasons and it stuck)!

To get from a z-score on the normal distribution to a p-value, we can use a table or statistical software like R. The result will show us the probability of a z-score lower than the calculated value. For example, with a z-score of 2, the p-value is 0.977, which means there is only a 2.3% probability we observe a z-score higher than 2 at random.

The percentage of the distribution below a z-score of 2 is 97.7%

As a summary so far, we have covered three ideas:

  1. Hypothesis Testing: A technique used to test a theory
  2. Normal Distribution: An approximate representation of the data in a hypothesis test.
  3. p-value: The probability the observed result in the data occurred at random.

Now, let’s put the pieces together in our example. Here are the basics:

  • Students across the country average 7.02 hours of sleep per night according to the National Sleep Foundation
  • In a poll of 202 students at our university the average hours of sleep per night was 6.90 hours with a standard deviation of 0.82 hours.
  • Our alternative hypothesis is the average sleep of students at our university is below the national average for college students.
  • We will use an alpha value of 0.05 which means the results are significant f the p-value is below 0.05.

First, we need to convert our measurement into a z-score, or the number of standard deviations it is away from the mean. We do this by subtracting the population mean (the national average) from our measured value and dividing by the standard deviation over the square root of the number of samples. (As the number of samples increases, the standard deviation and hence the variation decreases. We account for this by dividing the standard deviation by the square root of the number of samples.)

Conversion to z-score

The z-score is called our test-statistic. Once we have a test-statistic, we can use a table or a programming language such as R to calculate the p-value. I use code here not to intimidate but to show how easy it is to implement our solution with free tools! (# are comments and bold is output)

# Calculate the results
z_score = (6.90 - 7.02) / (0.84 / sqrt(202))
p_value = pnorm(z_score)
# Print our results
sprintf('The p-value is %0:5f for a z-score of %0.5f.', p_value, z_score)
"The p-value is 0.02116 for a z-score of -2.03038."

Based on the p-value of 0.02116, we can reject the null hypothesis. (Statisticians like us to say reject the null rather than accept the alternative.) There is statistically significant evidence our students get less sleep on average than college students in the US at a significance level of 0.05. The p-value shows there is a 2.12% chance that our results occurred at random. In this battle of the presidents, the student was right.

Before we ban all homework, we need to be careful not to assign too much to this result. Notice that our p-value, 0.02116, would not be significant if we had used a threshold of 0.01. Someone who wants to prove the opposite point in our study can simply manipulate the p-value. Anytime we examine a study, we should think about the p-value in addition to the conclusion. Further, this was an observational study, which means there is only evidence for correlation and not causation. We showed there is a correlation between students at our school and less average sleep, but not that going to our school causes a decrease in sleep. There could be other factors at play that affect sleep and only a randomized controlled study is able to prove causation.

As with most technical concepts, statistical significance is not that complex and is just a combination of many small ideas. Most of the trouble comes with learning the vocabulary! Once you put the pieces together, you can start applying these statistical concepts. As you learn the basics of stats, you become better prepared to view studies and the news with a healthy skepticism. You can see what the data actually says rather than what someone tells you it means. The best tactic against dishonest politicians and corporations is a skeptical, well-educated public!

As always, I welcome constructive criticism and feedback. I can be reached on Twitter @koehrsen_will.

Statistical Significance Explained was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

How to Transform Boring and Dry Reports with Data Visualization

Studies show that one of the most fundamental ways to help people today cope with information overload is to visualize it. In layman’s terms, this means drawing it out as a graph, plotting it on a map or even using data to create an interactive diagram…

Combined outlier detection with dplyr and ruler

Overview of simple outlier detection methods with their combination using dplyr and ruler packages.


During the process of data analysis one of the most crucial steps is to identify and account for outliers, observations that have essentially different nature than most other observations. Their presence can lead to untrustworthy conclusions. The most complicated part of this task is to define a notion of “outlier”. After that, it is straightforward to identify them based on given data.

There are many techniques developed for outlier detection. Majority of them deal with numerical data. This post will describe the most basic ones with their application using dplyr and ruler packages.

After reading this post you will know:

  • Most basic outlier detection techniques.
  • A way to implement them using dplyr and ruler.
  • A way to combine their results in order to obtain a new outlier detection method.
  • A way to discover notion of “diamond quality” without prior knowledge of this topic (as a happy consequence of previous point).


We will perform an analysis with the goal to find not typical diamonds listed in diamonds dataset from ggplot2 package. Here one observation represents one diamond and is stored as a row in data frame.

The way we will do that is by combining different outlier detection techniques to identify rows which are “strong outliers”, i.e. which might by considered outliers based on several methods.

Packages required for this analysis:


Outlier detection methods

To do convenient outlier detection with ruler it is better to define notion of non-outlier in form of the rule “Observation is not an outlier if …”. This way actual outliers are considered as rule breakers, objects of interest of ruler package. Note that definition of non-outlier is essentially a definition of outlier because of total two possibilities.


Z-score, also called a standard score, of an observation is [broadly speaking] a distance from the population center measured in number of normalization units. The default choice for center is sample mean and for normalization unit is standard deviation.

Observation is not an outlier based on z-score if its absolute value of default z-score is lower then some threshold (popular choice is 3).

Here is the function for identifying non-outliers based on z-score:

isnt_out_z <- function(x, thres = 3, na.rm = TRUE) {
  abs(x - mean(x, na.rm = na.rm)) <= thres * sd(x, na.rm = na.rm)

It takes a numeric vector as input and returns logical vector of the same length indicating whether input value is a non-outlier.

Z-score with MAD

Median Absolute Deviation is a robust normalization unit based on median as a population center. In order to use MAD “as a consistent estimator for the estimation of the standard deviation” one takes its value multiplied by a factor. This way base R function mad is implemented.

Observation is not an outlier based on MAD if its absolute value of z-score with median as center and MAD as normalization unit is lower then some threshold (popular choice is 3).

isnt_out_mad <- function(x, thres = 3, na.rm = TRUE) {
  abs(x - median(x, na.rm = na.rm)) <= thres * mad(x, na.rm = na.rm)

Tukey’s fences

Tukey’s fences is a technique used in box plots. The non-outlier range is defined with \([Q_1 – k(Q_3 – Q_1),~ Q_3 + k(Q_3 – Q_1)]\), where \(Q_1\) and \(Q_3\) are the lower and upper quartiles respectively, \(k\) – some nonnegative constant (popular choice is 1.5).

Observation is not an outlier based on Tukey’s fences if its value lies in non-outlier range.

isnt_out_tukey <- function(x, k = 1.5, na.rm = TRUE) {
  quar <- quantile(x, probs = c(0.25, 0.75), na.rm = na.rm)
  iqr <- diff(quar)
  (quar[1] - k * iqr <= x) & (x <= quar[2] + k * iqr)

Mahalanobis distance

All previous approaches were created for univariate numerical data. To detect outliers in multivariate case one can use Mahalanobis distance to reduce to univariate case and then apply known techniques.

Observation is not an outlier based on Mahalanobis distance if its distance is not an outlier.

maha_dist <- . %>% select_if(is.numeric) %>%
    mahalanobis(center = colMeans(.), cov = cov(.))

isnt_out_maha <- function(tbl, isnt_out_f, ...) {
  tbl %>% maha_dist() %>% isnt_out_f(...)

This function takes as input a data frame of interest (with possible non-numeric columns which are ignored) and function performing univariate outlier detection. It returns a logical vector of the same length as number of rows in input data frame.

To read more about practical usefulness of Mahalanobis distance in detecting outliers go to Steffen’s very helpful post.

Using dplyr and ruler

Definition of non-outlier row

Package ruler, based on dplyr grammar of data manipulation, offers tools for validating the following data units: data as a whole, group [of rows] as a whole, column as a whole, row as a whole, cell. Our primary interest is row as a whole. However, using this framework, we can construct several approaches for definition of the non-outlier row:

  1. Row is not an outlier based on some column if it doesn’t contain outlier (computed based on the target column) on the intersection with that column. In other words, first a univariate outlier detection is performed based solely on data from target column and then all rows containing non-outliers are named non-outlier rows.
  2. Row is not an outlier based on Mahalanobis distance if its distance (computed based on the selected numeric columns) is not an outlier.
  3. Row is not an outlier based on grouping if it is a part of a non-outlier group [of rows]. A group [of rows] is not an outlier if its summary value is not an outlier among summary values of other groups.

Note that all listed approached depend on the choice of the univariate outlier detection method. We will use all three previously listed univariate techniques.

isnt_out_funs <- funs(
  z = isnt_out_z,
  mad = isnt_out_mad,
  tukey = isnt_out_tukey


In ruler framework rules are defined in packs (to learn more go to ruler README and vignettes).

Column based non-outlier rows

For diamonds dataset rules for column based non-outlier rows can be defined based on 7 numeric columns and 3 presented univariate detection methods. There is a convenient way of computing all them at once using scoped variant of dplyr::transmute():

diamonds %>% transmute_if(is.numeric, isnt_out_funs)
## # A tibble: 53,940 x 21
##   carat_z depth_z table_z price_z   x_z   y_z   z_z carat_mad depth_mad
##     <lgl>   <lgl>   <lgl>   <lgl> <lgl> <lgl> <lgl>     <lgl>     <lgl>
## 1    TRUE    TRUE    TRUE    TRUE  TRUE  TRUE  TRUE      TRUE      TRUE
## 2    TRUE    TRUE    TRUE    TRUE  TRUE  TRUE  TRUE      TRUE      TRUE
## 4    TRUE    TRUE    TRUE    TRUE  TRUE  TRUE  TRUE      TRUE      TRUE
## 5    TRUE    TRUE    TRUE    TRUE  TRUE  TRUE  TRUE      TRUE      TRUE
## # ... with 5.394e+04 more rows, and 12 more variables: table_mad <lgl>,
## #   price_mad <lgl>, x_mad <lgl>, y_mad <lgl>, z_mad <lgl>,
## #   carat_tukey <lgl>, depth_tukey <lgl>, table_tukey <lgl>,
## #   price_tukey <lgl>, x_tukey <lgl>, y_tukey <lgl>, z_tukey <lgl>

The result has outputs for 21 methods. Their names are of the form <column name>_<method name>. So the name ‘carat_z’ is interpreted as result of univariate method with name ‘z’ for column with name ‘carat’.

Mahalanobis based non-outlier rows

To define non-outlier rows based on Mahalanobis distance one should apply univariate method for distances computed for some subset of numeric columns. To simplify a little bit, we will one “subset” with all numeric columns and all listed methods:

diamonds %>%
  transmute(maha = maha_dist(.)) %>%
  transmute_at(vars(maha = maha), isnt_out_funs)
## # A tibble: 53,940 x 3
##   maha_z maha_mad maha_tukey
##    <lgl>    <lgl>      <lgl>
## 1   TRUE     TRUE       TRUE
## 2   TRUE    FALSE      FALSE
## 3   TRUE    FALSE      FALSE
## 4   TRUE     TRUE       TRUE
## 5   TRUE     TRUE       TRUE
## # ... with 5.394e+04 more rows

The result has outputs for 3 methods. Their names are considered as method names. Note that with this approach outlier rows are not only the ones far from multivariate center, but also the ones that are unnaturally close to it.

Group based non-outlier rows

Definition of non-outlier rows based on grouping depends on group summary function and univariate outlier detection method. As grouping column we will choose all non-numeric columns (cut, color and clarity) united into one called group (for later easier imputation of non-outlier rows). As reasonable summary functions we will choose mean value of some numeric column (total of 7 functions):

data_tbl <- diamonds %>%
  unite(col = "group", cut, color, clarity)

compute_group_non_outliers <- . %>%
  # Compute per group mean values of columns
  group_by(group) %>%
  summarise_if(is.numeric, mean) %>%
  ungroup() %>%
  # Detect outliers among groups
  mutate_if(is.numeric, isnt_out_funs) %>%
  # Remove unnecessary columns

data_tbl %>% compute_group_non_outliers()
## # A tibble: 276 x 22
##        group carat_z depth_z table_z price_z   x_z   y_z   z_z carat_mad
##        <chr>   <lgl>   <lgl>   <lgl>   <lgl> <lgl> <lgl> <lgl>     <lgl>
## 1  Fair_D_I1   FALSE    TRUE    TRUE    TRUE  TRUE  TRUE  TRUE     FALSE
## 2  Fair_D_IF    TRUE    TRUE    TRUE    TRUE  TRUE  TRUE  TRUE      TRUE
## 3 Fair_D_SI1    TRUE    TRUE    TRUE    TRUE  TRUE  TRUE  TRUE      TRUE
## 4 Fair_D_SI2    TRUE    TRUE    TRUE    TRUE  TRUE  TRUE  TRUE      TRUE
## 5 Fair_D_VS1    TRUE    TRUE    TRUE    TRUE  TRUE  TRUE  TRUE      TRUE
## # ... with 271 more rows, and 13 more variables: depth_mad <lgl>,
## #   table_mad <lgl>, price_mad <lgl>, x_mad <lgl>, y_mad <lgl>,
## #   z_mad <lgl>, carat_tukey <lgl>, depth_tukey <lgl>, table_tukey <lgl>,
## #   price_tukey <lgl>, x_tukey <lgl>, y_tukey <lgl>, z_tukey <lgl>

The result has outputs for 21 methods applied to the 276 groups. Their names are of the form <column name for summary function>_<method name>. So the name ‘carat_z’ is interpreted as result of method ‘z’ for summary function equal to mean value of ‘carat’ column. Column group defines names of the groupings.


Column and Mahalanobis based definition of non-outlier rows can be expressed with row packs and group based – as group packs.

row_packs_isnt_out <- row_packs(
  # Non-outliers based on some column
  column = . %>% transmute_if(is.numeric, isnt_out_funs),
  # Non-outliers based on Mahalanobis distance
  maha = . %>% transmute(maha = maha_dist(.)) %>%
    transmute_at(vars(maha = maha), isnt_out_funs)

group_packs_isnt_out <- group_packs(
  # Non-outliers based on grouping
  group = compute_group_non_outliers,
  .group_vars = "group"

Application of all those packs is called exposing process. The result is an exposure from which we can extract tidy data validation report using get_report.

# Don't remove obeyers to compute total number of applied rules
full_report <- data_tbl %>%
  expose(row_packs_isnt_out, group_packs_isnt_out,
         .remove_obeyers = FALSE) %>%

used_rules <- full_report %>%
  distinct(pack, rule)

breaker_report <- full_report %>%
  filter(!(value %in% TRUE))

used_rules contains data about all definitions of non-outlier rows applied to data. They are encoded with combination of columns pack and rule.

breaker_report contains data about data units that break certain rules. Packs column and maha has actual row numbers of data_tbl listed in id column of report (for rows which should be considered as outliers).

On the other hand, pack group defines group pack and is represented in breaker_report with id 0. To obtain row outliers based on grouping we need to expand those rows with information about rows in the data that belong to those groups. This can be done using dplyr::left_join():

group_breakers <- breaker_report %>%
  # Filter group packs
  filter(pack == "group") %>%
  # Expand rows by matching group with its rows
  select(-id) %>%
    y = data_tbl %>% transmute(var = group, id = 1:n()),
    by = "var"
  ) %>%
  select(pack, rule, var, id, value)

outliers <- bind_rows(
  breaker_report %>% filter(pack != "group"),
) %>%
  select(pack, rule, id)

# Not all group based definitions resulted with outliers
outliers %>%
  count(pack, rule) %>%
  filter(pack == "group") %>%
  print(n = Inf)
## # A tibble: 13 x 3
##     pack        rule     n
##    <chr>       <chr> <int>
##  1 group   carat_mad    37
##  2 group carat_tukey    37
##  3 group     carat_z    29
##  4 group   depth_mad  1093
##  5 group depth_tukey  1016
##  6 group     depth_z   156
##  7 group   price_mad   209
##  8 group price_tukey  1146
##  9 group     price_z    44
## 10 group   table_mad   920
## 11 group table_tukey     8
## 12 group     table_z     7
## 13 group         z_z    23

Tibble outliers contains data about outlier rows. Combination of columns pack and rule defines non-outlier/outlier definition approach and column id defines row number of input data frame that should be considered an outlier based on the definition.

Definitions with most outliers are as follows:

outliers %>%
  count(pack, rule, sort = TRUE)
## # A tibble: 37 x 3
##     pack        rule     n
##    <chr>       <chr> <int>
## 1   maha    maha_mad  6329
## 2   maha  maha_tukey  5511
## 3 column   price_mad  5386
## 4 column price_tukey  3540
## 5 column   table_mad  2560
## # ... with 32 more rows

Two out of three Mahalanobis based definition yielded the most row outliers.


Given outliers data frame, one can do whatever he/she wants to identify outliers. Here we will use the basic combination approach based on average score.

Combined outlier detection score for certain row can be defined as share of applied methods that tagged it as outlier. Alternatively one can define it just as number of those methods as it will only change absolute value of the result and not the order.

outlier_score <- outliers %>%
  group_by(id) %>%
  # nrow(used_rules) equals total number of applied methods
  summarise(score = n() / nrow(used_rules))

# Top 10 outliers
outlier_score %>% arrange(desc(score)) %>% slice(1:10)
## # A tibble: 10 x 2
##       id     score
##    <int>     <dbl>
##  1 26432 0.5777778
##  2 27416 0.5777778
##  3 27631 0.5777778
##  4 27131 0.4666667
##  5 23645 0.4222222
##  6 26445 0.4222222
##  7 26745 0.4000000
##  8 27430 0.4000000
##  9 15952 0.3777778
## 10 17197 0.3777778

Finally we will tag those rows as strong outliers which has score more than 0.2 (subjective threshold which should be researched more).

diam_tbl <- diamonds %>%
  mutate(id = 1:n()) %>%
  left_join(y = outlier_score, by = "id") %>%
    score = coalesce(score, 0),
    is_out = if_else(score > 0.2, "Outlier", "Not outlier")

# Total number of outliers
sum(diam_tbl$score > 0.2)
## [1] 161

Tibble diam_tbl is basically the diamonds but with three more columns: id for row number, score for combined outlier score and is_out for non-outlier/outlier tag.

Plots illustrating strong outliers:


plot_outliers <- function(tbl, x, y, facet_var) {
  tbl %>%
    arrange(is_out) %>%
    ggplot(aes_string(x, y, colour = "is_out")) +
      geom_point() +
      facet_wrap(facets = facet_var) +
      scale_colour_manual(values = c("#AAAAAA", "#004080")) +
      guides(colour = guide_legend(title = NULL,
                                   override.aes = list(size = 4))) +
      labs(title = paste0("Strong outliers illustration by ", facet_var)) +
      theme(legend.position = "bottom",
            legend.text = element_text(size = 14))

diam_tbl %>% plot_outliers("carat", "price", facet_var = "cut")

diam_tbl %>% plot_outliers("x", "depth", facet_var = "color")

diam_tbl %>% plot_outliers("price", "table", facet_var = "clarity")

Based on those plots we see the complicated nature of “strong outliers”. They are not necessary located “on the edge” of two-dimensional scatter plots, but most extreme cases are tagged as outliers.

Also one interesting observation: most outliers are concentrated in the combination of “Fair” cut, “J” colour and “I1” clarity which are worst options among their features. The reason of this effect is group-based definitions of non-outliers which tagged certain groups more than others:

breaker_report %>%
  filter(pack == "group") %>%
  count(var, sort = TRUE) %>%
  print(n = 10)
## # A tibble: 47 x 2
##            var     n
##          <chr> <int>
##  1   Fair_D_I1     7
##  2   Fair_J_I1     7
##  3 Fair_H_VVS1     6
##  4  Ideal_J_I1     6
##  5 Fair_J_VVS1     5
##  6 Fair_G_VVS1     4
##  7 Fair_D_VVS1     3
##  8   Fair_E_I1     3
##  9   Fair_F_I1     3
## 10   Fair_H_I1     3
## # ... with 37 more rows

Here we see that “Fair” cut is among majority of top breaker groups. There are also some interesting combinations: Fair_D_I1 (“worst”-“best”-“worst”), Fair_J_I1 (“worst”-“worst”-“worst”), Ideal_J_I1 (“best”-“worst”-“worst”).

This fact might be interpreted as suggested combined outlier detection approach discovered notion of diamond quality without prior knowledge about it.


  • Using only basic outlier detection methods one can achieve insightful results by combining them. Observations which are tagged as outlier by more than some threshold number of methods might be named as “strong outliers”. Those should be considered as outliers based on the whole data rather then on separate features.
  • With ruler combining results of several outlier detection methods is straightforward due to the format of tidy data validation report.
  • Suggested “strong outlier” observations in diamonds dataset are not only those with extreme numerical values but also ones based on quality of diamonds. This is achieved without prior knowledge of “diamond quality” notion.


## R version 3.4.3 (2017-11-30)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.3 LTS
## Matrix products: default
## BLAS: /usr/lib/openblas-base/
## LAPACK: /usr/lib/
## locale:
##  [1] LC_CTYPE=ru_UA.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=ru_UA.UTF-8        LC_COLLATE=ru_UA.UTF-8    
##  [5] LC_MONETARY=ru_UA.UTF-8    LC_MESSAGES=ru_UA.UTF-8   
##  [7] LC_PAPER=ru_UA.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## attached base packages:
## [1] methods   stats     graphics  grDevices utils     datasets  base     
## other attached packages:
## [1] bindrcpp_0.2  ruler_0.1.0   ggplot2_2.2.1 tidyr_0.7.2   dplyr_0.7.4  
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.14     knitr_1.17       bindr_0.1        magrittr_1.5    
##  [5] tidyselect_0.2.3 munsell_0.4.3    colorspace_1.3-2 R6_2.2.2        
##  [9] rlang_0.1.4      plyr_1.8.4       stringr_1.2.0    tools_3.4.3     
## [13] grid_3.4.3       gtable_0.2.0     htmltools_0.3.6  lazyeval_0.2.1  
## [17] yaml_2.1.16      rprojroot_1.2    digest_0.6.13    assertthat_0.2.0
## [21] tibble_1.3.4     bookdown_0.5     purrr_0.2.4      glue_1.2.0      
## [25] evaluate_0.10.1  rmarkdown_1.8    blogdown_0.4     labeling_0.3    
## [29] stringi_1.1.6    keyholder_0.1.1  compiler_3.4.3   scales_0.5.0    
## [33] backports_1.1.2  pkgconfig_2.0.1

How to apply Linear Regression in R

Machine Learning and Regression Machine Learning (ML) is a field of study that provides the capability to a Machine to understand data and to learn from the data. ML is not only about analytics modeling but it is end-to-end modeling that broadly involves following steps: – Defining problem statement – Data collection. – Exploring, Cleaning […]

    Related Post

    1. Linear Regression in Python; Predict The Bay Area’s Home Prices
    2. Building A Logistic Regression in Python, Step by Step
    3. Multicollinearity in R
    4. Scikit-Learn for Text Analysis of Amazon Fine Food Reviews
    5. Survival Analysis – Part I