This tutorial is provided by Guru99. Originally posted here.

Apache HADOOP is a framework used to develop data processing applications which are executed in a distributed computing environment.

In this tutorial we will learn,

Components of Hadoo…

Skip to content
# Data Science Wale

## Curated data science & artificial intelligence articles from all around the web

# Month: December 2017

# Hadoop Tutorial: Features, Components, Cluster & Topology

# How to Use googlesheets to Connect R to Google Sheets

# Getting started with dplyr in R using Titanic Dataset

**
**
# Deep Compression and Pruning for Machine Learning in AI Self-Driving Cars: Using Convolutional Neural Networks (CNN)

# CognitiveScale CEO: What to expect in AI in 2018

# Building your own blockchain in R

# Combined outlier detection with dplyr and ruler

# Prologue

# Overview

# Outlier detection methods

## Z-score

## Z-score with MAD

## Tukey’s fences

## Mahalanobis distance

# Using dplyr and ruler

## Definition of non-outlier row

## Implementation

### Column based non-outlier rows

### Mahalanobis based non-outlier rows

### Group based non-outlier rows

## Exposure

## Combination

# Conclusions

# A Simple Intro to Q-Learning in R: Floor Plan Navigation

# Reinforcement Learning: Q-Learning with the Hopping Robot

# Tiny Art in Less Than 280 Characters

This tutorial is provided by Guru99. Originally posted here.

Apache HADOOP is a framework used to develop data processing applications which are executed in a distributed computing environment.

In this tutorial we will learn,

Components of Hadoo…

Often I use R to handle large datasets, analyze the data and filter out the data I donâ€™t need.

When all this is done, I usually use write.csv() to print my data off and reopen it in Google Sheets.

My workflow would look something like this:

full_data &…

dplyr is one of the most popular r-packages and also part of tidyverse that’s been developed by Hadley Wickham. The mere fact that dplyr package is very famous means, it’s one of the most frequently used. Being a data scientist is not always about creating sophisticated models but Data Analysis (Manipulation) and Data Visualization play […]

Related Post

By Dr. Lance Eliot, the AI Trends Insider When my son was very young, he and I played a board game that he seemed specially to enjoy. It required concentration and dexterity, and a bit of imagination along with a dose of excitement. If you made a mistake during the game, there would be a […]

By Akshay Sabhikhi, CEO & Co-founder, CognitiveScale AI Will Not Be Commoditized Only one in 20 companies has extensively incorporated AI in offerings or processes. Less than 39% of all companies have an AI strategy in place. According to MIT Sloan Review, the largest companies — those with at least 100,000 employees — are the […]

Everybody is freaking out about the rise of the Bitcoin and the potential of the Blockchain technologies. The advent of cryptocurrencies, game changing use cases, disruption of established business models by disintermediation, etc..
By…

Overview of simple outlier detection methods with their combination using dplyr and ruler packages.

During the process of data analysis one of the most crucial steps is to identify and account for outliers, observations that have essentially different nature than most other observations. Their presence can lead to untrustworthy conclusions. The most complicated part of this task is **to define a notion of “outlier”**. After that, it is straightforward to identify them based on given data.

There are many techniques developed for outlier detection. Majority of them deal with numerical data. This post will describe the most basic ones with their application using dplyr and ruler packages.

After reading this post you will know:

- Most basic outlier detection techniques.
- A way to implement them using
`dplyr`

and`ruler`

. - A way to combine their results in order to obtain a new outlier detection method.
- A way to discover notion of “diamond quality” without prior knowledge of this topic (as a happy consequence of previous point).

We will perform an analysis with the goal to find not typical diamonds listed in diamonds dataset from ggplot2 package. Here one observation represents one diamond and is stored as a row in data frame.

The way we will do that is **by combining different outlier detection techniques to identify rows which are “strong outliers”**, i.e. which might by considered outliers based on several methods.

Packages required for this analysis:

```
library(dplyr)
library(tidyr)
library(ggplot2)
library(ruler)
```

To do convenient outlier detection with `ruler`

it is better to define notion of **non-outlier** in form of the rule “Observation **is not** an outlier if …”. This way actual outliers are considered as **rule breakers**, objects of interest of `ruler`

package. **Note** that definition of non-outlier is essentially a definition of outlier because of total two possibilities.

Z-score, also called a standard score, of an observation is [broadly speaking] a distance from the population center measured in number of normalization units. The default choice for center is sample mean and for normalization unit is standard deviation.

⬛ *Observation is not an outlier based on z-score* if its absolute value of default z-score is lower then some threshold (popular choice is 3).

Here is the function for identifying non-outliers based on z-score:

```
isnt_out_z <- function(x, thres = 3, na.rm = TRUE) {
abs(x - mean(x, na.rm = na.rm)) <= thres * sd(x, na.rm = na.rm)
}
```

It takes a numeric vector as input and returns logical vector of the same length indicating whether input value is a non-outlier.

Median Absolute Deviation is a robust normalization unit based on median as a population center. In order to use MAD “as a consistent estimator for the estimation of the standard deviation” one takes its value multiplied by a factor. This way base R function mad is implemented.

⬛ *Observation is not an outlier based on MAD* if its absolute value of z-score with median as center and MAD as normalization unit is lower then some threshold (popular choice is 3).

```
isnt_out_mad <- function(x, thres = 3, na.rm = TRUE) {
abs(x - median(x, na.rm = na.rm)) <= thres * mad(x, na.rm = na.rm)
}
```

Tukey’s fences is a technique used in box plots. The non-outlier range is defined with \([Q_1 – k(Q_3 – Q_1),~ Q_3 + k(Q_3 – Q_1)]\), where \(Q_1\) and \(Q_3\) are the lower and upper quartiles respectively, \(k\) – some nonnegative constant (popular choice is 1.5).

⬛ *Observation is not an outlier based on Tukey’s fences* if its value lies in non-outlier range.

```
isnt_out_tukey <- function(x, k = 1.5, na.rm = TRUE) {
quar <- quantile(x, probs = c(0.25, 0.75), na.rm = na.rm)
iqr <- diff(quar)
(quar[1] - k * iqr <= x) & (x <= quar[2] + k * iqr)
}
```

All previous approaches were created for univariate numerical data. To detect outliers in multivariate case one can use Mahalanobis distance to reduce to univariate case and then apply known techniques.

⬛ *Observation is not an outlier based on Mahalanobis distance* if its distance is not an outlier.

```
maha_dist <- . %>% select_if(is.numeric) %>%
mahalanobis(center = colMeans(.), cov = cov(.))
isnt_out_maha <- function(tbl, isnt_out_f, ...) {
tbl %>% maha_dist() %>% isnt_out_f(...)
}
```

This function takes as input a data frame of interest (with possible non-numeric columns which are ignored) and function performing univariate outlier detection. It returns a logical vector of the same length as number of rows in input data frame.

To read more about practical usefulness of Mahalanobis distance in detecting outliers go to Steffen’s very helpful post.

Package `ruler`

, based on `dplyr`

grammar of data manipulation, offers tools for validating the following data units: data as a whole, group [of rows] as a whole, column as a whole, row as a whole, cell. Our primary interest is row as a whole. However, using this framework, we can construct several approaches for definition of the non-outlier row:

*Row is not an outlier based on some column*if it doesn’t contain outlier (computed based on the target column) on the intersection with that column. In other words, first a univariate outlier detection is performed based solely on data from target column and then all rows containing non-outliers are named non-outlier rows.*Row is not an outlier based on Mahalanobis distance*if its distance (computed based on the selected numeric columns) is not an outlier.*Row is not an outlier based on grouping*if it is a part of a non-outlier group [of rows]. A group [of rows] is not an outlier if its summary value is not an outlier among summary values of other groups.

**Note** that all listed approached depend on the choice of the univariate outlier detection method. We will use all three previously listed univariate techniques.

```
isnt_out_funs <- funs(
z = isnt_out_z,
mad = isnt_out_mad,
tukey = isnt_out_tukey
)
```

In `ruler`

framework rules are defined in packs (to learn more go to `ruler`

README and vignettes).

For `diamonds`

dataset rules for column based non-outlier rows can be defined based on 7 numeric columns and 3 presented univariate detection methods. There is a convenient way of computing all them at once using scoped variant of `dplyr::transmute()`

:

```
diamonds %>% transmute_if(is.numeric, isnt_out_funs)
## # A tibble: 53,940 x 21
## carat_z depth_z table_z price_z x_z y_z z_z carat_mad depth_mad
## <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
## 1 TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## 2 TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## 3 TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE FALSE
## 4 TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## 5 TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## # ... with 5.394e+04 more rows, and 12 more variables: table_mad <lgl>,
## # price_mad <lgl>, x_mad <lgl>, y_mad <lgl>, z_mad <lgl>,
## # carat_tukey <lgl>, depth_tukey <lgl>, table_tukey <lgl>,
## # price_tukey <lgl>, x_tukey <lgl>, y_tukey <lgl>, z_tukey <lgl>
```

The result has outputs for 21 methods. Their names are of the form <column name>_<method name>. So the name ‘carat_z’ is interpreted as result of univariate method with name ‘z’ for column with name ‘carat’.

To define non-outlier rows based on Mahalanobis distance one should apply univariate method for distances computed for some subset of numeric columns. To simplify a little bit, we will one “subset” with all numeric columns and all listed methods:

```
diamonds %>%
transmute(maha = maha_dist(.)) %>%
transmute_at(vars(maha = maha), isnt_out_funs)
## # A tibble: 53,940 x 3
## maha_z maha_mad maha_tukey
## <lgl> <lgl> <lgl>
## 1 TRUE TRUE TRUE
## 2 TRUE FALSE FALSE
## 3 TRUE FALSE FALSE
## 4 TRUE TRUE TRUE
## 5 TRUE TRUE TRUE
## # ... with 5.394e+04 more rows
```

The result has outputs for 3 methods. Their names are considered as method names. **Note** that with this approach outlier rows are not only the ones far from multivariate center, but also the ones that are unnaturally close to it.

Definition of non-outlier rows based on grouping depends on group summary function and univariate outlier detection method. As grouping column we will choose all non-numeric columns (`cut`

, `color`

and `clarity`

) united into one called `group`

(for later easier imputation of non-outlier rows). As reasonable summary functions we will choose mean value of some numeric column (total of 7 functions):

```
data_tbl <- diamonds %>%
unite(col = "group", cut, color, clarity)
compute_group_non_outliers <- . %>%
# Compute per group mean values of columns
group_by(group) %>%
summarise_if(is.numeric, mean) %>%
ungroup() %>%
# Detect outliers among groups
mutate_if(is.numeric, isnt_out_funs) %>%
# Remove unnecessary columns
select_if(Negate(is.numeric))
data_tbl %>% compute_group_non_outliers()
## # A tibble: 276 x 22
## group carat_z depth_z table_z price_z x_z y_z z_z carat_mad
## <chr> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
## 1 Fair_D_I1 FALSE TRUE TRUE TRUE TRUE TRUE TRUE FALSE
## 2 Fair_D_IF TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## 3 Fair_D_SI1 TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## 4 Fair_D_SI2 TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## 5 Fair_D_VS1 TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## # ... with 271 more rows, and 13 more variables: depth_mad <lgl>,
## # table_mad <lgl>, price_mad <lgl>, x_mad <lgl>, y_mad <lgl>,
## # z_mad <lgl>, carat_tukey <lgl>, depth_tukey <lgl>, table_tukey <lgl>,
## # price_tukey <lgl>, x_tukey <lgl>, y_tukey <lgl>, z_tukey <lgl>
```

The result has outputs for 21 methods applied to the 276 groups. Their names are of the form <column name for summary function>_<method name>. So the name ‘carat_z’ is interpreted as result of method ‘z’ for summary function equal to mean value of ‘carat’ column. Column `group`

defines names of the groupings.

Column and Mahalanobis based definition of non-outlier rows can be expressed with row packs and group based – as group packs.

```
row_packs_isnt_out <- row_packs(
# Non-outliers based on some column
column = . %>% transmute_if(is.numeric, isnt_out_funs),
# Non-outliers based on Mahalanobis distance
maha = . %>% transmute(maha = maha_dist(.)) %>%
transmute_at(vars(maha = maha), isnt_out_funs)
)
group_packs_isnt_out <- group_packs(
# Non-outliers based on grouping
group = compute_group_non_outliers,
.group_vars = "group"
)
```

Application of all those packs is called exposing process. The result is an exposure from which we can extract tidy data validation report using `get_report`

.

```
# Don't remove obeyers to compute total number of applied rules
full_report <- data_tbl %>%
expose(row_packs_isnt_out, group_packs_isnt_out,
.remove_obeyers = FALSE) %>%
get_report()
used_rules <- full_report %>%
distinct(pack, rule)
breaker_report <- full_report %>%
filter(!(value %in% TRUE))
```

`used_rules`

contains data about all definitions of non-outlier rows applied to data. They are encoded with combination of columns `pack`

and `rule`

.

`breaker_report`

contains data about data units that break certain rules. Packs `column`

and `maha`

has actual row numbers of `data_tbl`

listed in `id`

column of report (for rows which should be considered as outliers).

On the other hand, pack `group`

defines group pack and is represented in `breaker_report`

with `id`

0. To obtain row outliers based on grouping we need to expand those rows with information about rows in the data that belong to those groups. This can be done using `dplyr::left_join()`

:

```
group_breakers <- breaker_report %>%
# Filter group packs
filter(pack == "group") %>%
# Expand rows by matching group with its rows
select(-id) %>%
left_join(
y = data_tbl %>% transmute(var = group, id = 1:n()),
by = "var"
) %>%
select(pack, rule, var, id, value)
outliers <- bind_rows(
breaker_report %>% filter(pack != "group"),
group_breakers
) %>%
select(pack, rule, id)
# Not all group based definitions resulted with outliers
outliers %>%
count(pack, rule) %>%
filter(pack == "group") %>%
print(n = Inf)
## # A tibble: 13 x 3
## pack rule n
## <chr> <chr> <int>
## 1 group carat_mad 37
## 2 group carat_tukey 37
## 3 group carat_z 29
## 4 group depth_mad 1093
## 5 group depth_tukey 1016
## 6 group depth_z 156
## 7 group price_mad 209
## 8 group price_tukey 1146
## 9 group price_z 44
## 10 group table_mad 920
## 11 group table_tukey 8
## 12 group table_z 7
## 13 group z_z 23
```

Tibble `outliers`

contains data about outlier rows. Combination of columns `pack`

and `rule`

defines non-outlier/outlier definition approach and column `id`

defines row number of input data frame that should be considered an outlier based on the definition.

Definitions with most outliers are as follows:

```
outliers %>%
count(pack, rule, sort = TRUE)
## # A tibble: 37 x 3
## pack rule n
## <chr> <chr> <int>
## 1 maha maha_mad 6329
## 2 maha maha_tukey 5511
## 3 column price_mad 5386
## 4 column price_tukey 3540
## 5 column table_mad 2560
## # ... with 32 more rows
```

Two out of three Mahalanobis based definition yielded the most row outliers.

Given `outliers`

data frame, one can do whatever he/she wants to identify outliers. Here we will use the basic combination approach based on average score.

*Combined outlier detection score* for certain row can be defined as **share of applied methods that tagged it as outlier**. Alternatively one can define it just as number of those methods as it will only change absolute value of the result and not the order.

```
outlier_score <- outliers %>%
group_by(id) %>%
# nrow(used_rules) equals total number of applied methods
summarise(score = n() / nrow(used_rules))
# Top 10 outliers
outlier_score %>% arrange(desc(score)) %>% slice(1:10)
## # A tibble: 10 x 2
## id score
## <int> <dbl>
## 1 26432 0.5777778
## 2 27416 0.5777778
## 3 27631 0.5777778
## 4 27131 0.4666667
## 5 23645 0.4222222
## 6 26445 0.4222222
## 7 26745 0.4000000
## 8 27430 0.4000000
## 9 15952 0.3777778
## 10 17197 0.3777778
```

Finally we will tag those rows as **strong outliers** which has score more than 0.2 (subjective threshold which should be researched more).

```
diam_tbl <- diamonds %>%
mutate(id = 1:n()) %>%
left_join(y = outlier_score, by = "id") %>%
mutate(
score = coalesce(score, 0),
is_out = if_else(score > 0.2, "Outlier", "Not outlier")
)
# Total number of outliers
sum(diam_tbl$score > 0.2)
## [1] 161
```

Tibble `diam_tbl`

is basically the `diamonds`

but with three more columns: `id`

for row number, `score`

for combined outlier score and `is_out`

for non-outlier/outlier tag.

Plots illustrating strong outliers:

```
theme_set(theme_bw())
plot_outliers <- function(tbl, x, y, facet_var) {
tbl %>%
arrange(is_out) %>%
ggplot(aes_string(x, y, colour = "is_out")) +
geom_point() +
facet_wrap(facets = facet_var) +
scale_colour_manual(values = c("#AAAAAA", "#004080")) +
guides(colour = guide_legend(title = NULL,
override.aes = list(size = 4))) +
labs(title = paste0("Strong outliers illustration by ", facet_var)) +
theme(legend.position = "bottom",
legend.text = element_text(size = 14))
}
diam_tbl %>% plot_outliers("carat", "price", facet_var = "cut")
```

```
diam_tbl %>% plot_outliers("x", "depth", facet_var = "color")
```

```
diam_tbl %>% plot_outliers("price", "table", facet_var = "clarity")
```

Based on those plots we see the complicated nature of “strong outliers”. They are not necessary located “on the edge” of two-dimensional scatter plots, but most extreme cases are tagged as outliers.

Also one interesting observation: most outliers are concentrated in the combination of “Fair” cut, “J” colour and “I1” clarity which are worst options among their features. The reason of this effect is group-based definitions of non-outliers which tagged certain groups more than others:

```
breaker_report %>%
filter(pack == "group") %>%
count(var, sort = TRUE) %>%
print(n = 10)
## # A tibble: 47 x 2
## var n
## <chr> <int>
## 1 Fair_D_I1 7
## 2 Fair_J_I1 7
## 3 Fair_H_VVS1 6
## 4 Ideal_J_I1 6
## 5 Fair_J_VVS1 5
## 6 Fair_G_VVS1 4
## 7 Fair_D_VVS1 3
## 8 Fair_E_I1 3
## 9 Fair_F_I1 3
## 10 Fair_H_I1 3
## # ... with 37 more rows
```

Here we see that “Fair” cut is among majority of top breaker groups. There are also some interesting combinations: `Fair_D_I1`

(“worst”-“best”-“worst”), `Fair_J_I1`

(“worst”-“worst”-“worst”), `Ideal_J_I1`

(“best”-“worst”-“worst”).

This fact might be interpreted as **suggested combined outlier detection approach discovered notion of diamond quality without prior knowledge about it**.

- Using only basic outlier detection methods one can achieve insightful results by combining them. Observations which are tagged as outlier by more than some threshold number of methods might be named as “strong outliers”. Those should be considered as outliers based on the whole data rather then on separate features.
- With
`ruler`

combining results of several outlier detection methods is straightforward due to the format of tidy data validation report. - Suggested “strong outlier” observations in
`diamonds`

dataset are not only those with extreme numerical values but also ones based on quality of diamonds. This is achieved without prior knowledge of “diamond quality” notion.

```
sessionInfo()
## R version 3.4.3 (2017-11-30)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.3 LTS
##
## Matrix products: default
## BLAS: /usr/lib/openblas-base/libblas.so.3
## LAPACK: /usr/lib/libopenblasp-r0.2.18.so
##
## locale:
## [1] LC_CTYPE=ru_UA.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=ru_UA.UTF-8 LC_COLLATE=ru_UA.UTF-8
## [5] LC_MONETARY=ru_UA.UTF-8 LC_MESSAGES=ru_UA.UTF-8
## [7] LC_PAPER=ru_UA.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=ru_UA.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] methods stats graphics grDevices utils datasets base
##
## other attached packages:
## [1] bindrcpp_0.2 ruler_0.1.0 ggplot2_2.2.1 tidyr_0.7.2 dplyr_0.7.4
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.14 knitr_1.17 bindr_0.1 magrittr_1.5
## [5] tidyselect_0.2.3 munsell_0.4.3 colorspace_1.3-2 R6_2.2.2
## [9] rlang_0.1.4 plyr_1.8.4 stringr_1.2.0 tools_3.4.3
## [13] grid_3.4.3 gtable_0.2.0 htmltools_0.3.6 lazyeval_0.2.1
## [17] yaml_2.1.16 rprojroot_1.2 digest_0.6.13 assertthat_0.2.0
## [21] tibble_1.3.4 bookdown_0.5 purrr_0.2.4 glue_1.2.0
## [25] evaluate_0.10.1 rmarkdown_1.8 blogdown_0.4 labeling_0.3
## [29] stringi_1.1.6 keyholder_0.1.1 compiler_3.4.3 scales_0.5.0
## [33] backports_1.1.2 pkgconfig_2.0.1
```

This example is drawn from “A Painless Q-Learning Tutorial” at http://mnemstudio.org/path-finding-q-learning-tutorial.htm which explains how to manually calculate iterations using the updating equation for Q-Learning, based on the Bellman Equation (image from https://www.is.uni-freiburg.de/ressourcen/business-analytics/13_reinforcementlearning.pdf): The […]

Overview: Reinforcement learning uses “reward” signals to determine how to navigate through a system in the most valuable way. (I’m particularly interested in the variant of reinforcement learning called “Q-Learning” because the […]

Now that Twitter allows 280 characters, the code of some drawings I have made can fit in a tweet. In this post I have compiled a few of them. The first one is a cardioid inspired in string art (more info here): This other is based on Fermat’s spiral (more info here): A recurrence plot … Continue reading Tiny Art in Less Than 280 Characters →