If you created a dataset to create a classification model, you must perform cleansing of the data. After you create the dataset, you should do the following:
The alookr package makes these steps fast and easy:
To illustrate basic use of the alookr package, create the data_exam
with sample function. The data_exam
dataset include 5 variables.
variables are as follows.:
id
: characteryear
: charactercount
: numericalpha
: characterflag
: character# create sample dataset
set.seed(123L)
<- sapply(1:1000, function(x)
id paste(c(sample(letters, 5), x), collapse = ""))
<- "2018"
year
set.seed(123L)
<- sample(1:10, size = 1000, replace = TRUE)
count
set.seed(123L)
<- sample(letters, size = 1000, replace = TRUE)
alpha
set.seed(123L)
<- sample(c("Y", "N"), size = 1000, prob = c(0.1, 0.9), replace = TRUE)
flag
<- data.frame(id, year, count, alpha, flag, stringsAsFactors = FALSE)
data_exam
# structure of dataset
str(data_exam)
'data.frame': 1000 obs. of 5 variables:
$ id : chr "osncj1" "rvket2" "nvesi3" "chgji4" ...
$ year : chr "2018" "2018" "2018" "2018" ...
$ count: int 3 3 10 2 6 5 4 6 9 10 ...
$ alpha: chr "o" "s" "n" "c" ...
$ flag : chr "N" "N" "N" "N" ...
# summary of dataset
summary(data_exam)
id year count alpha :1000 Length:1000 Min. : 1.000 Length:1000
Length:character Class :character 1st Qu.: 3.000 Class :character
Class :character Mode :character Median : 6.000 Mode :character
Mode : 5.698
Mean : 8.000
3rd Qu.:10.000
Max.
flag :1000
Length:character
Class :character
Mode
cleanse()
cleans up the dataset before fitting the classification model.
The function of cleanse() is as follows.:
cleanse()
For example, we can cleanse all variables in data_exam
:
# cleansing dataset
<- cleanse(data_exam)
newDat
─ Checking unique value ────────────── unique value is one ─
remove variables that unique value is one
● year
─ Checking unique rate ──────────────── high unique rate ─
remove variables with high unique rate= 1000(1)
● id
─ Checking character variables ──────────── categorical data ─
converts character variables to factor
● alpha
● flag
# structure of cleansing dataset
str(newDat)
'data.frame': 1000 obs. of 3 variables:
$ count: int 3 3 10 2 6 5 4 6 9 10 ...
$ alpha: Factor w/ 26 levels "a","b","c","d",..: 15 19 14 3 10 18 22 11 5 20 ...
$ flag : Factor w/ 2 levels "N","Y": 1 1 1 1 2 1 1 1 1 1 ...
remove variables whose unique value is one
: The year variable has only one value, “2018”. Not needed when fitting the model. So it was removed.remove variables with high unique rate
: If the number of levels of categorical data is very large, it is not suitable for classification model. In this case, it is highly likely to be an identifier of the data. So, remove the categorical (or character) variable with a high value of the unique rate defined as “number of levels / number of observations”.
converts character variables to factor
: The character type flag variable is converted to a factor type.For example, we can not remove the categorical data that is removed by changing the threshold of the unique rate
:
# cleansing dataset
<- cleanse(data_exam, uniq_thres = 0.03)
newDat
─ Checking unique value ────────────── unique value is one ─
remove variables that unique value is one
● year
─ Checking unique rate ──────────────── high unique rate ─
remove variables with high unique rate= 1000(1)
● id
─ Checking character variables ──────────── categorical data ─
converts character variables to factor
● alpha
● flag
# structure of cleansing dataset
str(newDat)
'data.frame': 1000 obs. of 3 variables:
$ count: int 3 3 10 2 6 5 4 6 9 10 ...
$ alpha: Factor w/ 26 levels "a","b","c","d",..: 15 19 14 3 10 18 22 11 5 20 ...
$ flag : Factor w/ 2 levels "N","Y": 1 1 1 1 2 1 1 1 1 1 ...
The alpha
variable was not removed.
If you do not want to apply a unique rate, you can set the value of the uniq
argument to FALSE.:
# cleansing dataset
<- cleanse(data_exam, uniq = FALSE)
newDat
─ Checking character variables ──────────── categorical data ─
converts character variables to factor
● id
● year
● alpha
● flag
# structure of cleansing dataset
str(newDat)
'data.frame': 1000 obs. of 5 variables:
$ id : Factor w/ 1000 levels "ablnc282","abqym54",..: 594 715 558 94 727 270 499 882 930 515 ...
$ year : Factor w/ 1 level "2018": 1 1 1 1 1 1 1 1 1 1 ...
$ count: int 3 3 10 2 6 5 4 6 9 10 ...
$ alpha: Factor w/ 26 levels "a","b","c","d",..: 15 19 14 3 10 18 22 11 5 20 ...
$ flag : Factor w/ 2 levels "N","Y": 1 1 1 1 2 1 1 1 1 1 ...
If you do not want to force type conversion of a character variable to factor, you can set the value of the char
argument to FALSE.:
# cleansing dataset
<- cleanse(data_exam, char = FALSE)
newDat
─ Checking unique value ────────────── unique value is one ─
remove variables that unique value is one
● year
─ Checking unique rate ──────────────── high unique rate ─
remove variables with high unique rate= 1000(1)
● id
# structure of cleansing dataset
str(newDat)
'data.frame': 1000 obs. of 3 variables:
$ count: int 3 3 10 2 6 5 4 6 9 10 ...
$ alpha: chr "o" "s" "n" "c" ...
$ flag : chr "N" "N" "N" "N" ...
If you want to remove a variable that contains missing values, specify the value of the missing
argument as TRUE. The following example removes the flag variable that contains the missing value.
$flag[1] <- NA
data_exam
# cleansing dataset
<- cleanse(data_exam, missing = TRUE)
newDat NA ─
─ Checking missing value ───────────────── included NA
remove variables whose included
● flag
─ Checking unique value ────────────── unique value is one ─
remove variables that unique value is one
● year
─ Checking unique rate ──────────────── high unique rate ─
remove variables with high unique rate= 1000(1)
● id
─ Checking character variables ──────────── categorical data ─
converts character variables to factor
● alpha
# structure of cleansing dataset
str(newDat)
'data.frame': 1000 obs. of 2 variables:
$ count: int 3 3 10 2 6 5 4 6 9 10 ...
$ alpha: Factor w/ 26 levels "a","b","c","d",..: 15 19 14 3 10 18 22 11 5 20 ...