Introduction
Package presentation
Based on data.table package, dataPreparation will allow you to do most of the painful data preparation for a data science project with a minimum amount of code.
This package is
- fast (use
data.table
and exponential search)
- RAM efficient (perform operations by reference and column-wise to avoid copying data)
- stable (most exceptions are handled)
- verbose (log a lot)
data.table
and other dependencies are handled at installation.
Main preparation steps
Before using any machine learning (ML) algorithm, one needs to prepare its data. Preparing a data set for a data science project can be long and tricky. The main steps are the followings:
- Read: load the data set (this package don’t treat this point: for csv we recommend
data.table::fread
)
- Correct: most of the times, there are some mistake after reading, wrong format… one have to correct them
- Transform: creating new features from date, categorical, character… in order to have information usable for a ML algorithm (aka: numeric or categorical)
- Filter: get rid of useless information in order to speed up computation
- Handle NA: replace missing values
- Pre model transformation: Specific manipulation for the chosen model (handling NA, discretization, one hot encoding, scaling…)
- Shape: put your data set in a nice shape usable by a ML algorithm
Here are the functions available in this package to tackle those issues:
unFactor |
generateDateDiffs |
fastFilterVariables |
fastHandleNa |
shapeSet |
findAndTransformDates |
generateFactorFromDate |
whichAreConstant |
fastDiscretization |
sameShape |
findAndTransformNumerics |
aggregateByKey |
whichAreInDouble |
fastScale |
setAsNumericMatrix |
setColAsCharacter |
generateFromFactor |
whichAreBijection |
|
one_hot_encoder |
setColAsNumeric |
generateFromCharacter |
|
|
|
setColAsDate |
fastRound |
|
|
|
setColAsFactor |
|
|
|
|
All of those functions are integrated in the full pipeline function prepareSet
.
In this tutorial we will detail all those steps and how to treat them with this package using an example data set.
Tutorial data
For this tutorial, we are going to use a messy version of adult data base.
data(messy_adult)
print(head(messy_adult, n = 4))
# date1 date2 date3 date4 num1 num2
# 1: NA 1510441200 24-Mar-2017 26-march, 2017 1.9309 0,0864
# 2: 2017-26-9 1490482800 01-Feb-2017 03-february, 2017 -0.4273 0,6345
# 3: NA 1510614000 18-Sep-2017 20-september, 2017 0.6093 1,8958
# 4: 2017-6-1 NA 25-Jun-2017 27-june, 2017 -0.5138 0,4505
# constant mail num3 age type_employer
# 1: 1 pierre.caroline@aol.com 1,9309 39 State-gov
# 2: 1 pierre.lucas@yahoo.com -0,4273 50 Self-emp-not-inc
# 3: 1 caroline.caroline@protonmail.com 0,6093 38 Private
# 4: 1 marie.caroline@gmail.com -0,5138 53 Private
# fnlwgt education education_num marital occupation
# 1: 77516 Bachelors 13 Never-married Adm-clerical
# 2: 83311 Bachelors 13 Married-civ-spouse Exec-managerial
# 3: 215646 HS-grad 9 Divorced Handlers-cleaners
# 4: 234721 11th 7 Married-civ-spouse Handlers-cleaners
# relationship race sex capital_gain capital_loss hr_per_week
# 1: Not-in-family White Male 2174 0 40
# 2: Husband White Male 0 0 13
# 3: Not-in-family White Male 0 0 40
# 4: Husband Black Male 0 0 40
# country income
# 1: United-States <=50K
# 2: United-States <=50K
# 3: United-States <=50K
# 4: United-States <=50K
We added 9 really ugly columns to the data set:
- 4 dates with various formats, or time stamps, and NAs
- 1 constant column
- 3 numeric with different decimal separator
- 1 email address
The same info can be contained in two different columns.
Shape functions
There are two types of machine learning algorithm in R: those which accept data.table and factor, those which only accept numeric matrix.
Transforming a data set into something acceptable for a machine learning algorithm could be tricky.
The shapeSet
function do it for you, you just have to choose if you want a data.table or a numerical_matrix.
First with data.table:
clean_adult = shapeSet(copy(messy_adult), finalForm = "data.table", verbose = FALSE)
# "setColAsFactor: num1 has more than 10 values, i don't transform it."
# "setColAsFactor: num2 has more than 10 values, i don't transform it."
# "setColAsFactor: age has more than 10 values, i don't transform it."
# "setColAsFactor: fnlwgt has more than 10 values, i don't transform it."
# "setColAsFactor: capital_gain has more than 10 values, i don't transform it."
# "setColAsFactor: capital_loss has more than 10 values, i don't transform it."
# "setColAsFactor: hr_per_week has more than 10 values, i don't transform it."
# "setColAsFactor: date1.Minus.date2 has more than 10 values, i don't transform it."
# "setColAsFactor: date1.Minus.date3 has more than 10 values, i don't transform it."
# "setColAsFactor: date1.Minus.analysisDate has more than 10 values, i don't transform it."
# "setColAsFactor: date2.Minus.date3 has more than 10 values, i don't transform it."
# "setColAsFactor: date2.Minus.analysisDate has more than 10 values, i don't transform it."
# "setColAsFactor: date3.Minus.analysisDate has more than 10 values, i don't transform it."
# "setColAsFactor: mail.num has more than 10 values, i don't transform it."
# "setColAsFactor: mail.order has more than 10 values, i don't transform it."
print(table(sapply(clean_adult, class)))
#
# factor integer numeric
# 12 1 15
As one can see, there only are, numeric and factors.
Now with numerical_matrix:
clean_adult <- shapeSet(copy(messy_adult), finalForm = "numerical_matrix", verbose = FALSE)
# "setColAsFactor: num1 has more than 10 values, i don't transform it."
# "setColAsFactor: num2 has more than 10 values, i don't transform it."
# "setColAsFactor: age has more than 10 values, i don't transform it."
# "setColAsFactor: fnlwgt has more than 10 values, i don't transform it."
# "setColAsFactor: capital_gain has more than 10 values, i don't transform it."
# "setColAsFactor: capital_loss has more than 10 values, i don't transform it."
# "setColAsFactor: hr_per_week has more than 10 values, i don't transform it."
# "setColAsFactor: date1.Minus.date2 has more than 10 values, i don't transform it."
# "setColAsFactor: date1.Minus.date3 has more than 10 values, i don't transform it."
# "setColAsFactor: date1.Minus.analysisDate has more than 10 values, i don't transform it."
# "setColAsFactor: date2.Minus.date3 has more than 10 values, i don't transform it."
# "setColAsFactor: date2.Minus.analysisDate has more than 10 values, i don't transform it."
# "setColAsFactor: date3.Minus.analysisDate has more than 10 values, i don't transform it."
# "setColAsFactor: mail.num has more than 10 values, i don't transform it."
# "setColAsFactor: mail.order has more than 10 values, i don't transform it."
num1
|
num2
|
age
|
type_employer?
|
type_employerFederal-gov
|
type_employerLocal-gov
|
…
|
0.59
|
-0.50
|
60
|
0
|
0
|
0
|
…
|
0.00
|
-0.60
|
25
|
0
|
0
|
0
|
…
|
0.00
|
0.48
|
26
|
0
|
0
|
0
|
…
|
0.02
|
2.83
|
28
|
0
|
0
|
0
|
…
|
-0.87
|
-0.39
|
45
|
0
|
0
|
0
|
…
|
1.20
|
-0.74
|
31
|
0
|
0
|
0
|
…
|
As one can see, with finalForm = "numerical_matrix"
every character and factor have been binarized.
Full pipeline
Doing it all with one function is possible:
To do that we will reload the ugly data set and perform aggregation.
data("messy_adult")
agg_adult <- prepareSet(messy_adult, finalForm = "data.table", key = "country", analysisDate = Sys.Date(), digits = 2)
# "prepareSet: step one: correcting mistakes."
# "fastFilterVariables: I check for constant columns."
# "fastFilterVariables: I delete 1 constant column(s) in dataSet."
# "fastFilterVariables: I check for columns in double."
# "fastFilterVariables: I check for columns that are bijections of another column."
# "fastFilterVariables: I delete 3 column(s) that are bijections of another column in dataSet."
# "unFactor: I will identify variable that are factor but shouldn't be."
# "unFactor: I unfactor mail."
# "unFactor: It took me 0.12s to unfactor 1 column(s)."
# "findAndTransformNumerics: It took me 0.14s to identify 2 numerics column(s), i will set them as numerics"
# "setColAsNumeric: I will set some columns as numeric"
# "setColAsNumeric: I will set some columns as numeric"
# "setColAsNumeric: I am doing the column num2."
# "setColAsNumeric: 0 NA have been created due to transformation to numeric."
# "setColAsNumeric: I am doing the column num3."
# "setColAsNumeric: 0 NA have been created due to transformation to numeric."
# "findAndTransformNumerics: It took me 0.05s to transform 2 column(s) to a numeric format."
# "findAndTransformDates: It took me 0.67s to identify formats"
# "findAndTransformDates: It took me 0.11s to transform 3 columns to a Date format."
# "prepareSet: step two: transforming dataSet."
# "generateDateDiffs: I will generate difference between dates."
# "generateDateDiffs: It took me 0.02s to create 6 column(s)."
# "generateFactorFromDate: I will create a factor column from each date column."
# "generateFactorFromDate: It took me 0.46s to transform 3 column(s)."
# "generateFromCharacter: it took me: 0s to transform 1 character columns into, 3 new columns."
# "aggregateByKey: I start to aggregate"
# "aggregateByKey: 164 columns have been constructed. It took 0.41 seconds. "
# "prepareSet: step three: filtering dataSet."
# "fastFilterVariables: I check for constant columns."
# "fastFilterVariables: I delete 2 constant column(s) in result."
# "fastFilterVariables: I check for columns in double."
# "fastFilterVariables: I delete 1 column(s) that are in double in result."
# "fastFilterVariables: I check for columns that are bijections of another column."
# "fastFilterVariables: I delete 35 column(s) that are bijections of another column in result."
# "prepareSet: step four: handling NA."
# "prepareSet: step five: shaping result."
# "setColAsFactor: I will set some columns to factor."
# "setColAsFactor: it took me: 0s to transform 0 column(s) to factor."
# "Transforming numerical variables into factors when length(unique(col)) <= 10."
# "setColAsFactor: nbrLines has more than 10 values, i don't transform it."
# "setColAsFactor: max.age has more than 10 values, i don't transform it."
# "setColAsFactor: type_employer.? has more than 10 values, i don't transform it."
# "setColAsFactor: type_employer.Local-gov has more than 10 values, i don't transform it."
# "setColAsFactor: type_employer.Private has more than 10 values, i don't transform it."
# "setColAsFactor: type_employer.Self-emp-not-inc has more than 10 values, i don't transform it."
# "setColAsFactor: education.11th has more than 10 values, i don't transform it."
# "setColAsFactor: education.5th-6th has more than 10 values, i don't transform it."
# "setColAsFactor: education.7th-8th has more than 10 values, i don't transform it."
# "setColAsFactor: education.Bachelors has more than 10 values, i don't transform it."
# "setColAsFactor: education.HS-grad has more than 10 values, i don't transform it."
# "setColAsFactor: education.Masters has more than 10 values, i don't transform it."
# "setColAsFactor: education.Some-college has more than 10 values, i don't transform it."
# "setColAsFactor: marital.Divorced has more than 10 values, i don't transform it."
# "setColAsFactor: marital.Married-civ-spouse has more than 10 values, i don't transform it."
# "setColAsFactor: marital.Married-spouse-absent has more than 10 values, i don't transform it."
# "setColAsFactor: marital.Never-married has more than 10 values, i don't transform it."
# "setColAsFactor: marital.Separated has more than 10 values, i don't transform it."
# "setColAsFactor: occupation.Adm-clerical has more than 10 values, i don't transform it."
# "setColAsFactor: occupation.Craft-repair has more than 10 values, i don't transform it."
# "setColAsFactor: occupation.Exec-managerial has more than 10 values, i don't transform it."
# "setColAsFactor: occupation.Handlers-cleaners has more than 10 values, i don't transform it."
# "setColAsFactor: occupation.Machine-op-inspct has more than 10 values, i don't transform it."
# "setColAsFactor: occupation.Other-service has more than 10 values, i don't transform it."
# "setColAsFactor: occupation.Prof-specialty has more than 10 values, i don't transform it."
# "setColAsFactor: occupation.Sales has more than 10 values, i don't transform it."
# "setColAsFactor: occupation.Transport-moving has more than 10 values, i don't transform it."
# "setColAsFactor: relationship.Husband has more than 10 values, i don't transform it."
# "setColAsFactor: relationship.Not-in-family has more than 10 values, i don't transform it."
# "setColAsFactor: relationship.Other-relative has more than 10 values, i don't transform it."
# "setColAsFactor: relationship.Own-child has more than 10 values, i don't transform it."
# "setColAsFactor: relationship.Unmarried has more than 10 values, i don't transform it."
# "setColAsFactor: relationship.Wife has more than 10 values, i don't transform it."
# "setColAsFactor: race.Asian-Pac-Islander has more than 10 values, i don't transform it."
# "setColAsFactor: race.Black has more than 10 values, i don't transform it."
# "setColAsFactor: race.Other has more than 10 values, i don't transform it."
# "setColAsFactor: race.White has more than 10 values, i don't transform it."
# "setColAsFactor: sex.Female has more than 10 values, i don't transform it."
# "setColAsFactor: sex.Male has more than 10 values, i don't transform it."
# "setColAsFactor: mean.capital_gain has more than 10 values, i don't transform it."
# "setColAsFactor: max.capital_gain has more than 10 values, i don't transform it."
# "setColAsFactor: mean.capital_loss has more than 10 values, i don't transform it."
# "setColAsFactor: max.capital_loss has more than 10 values, i don't transform it."
# "setColAsFactor: sd.capital_loss has more than 10 values, i don't transform it."
# "setColAsFactor: min.hr_per_week has more than 10 values, i don't transform it."
# "setColAsFactor: max.hr_per_week has more than 10 values, i don't transform it."
# "setColAsFactor: income.<=50K has more than 10 values, i don't transform it."
# "setColAsFactor: income.>50K has more than 10 values, i don't transform it."
# "setColAsFactor: date1.yearmonth.2017 Apr has more than 10 values, i don't transform it."
# "setColAsFactor: date1.yearmonth.2017 Aug has more than 10 values, i don't transform it."
# "setColAsFactor: date1.yearmonth.2017 Dec has more than 10 values, i don't transform it."
# "setColAsFactor: date1.yearmonth.2017 Feb has more than 10 values, i don't transform it."
# "setColAsFactor: date1.yearmonth.2017 Jan has more than 10 values, i don't transform it."
# "setColAsFactor: date1.yearmonth.2017 Jul has more than 10 values, i don't transform it."
# "setColAsFactor: date1.yearmonth.2017 Jun has more than 10 values, i don't transform it."
# "setColAsFactor: date1.yearmonth.2017 Mar has more than 10 values, i don't transform it."
# "setColAsFactor: date1.yearmonth.2017 May has more than 10 values, i don't transform it."
# "setColAsFactor: date1.yearmonth.2017 Nov has more than 10 values, i don't transform it."
# "setColAsFactor: date1.yearmonth.2017 Oct has more than 10 values, i don't transform it."
# "setColAsFactor: date1.yearmonth.2017 Sep has more than 10 values, i don't transform it."
# "setColAsFactor: date1.yearmonth.NA has more than 10 values, i don't transform it."
# "setColAsFactor: date2.yearmonth.2017 Apr has more than 10 values, i don't transform it."
# "setColAsFactor: date2.yearmonth.2017 Aug has more than 10 values, i don't transform it."
# "setColAsFactor: date2.yearmonth.2017 Dec has more than 10 values, i don't transform it."
# "setColAsFactor: date2.yearmonth.2017 Feb has more than 10 values, i don't transform it."
# "setColAsFactor: date2.yearmonth.2017 Jan has more than 10 values, i don't transform it."
# "setColAsFactor: date2.yearmonth.2017 Jul has more than 10 values, i don't transform it."
# "setColAsFactor: date2.yearmonth.2017 Jun has more than 10 values, i don't transform it."
# "setColAsFactor: date2.yearmonth.2017 Mar has more than 10 values, i don't transform it."
# "setColAsFactor: date2.yearmonth.2017 May has more than 10 values, i don't transform it."
# "setColAsFactor: date2.yearmonth.2017 Nov has more than 10 values, i don't transform it."
# "setColAsFactor: date2.yearmonth.2017 Oct has more than 10 values, i don't transform it."
# "setColAsFactor: date2.yearmonth.2017 Sep has more than 10 values, i don't transform it."
# "setColAsFactor: date2.yearmonth.NA has more than 10 values, i don't transform it."
# "setColAsFactor: date4.yearmonth.2017 Apr has more than 10 values, i don't transform it."
# "setColAsFactor: date4.yearmonth.2017 Aug has more than 10 values, i don't transform it."
# "setColAsFactor: date4.yearmonth.2017 Dec has more than 10 values, i don't transform it."
# "setColAsFactor: date4.yearmonth.2017 Feb has more than 10 values, i don't transform it."
# "setColAsFactor: date4.yearmonth.2017 Jan has more than 10 values, i don't transform it."
# "setColAsFactor: date4.yearmonth.2017 Jul has more than 10 values, i don't transform it."
# "setColAsFactor: date4.yearmonth.2017 Jun has more than 10 values, i don't transform it."
# "setColAsFactor: date4.yearmonth.2017 Mar has more than 10 values, i don't transform it."
# "setColAsFactor: date4.yearmonth.2017 May has more than 10 values, i don't transform it."
# "setColAsFactor: date4.yearmonth.2017 Nov has more than 10 values, i don't transform it."
# "setColAsFactor: date4.yearmonth.2017 Oct has more than 10 values, i don't transform it."
# "setColAsFactor: date4.yearmonth.2017 Sep has more than 10 values, i don't transform it."
# "setColAsFactor: max.mail.num has more than 10 values, i don't transform it."
# "setColAsFactor: min.mail.order has more than 10 values, i don't transform it."
# "setColAsFactor: max.mail.order has more than 10 values, i don't transform it."
# "Previous distribution of column types:"
# col_class_init
# factor numeric
# 1 125
# "Current distribution of column types:"
# col_class_end
# factor numeric
# 37 89
As one can see, every previously steps have been done.
Let’s have a look to the result
# "126 columns have been built; for 42 countries."
country
|
nbrLines
|
mean.num2
|
sd.num2
|
mean.num3
|
sd.num3
|
min.age
|
…
|
?
|
529
|
0
|
0
|
0
|
0
|
17
|
…
|
Cambodia
|
16
|
0.08
|
0.78
|
0
|
0
|
25
|
…
|
Canada
|
108
|
0
|
0
|
0
|
0
|
17
|
…
|
China
|
67
|
0
|
0
|
0
|
0
|
22
|
…
|
Columbia
|
53
|
0
|
0
|
0
|
0
|
21
|
…
|
Cuba
|
88
|
0
|
0
|
0
|
0
|
21
|
…
|