One-Hot-Encode unordered factor columns of a data.table mltools. From ben519's "mltools" package.

  cols = "auto",
  sparsifyNAs = FALSE,
  naCols = FALSE,
  dropCols = TRUE,
  dropUnusedLevels = FALSE



A data.table


Which column(s) should be one-hot-encoded? DEFAULT = "auto" encodes all unordered factor columns.


Should NAs be converted to 0s?


Should columns be generated to indicate the present of NAs? Will only apply to factor columns with at least one NA


Should the resulting data.table exclude the original columns which are one-hot-encoded?


Should columns of all 0s be generated for unused factor levels?


One-hot-encoding converts an unordered categorical vector (i.e. a factor) to multiple binarized vectors where each binary vector of 1s and 0s indicates the presence of a class (i.e. level) of the of the original vector.


library(data.table) dt <- data.table( ID = 1:4, color = factor(c("red", NA, "blue", "blue"), levels=c("blue", "green", "red")) ) one_hot(dt)
#> ID color_blue color_green color_red #> 1: 1 0 0 1 #> 2: 2 NA NA NA #> 3: 3 1 0 0 #> 4: 4 1 0 0
one_hot(dt, sparsifyNAs=TRUE)
#> ID color_blue color_green color_red #> 1: 1 0 0 1 #> 2: 2 0 0 0 #> 3: 3 1 0 0 #> 4: 4 1 0 0
one_hot(dt, naCols=TRUE)
#> ID color_NA color_blue color_green color_red #> 1: 1 0 0 0 1 #> 2: 2 1 NA NA NA #> 3: 3 0 1 0 0 #> 4: 4 0 1 0 0
one_hot(dt, dropCols=FALSE)
#> ID color color_blue color_green color_red #> 1: 1 red 0 0 1 #> 2: 2 <NA> NA NA NA #> 3: 3 blue 1 0 0 #> 4: 4 blue 1 0 0
one_hot(dt, dropUnusedLevels=TRUE)
#> ID color_blue color_red #> 1: 1 0 1 #> 2: 2 NA NA #> 3: 3 1 0 #> 4: 4 1 0