In contextual, arms are one based. In other words, in contextual k
arms are represented by k <- c(1,2,3,4,...)
In many of contextual’s Policy’s set_parameters() initialisation function, a parameter list is being assigned to a “self$theta_to_arms”. But the parameters themselves are later only available through self$theta.
The following source code comments how this little helper variable is used:
EpsilonFirstPolicy <- R6::R6Class( inherit = Policy, public = list( first = NULL, class_name = "EpsilonFirstPolicy", initialize = function(epsilon = 0.1, N = 1000) { super$initialize() self$first <- ceiling(epsilon*N) }, set_parameters = function(context_params) { self$theta_to_arms <- list('n' = 0, 'mean' = 0) # Here we define a list with 'n' and 'mean' theta parameters to each # arm through helper variable self$theta_to_arms. That is, when the # number of arms is 'k', the above would equal: # self$theta <- list(n = rep(list(0,k)), 'mean' = rep(list(0,k))) # ... which would also work just fine, but is much less concise. # When assigning both to self$theta directly & via self$theta_to_arms, # make sure to do it in that particular order. }, get_action = function(t, context) { if (sum_of(self$theta$n) < self$first) { action$choice <- sample.int(context$k, 1, replace = TRUE) } else { action$choice <- max_in(self$theta$mean) } action }, set_reward = function(t, context, action, reward) { arm <- action$choice reward <- reward$reward inc(self$theta$n[[arm]]) <- 1 if (sum_of(self$theta$n) < self$first - 1) { inc(self$theta$mean[[arm]]) <- (reward - self$theta$mean[[arm]]) / self$theta$n[[arm]] } self$theta } ) )
Contextual is build using the R6
class system. In R6
, you cannot create new attributes to R6
objects just by assigning to $ indexed values as you can in S3. More information on R6
in general and public and private members in particular can be found in the Introduction to R6 classes vignette.
By default, contextual’s agents are running in separate parallel worker processes. As these worker processes have no direct access to the R console, they can not be tracked there. But there are two options to keep track of Bandit and Policy progress and their messages. The first is to set Simulator’s do_parallel
argument to FALSE
. The second is to set Simulator’s progress_file
argument to TRUE
. If TRUE
, Simulator writes workers_progress.log
, agents_progress.log
and parallel.log
files to the current working directory, allowing you to keep track of respectively workers, agents, and potential errors when running a Simulator in parallel mode.
If one of your custom Bandits or Policies requires a specific R package, use Simulator’s include_packages
option to distribute the package(s) to each of a Simulator intance’s workers:
simulator <- Simulator$new(agents, horizon = 100L, simulations = 100L, include_packages = c("rstan","plyr")) # <- distributes package(s) # to parallel workers
It is then good practice to use a double colon operator to access some function from any included packages:
# At the top of a custom Bandit or Policy R file library(testthat) Bandit <- R6::R6Class( class = FALSE, public = list( ... # Within a custom Bandit or Policy fit <- stan::stan(...) ... )
Contextual makes ample use of the data.table
package, which, if used the “data.table way”, can speed up in-memory data related operations substantially. For more infomation on data.table
, see the Intro to data.table vignette, and, for instance, this data.table cheat sheet.
That does not imply that this is the only way to go - contextual’s Bandits and Policies make use of basic (fast) matrix based operations as well, and its Simulation and offline Bandits have been extended to, among others, read and write to and from the super fast open-source column-oriented Monet database management system.
Bandit initialization()
function is called before Bandit and Policy clones are distributed simulations. So generally, it is advisable to randomize within Bandits’ post_initialization()
function, which is called right before starting every simulation.
Here, contextual’s Simulator class ensures that every Policy and Bandit pair (bound by their shared Agent) will run their simulations on a deterministically set of seeds, ensuring both replicability and fair comparisons between agents.
Contextual uses the public “class_name” member internally to keep track of Policy’s and Bandit’s, and to, among others, generate names for Agents’ that have not been named explicitly. So don’t forget to include it in your own custom Bandit’s and Policies!
#' @export LinUCBDisjointOptimizedPolicy <- R6::R6Class( # <-- not directly accessible to contextual portable = FALSE, class = FALSE, inherit = Policy, public = list( alpha = NULL, class_name = "LinUCBDisjointOptimizedPolicy", # <-- so Bandits and Policies need class_name too initialize = function(alpha = 1.0) { super$initialize() self$alpha <- alpha }, ...
Contextual passes a “context” list object between Bandit and Policy for each time step t. This object contains, for every = {1, … , T}, a d
dimensional feature vector or d x k
dimensional matrix context$X
, the current number of Bandit
arms in context$k
, and the current number of contextual features in context$d
.
To transparently support Bandits
generating either contextual feature vectors or matrices, contextual
’s Policies
generally use the get_arm_context(context, arm)
utility function to obtain the current X
for a particular arm, or the get_full_context(context)
utility function to obtain the full d x k
context matrix.
The functions additionally support returning a subset of all context features through its select_features
arguments, and they can respectively prepend a “one hot encoded” style arm vector or diagonal arm matrix through their prepend_arm_vector
arguments.
In the get_reward()
function of a Bandit
you can set both the reward
, and the optimal reward
.
When both have been set, contextual
then knows to subtract the former from the latter to obtain the regret
.
For example, see BasicBernoulliBandit
’s implementation:
#' @export BasicBernoulliBandit <- R6::R6Class( inherit = Bandit, class = FALSE, public = list( weights = NULL, class_name = "BasicBernoulliBandit", initialize = function(weights) { self$weights <- weights self$k <- length(self$weights) }, get_context = function(t) { context <- list( k = self$k ) }, get_reward = function(t, context, action) { rewards <- as.double(runif(self$k) < self$weights) optimal_arm <- which_max_tied(self$weights) reward <- list( reward = rewards[action$choice], # <- reward here optimal_arm = optimal_arm, optimal_reward = rewards[optimal_arm] # <- optimal reward here ) } )
…