R/policy_mab_gittins_bl.R
GittinsBrezziLaiPolicy.RdGittinsBrezziLaiPolicy Algorithm based on Brezzi and Lai (2002)
"Optimal learning and experimentation in bandit problems."
The algorithm provides an approximation of the Gittins index, by specifying a closed-form expression, which is a function of the discount factor, and the number of successes and failures associated with each arm.
policy <- GittinsBrezziLaiPolicy$new(discount=0.95, prior=NULL)
discountnumeric; discount factor
priornumeric matrix; prior beliefs over Bernoulli parameters governing each arm. Beliefs are specified by Beta distribution with two parameters (alpha,beta) where alpha = number of success, beta = number of failures. Matrix is of arms times two (alpha / beta) dimensions
new(discount=0.95, prior=NULL)Generates and initializes a new Policy object.
get_action(t, context)arguments:
t: integer, time step t.
context: list, containing the current context$X (d x k context matrix),
context$k (number of arms) and context$d (number of context features)
theta
and the current context. Returns a named list containing
action$choice, which holds the index of the arm to play.set_reward(t, context, action, reward)arguments:
t: integer, time step t.
context: list, containing the current context$X (d x k context matrix),
context$k (number of arms) and context$d (number of context features)
(as set by bandit).
action: list, containing action$choice (as set by policy).
reward: list, containing reward$reward and, if available,
reward$optimal (as set by bandit).
theta.set_parameters()Helper function, called during a Policy's initialisation, assigns the values
it finds in list self$theta_to_arms to each of the Policy's k arms.
The parameters defined here can then be accessed by arm index in the following way:
theta[[index_of_arm]]$parameter_name.
Brezzi, M., & Lai, T. L. (2002). Optimal learning and experimentation in bandit problems. Journal of Economic Dynamics and Control, 27(1), 87-108.
Implementation follows https://github.com/elarry/bandit-algorithms-simulated
Core contextual classes: Bandit, Policy, Simulator,
Agent, History, Plot
Bandit subclass examples: BasicBernoulliBandit, ContextualLogitBandit,
OfflineReplayEvaluatorBandit
Policy subclass examples: EpsilonGreedyPolicy, ContextualLinTSPolicy