R/policy_mab_gittins_bl.R
GittinsBrezziLaiPolicy.Rd
GittinsBrezziLaiPolicy
Algorithm based on Brezzi and Lai (2002)
"Optimal learning and experimentation in bandit problems."
The algorithm provides an approximation of the Gittins index, by specifying a closed-form expression, which is a function of the discount factor, and the number of successes and failures associated with each arm.
policy <- GittinsBrezziLaiPolicy$new(discount=0.95, prior=NULL)
discount
numeric; discount factor
prior
numeric matrix; prior beliefs over Bernoulli parameters governing each arm. Beliefs are specified by Beta distribution with two parameters (alpha,beta) where alpha = number of success, beta = number of failures. Matrix is of arms times two (alpha / beta) dimensions
new(discount=0.95, prior=NULL)
Generates and initializes a new Policy
object.
get_action(t, context)
arguments:
t
: integer, time step t
.
context
: list, containing the current context$X
(d x k context matrix),
context$k
(number of arms) and context$d
(number of context features)
theta
and the current context
. Returns a named list containing
action$choice
, which holds the index of the arm to play.set_reward(t, context, action, reward)
arguments:
t
: integer, time step t
.
context
: list, containing the current context$X
(d x k context matrix),
context$k
(number of arms) and context$d
(number of context features)
(as set by bandit
).
action
: list, containing action$choice
(as set by policy
).
reward
: list, containing reward$reward
and, if available,
reward$optimal
(as set by bandit
).
theta
.set_parameters()
Helper function, called during a Policy's initialisation, assigns the values
it finds in list self$theta_to_arms
to each of the Policy's k arms.
The parameters defined here can then be accessed by arm index in the following way:
theta[[index_of_arm]]$parameter_name
.
Brezzi, M., & Lai, T. L. (2002). Optimal learning and experimentation in bandit problems. Journal of Economic Dynamics and Control, 27(1), 87-108.
Implementation follows https://github.com/elarry/bandit-algorithms-simulated
Core contextual classes: Bandit
, Policy
, Simulator
,
Agent
, History
, Plot
Bandit subclass examples: BasicBernoulliBandit
, ContextualLogitBandit
,
OfflineReplayEvaluatorBandit
Policy subclass examples: EpsilonGreedyPolicy
, ContextualLinTSPolicy