10 minute read
Introduction It will go over a few of the commonly used approaches to exploration which focus on action-selection and show their strengths and weaknessWhy Explore? In order for an angent to learn how to deal optimally with all possible states,it must be exposed to as many as those states as possible Unlike supervised learning, the agent in RL has access to the environment through its own actions An Agent needs the right experiences to learn a good policy, but it also needs a good policy to obtain those experience exploration and exploitation tradeoffGreedy ApproachExplanation All RL seeks to maximize reward over time A greedy method Taking the action which the agent estimates to be the best at the current moment is exploitation This approach cab be thought of as providing little to no exploratory potential Shortcomings The problem is it almost arrives at a suboptimal solution Imagine a simple two-armed bandit problem If we suppose one arm gives a reward of 1 and the other arm gives a reward of 2, then if the agent鈥檚 parameters are such that it chooses the former arm first, then regardless of how complex a neural network we utilize, under a greedy approach it will never learn that the latter action is more optimal Implementation#Use this for action selection.#Q_out referrs to activation from final layer of Q-Network.Q_values = sess.run(Q_out, feed_dict={inputs:[state]})action = np.argmax(Q_values)Random ApproachExplanation It is to simply always take a random actionShortcomings It can be useful as an initial means of sampling from the state space in order to fill an experience buffer when using DQNImplementation#Assuming we are using OpenAI gym environment.action = env.action_space.sample()#Otherwise:#total_actions = ??action = np.random.randint(0,total_actions)e-greedy ApproachExplanation A simple combination of the greedy and random approaches yields one of the most used exploration strategies The agent chooses what it believes to be the optimal action most of the time,but occasionally acts randomly The epsion parameter determines the probability of taking a random action The most defacto technique in RLAdjusting during training At the start of the training process the e value is initialized to a large probability, to encourage exploration The e value is then annealed down to the small constant (0.1), as the agent is assumed to learn most of what it needs the environmentShortcomings This method is far from the optimal,it takes into account only whether actions are most rewarding or notImplementatione = 0.1if np.random.rand(1) [None,4] input# self.Temp is boltzmann parameter# self.keep_per is bayesian parameterself.inputs = tf.placeholder(shape=[None,4],dtype=tf.float32)self.Temp = tf.placeholder(shape=None,dtype=tf.float32)self.keep_per = tf.placeholder(shape=None,dtype=tf.float32)hidden = slim.fully_connected(self.inputs,64,activation_fn=tf.nn.tanh,biases_initializer=None)hidden = slim.dropout(hidden,self.keep_per)self.Q_out = slim.fully_connected(hidden,2,activation_fn=None,biases_initializer=None)self.predict = tf.argmax(self.Q_out,1)self.Q_dist = tf.nn.softmax(self.Q_out/self.Temp)#Below we obtain the loss by taking the sum of squares difference between the target and prediction Q values.self.actions = tf.placeholder(shape=[None],dtype=tf.int32)self.actions_onehot = tf.one_hot(self.actions,2,dtype=tf.float32)self.Q = tf.reduce_sum(tf.multiply(self.Q_out, self.actions_onehot), reduction_indices=1)self.nextQ = tf.placeholder(shape=[None],dtype=tf.float32)loss = tf.reduce_sum(tf.square(self.nextQ - self.Q))trainer = tf.train.GradientDescentOptimizer(learning_rate=0.0005)self.updateModel = trainer.minimize(loss)-From above
# CartPole has 4 states -> [None,4] input# self.Temp is boltzmann parameter# self.keep_per is bayesian parameterself.inputs = tf.placeholder(shape=[None,4],dtype=tf.float32)self.Temp = tf.placeholder(shape=None,dtype=tf.float32)self.keep_per = tf.placeholder(shape=None,dtype=tf.float32)tf.reset_default_graph()# non-statinary network q_net = Q_Network()target_net = Q_Network()init = tf.initialize_all_variables()trainables = tf.trainable_variables()targetOps = updateTargetGraph(trainables,tau)myBuffer = experience_buffer()#create lists to contain total rewards and steps per episodejList = []jMeans = []rList = []rMeans = []with tf.Session() as sess:sess.run(init)updateTarget(targetOps,sess)e = startEstepDrop = (startE - endE)/anneling_stepstotal_steps = 0for i in range(num_episodes):s = env.reset()rAll = 0d = Falsej = 0while j