State Space

$$\gdef \sam #1 {\mathrm{softargmax}(#1)}$$ $$\gdef \vect #1 {\boldsymbol{#1}} $$ $$\gdef \matr #1 {\boldsymbol{#1}} $$ $$\gdef \E {\mathbb{E}} $$ $$\gdef \V {\mathbb{V}} $$ $$\gdef \R {\mathbb{R}} $$ $$\gdef \N {\mathbb{N}} $$ $$\gdef \relu #1 {\texttt{ReLU}(#1)} $$ $$\gdef \D {\,\mathrm{d}} $$ $$\gdef \deriv #1 #2 {\frac{\D #1}{\D #2}}$$ $$\gdef \pd #1 #2 {\frac{\partial #1}{\partial #2}}$$ $$\gdef \set #1 {\left\lbrace #1 \right\rbrace} $$

Here we present the state space topic in the different AI fields.

  1. Abstract
  2. State space introduction
  3. State space in different fields
  4. State space in Classic AI
  5. Different state transition models
  6. Different policy learning techniques - DP and MC
  7. Different policy learning techniques - TD and more
  8. Summary

Abstract

State space introduction

  • See about purposefulness here.

State space in different fields

  • In the k-Nearest-Neighbor method, the clustering depends on metrics! How do you define $d(x)$ distance between $x$’s? This is prior knowledge.
  • In decision trees, the “after-learning” is made out of rectangles, since it is based on if-else nested trees.
  • More about searching in hypotheses space look in this section, as an intro to inductive bias topic.

State space in Classic AI

  • Some acronyms: dynamic programming (DP), temporal difference (TD), Monte Carlo (MC). Control methods: model predictive control (MPC), interpolating control (IC).
  • The reason we plan for $N$ steps while implementing only the first one, as in MPC, is that we want to reach our goal state, thus making sure the next step we perform is getting us there. Hence, this is a sub-optimal solution. Optimally, we would consider all steps till the goal state, i.e. $N=\infty$.
  • Extensions of A*: A*$\epsilon$ is a version of A*, where it guarantees approximate solution, $(1-\epsilon) \times$ optimal solution, since it picks not only the minimal node to branch from, but from minimal $\times (1-\epsilon)$. IDA* is a variation of A* with a memory limit for all the nodes in the front of the search. Note also that $g, h$ should be on the same scale.
  • Changing will: in numeric state representation we could dedicate special features for will, which can also act as direction features, to indicate where is preferred to move according to the current will. In function representation, it is via changing the function. In formal logic representation, it is by changing the will-specific features to indicate which direction/actions are preferred. All these representations assume that the will/request is embedded in the state, similar to the prompt to LM (e.g. ChatGPT), where we both include the given state (description of the problem) and what we want it to do (description of the goal).

Different state transition models

  • RL is mainly based on here.
  • In Markov models: We can infer forward using variable elimination, from previous state transitions. But we can also infer from future state-transitions, such as in HMM: infer $P(x_t|Y)$ or $P(x_t,x_{t+1}|Y)$ where $Y$ are past and future observations and $x$ states are hidden.
  • Another definition: Passive RL is when I have recorded the agent’s actions, transitions, and rewards, while Active RL can explore more actions, that are not in our policy, versus exploiting what is in the current best policy as in Passive RL.
  • Note, that in playing chess, though my action is optimal and hence deterministic, the resulting next state of my next turn is dependent on my component, and it is stochastic.
  • We can regard the state-to-state transition as internal dynamics, and state-to-output as external “dynamics”.
  • In control, there is a cost/objective function, which determines the goal of the control. Its destination. In RL it’s simply the reward (current+discounted future rewards). Linear–quadratic regulator (LQR) for example has a quadratic cost and linear dynamics and constraints, results with off-line control policy, i.e. action $a$ is derived directly from the state. But non-linear cost yields an online LQR (iLQR=iterative LQR). Similarly, MPC, IC, and SIC are online control solutions.
  • The controlled Markov chains are called stationary if the $P$ doesn’t depend on time.
  • In automats theory, state is state, and action=none/input.
  • See also here.
  • More on Mamba: here and here.

Different policy learning techniques - DP and MC

  • Based on RL course series here.
  • More about Monte Carlo here and here.
  • Also see this chapter in this book.

Different policy learning techniques - TD and more

  • From here: “However, unlike Monte Carlo approaches that only update value estimates at the end of an episode, TD methods continually update estimates based on other estimates, a process known as bootstrapping.”
  • See about TD here and about TD in neuroscience.
  • Also see the last file about RL in some course’s folder.
  • Another comparison of the methods see here.

Summary