Model (*Forward Part*)
A Model is owned by an Algorithm. Model is responsible for the entire network model (**forward part**) for the specific problems.
1. policy(self, *args)
1. policy(self, obs)
Define the structure of networks here. Algorithm will call this method to predict probabilities of actions.
It is optional.
2. value(self, *args)
2. value(self, obs)
Return: values: a dict of estimated values for the current observations and states.
For example, "q_value" and "v_value".
Getting Started
Goal of this tutorial:
- Understand PARL's abstraction at a high level
- Train an agent to solve the Cartpole problem with Policy Gradient algorithm
This tutorial assumes that you have a basic familiarity of policy gradient.
First, let's build a ``Model`` that predicts an action given the observation. As an objective-oriented programming framework, we build models on the top of ``parl.Model`` and implement the ``forward`` function.
Here, we construct a neural network with two fully connected layers.
.. code-block:: python
import parl
from parl import layers
class CartpoleModel(parl.Model):
def __init__(self, act_dim):
act_dim = act_dim
hid1_size = act_dim * 10
self.fc1 = layers.fc(size=hid1_size, act='tanh')
self.fc2 = layers.fc(size=act_dim, act='softmax')
def forward(self, obs):
out = self.fc1(obs)
out = self.fc2(out)
return out
``Algorithm`` will update the parameters of the model passed to it. In general, we define the loss function in ``Algorithm``.
In this tutorial, we solve the benchmark `Cartpole` using the `Policy Graident` algorithm, which has been implemented in our repository.
Thus, we can simply use this algorithm by importting it from ``parl.algorithms``.
We have also published various algorithms in PARL, please visit this page for more detail. For those who want to implement a new algorithm, please follow this tutorial.
.. code-block:: python
model = CartpoleModel(act_dim=2)
algorithm = parl.algorithms.PolicyGradient(model, lr=1e-3)
Note that each ``algorithm`` should have two functions implemented:
- ``learn``
updates the model's parameters given trainsition data
- ``predict``
predicts an action given current environmental state.
Now we pass the algorithm to an agent, which is used to interact with the environment to generate training data. Users should build their agents on the top of ``parl.Agent`` and implement four functions:
- ``build_program``
define programs of fluid. In general, two programs are built here, one for prediction and the other for training.
- ``learn``
preprocess transition data and feed it into the training program.
- ``predict``
feed current environmental state into the prediction program and return an exectuive action.
- ``sample``
this function is usually used for exploration, fed with current state.
.. code-block:: python
class CartpoleAgent(parl.Agent):
def __init__(self, algorithm, obs_dim, act_dim):
self.obs_dim = obs_dim
self.act_dim = act_dim
super(CartpoleAgent, self).__init__(algorithm)
def build_program(self):
self.pred_program = fluid.Program()
self.train_program = fluid.Program()
with fluid.program_guard(self.pred_program):
obs = layers.data(
name='obs', shape=[self.obs_dim], dtype='float32')
self.act_prob = self.alg.predict(obs)
with fluid.program_guard(self.train_program):
obs = layers.data(
name='obs', shape=[self.obs_dim], dtype='float32')
act = layers.data(name='act', shape=[1], dtype='int64')
reward = layers.data(name='reward', shape=[], dtype='float32')
self.cost = self.alg.learn(obs, act, reward)
def sample(self, obs):
obs = np.expand_dims(obs, axis=0)
act_prob = self.fluid_executor.run(
feed={'obs': obs.astype('float32')},
act_prob = np.squeeze(act_prob, axis=0)
act = np.random.choice(range(self.act_dim), p=act_prob)
return act
def predict(self, obs):
obs = np.expand_dims(obs, axis=0)
act_prob = self.fluid_executor.run(
feed={'obs': obs.astype('float32')},
act_prob = np.squeeze(act_prob, axis=0)
act = np.argmax(act_prob)
return act
def learn(self, obs, act, reward):
act = np.expand_dims(act, axis=-1)
feed = {
'obs': obs.astype('float32'),
'act': act.astype('int64'),
'reward': reward.astype('float32')
cost = self.fluid_executor.run(
self.train_program, feed=feed, fetch_list=[self.cost])[0]
return cost
Start Training
First, let's build an ``agent``. As the code shown below, we usually build a model, an algorithm and finally agent.
.. code-block:: python
model = CartpoleModel(act_dim=2)
alg = parl.algorithms.PolicyGradient(model, lr=1e-3)
agent = CartpoleAgent(alg, obs_dim=OBS_DIM, act_dim=2)
Then we use this agent to interact with the environment, and run around 1000 episodes for training, after which this agent can solve the problem.
.. code-block:: python
def run_episode(env, agent, train_or_test='train'):
obs_list, action_list, reward_list = [], [], []
obs = env.reset()
while True:
if train_or_test == 'train':
action = agent.sample(obs)
action = agent.predict(obs)
obs, reward, done, info = env.step(action)
if done:
return obs_list, action_list, reward_list
env = gym.make("CartPole-v0")
for i in range(1000):
obs_list, action_list, reward_list = run_episode(env, agent)
if i % 10 == 0:
logger.info("Episode {}, Reward Sum {}.".format(i, sum(reward_list)))
batch_obs = np.array(obs_list)
batch_action = np.array(action_list)
batch_reward = calc_discount_norm_reward(reward_list, GAMMA)
agent.learn(batch_obs, batch_action, batch_reward)
if (i + 1) % 100 == 0:
_, _, reward_list = run_episode(env, agent, train_or_test='test')
total_reward = np.sum(reward_list)
logger.info('Test reward: {}'.format(total_reward))
.. image:: ../examples/QuickStart/performance.gif
:width: 300px
.. image:: ./images/quickstart.png
:width: 300px
In this tutorial, we have shown how to build an agent step-by-step to solve the `Cartpole` problem.
The whole training code could be found `here <https://github.com/PaddlePaddle/PARL/tree/develop/examples/QuickStart>`_. Have a try quickly by running several commands:
.. code-block:: shell
# Install dependencies
pip install paddlepaddle
pip install gym
git clone https://github.com/PaddlePaddle/PARL.git
pip install .
# Train model
cd examples/QuickStart/
python train.py
*PARL is a flexible, distributed and object-oriented programming reinforcement learning framework.*
*PARL is a flexible, distributed and object-oriented programming reinforcement learning framework.*
| **Object Oriented Programming** | **Distributed Training** |
| **Object Oriented Programming** | **Distributed Training** |
|.. code-block:: python |.. code-block:: python |
| | |
| # Target Network in DQN | # Real multi-thread programming |
| | # witout the GIL limitation |
| | |
| target_network = copy.deepcopy(Q_network) | @parl.remote_class |
| ... | class HelloWorld(object): |
| #reset parameters periodically | def sum(self, a, b): |
| target_network.load(Q_network) | return a + b |
| | |
| | parl.init() |
| | obj = HelloWorld() |
| | # NOT consume local computation resources |
| | # Absolute multi-thread programming |
| class MLPModel(parl.Model): | # witout the GIL limitation |
| def __init__(self, act_dim): | |
| self.fc1 = layers.fc(size=10) | @parl.remote_class |
| self.fc2 = layers.fc(size=act_dim) | class HelloWorld(object): |
| | def sum(self, a, b): |
| def forward(self, obs): | return a + b |
| out = self.fc1(obs) | |
| out = self.fc2(out) | parl.connect('localhost:8003') |
| return out | obj = HelloWorld() |
| | ans = obj.sum(a, b) |
| | |
| model = MLPModel() | |
| target_model = copy.deepcopy(model) | |
.. image:: ../.github/abstractions.png
:align: center
:width: 400px
| PARL is distributed on PyPI and can be installed with pip:
| PARL aims to build an **agent** for training algorithms to perform complex tasks.
| The main abstractions introduced by PARL that are used to build an agent recursively are the following:
* **Model** is abstracted to construct the forward network which defines a policy network or critic network given state as input.
* **Algorithm** describes the mechanism to update parameters in the *model* and often contains at least one model.
.. centered:: ``pip install parl``
* **Agent**, a data bridge between the *environment* and the *algorithm*, is responsible for data I/O with the outside environment and describes data preprocessing before feeding data into the training process.
.. image:: ../.github/abstractions.png
:align: center
:width: 400px
| PARL aims to build an **agent** for training algorithms to perform complex tasks.
| The main abstractions introduced by PARL that are used to build an agent recursively are the following:
* **Model** is abstracted to construct the forward network which defines a policy network or critic network given state as input.
* **Algorithm** describes the mechanism to update parameters in the *model* and often contains at least one model.
* **Agent**, a data bridge between the *environment* and the *algorithm*, is responsible for data I/O with the outside environment and describes data preprocessing before feeding data into the training process.
......@@ -10,3 +10,6 @@ Install
PARL is distributed on PyPI and can be installed with pip:
pip install parl
or install from source:
pip install --upgrade git+https://github.com/PaddlePaddle/PARL.git
