Artificial Intelligence
10 mins

AI Learning Types - Reinforcement Learning

In reinforcement learning, AI systems learn by interacting with their environment and receiving feedback in the form of rewards or penalties. The AI system aims to maximise the cumulative reward over time by choosing the optimal sequence of actions.

In reinforcement learning, AI systems learn by interacting with their environment and receiving feedback in the form of rewards or penalties. The AI system aims to maximise the cumulative reward over time by choosing the optimal sequence of actions.

Reinforcement learning is suitable for tasks such as game AI, robotics, and autonomous vehicles.

Examples of Reinforcement Learning

Dog training

Dogs can be trained using reinforcement learning. Dogs are given a command and when they successfully perform it, they are rewarded with a treat, praise or petting (positive reinforcement). If the dog doesnt obey the command, the undesired behaviour is ignored (a negative reinforcement). It may be necessary to simplify or break down the command into smaller steps to help your dog understand what you are asking. Timing, patience, consistency and positive reinforcement are key to helping the dog understand which actions are correct.

The Stanford autonomous helicopter (Andrew Ng)

The Stanford autonomous helicopter is fitted with GPS, accelerometers, and a compass, ensuring it always has precise location data. Writing a program to facilitate autonomous flight for such a device is challenging. Traditional supervised learning, where you have a direct input-output correlation, becomes impractical as it's nearly impossible to specify the optimal flight behaviour in every situation.

This is where reinforcement learning brings a novel solution and can be applied to the helicopter. To illustrate this, we let the helicopter navigate in a simulator, permitting safe crashes without any harm. The AI has the freedom to experiment with flying the helicopter as it wishes.

When the AI executes a successful flight, we provide a positive reinforcement, akin to saying, "well done, helicopter." Conversely, if it crashes, it receives a form of negative reinforcement, similar to saying, "that's not right, helicopter." The AI's task is then to learn how to maximise the 'well done, helicopter' rewards and minimise the 'that's not right, helicopter' negative reinforcements. Through reinforcement learning, Andrew Ng's team were able to develop one of the most competent autonomous helicopters in existence.

Game Theory

Reinforcement learning has gained significant recognition in the field of game theory, particularly in complex games like chess. This technique enables an AI system to learn from each move, adapting its strategy based on whether the outcome was a win or a loss. As the AI continues to play, it refines its understanding of the game, identifying effective strategies and discarding ineffective ones. This approach has allowed AI to reach and even surpass human-level proficiency in chess, providing novel insights into strategic gameplay and offering exciting possibilities for future AI development.

One of the most notable applications of reinforcement learning in games was seen in the game of Go. In 2016, Google's AlphaGo, powered by reinforcement learning, defeated world champion Lee Sedol in a historic match. AlphaGo wasn't programmed with traditional Go strategies; instead, it learned the game from scratch through self-play and reinforcement learning. It played millions of games against itself, incrementally improving with each match. The AI received positive reinforcement when it made moves that led to victory and negative reinforcement for moves that resulted in losses. This process allowed AlphaGo to develop novel strategies, some of which even surprised experienced Go players. This victory was a testament to the power of reinforcement learning and marked a significant milestone in artificial intelligence research.


Reinforcement Learning (RL) is used to create autonomous trading algorithms. Starting with no knowledge of market dynamics, the RL agent learns through trial and error, using historical market data. Each action leads to a profit or loss, providing the agent with positive or negative reinforcement. The agent's ultimate goal is to develop a policy that maximises the sum of these rewards over time. This involves a careful balance between exploration (trying new strategies) and exploitation (sticking to current best strategies).As market dynamics constantly change over time, the RL agent has to continuously adapt its policy based on new data.

An example of an AI-powered trading program is JPMorgan's LOXM. This utilises reinforcement learning to execute client orders at optimal speed and price, having learned from millions of executed trades. Furthermore, open-source projects like OpenAI's Gym provide frameworks for developing and comparing RL algorithms and they have also released a library named 'gym-trading' specifically for creating trading bots.

Rigorous design, evaluation, and risk management are paramount when deploying RL-based trading algorithms as the agent's learning process can incur losses, and the influence of numerous factors on market dynamics may not be captured by historical data alone.

Designing an effective reward mechanism

Constructing a reward mechanism that guide the system's learning process, developers need to create a system to provide feedback to the AI to allow it to learn and improve its performance. The challenge lies in designing rewards that guide the AI system to learn the right things over time.

AI without external rewards

In the real world, there aren't many explicit rewards for animals, but they still learn and make decisions based on their intrinsic motivations, such as joy, fear, or hunger. These internal drives guide animals even in the absence of external rewards. Similarly, unsupervised learning in AI can be seen as an attempt to mimic this intrinsic motivation. AI systems using unsupervised learning try to identify patterns, structures, or correlations in data without any explicit guidance, allowing them to adapt and make decisions even when explicit rewards are not available.

May 10, 2023

Read our latest

Blog posts