and mobile nodes. The source node must be aware of its
own location and the location of its final destination.
There are also some routing protocols based on Machine
Learning techniques. Q-Routing, which is designed for
packet routing in telephone networks, is one of such early
works based on Q-learning technique [21]. It finds the best
routing path by considering not only the number of hops to
the destination, but also the contribution of traffic jams, and
hence, queue length of each node to the latency. The
simulation shows that under high network load, the
algorithm outperforms the shortest-path algorithm, which
considers only the number of hops to the destination. It also
performs well under changing network topologies. Since it
uses only local information at each node and is fully
distributed, it can be adapted to wireless networks as well.
An algorithm named Dual Reinforcement Q-Routing is later
presented to make the algorithm converge faster [22].
AdaR [23] is a self-configured routing protocol for
wireless sensor networks based on reinforcement learning.
However, in this protocol, packets are delivered from
sensors to a centralized base station to learn the environ-
ment, which is costly to apply in the context of UWSNs.
When one node forwards a packet to another according to
the current routing policy, both the action and associated
reward are appended to the packet. When the base station
receives the packet which contains samples along the
routing path, the base station calculates new weights in the
linear quality function (Q) and broadcasts them throughout
the whole network to improve the forwarding policy.
Ant colony optimization (ACO) is another Machine Learning
technique used in routing for mobile ad-hoc networks. ACO
is inspired by the nature of ants that exchange information
with each other by laying down pheromone on their way
back to the nest when they have found food. In ACO
routing protocols, packets act like artificial ants that put
pheromone on the nodes they visited from source (nest) to
sink (food) so as to achieve optimal paths for the following
packets. AntNet [24] is a proactive ACO-based algorithm, in
which every node in the network generates a forward ant
packet at regular intervals. After the forward ant reaches
the destination, a backward ant packet is dispatched and
sent all the way back. ARA [25] is a reactive ACO-based
algorithm which divides the routing process into route
discovery phase and route maintenance phase. Both of the
protocols achieve a high delivery rate and almost shortest
path. However, they also inherit disadvantages from
proactive and reactive routing protocols. As a proactive
routing protocol, AntNet introduces too much overhead in
that nodes regularly send ant packets; on the other hand,
the route discovery phase is long for ARA. In [26], a
geographical ACO-based routing protocol POSANT is
presented. However, the assumption of GPS being available
does not hold in UWSNs.
1.2 Paper Overview and Contributions
In this paper, we propose an adaptive, energy-efficient,
and lifetime-aware routing protocol, QELAR, to address
the issues mentioned above based on Q-learning techni-
que, which is a reinforcement learning technique that
solves decision problems. By learning the environment
and evaluating an action-value function (Q-value), which
gives the expected reward of taking an action in a given
state, the distributed learning agent is able to make a
decision automatically.
We find that Q-learning can perform well in UWSNs in
the following aspects to utilize or overcome the important
properties of UWSNs:
. Low Overhead. In the underwater environment
with low bandwidth, high delay, and high acoustic
communication power consumption, it is costly to
broadcast data throughout the network. Our pro-
posed routing protocol is a hybrid of proactive and
reactive protocol, where nodes only keep the routing
information of their direct neighbor nodes which is a
small subset of the network, so that the drawbacks of
both proactive and reactive routing protocols are
avoided. The routing information is updated by one-
hop broadcasts rather than flooding.
. Dynamic Network Topology. In the harsh under-
water environment, link quality is not guaranteed. A
node in the transmission range may not be available
due to link corruption, which results in topology
change. Meanwhile, the movement of nodes also
causes topology changes. Without GPS available
underwater, a routing protocol in such an environ-
ment should be able to adapt to the topology
changes. As we can see later, Q-learning algorithm
learns from the network environment, and allows a
fast adaptation to the current network topology.
. Load Balance. Shortest path routing algorithms
reduce both the energy consumption and end-to-
end delay by enforcing packets to go along the
shortest path. However, nodes on the shortest path
will drain more quickly than others, so that shortest-
path algorithms may reduce the lifetime of the
network. In contrast, after we take nodes energy into
consideration in Q-learning, alternative paths can be
chosen to use network nodes in a fair manner.
. General Framework. Q-learning is a framework that
can be easily extended. Its behavior is largely
determined by the reward function, which is used
to give a positive or negative reinforcement to the
agent after it makes a decision. By judiciously
designing the reward function, we make our routing
protocol energy-efficient and lifetime-aware. The
optimal policy (routing forwarders selection) is
derived through iterative Q-learning of the residual
energy of each node and the energy distribution
among a group of nodes. Meanwhile, we can easily
integrate other factors such as end-to-end delay and
node density for extension and can balance all the
factors according to our need by tuning the para-
meters in the reward function.
More specifically, in our Q-learning-based routing
algorithm, the energy consumption of sensor nodes and
residual energy distribution among groups of nodes are
novel considerations that are put into the reward function.
With such a reward function, the forwarding policy can be
improved at runtime, so as to achieve high energy efficiency
and uniform energy distribution, and thus, prolong the
network lifetime. The effects of different weights in the
798 IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. 9, NO. 6, JUNE 2010