Research design and methodology
In this work, we consider a task in which both a robot and a human must search for an object on a table whilst deprived of vision and hearing. The robot and the human have prior knowledge of the environmental setup making this a specific search problem with no required mapping of the environment, also known as active localisation. In Figure 1, a human has his sense of vision and hearing impeded, making the perception of the environment partially observable and only leaving the sense of touch available for solving the task. Before each demonstration, the human volunteer is disoriented. His transitional position is varied with respect to the table although his heading remains the same (facing the table) leaving no uncertainty on his orientation. The disorientation of the human subject is to ensure that his believed location is uniform. At the first time step, the human’s state of mind can be considered observable. All proceeding beliefs can then be recursively estimated from the initial belief. The hearing sense is also impeded since it can facilitate localisation when no visual information is available, and the robot has no equivalent giving an unfair advantage to the human. By impeding hearing, we align the perception correspondence between the human and robot.
It is nontrivial to have a robot learn the behaviour exhibited by humans performing this task. As we cannot encapsulate the true complexity of human thinking, we take a simplistic approach and model the human’s state through two variables, namely the human’s uncertainty about his current location and the human’s belief of his position. The various strategies adopted by humans are modelled by building a mapping from the state variables to actions, which are the motion of the human arm. Aside from the problem of correctly approximating the belief and its evolution over time, the model needs to take into consideration that people behave very differently given the same situation. As a result, it is not just a single strategy that will be transferred but rather a mixture of strategies. While this will provide the robot with a rich portfolio of search strategies, appropriate methods must be developed to encode, at times, these contradictory strategies. This leads to the main scientific question we seek to address in this work:
Do humans exhibit particular search strategies, and if so, is it feasible to learn them?
How well does a statistical controller learned from human demonstrations perform with respect to approaches which do not take into account the uncertainty directly?
Experimental setup
In the experimental setup, a group of 15 human volunteers were asked to search for a wooden green block located at a fixed position on a bare table (see Figure 2, top left). Each participant repeated the experiment ten times from each of four mean starting points with an associated small variance. The starting positions were given with respect to the location of the human’s hand (all participants were righthanded). The humans were always facing the table with their right arm stretched out in front of them. The position of their hand was then either in front, to the left, to the right or in contact with the table itself.
As covered in the ‘Background’ section, previous work has taken a probabilistic Bayesian approach to model the beliefs and intent of humans. A key finding was that humans update their beliefs using Bayes’ rule (shown so far in the discrete case). We make a similar assumption and represent the human’s location belief (where he thinks he is) by a particle filter which is a point mass representation of a probability density function. There is no way of knowing the human’s belief. We make the critical assumption that the belief is observable in the first time step of the search, and all following beliefs are assumed correct through applying Bayes integration. The belief is always initialized to be uniformly distributed on top of the table (see Figure 2, top right), and the starting position of the human’s hand is always in this area.
Before each trial, the participant was told that he/she would always be facing the same direction with respect to the table (so always facing the goal, like in the case of a door), but his/her transitional starting position would vary. For instance, the table might not be always directly in front of the person and his/her distance to the edge or corner could be varied. In Figure 2 (bottom left), we illustrate four representative recorded searches, whilst in the bottom right, we illustrate a set trajectories which all started from the same region. One interesting aspect is the diversity present, demonstrating clearly that humans behave differently given the same situation.
Formulation
In the standard PbD formulation of this problem, a parametrised function is learned, mapping from state, x_{
t
}, which denotes the current position of the demonstrator’s hand to {\stackrel{\u0307}{x}}_{t}, the hand’s displacement. In our case, since the environment is partially observable, we have a belief or probability density function, p(x_{
t
}z_{0:t}), which is conditioned on all sensing information, z (the subscript, 0:t, indicates the time slice which ranges from t=0 to the current time t=t) over the state space at any given point in time. We seek to learn this mapping, f:p\left({x}_{t}\right{z}_{0:t})\mapsto \stackrel{\u0307}{x}, from demonstrations. During each demonstration, we record a set of variables consisting of the following:

1.
{\stackrel{\u0307}{x}}_{t}\in {\mathbb{R}}^{3}
, velocity of the hand in Cartesian space, which is normalised.

2.
{\widehat{x}}_{t}=arg\phantom{\rule{0.3em}{0ex}}\underset{{x}_{t}}{max}p\left({x}_{t}\right{z}_{0:t})
, the most likely position of the end effector or believed position.

3.
, the level of uncertainty which is the entropy of the belief: H(p(x _{
t
}z _{0:t})).
A statistical controller was learned from a data set of triples \left\{\right(x,\widehat{x},U\left)\right\}, and a desired direction (normalised velocity) was obtained from conditioning on the belief and uncertainty.
Having described the experiment, we proceed to give an indepth description of the mathematical representation of the belief, sensing and motion models and the uncertainty.
Belief model
A human’s belief of his location in an environment can be multimodal or unimodal, Gaussian or nonGaussian and may change from one distribution to another. We chose a particle filter to be able to represent such a wide range of probability distributions. A particle filter is a Bayesian probabilistic method which recursively integrates dynamics and sensing to estimate a posterior from a prior probability density. The particle filter has two elements. The first estimates a distribution over the possible next state given dynamics, and the second corrects it through integrating sensing. Given a motion modelp\left({x}_{t}\right{x}_{t1},{\stackrel{\u0307}{x}}_{t}) and a sensing model p(z_{
t
}x_{
t
}), we recursively apply a prediction phase where we incorporate motion to update the state and an update phase where the sensing data is used to compute the state’s posterior distribution. The two steps are depicted below:
\begin{array}{lll}\phantom{\rule{12.0pt}{0ex}}p\left({x}_{t}{z}_{0:t1}\right)& =\phantom{\rule{2em}{0ex}}& \int p\left({x}_{t}{x}_{t1},{\stackrel{\u0307}{x}}_{t}\right)\phantom{\rule{0.3em}{0ex}}p\left({x}_{t1}{z}_{0:t1}\right)\phantom{\rule{0.3em}{0ex}}{\mathit{\text{dx}}}_{t1}\end{array}
(1)
\begin{array}{lll}\phantom{\rule{12.0pt}{0ex}}p\left({x}_{t}{z}_{0:t}\right)& =\phantom{\rule{2em}{0ex}}& \frac{p\left({z}_{t}{x}_{t}\right)\phantom{\rule{0.3em}{0ex}}p\left({x}_{t}{z}_{0:t1}\right)}{p\left({z}_{t}{z}_{0:t1}\right)}\end{array}
(2)
The probability distribution over the state p(x_{
t
}z_{0:t}) is represented by a set of weighted particles which represent hypothetical locations of the end effector and their density which is proportional to the likelihood. The particular particle filter used was the regularised sequential importance sampling[23], p. 182. From the previous literature [19], it has been shown that there is a similarity between Bayes update rule and the way humans integrate information over time. Under this assumption, we hypothesise that if the initial belief of the human is known then the successive update steps of the particle filter should correspond to a good approximation of the next beliefs.
Sensing and motion model
Sensing model. The sensing model tells us the likelihood, p(z_{
t
}x_{
t
}), of a particular sensation z_{
t
} given a position {x}_{t}\in {\mathbb{R}}^{3}. In a human’s case, the sensation of a curvature indicates the likelihood of being near an edge or a corner. However, the likelihood cannot be modelled through using the human’s sensing information. Direct access to pressure, temperature and such salient information is not available. Real sensory information needs to be matched against virtual sensation at each hypothetical location x_{
t
} of a particle. Additionally, for the transfer of behaviour from human to robot to be successful, the robot should be able to perceive the same information as the human, given the same situation. An approximation of what a human or robot senses can be inferred, based on the end effector’s distance to particular features in the environment. In our case, four main features are present, namely corners, edges, surfaces and an additional dummy feature defining no contact, air. The choice of these features is prior knowledge given to our system and not extracted through statistical analysis of recorded trajectories. The sensing vector is z_{
t
}=[p_{
c
},p_{
e
},p_{
s
},p_{
a
}], where p refers to probability and the subscript corresponds to the first letter of the feature it is associated with. In Equation 3, the sensing function, h(x_{
t
},x_{
c
}), returns the probability of sensing a corner, where {x}_{c}\in {\mathbb{R}}^{3} is the Cartesian position of the corner which is the closest to x_{
t
}.
{p}_{c}=h\left({x}_{t},{x}_{c};\beta \right)=exp\left({\left(\beta \xb7\parallel {x}_{t}{x}_{c}\parallel \right)}^{2}\right)
(3)
The exponential form of the function, h, allows the range of the sensor to be reduced. We set β>0 such that any feature which is more than 1 cm away from the end effector or hand has a probability close to zero of being sensed. The same sensing function is repeated for all feature types.
The sensing model takes into account the inherent uncertainty of the sensing function (3) and gives the likelihood, p(z_{
t
}x_{
t
}), of a position. Since the range of sensing is extremely small and entries are probabilistic, we assume no noise in the sensor measurement. The likelihood of a hypothetical location, x_{
t
}, is related to JensenShannon divergence (JSD), p\left({z}_{t}{x}_{t}\right)=1\text{JSD}\left({z}_{t}\left\right{\widehat{z}}_{t}\right), between true sensing vector, z_{
t
}, obtained by the agent and that of the hypothetical sensation {\widehat{z}}_{t} generated at the location of a particle.
Motion model. The motion model is straightforward compared with the sensing model. In the robot’s case, the Jacobian gives the next Cartesian position given the current joint angles and angular velocity of the robot’s joints. From this, the motion model is given by p\left({x}_{t}{x}_{t1},{\stackrel{\u0307}{x}}_{t}\right)=J\left(q\right)\stackrel{\u0307}{q}+\epsilon where q is the angular position of the robot’s joints, J(q) is the Jacobian and \epsilon \sim \mathcal{N}(0,{\sigma}^{2}I) is white noise. The robot’s motion is very precise and its noise variance is very low. For humans, the motion model is the velocity of the hand movement provided by the tracking system.
Uncertainty
In a probability distribution framework, entropy is used to represent uncertainty. It is the expectation of a random variable’s total amount of unpredictability. The higher the entropy, the more the uncertainty; likewise, the lower the entropy, the lesser the uncertainty. In our context, a set of weighted samples {w_{
i
},x_{
i
}}^{i=1…,N} replaces the true probability density function of the belief, p_{
u
}(x_{
t
}z_{0:t}). A reconstruction of the underlying probability density is achieved by fitting a Gaussian mixture model (GMM) (Equation 4) to the particles,
{p}_{u}\left({x}_{t}{z}_{0:t}\phantom{\rule{2.22144pt}{0ex}};\left\{\pi ,\mu ,\Sigma \right\}\right)=\sum _{k=1}^{K}{\pi}_{k}\xb7\mathcal{N}\left({x}_{t}\phantom{\rule{2.22144pt}{0ex}};{\mu}_{k},{\Sigma}_{k}\right)
(4)
where K is the number of Gaussian components, the scalar π_{
k
} represents the weight associated to the the mixture component k (indicating the component’s overall contribution to the distribution) and \sum _{k=1}^{K}{\pi}_{k}=1. The parameters μ_{
k
} and Σ_{
k
} are the mean and covariance of the normal distribution k.
The main difficulty here is determining the number of parameters of the density function in a computationally efficient manner. We approach this problem by finding all the modes in the particle set via meanshift hill climbing and set these as the means of the Gaussian functions. Their covariances are determined by maximising the likelihood of the density function via expectationmaximisation (EM).
Given the estimated density, we can compute the upper bound of the differential entropy [24], H, which is taken to be the uncertainty U,
\begin{array}{ll}H& \left({p}_{u}\left({x}_{t}\parallel {z}_{0:t}\phantom{\rule{2.22144pt}{0ex}};\left\{\pi ,\mu ,\Sigma \right\}\right)\right)\phantom{\rule{2em}{0ex}}\\ =\sum _{k=1}^{K}{\pi}_{k}\left(log\left({\pi}_{k}\right)+\frac{1}{2}log\left({\left(2\mathrm{\pi e}\right)}^{D}\left{\Sigma}_{k}\right\right)\right)\phantom{\rule{2em}{0ex}}\end{array}
(5)
where e is the base of the natural logarithm and D the dimension (being 3 in our case).
The reason for using the upper bound is that the exact differential entropy of a mixture of Gaussian functions has no analytical solution. When computing both the upper and lower bounds, it was found that the difference between the two was insignificant, making any bound a good approximation of the true entropy. The choice of the believed location of the robot/human end effector is taken to be the mean of the Gaussian function with the highest weighted π.
{\widehat{x}}_{t}=arg\phantom{\rule{0.3em}{0ex}}\underset{\phantom{\rule{18.0pt}{0ex}}{x}_{t}}{max}{p}_{u}\left({x}_{t}{z}_{0:t}\phantom{\rule{2.22144pt}{0ex}};\left\{\pi ,\mu ,\Sigma \right\}\right)={\mu}_{(k=max(\pi \left)\right)}
(6)
Figure 3 depicts different configurations of the modes (clusters) and believed position of the end effector (indicated by a yellow arrow).
Model of human search
During the experiments, the recorded trajectories show that different actions are present for the same belief and uncertainty making the data multimodal (for a particular position and uncertainty, different velocities are present). That is, multiple actions are possible given a specific belief. This results in a onetomany mapping which is not a valid function, eliminating any regression technique which directly learns a nonlinear function. To accommodate this fact, we again made use of a GMM to model the human’s demonstrated searches, \left\{(x,\stackrel{\u0307}{x},U)\right\}. Using statistical models to encode control policies in robotics is quite common (see [25]).
By normalising the velocity, the amount of information to be learned was reduced. We also took into consideration that velocity is more specific to embodiment capabilities: the robot might not be able to reproduce safely some of the velocity profiles demonstrated.
The training data set comprised a total of 20,000 triples \left(\stackrel{\u0307}{x},\widehat{x},U\right) from the 150 trajectories gathered from the demonstrators. The fitted GMM {p}_{s}\left(\stackrel{\u0307}{x},\widehat{x},U\right) had a total of seven dimensions, three for direction, three for position and one scalar for uncertainty. The definition of the GMM is presented in Equation 7:
{p}_{s}(\stackrel{\u0307}{x},\widehat{x},U\phantom{\rule{2.22144pt}{0ex}};\{\pi ,\mu ,\Sigma \left\}\right)=\sum _{k=1}^{K}{\pi}_{k}\xb7\mathcal{N}(\stackrel{\u0307}{x},\widehat{x},U\phantom{\rule{2.22144pt}{0ex}};{\mu}_{k},{\Sigma}_{k})
(7)
{\mu}_{k}=\left[\begin{array}{c}{\mu}_{\stackrel{\u0307}{x}}\\ {\mu}_{\widehat{x}}\\ {\mu}_{U}\end{array}\right]{\Sigma}_{k}=\left[\begin{array}{ccc}{\Sigma}_{\stackrel{\u0307}{x}\stackrel{\u0307}{x}}& {\Sigma}_{\stackrel{\u0307}{x}\widehat{x}}& {\Sigma}_{\stackrel{\u0307}{x}U}\\ {\Sigma}_{\widehat{x}\stackrel{\u0307}{x}}& {\Sigma}_{\widehat{x}\widehat{x}}& {\Sigma}_{\widehat{x}U}\\ {\Sigma}_{U\stackrel{\u0307}{x}}& {\Sigma}_{U\widehat{x}}& {\Sigma}_{\mathit{\text{UU}}}\end{array}\right]
Given this generative representation of the humans’ demonstrated searches, we proceeded to select the necessary parameters to correctly represent the data. This step is know as model selection, and we used Bayesian information criterion (BIC) to evaluate each set of parameters which were optimised via EM.
A total of 83 Gaussian functions were used in the final model, 67 for trajectories on the table and 15 for those in the air. In Figure 4 (left), we illustrate the model learned from human demonstrations where we plot the threedimensional slice (the position) of the sevendimensional GMM to give a sense of the size of the model.
Coastal navigation
Coastal navigation [26] is a path planning method in which the objective function (Equation 8) is composed of two terms.
f\left({x}_{0:T}\right)=\sum _{t=0}^{T}{\lambda}_{1}\xb7c\left({x}_{t}\right)+{\lambda}_{2}\xb7I\left({x}_{t}\right)
(8)
The first term, c(x_{
t
}), is the traditional ‘cost to go’ which penalizes every step taken so as to ensure that the optimal path is the shortest. The value was simply set to 1 for all discrete states in our case. The second term, I(x_{
t
}), is the information gain of a state. The information gain, I, of a particular state is related to how much the entropy of a probability density function (pdf), being the location’s uncertainty in our case, can be reduced. The two λ’s are scalars which weigh the influence of each term.
In our table environment, we discretised the state space, {\mathbb{R}}^{3}, into bins so as to have a resolution of approximately, 1 cm^{3}, giving us a total of a 125,000 states. The action space was discretised to six actions, two for each dimension meaning that all motion is parallel to the axis. For each state, x_{
t
}, an I(x_{
t
}) value is computed by evaluating Equation 9:
I\left({x}_{t}\right)={\mathbb{\mathbb{E}}}_{p\left({z}_{t}{x}_{t}\right)}\left\{H\left({p}_{u}\right({x}_{t}\left{z}_{0:t}\right)\right\}H\left({p}_{u}\left({x}_{t}{z}_{0:t1}\right)\right),
(9)
which is essentially the difference between the entropy of a prior pdf to that of a posterior pdf. We set our initial pdf to be uniformly distributed, and we computed the maximum likelihood sensation for each discrete state x_{
t
} which is akin to the expected sensation or assuming that there is no uncertainty in sensor measurement (an assumption often made throughout the literature to avoid carrying out the integral of the expectation in Equation 9). The result is the difference between the posterior pdf, given that the sensation occurred in x_{
t
}, and the prior pdf. The resulting cost map is illustrated in Figure 4. As expected, corners have the highest information gain followed by edges and surfaces. We do not show the values of the table since they provided much less information gain.
The optimization of the objective function is accomplished by running Dijkstra’s algorithm. This algorithm, given a cost map, computes the shortest path to a specific target from all the states. This results in a policy.
Control
The standard approach to control with a GMM is to condition on the state {\widehat{x}}_{t} and U_{
t
} in our case and perform inference on resulting conditional GMM (Equation 10) which is a distribution over velocities or directions.
{p}_{s}\left(\stackrel{\u0307}{x}\widehat{x},U\right)=\sum _{k=1}^{K}{\pi}_{\stackrel{\u0307}{x}\widehat{x},U}^{k}\xb7\mathcal{N}\left(\stackrel{\u0307}{x}\phantom{\rule{2.22144pt}{0ex}};{\mu}_{\stackrel{\u0307}{x}\widehat{x},U}^{k},{\Sigma}_{\stackrel{\u0307}{x}\widehat{x},U}^{k}\right)
(10)
The new distribution is of the dimension of the output variable, the velocity (dimension 3). The variable \stackrel{\u0307}{x} in \stackrel{\u0307}{x}\widehat{x},U indicates the predictor variable, and the variables \widehat{x},U have been conditioned. A common approach in statistical PbD methods using GMMs is to take the expectation of the conditional (known as Gaussian mixture regression) (Equation 11):
\stackrel{\u0307}{x}=\mathbb{\mathbb{E}}\left\{{p}_{s}\left(\stackrel{\u0307}{x}\widehat{x},U\right)\right\}=\sum _{k=1}^{K}{\pi}_{\stackrel{\u0307}{x}\widehat{x},U}^{k}\xb7{\mu}_{\stackrel{\u0307}{x}\widehat{x},U}^{k}
(11)
The problem with this expectation approach is that it averages out opposing directions or strategies and may leave a net velocity of zero. One possibility would be to sample from the conditional; however, this can lead to nonsmooth behaviour and flipping back and forth between modes resulting in no displacement. To maintain consistency between the choices and avoid random switching, we perform a weighted expectation on the means so that directions (modes) similar to the current direction of the end effector receive a higher weight than opposing directions. For every mixture component k, a weight α_{
k
} is computed based on the distance between the current direction and itself. If the current direction agrees with the mode, then the weight remains unchanged, but if it is in disagreement, a lower weight is calculated according to the equation below:
{\alpha}_{k}\left(\stackrel{\u0307}{x}\right)={\pi}_{\stackrel{\u0307}{x}\widehat{x},U}^{k}\xb7exp\left(\stackrel{1}{cos}\left(<\stackrel{\u0307}{x},\underset{\stackrel{\u0307}{x}\widehat{x},U}{\overset{k}{\mu}}>\right)\right)
(12)
Gaussian mixture regression is then performed with the normalised weights α instead of π (the initial weight obtained when conditioning).
\stackrel{\u0307}{x}={\mathbb{\mathbb{E}}}_{\alpha}\left\{{p}_{s}\left(\stackrel{\u0307}{x}\widehat{x},U\right)\right\}=\sum _{k=1}^{K}{\alpha}_{k}\left(\stackrel{\u0307}{x}\right)\phantom{\rule{2.22144pt}{0ex}}{\mu}_{\stackrel{\u0307}{x}\widehat{x},u}^{k}
(13)
The final output of Equation 13 gives the desired direction (\stackrel{\u0307}{x} is renormalised). In the case when the mode suddenly disappears (because of sudden change of the level of uncertainty caused by the appearance or disappearance of a feature), another present mode is selected at random. For example, when the robot has reached a corner, the level of uncertainty for this feature drops to zero. A new mode, and hence new direction of motion, will then be computed. However, this is not enough to be able to safely control the robot. One needs to control the amplitude of the velocity and ensure compliant control of the end effector when in contact with the table. This behaviour is not learned here, as this is specific to the embodiment of the robot and unrelated to the search strategy. The amplitude of the velocity is computed by a proportional controller based on the believed distance to the goal,
\nu =max(min({\beta}_{1},{K}_{p}({x}_{\mathrm{g}}\widehat{x}),{\beta}_{2})
(14)
where the β’s are lower and upper amplitude limits, x_{
g
} is the position of the goal and K_{
p
} is the proportional gain which was tuned through trials.
As mentioned previously, compliance is the other important aspect when having the robot duplicate the search strategies. Collisions with the environment occur as a result of the uncertainty. To avoid risks of breaking the table or the robot sensors we have an impedance controller at the lowest level which outputs appropriate joint torques τ. The overall control loop is depicted in Figure 5.