- Research
- Open Access
Learning search polices from humans in a partially observable context
- Guillaume de Chambrier^{1}Email author and
- Aude Billard^{1}
https://doi.org/10.1186/s40638-014-0008-1
© de Chambrier and Billard; licensee Springer. 2014
- Received: 23 April 2014
- Accepted: 30 August 2014
- Published: 7 November 2014
Abstract
Decision making and planning for which the state information is only partially available is a problem faced by all forms of intelligent entities they being either virtual, synthetic or biological. The standard approach to mathematically solve such a decisional problem is to formulate it as a partially observable decision process (POMDP) and apply the same optimisation techniques used in the Markov decision process (MDP). However, applying naively the same methodology to solve MDPs as with POMDPs makes the problem computationally intractable. To address this problem, we take a programming by demonstration approach to provide a solution to the POMDP in continuous state and action space. In this work, we model the decision making process followed by humans when searching blindly for an object on a table. We show that by representing the belief of the human’s position in the environment by a particle filter (PF) and learning a mapping from this belief to their end effector velocities with a Gaussian mixture model (GMM), we can model the human’s search process and reproduce it for any agent. We further categorize the type of behaviours demonstrated by humans as being either risk-prone or risk-averse and find that more than 70% of the human searches were considered to be risk-averse. We contrast the performance of this human-inspired search model with respect to greedy and coastal navigation search methods. Our evaluation metric is the distance taken to reach the goal and how each method minimises the uncertainty. We further analyse the control policy of the coastal navigation and GMM search models and argue that taking into account uncertainty is more efficient with respect to distance travelled to reach the goal.
Keywords
- Belief space planning
- Imitation learning
- Partially observable environment
- Search strategies in humans
Background
Acting under partial observability
Learning controllers or policies to act within a context where the state space is partially observable is of high relevance to all real robotic applications. Resulting from limited and inaccurate perceptual information, often only an approximation of the environment is available at any given time. If this inherent uncertainty is not taken into account during planning or control, there is a non-negligible risk of missing goals, getting lost and wasting valuable resources.
A common approach is to formulate the uncertainty present in both action and state as a partially observable Markov decision process (POMDP). POMDPs are an extensive area of research in the operational research, planning and decision theory community [1],[2]. The emphasis is to be able to act optimally with respect to an objective criteria when the state information is only partially available due to perceptual limitations and actions that are non-deterministic (stochastic).
The first approach to solving a POMDP is to apply value iteration (VI) [3] over the belief space (space of all possible probability distributions over the state space) as if we were solving for a standard Markov decision process (MDP). If the states, actions and observations are all discrete and the cost (or reward) function which encodes the task is the expected reward, then the overall value function is a convex combination of linear functions. In this setting, an exact solution exists [4], p. 513; however, the time and space complexity of VI in this context grows exponentially.
A popular approach to find a tractable solution to a POMDP is to reduce the size of the belief space by approximating it as a set of discrete reachable beliefs and then perform VI in this reduced space. Such methods fall under the category of point-based value iteration (PBVI) [5]. Most research has focused on determining the best set of belief points [6]-[8] to be evaluated in VI. These methods rely on exploratory/search heuristics to discover a sufficient set of probability densities or sample points to be able to construct a sufficiently accurate approximation of the belief space such that an optimal policy can be found (see [9] for a detailed review on PBVI algorithms).
Other approaches are based on compressing the belief to sufficient statistics (mean and entropy) as in [10] and thereafter to perform VI in this augmented state space. The drawback with these methods so far is that they cannot deal with both continuous state and action space (we do not consider macro/parametrised actions to be a true solution for the continuous domain). The noticeable exception is Monte Carlo POMDP [11] which represents the belief of the position of a robot by a particle filter. However, the value function is difficult to compute and requires storing belief instantiations for evaluating new unseen beliefs. The major drawback of all these approaches lies with the exploration problem which becomes infeasible as the number of states and actions increase.
Decision theoretic-based approaches have also been applied. Notable examples are [12],[13] where a decision tree graph is constructed with nodes representing beliefs (different realizations of a probability density function over the state space) and edges being actions (discrete). The actions themselves are typically macro-actions comprised of predefined start and end conditions. A planner (A* search) is used to find the appropriate set of actions to take, which follows a heuristic to find a trade-off between reducing the uncertainty and achieving the goal. If a large discrepancy exists between the estimated state and actual state, a new policy has to be re-planned. The shortcomings of these methods lie with the computational cost of constructing the search tree with particle filters (PF) for the belief nodes and the design of macro-actions. The responsiveness of these systems are bound to the computational cost and frequency of the re-planning step.
Programming by demonstration and uncertainty
Programming by demonstration (PbD) is advantageous in this context since it removes the need to perform the time-consuming exploration of the state-action tree to discover an optimal policy and does not rely on any exploration heuristics to gather a sufficient set of belief points (as in point-based value iteration methods). We expect humans to perform an informed search. In contrast to stochastic sampling methods, humans utilise past experience to evaluate the costs of their actions in the future and to guide their search. This foresight and experience are implicitly encoded in the parameters of the model we learn from the demonstrated searches.
PbD has a long history in the autonomous navigation community. In [14], behaviour primitives of the PHOENIX robot control architecture are incrementally learned from demonstrations. Two types of behaviour namely reactive and history-dependent are learned and are encoded by radial basis functions. The uncertainty is implicitly handled by directly learning the mapping between stimulus and response. In [15], the parameters of a controller which performs obstacle avoidance are learned from human demonstrations. The uncertainty is inherently handled by learning directly the relation between sensor input and control output. In [16], the objective function of a path planner is learned from human demonstrations. The objective function is a weighted sum of features corresponding to raw sensor measurements. This is another example where the partial information of the state is taken into account at the perception-action level, with the difference that instead of a policy being learned the objective function from which it is generated is learned. In [17], the authors learn how to combine low-level pre-acquired action primitives to achieve more complex tasks from human demonstrations, but they do not consider the effect of uncertainty.
Much work has been undertaken in learning reactive behaviour, history-dependent behaviour and combining multiple behaviour primitives to achieve complex behaviour. However, very few have studied the effect of uncertainty in the decision process and do not consider it during the learning or assume that it is implicitly handled. A noticeable exception is [18], in which a human expert guides the exploration of a robot in an indoor environment. The high-level actions (explore, loop closure, reach goal) taken by the human are recorded along with three different features related to the uncertainty in the map. Using SVM classification, a model is learned which indicates which type of action to take given a particular set of features. The difference with our approach is that we perform the learning in continuous action space at trajectory level and multiple actions are possible given the same state, which cannot be handled by a classifier.
Human beliefs
A crucial aspect of our work is to be able to infer the human’s location belief whilst he is searching. The work on modelling human beliefs and intentions [19],[20] has been undertaken in cognitive science. Human mind attributes, such as beliefs, desires and intentions, are not directly observable. They have to be inferred from actions. In [21], the authors present a Bayesian framework for modelling the way humans reason about and predict actions of an intentional agent. The comparison between the model and humans’ predictions yielded similar inference capabilities when asked to infer the intentions of an agent in a 2D world. This provided evidence supporting the hypothesis that humans integrate information using Bayes’ rule. Further, in [19], a similar experiment was performed in which the inference capabilities of humans, with regard to both belief and desire of an agent, were comparable to those of their Bayesian model. Our work makes the similar hypothesis that humans integrate information in a Bayesian way, however in the continuous domain. We infer the belief humans have of their location in the world during a search task.
As in our previous work [22], we learn a generative model of the human’s search behaviour in the task of finding an object on a table. We compliment this work with four additional components, namely (1) an analysis of the different types of exhibited behaviour by the human demonstrators, a learned GMM model and two other search algorithms (greedy and coastal navigation), (2) a comparison between the human learned controller (GMM) and a coastal navigation search policy in addition to greedy and hybrid controllers which have already been discussed in our previous work, (3) an analysis of variance (ANOVA) to ensure that the search experiments were statistically different and a report on the distance taken to reach the goal and (4) a comparison of the policy generated by the GMM controller and the coastal navigation algorithm, with an emphasis of the role of the uncertainty.
Methods
Research design and methodology
It is non-trivial to have a robot learn the behaviour exhibited by humans performing this task. As we cannot encapsulate the true complexity of human thinking, we take a simplistic approach and model the human’s state through two variables, namely the human’s uncertainty about his current location and the human’s belief of his position. The various strategies adopted by humans are modelled by building a mapping from the state variables to actions, which are the motion of the human arm. Aside from the problem of correctly approximating the belief and its evolution over time, the model needs to take into consideration that people behave very differently given the same situation. As a result, it is not just a single strategy that will be transferred but rather a mixture of strategies. While this will provide the robot with a rich portfolio of search strategies, appropriate methods must be developed to encode, at times, these contradictory strategies. This leads to the main scientific question we seek to address in this work:
Do humans exhibit particular search strategies, and if so, is it feasible to learn them?
How well does a statistical controller learned from human demonstrations perform with respect to approaches which do not take into account the uncertainty directly?
Experimental setup
As covered in the ‘Background’ section, previous work has taken a probabilistic Bayesian approach to model the beliefs and intent of humans. A key finding was that humans update their beliefs using Bayes’ rule (shown so far in the discrete case). We make a similar assumption and represent the human’s location belief (where he thinks he is) by a particle filter which is a point mass representation of a probability density function. There is no way of knowing the human’s belief. We make the critical assumption that the belief is observable in the first time step of the search, and all following beliefs are assumed correct through applying Bayes integration. The belief is always initialized to be uniformly distributed on top of the table (see Figure 2, top right), and the starting position of the human’s hand is always in this area.
Before each trial, the participant was told that he/she would always be facing the same direction with respect to the table (so always facing the goal, like in the case of a door), but his/her transitional starting position would vary. For instance, the table might not be always directly in front of the person and his/her distance to the edge or corner could be varied. In Figure 2 (bottom left), we illustrate four representative recorded searches, whilst in the bottom right, we illustrate a set trajectories which all started from the same region. One interesting aspect is the diversity present, demonstrating clearly that humans behave differently given the same situation.
Formulation
- 1., velocity of the hand in Cartesian space, which is normalised.${\stackrel{\u0307}{x}}_{t}\in {\mathbb{R}}^{3}$
- 2., the most likely position of the end effector or believed position.${\widehat{x}}_{t}=arg\phantom{\rule{0.3em}{0ex}}\underset{{x}_{t}}{max}p\left({x}_{t}\right|{z}_{0:t})$
- 3., the level of uncertainty which is the entropy of the belief: H(p(x _{ t }|z _{0:t})).$U\in \mathbb{R}$
A statistical controller was learned from a data set of triples $\left\{\right(x,\widehat{x},U\left)\right\}$, and a desired direction (normalised velocity) was obtained from conditioning on the belief and uncertainty.
Having described the experiment, we proceed to give an in-depth description of the mathematical representation of the belief, sensing and motion models and the uncertainty.
Belief model
The probability distribution over the state p(x_{ t }|z_{0:t}) is represented by a set of weighted particles which represent hypothetical locations of the end effector and their density which is proportional to the likelihood. The particular particle filter used was the regularised sequential importance sampling[23], p. 182. From the previous literature [19], it has been shown that there is a similarity between Bayes update rule and the way humans integrate information over time. Under this assumption, we hypothesise that if the initial belief of the human is known then the successive update steps of the particle filter should correspond to a good approximation of the next beliefs.
Sensing and motion model
The exponential form of the function, h, allows the range of the sensor to be reduced. We set β>0 such that any feature which is more than 1 cm away from the end effector or hand has a probability close to zero of being sensed. The same sensing function is repeated for all feature types.
The sensing model takes into account the inherent uncertainty of the sensing function (3) and gives the likelihood, p(z_{ t }|x_{ t }), of a position. Since the range of sensing is extremely small and entries are probabilistic, we assume no noise in the sensor measurement. The likelihood of a hypothetical location, x_{ t }, is related to Jensen-Shannon divergence (JSD), $p\left({z}_{t}|{x}_{t}\right)=1-\text{JSD}\left({z}_{t}\left|\right|{\widehat{z}}_{t}\right)$, between true sensing vector, z_{ t }, obtained by the agent and that of the hypothetical sensation ${\widehat{z}}_{t}$ generated at the location of a particle.
Motion model. The motion model is straightforward compared with the sensing model. In the robot’s case, the Jacobian gives the next Cartesian position given the current joint angles and angular velocity of the robot’s joints. From this, the motion model is given by $p\left({x}_{t}|{x}_{t-1},{\stackrel{\u0307}{x}}_{t}\right)=J\left(q\right)\stackrel{\u0307}{q}+\epsilon $ where q is the angular position of the robot’s joints, J(q) is the Jacobian and $\epsilon \sim \mathcal{N}(0,{\sigma}^{2}I)$ is white noise. The robot’s motion is very precise and its noise variance is very low. For humans, the motion model is the velocity of the hand movement provided by the tracking system.
Uncertainty
where K is the number of Gaussian components, the scalar π_{ k } represents the weight associated to the the mixture component k (indicating the component’s overall contribution to the distribution) and $\sum _{k=1}^{K}{\pi}_{k}=1$. The parameters μ_{ k } and Σ_{ k } are the mean and covariance of the normal distribution k.
The main difficulty here is determining the number of parameters of the density function in a computationally efficient manner. We approach this problem by finding all the modes in the particle set via mean-shift hill climbing and set these as the means of the Gaussian functions. Their covariances are determined by maximising the likelihood of the density function via expectation-maximisation (EM).
where e is the base of the natural logarithm and D the dimension (being 3 in our case).
Model of human search
During the experiments, the recorded trajectories show that different actions are present for the same belief and uncertainty making the data multi-modal (for a particular position and uncertainty, different velocities are present). That is, multiple actions are possible given a specific belief. This results in a one-to-many mapping which is not a valid function, eliminating any regression technique which directly learns a non-linear function. To accommodate this fact, we again made use of a GMM to model the human’s demonstrated searches, $\left\{(x,\stackrel{\u0307}{x},U)\right\}$. Using statistical models to encode control policies in robotics is quite common (see [25]).
By normalising the velocity, the amount of information to be learned was reduced. We also took into consideration that velocity is more specific to embodiment capabilities: the robot might not be able to reproduce safely some of the velocity profiles demonstrated.
Given this generative representation of the humans’ demonstrated searches, we proceeded to select the necessary parameters to correctly represent the data. This step is know as model selection, and we used Bayesian information criterion (BIC) to evaluate each set of parameters which were optimised via EM.
Coastal navigation
The first term, c(x_{ t }), is the traditional ‘cost to go’ which penalizes every step taken so as to ensure that the optimal path is the shortest. The value was simply set to 1 for all discrete states in our case. The second term, I(x_{ t }), is the information gain of a state. The information gain, I, of a particular state is related to how much the entropy of a probability density function (pdf), being the location’s uncertainty in our case, can be reduced. The two λ’s are scalars which weigh the influence of each term.
which is essentially the difference between the entropy of a prior pdf to that of a posterior pdf. We set our initial pdf to be uniformly distributed, and we computed the maximum likelihood sensation for each discrete state x_{ t } which is akin to the expected sensation or assuming that there is no uncertainty in sensor measurement (an assumption often made throughout the literature to avoid carrying out the integral of the expectation in Equation 9). The result is the difference between the posterior pdf, given that the sensation occurred in x_{ t }, and the prior pdf. The resulting cost map is illustrated in Figure 4. As expected, corners have the highest information gain followed by edges and surfaces. We do not show the values of the table since they provided much less information gain.
The optimization of the objective function is accomplished by running Dijkstra’s algorithm. This algorithm, given a cost map, computes the shortest path to a specific target from all the states. This results in a policy.
Control
where the β’s are lower and upper amplitude limits, x_{ g } is the position of the goal and K_{ p } is the proportional gain which was tuned through trials.
Results and discussion
We analysed the types of behaviour present in the human demonstration as well as in four different search algorithms, namely greedy, GMM, hybrid and coastal. A qualitative analysis of the GMM search policy (namely the different modes/decisions present) is contrasted with the coastal navigation policy. Finally, we evaluated the performance of the searches, with respect to the distance taken to reach the goal and the uncertainty profiles towards the end of the searches in five different experiments (different types of initializations).
Search and behaviour analysis
The selection of edges and corners as features as a means of classifying the type of behaviours present is not solely restricted to our search task. Salient landmarks will result in a high level of information gain, which is the case for the edge and corner (see Figure 4, right). Other tasks can use such features or variants in which the curvature is considered for representing the task space. These features are present in most settings, and high-level features can use these easily as their building blocks.
We note that the greedy search approach seeks to go directly to the goal without taking into account the uncertainty. The GMM models human search strategies. The hybrid is a combination of both the greedy and GMM method where once the uncertainty has been sufficiently minimised switches (threshold) to the greedy method for the rest of the search. The coastal navigation algorithm finds the optimal path to the goal based on an objective function which consists of a trade-off between the time taken to reach the goal and the minimisation of the uncertainty.
It can be seen that the human demonstrations have a much wider spread than those of the search algorithms. We suggest that this is due to human behaviours being optimal with respect to their own criteria as opposed to the algorithms which usually tend to only maximise a single objective function. The trajectories of the greedy and GMM methods represented by their expected features demonstrate two distinctive behaviours (in terms of expected sensation), risk-prone for the greedy and risk-averse for the GMM.
Percentage of risk-prone trajectories based on two decision criteria
Greedy | GMM | Hybrid | Coastal | Human | |
---|---|---|---|---|---|
Risk-prone (f) | 77% | 11% | 30% | 46% | 26% |
Risk-prone (r) | 78% | 12% | 24% | 45% | 7% |
We conclude that there is a strong inclination towards inferring that indeed multiple search strategies do arise in the human searches since they were extracted and encoded in the GMM model. From the risk distribution, humans have a tendency to be risk-averse.
GMM and coastal navigation policy analysis
It can be further seen that when the uncertainty tends towards its maximum value (U→1), all behaviour tends to go towards the edges and corners. As the uncertainty reduces (U→0), the vector field tends directly towards the goal. However, even at a low level of uncertainty, the behaviour at the edges and corners remains multi-modal and tends to favour remaining close to the edges and corners. This is an advantage of the GMM model. If the uncertainty has been sufficiently reduced and the true position of the end effector or hand is not near an edge, the policy dictates to go straight to the goal. This is not the case for the coastal algorithm which ignores the uncertainty and strives to remain in the proximity of corners and edges until sufficiently close. This approach could potentially lead to unnecessary travel cost which could otherwise have been avoided.
Time efficiency and uncertainty
We seek to distinguish the most efficient method in terms of two metrics, the distance taken to reach the goal and the level of uncertainty upon arriving at the goal. We report results on five different search experiments in which we compare the greedy, GMM and coastal navigation algorithms. The hybrid was not fully considered since it is a heuristic combination of the greedy and GMM methods.
Mean distance and variance taken to reach the goal for three methods in five experiments
Greedy | GMM | Coastal | |
---|---|---|---|
Uniform | 1.5396 (0.4580) | 0.9981 (0.1440) | 1.1267 (0.5678) |
#1 | 3.0205 (0.3567) | 1.8220 (0.2314) | 3.4383 (1.5044) |
#2 | 0.8025 (0.0129) | 1.4129 (0.1446) | 0.9392 (0.0126) |
#3 | 1.1429 (0.0804) | 1.8036 (0.1670) | 2.1432 (0.8136) |
#4 | 0.7505 (0.0383) | 1.3451 (0.0762) | 0.6820 (0.0094) |
ANOVA test of the null hypothesis (rejected): all searches are the same
Search method | ||||
---|---|---|---|---|
Uniform | #1 | #2 | #3 | #4 |
2.01e −06 (14) | 5.03e −07 (19) | 7.17e −11 (36) | 4.1e −06 (15) | 4.21e −16 (67) |
ANOVA between paired search methods
Greedy vs GMM | Greedy vs coastal | GMM vs coastal | |
---|---|---|---|
Uniform | 3.59e −08 (30) | 3.32e −04 (13) | 1.90e −01 (2) |
#1 | 5.80e −08 (46) | 1.88e −01 (2) | 4.58e −06 (28) |
#2 | 3.60e −08 (47) | 4.68e −04 (14) | 4.54e −06 (28) |
#3 | 3.57e −07 (37) | 2.07e −05 (23) | 1.25e −01 (2) |
#4 | 6.70e −10 (64) | 1.58e −01 (2) | 6.34e −13 (107) |
From our ANOVA analysis, we conclude that the behaviour exhibited by the three search strategies are significantly different. This is certainly the case for the greedy and GMM methods, even though in certain situations the greedy and coastal policies display similar behaviour such as in experiment #1. The reason for this is that both the greedy and coastal policies start in a situation where there are no salient features available, and their polices take the true end effector location to an even more feature deprived region. In this situation, the GMM policy is the clear winner with respect to the distance taken to reach the goal.
In experiment #2, both greedy and coastal policies perform equally well and will usually perform faster than the GMM model if the true and believed locations of the end effector do not leave the surface of the table. If this is not the case, they will both reduce the uncertainty in a very inefficient way as the modes will often change during the period of the search, where they are in contact with the table. This leads to the believed position (most likely state, ${\widehat{x}}_{t}$) varying greatly, resulting in an increased time period before the uncertainty has been narrowed down sufficiently for a contact to occur with the table (or simply by chance).
The results show which methods actively minimise the uncertainty and which methods found the goal whilst being more dependent on chance. For all the reported experiments, the GMM (learned from human searches) reaches a lower expected uncertainty than all other search algorithms. For the Uniform and #1 search experiment, all methods reach the same final uncertainty level. However, for the #2 and #4 experiments, the GMM reaches the goal with significantly lower uncertainty. It is inferred that the GMM model actively minimises the uncertainty which is also reflected in the distance taken for this method to reach the goal in comparison with the other methods.
The rows in Table 2 for the greedy (#2) and coastal navigate (#4) are an order of magnitude faster than the GMM method. However, both have a far higher level of uncertainty at the arrival which leads to the assumption that chance has a non-negligible effect on their success.
Conclusions
In this work, we have shown a novel approach in teaching a robot to act in a partially observable environment. Through having human volunteers demonstrate the task of finding an object on a table, we recorded both the inferred believed position of their hand and associated action (normalised velocity). A generative model mapping the believed end effector position to actions was learned, encapsulating this relationship. As speculated and observed, multiple strategies are present given a specific belief. This can be interpreted as the fact that humans act differently given the same situation.
The behaviour recorded from the human demonstrations, encoded as set of expected sensations, showed the presence of not only trajectories which both remained near the edge and corner features but also trajectories which remained far away. The fact of risk-prone and risk-averse behaviour was further confirmed by the overlap of the risk factor of human and GMM-generated trajectories with that of the greedy risk factor. According to the feature-based factor, more than 70% of the human search trajectories were considered to be risk-averse, whilst 93% according to the risk factor. Similarly, the GMM search trajectories showed to be 89% to 88% risk-averse.
In terms of the comparative study, the GMM controller is more adapted to dealing with situations of high uncertainty and better takes it into account than greedy or coastal planning approach. This is evident in the experiment where the believed position and true position of the end effector were significantly far apart and distant from salient areas. Future questions of scientific value to be addressed are to which extent do humans follow the reasoning of a Markov decision process in a partially observable situation where the state space is continuous (the problem has been partially addressed in [19] for discrete states and actions). A further aspect of interest is to study the situation where multiple beliefs are present and investigate how humans perform simultaneous localization and mapping as opposed to active localization which was the area of interest of this research.
Declarations
Acknowledgements
This research was supported by the European project, Flexible Skill Acquisition and Intuitive Robot Tasking for Mobile Manipulation in the Real World (First-MM), in Robotic Research.
Authors’ Affiliations
References
- Kaelbling LP, Littman ML, Cassandra AR: Planning and acting in partially observable stochastic domains. Artif Intell 1998, 101(1):99–134. 10.1016/S0004-3702(98)00023-XMATHMathSciNetView ArticleGoogle Scholar
- Smith T (2007) Probabilistic planning for robotic exploration. PhD thesis, Robotics Institute,Carnegie Mellon University, Pittsburgh, PA.Google Scholar
- Sutton RS, Barto AG: Reinforcement learning: an introduction. MIT Press, Cambridge; 1998.Google Scholar
- Thrun S, Burgard W, Fox D: Probabilistic robotics (intelligent robotics and autonomous agents). The MIT Press, Cambridge; 2005.Google Scholar
- Pineau J, Gordon G, Thrun S (2003) Point-based value iteration: an anytime algorithm for POMDPS. In: IJCAI, 1025–1030, Mexico, 9–15 August 2003.Google Scholar
- Kurniawati H, Hsu D, Lee WS (2008) SARSOP: efficient point-based POMDP planning by approximating optimally reachable belief spaces. In: Oliver Brock JT Ramos F (eds)Proceedings of robotics: science and systems (RSS), Zurich, 25–28 June 2008.Google Scholar
- Smith T, Simmons R: Heuristic search value iteration for POMDPS. In Proceedings of the 20th conference on uncertainty in artificial intelligence (UAI ’04). AUAI Press, Arlington; 2004:520–527.Google Scholar
- Shani G, Brafman RI, Shimony SE (2007) Forward search value iteration for POMDPS. In: Proceedings of the 20th international joint conference on artifical intelligence.Google Scholar
- Shani G, Pineau J, Kaplow R: A survey of point-based POMDP solvers. Autonomous Agents Multi-Agent Syst 2013, 27(1):1–51. 10.1007/s10458-012-9200-2View ArticleGoogle Scholar
- Roy N, Pineau J, Thrun S (2000) Spoken dialogue management using probabilistic reasoning. In: Iida H (ed)Proceedings of the 38th annual meeting of the association for computational linguistics, 93–100, Hong Kong, 2000.Google Scholar
- Thrun S: Monte carlo POMDPs. In Advances in neural information processing systems 12. Edited by: Solla SA, Leen TK, Müller K-R. MIT Press, Cambridge; 2000:1064–1070.Google Scholar
- Hsiao K, Kaelbling L, Lozano-Perez T (2010) Task-driven tactile exploration. In: Yoky Matsuoka HD-W Neira J (eds)Proceedings of robotics: science and systems (RSS).Google Scholar
- Hebert P, Howard T, Hudson N, Ma J, Burdick JW (2013) The next best touch for model-based localization. In: International conference on robotics and automation (ICRA), 99–106, Karlsruhe, 6–10 May 2013.Google Scholar
- Kasper M, Fricke G, Steuernagel K, von Puttkamer E: A behavior-based mobile robot architecture for learning from demonstration. Robot Autonom Syst 2001, 34(2):153–164. 10.1016/S0921-8890(00)00119-6MATHView ArticleGoogle Scholar
- Hamner B, Singh S, Scherer S: Learning obstacle avoidance parameters from operator behavior. Field Robot 2006, 23(11/12):1037–1058. 10.1002/rob.20171View ArticleGoogle Scholar
- Silver D, Bagnell JA, Stentz A: Learning from demonstration for autonomous navigation in complex unstructured terrain. IJRR 2010, 29(12):1565–1592.Google Scholar
- Nicolescu MN, Mataric MJ: Learning and interacting in human-robot domains. IEEE Trans Syst Man Cybern Syst Hum 2001, 31(5):419–430. 10.1109/3468.952716View ArticleGoogle Scholar
- Lidoris G: State estimation, planning, and behavior selection under uncertainty for autonomous robotic exploration in dynamic environments. Kassel University Press GmbH, Kassel; 2011.Google Scholar
- Bake C, Tenenbaum J, Saxe R (2011) Bayesian theory of mind: modeling joint belief-desire attribution. In: Thirty-third annual conference of the Cognitive Science Society, 2469–2474, Boston, 20 July 2011.Google Scholar
- Richardson H, Bake C, Tenenbaum J, Saxe R (2012) The development of joint belief-desire inferences. In: Proceedings of the 34th annual meeting of the Cognitive Science Society (COGSCI), Sapporo, 1 Aug 2012.Google Scholar
- Baker CL, Tenenbaum JB, Saxe RR (2006) Bayesian models of human action understanding. In: Advances in neural information processing systems 18, 99–106, Nevada, 4 December 2006.Google Scholar
- de Chambrier G, Billard A (2013) Learning search behaviour from humans. In: IEEE international conference on robotics and biomimetics (ROBIO), 573–580, Shenzhen, 12 December 2013.Google Scholar
- Arulampalam MS, Maskell S, Gordon N, Clapp T: A tutorial on particle filters for online nonlinear/non-Gaussian Bayesian tracking. IEEE Trans Signal Process 2002, 50(2):174–188. 10.1109/78.978374View ArticleGoogle Scholar
- Huber MF, Bailey T, Durrant-Whyte H, Hanebeck UD (2008) On entropy approximation for Gaussian mixture random vectors. In: Multisensor fusion and integration, 181–188.Google Scholar
- Billard A, Calinon S, Dillmann R, Schaal S: Robot programming by demonstration. In Springer handbook of robotics. Springer, Berlin; 2008:1371–1394. 10.1007/978-3-540-30301-5_60View ArticleGoogle Scholar
- Roy N, Burgard W, Fox D, Thrun S (1999) Coastal navigation-mobile robot navigation with uncertainty in dynamic environments. In: IEEE international conference on robotics and automation, 35–40.Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.