Imagine you are part of an experiment where you are shown a set of patterns, where each pattern is a distorted version of some prototype pattern. The researchers train you to recognize and categorize the patterns. Next you are shown a new set of distorted patterns, and you are told to classify the patterns as seen belonging to one of the prototypes. This is a classic experiment done by Posner et. al \cite{POSNER:1967ef} to understand how humans learn to abstract visual stimuli and form categories. The experiment shows how categories form in a continuum rather than in distinct classes and that sufficiently distinct members, heavily distorted patterns, are treated as individual exemplars rather than belonging to some specific category.
For a robotic agent acting in an unstructured environment learning object affordances is similar to the pattern experiment — there is not much need for object individuation other than when interacting with unique objects — the key is rather to understand the underlying common pattern among the objects that can be categorized under a specific affordance. To differentiate and categorize objects the agent needs to learn what and how to measure the similarity to the patterns it has pervasively observed for the different categories.
From a learning perspective this boils down to feature selection and classification. Most approaches have focused on the feature selection formulated as pre- and post-conditions. In \cite{Stoytchev:2005il, Chao:2011id, Niekum:us} the authors tries to learn pre and post-conditions of the features that affects the outcomes given the actions. While successful at learning and performing manipulation they do not adddress more complex affordances involving human made objects and their understanding is limited to the setting the affordances was learned in.
From this perspective our approach is more general and holistic, we consider learning the feature selection and classification as an integral part of each other that can be learned irrespectively of manipulation context. We realize that to use similarity for affordance categorization, the distance between objects that are similar should be close in the feature space under the metric. To this end we learn the similarity metric from the data, that is, we learn a affordance specific transform that pushes similar items together and non-similar items far away.
Any features with general discriminative capabilities are high-dimensional and noisy, so that possible distance measures will be distorted. However, we can observe that most features are redundant for determining a specific affordance category, this implies that the category specific features inhabits a lower dimensional space. We therefore force the transform to project onto a lower dimensional space. In addition, we introduce a regularizer in the learning algorithm that penalizes transforms that are non-sparse and induce feature selection. This means that we are simultaneously learning to categorize objects but also what in the category makes it afford an action.
Most research on affordances argue that learning should be done through grounding in the sensorimotor system of the agent and that only by forming its own representation will the agent be able to fully understand the affordance. Affordances recognition and discovery is therefore as much tied to interaction with the object as it is to visual stimuli. For example, in Sahin et al. \cite{Sahin:2007gr}, the authors define an agent centric definition of affordances that relies of the effects of interaction with the world.
Most of the affordance learning through interaction have so far been focused on simple affordances such as pushing, rolling, etc. For example in Modayil et. al \cite{Modayil:2008it} an autonomous robot learns about the world through random motor babbling, observing changes in the perceived world and makes goal oriented plans based on matching the learned actions to a given plan. In Yuruten et al. \cite{Yuruten:2013hr} the authors learn a robot to match simple descriptive adjectives and nouns to object features using feature selection with a radial based SVM.
Learning for more complex tasks such as interaction with everyday objects is more complicated. In robotics the problem is usually approached by Learning from Demonstration (LfD) as this is the way humans also learns complex tasks. One common approach is to let the robot imitate the demonstrator so as to reformulate the obsveration into its own sensori-motor system.
One integral part in performing these complex tasks are grasping and manipulation, where current research mostly have focused on grasping objects regardless of task context \cite{Bohg:H95zG3Ya}. However, steps have been taken towards learning task dependent grasp selection by imitation as in Hjelm et. al \cite{Hjelm:2015hw}.
Understanding of how an object should be manipulated thus lays as much in learning complex motor commands as recognizing how an object is best manipulated in the currrent context. Realizing what in the feature representation of an object makes the object afford the action is a key part in the understanding of the context. From this understanding it also follows that an agent can greatly reduce the space of possible manipulation actions that it needs to consider, and allows it to gain a deeper understanding of the affordance.
Previous efforts towards finding affordance bearing features have been part in grasp planning models and focused on different forms of semantic segmentation associated with possible grasping actions. In Aleotti et al. \cite{Aleotti:2011hc} the authors segments objects as graphs where each node can be linked to specific grasps. Similarly \cite{Faria:2014ug, Hjelm:2014uc, Antanas:2014uo} segments the objects main axis, and other functional areas, to find parts that can be associated with particular types of grasps.
Since many affordances cannot be explained by local features our model breaks with the locality assumptions of the grasp affordance learning approaches by assuming that objects can both be formulated around both global and local features. We also recognize that most man-made objects within an affordance category have low variability and can be described by simple features, such as the elongatedness of the object, the volume, the specific color consistency across objects, or the local curvature. By discovering the affordance bearing features across objects we have no need for a tuned segmentation algorithm since our approach grounds the affordance in the sensory input and in the case of local features automatically induces a natural segmentation.
The approach most similar to ours is by Hermans et al. \cite{Hermans:2011vz} and Sun \cite{JieSun:2010kv}. In Hermans the authors learn a robot to recognize key attributes of objects such as size, shape, color, material, and weight. Affordances are then classified with the attributes as inputs and compared with a SVM trained directly on the feature space. The direct approach performs comparably or better, this is explained in terms of the feature space containing information not directly explainable as any specific semantic attribute. A similar conclusion can be reached from the results in \cite{JieSun:2010kv}. We want to remark that if we follow the idea that affordances should be grounded in the sensory input of the agent we can conclude that imposing a human structure on top of how objects are interpreted might not help the agent but rather confuse it; especially when the individual features are extremely noisy. Language is a powerful way to communicate and dissect knowledge however as their experiments show what is right for one sensorimotor system might not be right for another.
Like the afformentioned papers we also make use of discretization in the creation of certain features but as it always means a trade-off between discarding and keeping certain information we try to avoid it when possible. This means that we let the algorithm make the choice of what is needed for recognition. And contrary to \cite{Hermans:2011vz, JieSun:2010kv} the information explaining the affordance is not hidden in the object category or attributes but is reflected in the transform and directly relatable to the input.
As mentioned above we are not interested in object categorization and individuation as we think this is a hinderance to truly understanding affordances. In humans there are cases where damage to the primary visual cortex affected object recognition but not manipulation \cite{Goodale:1991hc}. This suggests that discrimative features used in recognition might be redunant in affordance recognition. We instead want to formulate the understanding around the feature transform. This removes the need for object category knowledge and enables the agent to, in the case of a local feature, map the affordance to a specific portion of the object. The transform can also help the agent to understand which affordances are most similar by comparing the distances between the transforms. Another benefit of learning the similarity measures is that the human demonstrator can show the robot which objects are similar, creating an intra category ordering, as opposed to it using a distance measure to formulate the initial similarity.
\cite{Stark:2008bx}
\begin{comment}
\cite{Nikandrova:2015uu} \cite{Thomaz:2009uk} \cite{Griffith:2009cm} ONE IDEA IS TO USE SEVERAL BOW REPRESENTATIONS AT DIFFERENT RESOLUTIONS!!
In the future we can expect that skill transfer will happen from robot to human, this will further the need for the agent in understandning for
Skill-transfer from human to human works in the same manner as in the above experiment. We show some objects, explain the underlying invariant features, similarities and how the action works. The learner can then use the acquired skills to explore and act in the environment using the invariant knowledge to extrapolate and solve tasks. An important point is that that the transfer is not an exact copy due to the difference in semantic and sensory capabilities of the teacher and learner.
In the future it is reasonable to assume that this skill-transfer will also happen between human and robots. Communication would be greatly enhanced by giving the robot the ability to ground the sensory input in semantic meaning, that is, find the the invariant features of the group and form metrics over these to discriminate which objects belongs to the group or not. Further on, utilizing this learning strategy as opposed to downloading an already trained classifier, would lead to greater adaptability and autonomy as the experience would be grounded in the robots own experience.
If we go even further we can assume that in the future skill transfer will also happen from robot to human. Here semantic meaning grounded in sensory input will be the only way to transfer knowledge as there is yet no known way of programming a human brain for learning complex tasks.
Previous approaches to learning affordances have focused on....
Most of the above mentioned approaches focuses on learning affordances using large datasets that does not discover the invariant features of the object and where the focus has shifted from grounding to model accuracy. These models while useful for detection will not be very helpful in human-robot interaction.
Lära eigen spectrumet dominanta plotta Renauds
Hur annorlunda är representationen för de olika tasken projektion av data punkter overlay. Vilka task är korrelerade?
\end{comment}