www.fgks.org   »   [go: up one dir, main page]

Academia.eduAcademia.edu
Semantic Segmentation from Limited Training Data arXiv:1709.07665v1 [cs.RO] 22 Sep 2017 A. Milan1,3 , T. Pham1,3 , K. Vijay1,3 , D. Morrison1,2 , A.W. Tow1,2 , L. Liu3 , J. Erskine1,2 , R. Grinover1,2 , A. Gurman1,2 , T. Hunn1,2 , N. Kelly-Boxall1,2 , D. Lee1,2 , M. McTaggart1,2 , G. Rallos1,2 , A. Razjigaev1,2 , T. Rowntree1,3 , T. Shen1,2 R. Smith1,2 , S. Wade-McCue1,2 , Z. Zhuang1,4 , C. Lehnert2 , G. Lin1,3 , I. Reid1,3 , P. Corke1,2 , and J. Leitner1,2 Abstract— We present our approach for robotic perception in cluttered scenes that led to winning the recent Amazon Robotics Challenge (ARC) 2017. Next to small objects with shiny and transparent surfaces, the biggest challenge of the 2017 competition was the introduction of unseen categories. In contrast to traditional approaches which require large collections of annotated data and many hours of training, the task here was to obtain a robust perception pipeline with only few minutes of data acquisition and training time. To that end, we present two strategies that we explored. One is a deep metric learning approach that works in three separate steps: semantic-agnostic boundary detection, patch classification and pixel-wise voting. The other is a fully-supervised semantic segmentation approach with efficient dataset collection. We conduct an extensive analysis of the two methods on our ARC 2017 dataset. Interestingly, only few examples of each class are sufficient to fine-tune even very deep convolutional neural networks for this specific task. I. INTRODUCTION Robotic solutions have been utilised in the industry for many years. However, their use is typically restricted to known and structured environments with pre-defined actions. Examples include parts assembly in the automotive industry or produce sorting in the grocery sector. Manipulating individual items in a large warehouse is a more complex task for machines and to date remains unsolved. Even though the environment can be controlled to a certain degree, e.g. the storage system can be designed in a certain way to facilitate recognition, the sheer number of items to be handled poses non-trivial challenges. In addition, the items are often placed in narrow bins to save space, thus partial or even full occlusion must be addressed from both the perception and manipulation side. Quantitatively evaluating robotic solutions is not a trivial task. To allow for a fair comparison and also to advance the state of the art in autonomous warehouse manipulation, in 2017, Amazon organised the third Amazon Robotics Challenge (ARC), previously known as the Amazon Picking Challenge. Sixteen qualified teams competed on the tasks of stowing and picking common objects. The first one consisted of emptying a crowded tote into a self-designed This research was supported by the Australian Research Council Centre of Excellence for Robotic Vision (ACRV) (project number CE140100016). The participation at the ARC was supported by Amazon Robotics LLC. Contact: trung.pham@adelaide.edu.au 1 Authors are with the Australian Centre for Robotic Vision (ACRV). 2 Authors are with the Queensland University of Technology (QUT). 3 Authors are with the University of Adelaide. 4 Authors are with the Australian National University (ANU). (a) Input Image (b) Ground Truth (c) Deep Metric Learning (d) Fully-Supervised Segmentation Fig. 1. An example of semantic segmentation for object picking. (a) and (b) show an input image and its ground-truth. (c) and (d) are the results of the Deep Metric Learning and fully-supervised semantic segmentation approach, respectively. The tote segment is not visualised. White diamonds mark unseen items. storage system, while the latter required the robot to pack three different orders into different sized cardboard boxes, consisting of 2, 3, and 5 items, respectively. The two tasks were combined in the final round. Both the number of items in the bins as well as their appearance made the perceptions task slightly more difficult compared to previous editions. However, the biggest change was in the replacement of the training objects by new unseen ones, which were presented to the participants only 45 minutes prior to the start of each task. These conditions require very robust solutions that are not fully over-fitted to the training set, but also models that can be quickly adapted to new categories. In this work we present two different approaches that address both challenges, which we examined during the development phase. Our initial strategy was to entirely bypass the need for data collection and model training for new categories. To that end, we first learn a feature embedding [1] which transforms image patches of the same object into low-dimensional points that are close-by, and patches of different objects into points that a far apart in the feature space. Classification can then be performed by a simple nearest neighbour look up. The patches are generated by a class-agnostic RGB-D segmentation [2]. To obtain the final segmentation map, a pixel-wise voting scheme is performed on all segments. As our second approach, we explore a fully-supervised se- mantic segmentation solution based on deep neural networks. It is common practice to finetune existing models to specific tasks [3], [4], [5], [6]. While this is fairly straightforward in cases where all categories are known and defined beforehand, it is not such a common strategy for tasks where the amount of available training data is very limited. Nonetheless, it turns out that our choice of a deep architecture, which is RefineNet [4], is able to adapt to new categories using only very few training examples. To collect these examples, we follow an efficient data collection strategy, capturing seven views of each new object and using semi-automatic segmentation. The CNN is then fine-tuned for a few minutes only and reaches a level of performance that is sufficiently accurate for autonomous object manipulation. Fig. 1 illustrates typical semantic segmentation results using the two approaches. In summary, our main contributions are as follows. • We present the full perception approach of the winning team of the Amazon Robotics Challenge 2017. • We examine a deep metric learning approach that does not require any on-the-fly training and can easily handle unseen categories. • We adapt RefineNet, a deep convolutional neural network for the specific task of autonomous picking and demonstrate how such deep architectures can be finetuned with very few training examples. • We conduct a series of experiments and quantitatively compare the two orthogonal approaches. • We compile and release a manually labelled dataset for 40 training objects and 16 validation objects, annotated with pixel-wise semantic segmentation masks. After reviewing related work in the next section, we first provide a general overview of our system and hardware used in Sec. III and then present both our perception strategies in Sec. IV and Sec. V, respectively. Finally, we discuss our experiments in Sec. VI. II. RELATED WORK The body of literature on robotic vision is vast. Here we only concentrate on relevant work concerning object perception for bin picking applications, in particular focusing on the specific task of stowing and picking items within the Amazon Challenge series. Perception is considered the most challenging part of the problem, as clearly indicated in the survey by Correll et al. [7], which was conducted for the first Amazon Picking Challenge (APC) in 2015. RBO [8], the winning team of the first competition approached the object segmentation problem using one RGB-D sensor and without employing any deep learning techniques. Rather, manually designed colour and geometry features were used to learn feature histograms and then estimate per-pixel posterior probabilities for each object class. The maximum across all classes then yields the most likely object for each image region. A 3D bounding box is then fitted to the hypothesis to determine the grasp point. Their perception pipeline is described in detail in [9]. Team MIT [10] used depth measurements to register previously scanned object models to the obtained point cloud in order to determine the object pose. In their 2016 approach [11], the scene was measured from 15 to 18 views to produce a very dense point cloud. Similar to their previous solution, the objects were then registered to the point cloud using a modified ICP algorithm. In addition, semantic segmentation of each of the views was done using a VGG-style architecture [12]. To train the deep model, the team required over 100,000 labelled training images, which were collected and annotated semi-automatically using background subtraction. This strategy is reminiscent to our data collection approach described in Sec.V-A, but in our case we require 3 orders of magnitude fewer data samples. Contrary to the above approach, the winning solution in 2016 designed by Team Delft [13] did not rely on pixel-wise segmentation prediction, but rather used Faster-RCNN [14], an object detection algorithm. The bounding boxes only provide a coarse estimate of the object extensions. To obtain more accurate object localisation, Hernandez et al. [13] use depth cameras to register known object geometry to the point cloud, while rejecting potentially wrong measurements using heuristics. Note that our second approach presented in this paper does not rely on depth sensors but rather uses RGB information only. Team NimbRo [3], [15] integrated both object detection and semantic segmentation into their vision system. The former is based on DenseCap [16] and adapted for the task at hand, while the latter uses the OverFeat architecture [17], [6] to learn strong features. Both solutions incorporate depth information obtained from three sources: two depth sensors and a stereo RGB setup. Our first approach is inspired by the recent work on deep metric learning [18], [1], [19], [20] which shows that the feature embedding learned on seen categories generalises well to unseen object classes. The global loss [1] uses the batch statistics to overcome the limitations of the triplet network. Lifted structure embedding [18] uses a structured loss objective on the pairwise distance matrix to improve the embedding, while [19] employs a global clustering quality metric in the loss function. In this work, we use [1] to learn a feature embedding and then use a nearest-neighbour classifier to recover the class of the input image. The second strategy that we develop falls into the category of fully-convolutional networks (FCN). Since they were first introduced by Long et al. FCN [21] for semantic segmentation, FCNs have been explored for this specific task in various ways. DeepLab [22] uses so-called atrous (or dilated) convolutions to prevent excessive downscaling of the original input image. The Pyramid Scene Parsing Network (PSPNet) [23] introduces a special pooling module to aggregate context from different levels of the image pyramid. Mask R-CNN [24] is an extension of the widely known Faster R-CNN detection framework [14] that includes an instance-level segmentation prediction. In this work, we build on RefineNet [4], a multipath refinement network that offers a good trade off between model complexity and training efficiency. Interestingly, it can be fine-tuned to new categories using only few training samples. This approach is described in more detail in Sec. V. III. SYSTEM OVERVIEW The main focus of this paper is on the perception system behind Cartman [25], the robot that won the 2017 Amazon Robotics Challenge. For completeness, we briefly describe the hardware as well as the integration of our perception solution into the system, but refer the reader to [25] for further details. Cartman is a Cartesian manipulator composed of a threeaxis gantry that moves an end-effector comprised of a gripper and a sucker. The objective of Cartman is to move items between a storage system, stow tote and packing boxes, positioned atop the floor below. The gripper and sucker are positioned opposite one another; one of the wrist motors enables switching between the two tools. Cartman uses two Intel RealSense SR300 cameras for perception. The primary camera is mounted to the underside of Cartman’s wrist, enabling vision into the storage system, stow tote or packing boxes beneath. A secondary camera is mounted to the side of Cartman, looking across the top of the storage system towards a red sheet. The secondary camera allows for photographing a grasped item on a plain background. Scales are positioned beneath the storage system and stow tote to provide additional feedback through measurement of weight deltas pre and post grasp. Scales allow errors in perception to be captured by ensuring the weight change measured matches the weight of the intended item. Within the software system behind Cartman, semantic segmentation provides the link between a raw image of the scene and grasp synthesis. Grasp synthesis provides a pose at which Cartman should position its end-effector to enable a grasp to be performed. Grasp synthesis assumes that all visible items in the scene have been segmented. With an item segment provided, a hierarchy of approaches, each relying on a different level of available depth information, is performed to select an appropriate grasp pose. More details of the grasp synthesis approach can be found in [25], [26]. IV. DEEP METRIC LEARNING Deep learning based object recognition approaches have shown great success [27], [12], [28], [14], [22]. Such deep models, however, usually require a large collection of ground-truth data for training. Recognising unseen objects with only a few sample images available remains a difficult task. Here we attempt a deep metric learning approach to recognising unseen objects without requiring re-training the system. The idea is to learn a feature embedding via a deep neural network that transforms images of different views of the same object into similar low-dimensional features. A. Geometric Segmentation Object contour detection and object segmentation in images are dual problems. One can easily obtain segmentation if a contour is available, and vice versa. As contours are independent from semantic classes, contour detection approaches are able to generalise to unseen categories and novel scenes without any re-training. This property is well suited for the picking challenge where unknown object categories are introduced during the competition. The adoption of Convolutional Neural Networks (CNNs) has made significant progresses in many computer vision tasks such as object detection [14] and semantic segmentation [21]. Modern approaches to low-level tasks such as contour detection also achieve impressive performance [29], [2]. Here we resort to the convolutional oriented boundary (COB) network [2] to predict object boundaries directly from RGB-D images. In particular, the COB network predicts multi-scale multi-orientation contour probability maps, which are then combined into a single Ultrametric Contour Map (UCM) [30] — a hierarchical representation of the image. Figure 2 illustrates multiple segmentation hypotheses at multiple scales. One drawback of this approach is that it is not obvious which level will yield the optimal segmentation. We bypass this issue by passing all regions at all levels to the object classification step (see Section IV-B), and then use pixel-voting (see Section IV-C) to arrive at the final semantic segmentation for the entire image. In practice, we remove regions that are either above or below certain sizes, i.e., regions bigger than 50% or smaller than 0.1% of the image size are rejected. As our robot is equipped with a color and a depth camera for perception, we use RGB-D images for boundary detection. We adopt the COB model trained on NYU RGBD dataset [31], which somewhat resembles our bin picking data. No fine-tuning is performed. This COB model takes an RGB and HHA [32] images, as input. The HHA image encodes depth, height above ground and angle to gravity and is normalised to [0, 255] so that it is consistent with the RGB image before passing it through the network. In our picking task, the camera is mounted on the top looking downward onto the objects in the tote. Therefore we set the height above ground to a constant value (1 in our case). B. Feature Embedding The output segments from the COB can be very noisy varying from a small part of an object to a segment with multiple objects. Hence, it is important to learn a robust classifier that can classify these noisy segments into correct object categories. Traditional losses for classification such as softmax and binary cross entropy (BCE) do not suit the task at hand because these losses require the number of categories to be defined and fixed a-priori. To overcome this, we use a deep metric learning approach that has shown promising results on handling unseen categories [20], [1], [18]. As opposed to standard classifiers that learn the class specific information, the metric learning approach attempts to learn the general concept of distance metrics in the embedding space and thus can generalise well to unseen categories. In addition, a simple nearest neighbour classifier on the learned embedding space can be used to classify the given objects. In this paper, we employ the deep metric learning approach proposed in [1] for learning the embedding space. This method learns a convolutional neural network (CNN) F : Rn×n → Rd that directly maps the input image xi ∈ Rn×n Fig. 2. An example of semantic-agnostic object segmentation. Top row: input color, raw depth and inpainted depth images. Bottom row: predicted boundary map and segmentations at two different scales. Different colours encode different object candidate regions, rather than semantic classes. Fig. 3. Overview of the Deep Metric Learning approach. to a low-dimensional embedding space F (xi ) ∈ Rd where the images belonging to the same category are mapped to nearby points and the images belonging to different categories are mapped far apart. To learn such a mapping, the authors employ a Triplet Network that consists of three identical branches of CNNs that share the same weights as shown in Fig. 4. The input data for the Triplet Network consists of triplets of samples where each tiplet includes an anchor xa , a positive xp and a negative xn such that xa and xp belong to the same category and xn belongs to a different category. The network is trained using a triplet loss J t that aims to separate the similar pairs (xai ,xpi ) from the dissimilar pairs (xai ,xni ) by a margin m [33]:   kF (xai ) − F (xni )k2 J t (xai , xpi , xni ) = max 0, 1 − . kF (xai ) − F (xpi )k2 + m (1) However, it was shown in [19], [1] that the triplet network ignores global structure of the embedding space and can lead to sub-optimal results. To overcome this, a global loss is proposed in [19], [1] that uses the statistics of the samples in the batch to improve the embedding. Specifically, it assumes a that the distances between the similar pairs d+ i = kF (xi ) − a n 2 F (xpi )k22 /4 and dis-similar pairs d− = kF (x )−F (x )k 2 /4 i i i Fig. 4. A Triplet Network with a global loss. The figure is adopted from [1]. It is important to point out that the network is trained on images of seen categories only. follow a distribution and the objective is to minimise the overlap J g between the two distributions: J g ({xai , xpi ,xni }N i=1 ) =  (2) (σ 2+ + σ 2− ) + λ max 0, µ+ − µ− + t , where N is number of samples in the mini-batch, t is the margin that separates the mean of the two distributions, In this paper, we use a weighted combination of the triplet and the global loss defined by: J c ({xai , xpi ,xni }N i=1 ) = J g ({xai , xpi , xni }N i=1 ) + α N X J t (xai , xpi , xni ), i=1 (3) where α is set to 0.8 in our experiments. Implementation and training details: For training the feature embedding model, we initialize the network with ImageNet pre-trained GoogLeNet [34] weights and randomly initialize the final fully connected layer similar to [18]. The learning rate for the randomly initialized fully connected layer is multiplied by 10 to achieve faster convergence. We RefineNet 1 RefineNet M iult Pa R th ne efi me nt 2 RefineNet Prediction 3 RefineNet 4 RefineNet Fig. 5. A schematic overview of RefineNet adapted from [4]. ResNet [35] feature maps of varying resolutions are gradually combined to arrive at a refined high-resolution pixel-wise prediction. use the images from our dataset consisting of 42 categories (40 training categories + tote + unlabelled ) to generate 800K triplets and train the network for 20 epochs. The images are first resized to 256 × 256 and then randomly cropped to 227 × 227. Similar to [1], we set the margin for the triplet and global loss to 0.2 and 0.01 respectively. We start experiment with an initial learning rate of 0.1 and gradually decrease it by a factor of 2 after every 3 epochs. We use a weight decay of 0.0005 for all of our experiments. C. Pixel Voting As described above, the COB network produces an entire hierarchy of image segmentations for each image. Consequently, each pixel can be part of multiple segments. To resolve this ambiguity, a pixel voting scheme is used in the deep metric learning pipeline as a method to concatenate multiple labels for multiple segmentation proposals into a single layered segmentation map. In particular, the COB and feature embedding steps produce approximately 100 binary segmentation masks that are each assigned the k nearest labels (we used k = 3 in our experiments). Because many segmentation masks overlap and the k nearest labels may not belong to the same class, multiple labels are associated with the same pixels. The pixel voting method iterates over each label for every mask and accumulates a pixel-wise tally of how many times a class is associated with a pixel. Then, a list of expected classes, maintained by the robot system, is used to remove tallies for classes that are known to be absent from the image. Finally, for each pixel, the class with the highest tally is used as the label for that pixel. The output after pixel voting is a single segmentation map that can be directly used by the robot for manipulating a particular object. V. FULLY-SUPERVISED SEMANTIC SEGMENTATION The above approach conceptually fits well into the ARC 2017 competition rules. It can segment the objects without any notion of semantics and it does not require any training to handle novel objects. However, it also has two major drawbacks. First, the RGB-D boundary detection is rather slow even on modern GPUs due to the computation of additional geometric features (i.e., HHA features) before being passed to the boundary network. Moreover, it requires an inpainting Fig. 6. Exemplar results of our automatic segmentation. A failure example is shown on the right, where the red part of the item is erroneously considered as background. Such failure cases are corrected manually by a human operator. procedure to produce dense depth maps. Second, and perhaps more important, the entire segmentation pipeline consists of multiple sequential steps. Consequently, an error made in the boundary detection cannot be corrected later on. To remedy these shortcomings, we explore a second scene understanding approach based on semantic segmentation which can be trained end-to-end. To that end, we adopt the recently developed RefineNet architecture [4] for our purpose. In a nutshell, RefineNet is a deep (fully) convolutional neural network (CNN), which exploits feature maps at different levels of detail to produce high-resolution semantic maps. A high-level overview is illustrated in Fig. 5. Compared to previous approaches that tackled high-resolution semantic segmentation such as DeepLab [22], RefineNet reduces the memory consumption and yields more accurate results on a number of public benchmarks. Its key idea is the adaptation of the identity mapping paradigm [36], which allows for effective gradient flow and consequently effective and efficient end-to-end training of very deep networks. A. Fast Data Collection The 2017 Amazon Robotics Challenge (ARC) required robots to pick from a set of 50% known and 50% unknown items. The unknown item set was provided 45 minutes before an official run and available for the first 30 minutes of that time. The challenge here is that standard data collection and annotation approaches (like manual annotation of cluttered scenes) are infeasible to perform in the available time. As a compromise to cluttered scenes, we opted to capture images of each new item without clutter, but with as many other commonalities to the final environment as possible. To achieve this, each item was placed in the Amazon-provided tote with the camera mounted on the robot’s wrist at the same height above the scene as during a run. Each item was manually cycled through a number of positions and orientations to capture some of the variations the network would need to handle. As further described in Section VIC, we chose to capture seven images of each unseen item. Note that we also experimented with a turntable solution. However, we found that manually placing and manipulating the items within the actual scene, i.e. tote or storage system, to be both more efficient and to yield more reliable training data for our purpose. To speed up the annotation of these images, we employ the same RefineNet architecture as outlined above, but trained on B. Implementation and training details Training is performed in two steps. We first train the model on our dataset with 41 training categories (40 objects and one background class), for 100 epochs, initialised with pre-trained ResNet-101 ImageNet weights. Note that the final softmax layer contains 16 (or 10 for the stow task) placeholder entries for unseen categories. Upon collecting the data as described above, we fine-tune the model using all available training data for 20 epochs within the available time frame. We use four NVIDIA GTX 1080Ti GPUs for training. Batch size 1 and learning rate 1e−4 is used for the initial finetuning stage, batch size 32 and learning rate 1e−5 is used for the final fine-tuning stage. It is important to note that we also exploit the available information about the presence or absence of items in the scene. The final prediction is taken as argmax not over all 57 classes, but only over the set of categories that are assumed to be present. VI. EXPERIMENTS F0.5 = 63% F1 = 72% IOU = 57% F0.5 = 83% F1 = 66% IOU = 50% Fig. 7. An example illustrating the importance of different measures for grasping applications. Top: The object of interest, here, the tube socks, is undersegmented and the robot may pick a wrong item that is contained in the segmented region. Bottom: Only a part of the entire object is segmented correctly, yielding lower F1 and intersection-over-union (IOU) measures. It is evident, however, that this segmentation is more suitable for determining a correct grasp point and successfully manipulate the object. We argue that the F0.5 measure is far more informative than the common IOU or F1 . Note that precision would also be indicative of success in this example, but should not be used in isolation because it loses information about recall. 0.64 Mean F0.5 Score only two classes for binary foreground/background segmentation. After each image is captured, the network outputs two segments for background and foreground, providing a segment of the item in the scene. To further accelerate the data collection task, we parallelise this approach by placing two items in the tote at a time such that they do not overlap in the image. When successful, the foreground mask contains two segments horizontally adjacent to one another (cf. Fig.6) that can be easily separated into two connected components. Labels are automatically assigned based on the assumption that the items in the scene match those read out by a human operator with an item list. During the data capture process, another human operator visually verified each segment and class label, manually correcting any that are unsatisfactory for direct addition to the training set. After a few practice runs, a team of four members are able to capture 7 images of 16 items in approximately 4 minutes. An additional 3 minutes are taken by one human to finalise the manual check and correction procedure. 0.62 0.60 0.58 0.56 Deep Metric Learning (DML) Fully-Supervised Segmentation 0.54 0.52 1 2 3 4 5 6 7 8 9 10 11 12 Number of Unseen Item Training Examples 13 14 Fig. 8. We report the F0.5 score of both fully-supervised segmentation (RefineNet) and our deep metric learning approach with respect to the number of unseen images used for training. We find that both improve with the number of images but see a significant difference in absolute performance between the two. Interestingly, DML clearly outperforms the fully-supervised CNN for one training example, but does not quite reach the performance when enough training data are available. A. Dataset The 2017 contest challenged the teams by operating on a training and a competition item set. The former contained 40 items physically provided by Amazon Robotics to each team three months before the event, while the latter was revealed just 45 minutes before each run. Using the provided 40 training items and a curated set of 16 items to simulate the competition set, we produced a dataset to benchmark various vision systems. The unseen items set was selected to reflect the properties of training items and consisted mainly of typical household objects found in stores across Australia. The dataset is composed of a train and test set, the training set is split into seen and unseen sets. The seen items training set comprises 137 images that contain between 0 and 20 items. Each of the 40 items in the seen items train set appear between 11 and 76 times. The unseen items training set comprises 120 images that contain 2 items each. Each of the 16 items appear 15 times. The test set comprises 67 images that contain between 2 and 20 items with an approximately equal split of seen and unseen items in each image. The dataset was captured using the hardware setup described in Sec. III and contains per-pixel labelled RGB images alongside aligned depth images. B. Evaluation Criteria The most common approaches for evaluating semantic segmentation vision systems are Intersection over Union (IOU) and F1 . We argue that these metrics are sub-optimal when benchmarking semantic segmentation vision systems for use in robotic applications. Within the context of our robot, Cartman, the semantic segmentation system links a raw image of the scene to a grasp point selection system. Our grasp point selection F0.5 = 1.25 ∗ precision · recall 0.25 · precision + recall (4) is used throughout this paper. F0.5 differs from F1 by weighting recall less than precision, better differentiating desirable and undesirable item segments. C. Results We report results for both deep metric learning (DML) and fully-supervised semantic segmentation on 67 cluttered scenes with an approximately even mix of seen and unseen items. As shown in Fig. 8, both approaches improve with an increasing number of unseen item images. Our experiments showed that the main bottleneck for the underwhelming performance of DML is the COB step, which provides rather noisy object segmentation. Using perfect segment, DML achieves an F0.5 score of about 0.85. However, with the fullysupervised approach being both more efficient and more accurate for the relevant case of multiple training examples, it was selected for use in the final competition system. We capture seven images of each item from the competition set before each official run to finetune our semantic segmentation approach. Seven images were found to provide the perfect balance between time to capture the data and overall performance, as shown in Fig. 9. While the time frame is sufficiently large to collect 12 images per item and gain a slight improvement in performance, we prefer to have a safety buffer for re-collecting the data and re-training the model for a second time in case if something goes wrong. All following experiments are conducted using the fine-tuned RefineNet model on seven images per unseen object. We perform two additional tests that help to characterise the semantic segmentation network. Firstly, Fig. 10 shows how the number of appearances of an item in the training set influences the F0.5 score of that item across the test set. We find no correlation between the two, indicating that performance is rather a function of an item’s appearance, and not necessarily how many training examples of the item are available. Secondly, we analyse in Fig. 11 how the number TABLE I Q UANTITATIVE COMPARISON OF OUR APPROACHES . Method Input Training Prediction F0.5 DML FSS RGB-D RGB Offl. Onl. 24h 1h 0s 10m 8s 0.2s 1 img. 7 imgs. 0.57 0.53 0.59 0.62 0.65 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0.60 Mean F0.5 Score system assumes that every pixel of a provided segment is the target item. Grasp points are generated using heuristics such as distance from segment edge and direction of surface normals. With our grasp point selection system in mind, receiving a segment that contains the pixels of neighbouring items may result in the robot completely grasping the wrong item. As depicted in Fig. 7, the measures of IOU and F1 do not capture the desirability of a given segment for this type of grasp selection system. This point highlights that the choice of metric is critical and provides our justification for why the F0.5 measure, defined as 0.55 0.50 0.45 0.40 0 10 20 30 40 50 60 Time (min.) - Data Capture + Training Time 70 Fig. 9. Each line represents the entire time needed to collect n images for all unseen items and train a model to reach a certain F0.5 score. Our operating mode of training with 7 images for each item is highlighted in bold red [25]. Unseen Items Seen Items Fig. 10. A detailed analysis of per-class performance as a function of the number of available training samples. See text for details. of objects in a scene impacts the mean F0.5 score of the network on each image in our test set. As expected, the performance semantic segmentation degrades monotonically as more and more objects are present and the scene becomes more cluttered. Table I provides a quantitative overview of the two methods discussed above. Even though our chosen approach (FSS) yields higher accuracy, the DML method is a better candidate when fewer training images or less time for data collection are available. Note that most of the time spent on DML is actually due to depth inpainting and HHA feature computation for the boundary detection step. The classification alone is very efficient because it consists of extracting features and finding a nearest neighbour. Mean F0.5 Score 1.0 0.8 0.6 0.4 0.2 0.0 0 2 4 6 8 10 12 14 Number of Items in Scene 16 18 20 Fig. 11. We report how the number of items in a scene changes the F0.5 score of RefineNet on our test set. VII. CONCLUSIONS We presented two segmentation approaches that were developed to win the 2017 Amazon Robotics Challenge. One following a deep metric learning strategy without the need to retrain the model to handle new object categories, the other a state-of-the-art semantic segmentation CNN which can be fine-tuned using very few training examples. To obtain the training data, we also introduced an effective data collection scheme. Finally, we investigated the importance of segmentation measures in the context of robotic manipulation. R EFERENCES [1] V. Kumar B G, G. Carneiro, and I. Reid, “Learning local image descriptors with deep siamese and triplet convolutional networks by minimising global loss functions,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016. [2] K. Maninis, J. Pont-Tuset, P. Arbeláez, and L. V. Gool, “Convolutional oriented boundaries: From image segmentation to high-level tasks,” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2017. [3] M. Schwarz, A. Milan, C. Lenz, A. Munoz, A. S. Periyasamy, M. Schreiber, S. Schller, and S. Behnke, “NimbRo Picking: Versatile part handling for warehouse automation,” in IEEE International Conference on Robotics and Automation (ICRA), 2017. [4] G. Lin, A. Milan, C. Shen, and I. Reid, “RefineNet: Multi-path refinement networks for high-resolution semantic segmentation,” in CVPR 2017, 2017. [5] P. O. Pinheiro and R. Collobert, “From image-level to pixel-level labeling with convolutional networks,” in CVPR 2015. [6] F. Husain, H. Schulz, B. Dellen, C. Torras, and S. Behnke, “Combining semantic and geometric features for object class segmentation of indoor scenes,” IEEE Robotics and Automation Letters, vol. 2, no. 1, pp. 49–55, 5 2016. [7] N. Correll, K. E. Bekris, D. Berenson, O. Brock, A. Causo, K. Hauser, K. Okada, A. Rodriguez, J. M. Romano, and P. R. Wurman, “Analysis and observations from the first amazon picking challenge,” IEEE Transactions on Automation Science and Engineering, 2016. [8] C. Eppner, S. Hfer, R. Jonschkowski, R. Martn-Martn, A. Sieverling, V. Wall, and O. Brock, “Lessons from the Amazon Picking Challenge: Four aspects of building robotic systems,” in Robotics: Science and Systems (RSS), 6 2016. [9] R. Jonschkowski, C. Eppner, S. Höfer, R. Martn-Martn, and O. Brock, “Probabilistic multi-class segmentation for the amazon picking challenge,” in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Oct. 2016. [10] K.-T. Yu, N. Fazeli, N. Chavan-Dafle, O. Taylor, E. Donlon, G. D. Lankenau, and A. Rodriguez, “A summary of team MIT’s approach to the Amazon Picking Challenge 2015,” arXiv:1604.03639, 2016. [11] A. Zeng, K.-T. Yu, S. Song, D. Suo, E. Walker Jr, A. Rodriguez, and J. Xiao, “Multi-view self-supervised deep learning for 6D pose estimation in the Amazon Picking Challenge,” in IEEE International Conference on Robotics and Automation (ICRA), 2017. [12] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” CoRR, vol. abs/1409.1556, 2014. [13] C. Hernandez, M. Bharatheesha, W. Ko, H. Gaiser, J. Tan, K. van Deurzen, M. de Vries, B. V. Mil, J. van Egmond, R. Burger, M. Morariu, J. Ju, X. Gerrmann, R. Ensing, J. van Frankenhuyzen, and M. Wisse, “Team Delft’s robot winner of the Amazon Picking Challenge 2016,” arXiv:1610.05514, 2016. [14] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards realtime object detection with region proposal networks,” in Advances in Neural Information Processing Systems (NIPS), 2015, pp. 91–99. [15] M. Schwarz, A. Milan, A. S. Periyasamy, and S. Behnke, “RGB-D object detection and semantic segmentation for autonomous manipulation in clutter,” The International Journal of Robotics Research. [16] J. Johnson, A. Karpathy, and L. Fei-Fei, “DenseCap: Fully convolutional localization networks for dense captioning,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. [17] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun, “Overfeat: Integrated recognition, localization and detection using convolutional networks,” arXiv:1312.6229, 2013. [18] H. O. Song, Y. Xiang, S. Jegelka, and S. Savarese, “Deep metric learning via lifted structured feature embedding,” 2016. [19] H. O. Song, S. Jegelka, V. Rathod, and K. Murphy, “Deep metric learning via facility location,” in CVPR, 2017. [20] B. Harwood, V. Kumar B G, G. Carneiro, I. Reid, and T. Drummond, “Smart mining for deep metric learning,” in The IEEE International Conference on Computer Vision (ICCV), October 2017. [21] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in CVPR 2015. [22] G. Papandreou, L.-C. Chen, K. P. Murphy, and A. L. Yuille, “Weaklyand semi-supervised learning of a deep convolutional network for semantic image segmentation,” in ICCV 2015, Dec. 2015. [23] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing network,” in CVPR, 2017. [24] K. He, G. Gkioxari, P. Dollár, and R. B. Girshick, “Mask R-CNN,” CoRR, vol. abs/1703.06870, 2017. [25] D. Morrison, A. W. Tow, M. McTaggart, et al., “Cartman: The lowcost cartesian manipulator that won the amazon robotics challenge,” Australian Centre for Robotic Vision, Tech. Rep. ACRV-TR-2017-01, 2017. [26] M. McTaggart, R. Smith, et al., “Mechanical design of a cartesian manipulator for warehouse pick and place,” Australian Centre for Robotic Vision, Tech. Rep. ACRV-TR-2017-02, 2017. [27] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” in NIPS*2012, pp. 1097– 1105. [28] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun, “OverFeat: Integrated recognition, localization and detection using convolutional networks,” CoRR, vol. abs/1312.6229, 2013. [29] S. ”Xie and Z. Tu, “Holistically-nested edge detection,” in Proceedings of IEEE International Conference on Computer Vision, 2015. [30] P. Arbelaez, “Boundary extraction in natural images using ultrametric contour maps,” in 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW’06), June 2006, pp. 182–182. [31] P. K. Nathan Silberman, Derek Hoiem and R. Fergus, “Indoor segmentation and support inference from rgbd images,” in ECCV, 2012. [32] S. Gupta, R. Girshick, P. Arbelaez, and J. Malik, “Learning rich features from RGB-D images for object detection and segmentation,” in ECCV 2014, 2014. [33] P. Wohlhart and V. Lepetit, “Learning descriptors for object recognition and 3d pose estimation,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015. [34] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015. [35] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR 2016. [36] X. Z. Kaiming He, S. Ren, and J. Sun, “Identity mappings in deep residual networks,” in ECCV, 2016, pp. 630–645.