Many of the tasks that a humanoid robot might perform in a factory, warehouse, or home require an understanding of the geometric and semantic properties of the objects around it, such as the shapes and context of objects it interacts with. Engineers at Boston Dynamics have shared details about how their Atlas robot “sees” the world thanks to a flexible and adaptive perception system.

Image source: Boston Dynamics

Even a seemingly simple task — picking up a car part and installing it in the right place — is broken down into several steps, each requiring extensive knowledge of the surrounding space. First, Atlas detects and identifies the object. Many parts in the factory are shiny or low-contrast and dark, making it difficult for the robot’s cameras to distinguish them. Then the robot needs to determine where the object is in order to grab it. The desired object may be lying on a table, in a container with limited space, etc. Once Atlas has decided on the object, it decides where to install it and how to get it to the right place.

Ultimately, Atlas must place an object in a specific location with high precision, and even a few centimeters of deviation can result in the object the robot is interacting with being placed incorrectly or falling. To avoid this, Atlas must be able to make adjustments to its actions if something goes wrong. For example, if a part fails to place and falls to the floor, the robot can pick it up using a computer vision-based system.

These tasks require the implementation of new methods and affect the entire Atlas perception system, which includes well-calibrated sensors, modern AI algorithms with machine learning, a state assessment system, etc. Perception begins with what is around the robot and whether there are any obstacles in its path. To identify surrounding objects, Boston Dynamics engineers used a detection system that provides data on objects in the form of identifiers, bounding boxes, and points of interest.

During operation in an automobile manufacturing facility, Atlas detects racks where different car parts are stored. Racks come in different shapes and sizes, so the robot must know not only their type but also their location to avoid collisions. In addition to detecting racks, the robot identifies their corners as points of interest, which allows the surrounding space to be aligned with an internal map.

Key points are two types of 2D pixel points: outer (green) and inner (red). Outer points represent objects that need to be avoided during operation. Inner points are more varied and numerous, and can represent the distribution of shelves on a rack, the location of boxes, and allow for the precise localization of individual objects. To classify large objects and predict the location of points of interest, Atlas uses a lightweight network architecture that allows for a trade-off between performance and perception, which is important for the robot’s maneuverability.

Before manipulating objects within the highlighted spaces, Atlas determines its location using the object localization module based on key points. The system estimates the position of the object and its orientation relative to other objects in the vicinity. The localization system receives data on internal and external points of interest from the object detection pipeline and aligns them according to a preliminary model of their distribution in space. In addition, kinematic odometry is used to determine the robot’s movements to consolidate the positions of objects and improve the reliability of the prediction of the locations of points of interest.

Atlas’s object manipulation skills are based on the precision of its online perception of the surrounding space. The SuperTracker position tracking system combines different streams of information: robot kinematics, computer vision, etc. Kinematic information received from Atlas’ joint encoders allows determining the location of the robot’s grippers in space. Combining kinematic data allows overcoming situations when the desired object is hidden or not in the field of view of the cameras.

When an object is in the field of view of the cameras, Atlas employs an object pose estimation model that uses rendering and comparison to estimate pose from monocular images. The model is trained on a large set of synthetic data and generalizes from frame 0 to new objects using the CAD model. When initialized using a 3D image, the model refines its appearance to minimize discrepancies between the rendered CAD model and the camera image. Alternatively, the estimation system can work based on a 2D model of the area of ​​interest, such as an object mask. A set of hypotheses is then generated and analyzed by a special algorithm to select the optimal one. Atlas’ estimation system has been proven to work reliably on hundreds of factory objects that were previously modeled and textured.

The SuperTracker system receives visual estimates of the robot’s pose in 3D. In manipulation scenarios, visual pose estimates may be ambiguous, for example due to incomplete visibility or poor lighting. Special filters are used to check the accuracy of the estimate in these cases.

When performing precision manipulations that involve multiple actions, precise calibration of sensors, computer vision systems, etc. is essential. Engineers note that precise calibration is a key factor that enables high-performance manipulation and autonomy based on perception of the surrounding space.

Going forward, Boston Dynamics plans to improve the precision and adaptability of Atlas. The team is focused on moving toward a single base model for Atlas. The team intends to go beyond what has been achieved, where perception and action are not separate processes.

Leave a Reply

Your email address will not be published. Required fields are marked *