Antoni Rosinol

Antoni Rosinol

San Francisco, California, United States
20K followers 500+ connections

Activity

Join now to see all activity

Experience

  • StackAI Graphic
  • -

    Cambridge, Massachusetts, United States

  • -

    Pasadena, California, United States

  • -

  • -

    Zürich Area, Switzerland

  • -

    Zürich, Switzerland

  • -

    Zürich Area, Switzerland

Education

Licenses & Certifications

Volunteer Experience

  • IEEE Graphic

    Reviewer for Conferences and Journals

    IEEE

    - Present 6 years 7 months

    Science and Technology

  • GoPro Graphic

    GoPro 12 days of cause

    GoPro

    - 1 month

    Poverty Alleviation

  • Swissnex Graphic

    Swissnex Higher Education Fair in Singapore

    Swissnex

    - 1 month

    Education

Publications

  • NeRF-SLAM: Real-Time Dense Monocular SLAM with Neural Radiance Fields

    Arxiv

    We propose a novel geometric and photometric 3D mapping pipeline for accurate and real-time scene reconstruction from monocular images. To achieve this, we leverage recent advances in dense monocular SLAM and real-time hierarchical volumetric neural radiance fields. Our insight is that dense monocular SLAM provides the right information to fit a neural radiance field of the scene in real-time, by providing accurate pose estimates and depth-maps with associated uncertainty. With our proposed…

    We propose a novel geometric and photometric 3D mapping pipeline for accurate and real-time scene reconstruction from monocular images. To achieve this, we leverage recent advances in dense monocular SLAM and real-time hierarchical volumetric neural radiance fields. Our insight is that dense monocular SLAM provides the right information to fit a neural radiance field of the scene in real-time, by providing accurate pose estimates and depth-maps with associated uncertainty. With our proposed uncertainty-based depth loss, we achieve not only good photometric accuracy, but also great geometric accuracy. In fact, our proposed pipeline achieves better geometric and photometric accuracy than competing approaches (up to 179% better PSNR and 86% better L1 depth), while working in real-time and using only monocular images.

    Other authors
    See publication
  • Probabilistic Volumetric Fusion for Dense Monocular SLAM

    IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

    We present a novel method to reconstruct 3D scenes from
    images by leveraging deep dense monocular SLAM and fast
    uncertainty propagation. The proposed approach is able
    to 3D reconstruct scenes densely, accurately, and in real time while being robust to extremely noisy depth estimates
    coming from dense monocular SLAM. Differently from previous approaches, that either use ad-hoc depth filters, or
    that estimate the depth uncertainty from RGB-D cameras’
    sensor models, our…

    We present a novel method to reconstruct 3D scenes from
    images by leveraging deep dense monocular SLAM and fast
    uncertainty propagation. The proposed approach is able
    to 3D reconstruct scenes densely, accurately, and in real time while being robust to extremely noisy depth estimates
    coming from dense monocular SLAM. Differently from previous approaches, that either use ad-hoc depth filters, or
    that estimate the depth uncertainty from RGB-D cameras’
    sensor models, our probabilistic depth uncertainty derives
    directly from the information matrix of the underlying bundle adjustment problem in SLAM. We show that the resulting
    depth uncertainty provides an excellent signal to weight the
    depth-maps for volumetric fusion. Without our depth uncertainty, the resulting mesh is noisy and with artifacts, while
    our approach generates an accurate 3D mesh with significantly fewer artifacts. We provide results on the challenging
    Euroc dataset, and show that our approach achieves 92%
    better accuracy than directly fusing depths from monocular
    SLAM, and up to 90% improvements compared to the best
    competing approach.

    Other authors
    See publication
  • Smooth Mesh Estimation from Depth Data using Non-Smooth Convex Optimization

    IROS

    Meshes are commonly used as 3D maps since they encode the topology of the scene while being lightweight. Unfortunately, 3D meshes are mathematically difficult to handle directly because of their combinatorial and discrete nature. Therefore, most approaches generate 3D meshes of a scene after fusing depth data using volumetric or other representations. Nevertheless, volumetric fusion remains computationally expensive both in terms of speed and memory. In this paper, we leapfrog these…

    Meshes are commonly used as 3D maps since they encode the topology of the scene while being lightweight. Unfortunately, 3D meshes are mathematically difficult to handle directly because of their combinatorial and discrete nature. Therefore, most approaches generate 3D meshes of a scene after fusing depth data using volumetric or other representations. Nevertheless, volumetric fusion remains computationally expensive both in terms of speed and memory. In this paper, we leapfrog these intermediate representations and build a 3D mesh directly from a depth map and the sparse landmarks triangulated with visual odometry. To this end, we formulate a non-smooth convex optimization problem that we solve using a primal-dual method. Our approach generates a smooth and accurate 3D mesh that substantially improves the state-of-the-art on direct mesh reconstruction while running in real-time.

    Other authors
    See publication
  • Kimera: from SLAM to Spatial Perception with 3D Dynamic Scene Graphs

    Humans are able to form a complex mental model of the environment they move in. This mental model captures geometric and semantic aspects of the scene, describes the environment at multiple levels of abstractions (e.g., objects, rooms, buildings), includes static and dynamic entities and their relations (e.g., a person is in a room at a given time). In contrast, current robots' internal representations still provide a partial and fragmented understanding of the environment, either in the form…

    Humans are able to form a complex mental model of the environment they move in. This mental model captures geometric and semantic aspects of the scene, describes the environment at multiple levels of abstractions (e.g., objects, rooms, buildings), includes static and dynamic entities and their relations (e.g., a person is in a room at a given time). In contrast, current robots' internal representations still provide a partial and fragmented understanding of the environment, either in the form of a sparse or dense set of geometric primitives (e.g., points, lines, planes, voxels) or as a collection of objects. This paper attempts to reduce the gap between robot and human perception by introducing a novel representation, a 3D Dynamic Scene Graph(DSG), that seamlessly captures metric and semantic aspects of a dynamic environment. A DSG is a layered graph where nodes represent spatial concepts at different levels of abstraction, and edges represent spatio-temporal relations among nodes. Our second contribution is Kimera, the first fully automatic method to build a DSG from visual-inertial data. Kimera includes state-of-the-art techniques for visual-inertial SLAM, metric-semantic 3D reconstruction, object localization, human pose and shape estimation, and scene parsing. Our third contribution is a comprehensive evaluation of Kimera in real-life datasets and photo-realistic simulations, including a newly released dataset, uHumans2, which simulates a collection of crowded indoor and outdoor scenes. Our evaluation shows that Kimera achieves state-of-the-art performance in visual-inertial SLAM, estimates an accurate 3D metric-semantic mesh model in real-time, and builds a DSG of a complex indoor environment with tens of objects and humans in minutes. Our final contribution shows how to use a DSG for real-time hierarchical semantic path-planning. The core modules in Kimera are open-source.

    Other authors
    See publication
  • 3D Dynamic Scene Graphs: Actionable Spatial Perception with Places, Objects, and Humans

    Robotics: Science and Systems (RSS), 2020

    Video:
    https://youtu.be/SWbofjhyPzI

    We present a unified representation for actionable spatial perception: 3D Dynamic Scene Graphs. Scene graphs are directed graphs where nodes represent entities in the scene (e.g. objects, walls, rooms), and edges represent relations (e.g. inclusion, adjacency) among nodes. Dynamic scene graphs (DSGs) extend this notion to represent dynamic scenes with moving agents (e.g. humans, robots), and to include actionable information that supports planning…

    Video:
    https://youtu.be/SWbofjhyPzI

    We present a unified representation for actionable spatial perception: 3D Dynamic Scene Graphs. Scene graphs are directed graphs where nodes represent entities in the scene (e.g. objects, walls, rooms), and edges represent relations (e.g. inclusion, adjacency) among nodes. Dynamic scene graphs (DSGs) extend this notion to represent dynamic scenes with moving agents (e.g. humans, robots), and to include actionable information that supports planning and decision-making (e.g. spatio-temporal relations, topology at different levels of abstraction). Our second contribution is to provide the first fully automatic Spatial PerceptIon eNgine(SPIN) to build a DSG from visual-inertial data. We integrate state-of-the-art techniques for object and human detection and pose estimation, and we describe how to robustly infer object, robot, and human nodes in crowded scenes. To the best of our knowledge, this is the first paper that reconciles visual-inertial SLAM and dense human mesh tracking. Moreover, we provide algorithms to obtain hierarchical representations of indoor environments (e.g. places, structures, rooms) and their relations. Our third contribution is to demonstrate the proposed spatial perception engine in a photo-realistic Unity-based simulator, where we assess its robustness and expressiveness. Finally, we discuss the implications of our proposal on modern robotics applications. 3D Dynamic Scene Graphs can have a profound impact on planning and decision-making, human-robot interaction, long-term autonomy, and scene prediction.

    Other authors
    See publication
  • Kimera: an Open-Source Library for Real-Time Metric-Semantic Localization and Mapping

    IEEE Intl. Conf. on Robotics and Automation (ICRA), 2020

    Video:
    https://youtu.be/-5XxXRABXJs

    Code:
    https://github.com/MIT-SPARK/Kimera

    We provide an open-source C++ library for real-time metric-semantic visual-inertial Simultaneous Localization And Mapping (SLAM). The library goes beyond existing visual and visual-inertial SLAM libraries (e.g., ORB-SLAM, VINS- Mono, OKVIS, ROVIO) by enabling mesh reconstruction and semantic labeling in 3D. Kimera is designed with modularity in mind and has four key components: a visual-inertial…

    Video:
    https://youtu.be/-5XxXRABXJs

    Code:
    https://github.com/MIT-SPARK/Kimera

    We provide an open-source C++ library for real-time metric-semantic visual-inertial Simultaneous Localization And Mapping (SLAM). The library goes beyond existing visual and visual-inertial SLAM libraries (e.g., ORB-SLAM, VINS- Mono, OKVIS, ROVIO) by enabling mesh reconstruction and semantic labeling in 3D. Kimera is designed with modularity in mind and has four key components: a visual-inertial odometry (VIO) module for fast and accurate state estimation, a robust pose graph optimizer for global trajectory estimation, a lightweight 3D mesher module for fast mesh reconstruction, and a dense 3D metric-semantic reconstruction module. The modules can be run in isolation or in combination, hence Kimera can easily fall back to a state-of-the-art VIO or a full SLAM system. Kimera runs in real-time on a CPU and produces a 3D metric-semantic mesh from semantically labeled images, which can be obtained by modern deep learning methods. We hope that the flexibility, computational efficiency, robustness, and accuracy afforded by Kimera will build a solid basis for future metric-semantic SLAM and perception research, and will allow researchers across multiple areas (e.g., VIO, SLAM, 3D reconstruction, segmentation) to benchmark and prototype their own efforts without having to start from scratch.

    Other authors
    See publication
  • Primal-dual mesh convolutional neural networks

    Advances in Neural Information Processing Systems (NeurIPS 2020)

    Video: https://youtu.be/O8lgQgqQeuo
    Code: https://github.com/MIT-SPARK/PD-MeshNet
    Paper: https://arxiv.org/abs/2010.12455

    Recent works in geometric deep learning have introduced neural networks that allow performing inference tasks on three-dimensional geometric data by defining convolution, and sometimes pooling, operations on triangle meshes. These methods, however, either consider the input mesh as a graph, and do not exploit specific geometric properties of meshes for feature…

    Video: https://youtu.be/O8lgQgqQeuo
    Code: https://github.com/MIT-SPARK/PD-MeshNet
    Paper: https://arxiv.org/abs/2010.12455

    Recent works in geometric deep learning have introduced neural networks that allow performing inference tasks on three-dimensional geometric data by defining convolution, and sometimes pooling, operations on triangle meshes. These methods, however, either consider the input mesh as a graph, and do not exploit specific geometric properties of meshes for feature aggregation and downsampling, or are specialized for meshes, but rely on a rigid definition of convolution that does not properly capture the local topology of the mesh. We propose a method that combines the advantages of both types of approaches, while addressing their limitations: we extend a primal-dual framework drawn from the graph-neural-network literature to triangle meshes, and define convolutions on two types of graphs constructed from an input mesh. Our method takes features for both edges and faces of a 3D mesh as input and dynamically aggregates them using an attention mechanism. At the same time, we introduce a pooling operation with a precise geometric interpretation, that allows handling variations in the mesh connectivity by clustering mesh faces in a task-driven fashion. We provide theoretical insights of our approach using tools from the mesh-simplification literature. In addition, we validate experimentally our method in the tasks of shape classification and shape segmentation, where we obtain comparable or superior performance to the state of the art.

    Other authors
    See publication
  • Incremental Visual-Inertial 3D Mesh Generation with Structural Regularities

    IEEE Intl. Conf. on Robotics and Automation (ICRA), 2019

    Project Page: https://www.mit.edu/~arosinol/research/struct3dmesh.html

    Visual-Inertial Odometry (VIO) algorithms typically rely on a point cloud representation of the scene that does not model the topology of the environment. A 3D mesh instead offers a richer, yet lightweight, model. Nevertheless, building a 3D mesh out of the sparse and noisy 3D landmarks triangulated by a VIO algorithm often results in a mesh that does not fit the real scene. In order to regularize the mesh, previous…

    Project Page: https://www.mit.edu/~arosinol/research/struct3dmesh.html

    Visual-Inertial Odometry (VIO) algorithms typically rely on a point cloud representation of the scene that does not model the topology of the environment. A 3D mesh instead offers a richer, yet lightweight, model. Nevertheless, building a 3D mesh out of the sparse and noisy 3D landmarks triangulated by a VIO algorithm often results in a mesh that does not fit the real scene. In order to regularize the mesh, previous approaches decouple state estimation from the 3D mesh regularization step, and either limit the 3D mesh to the current frame [1], [2] or let the mesh grow indefinitely [3], [4]. We propose instead to tightly couple mesh regularization and state estimation by detecting and enforcing structural regularities in a novel factor-graph formulation. We also propose to incrementally build the mesh by restricting its extent to the time-horizon of the VIO optimization; the resulting 3D mesh covers a larger portion of the scene than a per-frame approach while its memory usage and computational complexity remain bounded. We show that our approach successfully regularizes the mesh, while improving localization accuracy, when structural regularities are present, and remains operational in scenes without regularities.

    Other authors
    See publication
  • Densifying Sparse VIO: a Mesh-based approach using Structural Regularities

    MSc. Thesis, ETH Zürich

    The ideal vision system for an autonomous robot would not only provide the robot’s position and orientation (localization), but also an accurate and complete model of the scene (mapping). While localization information allows for controlling the robot, a map of the scene allows for collision-free navigation; combined, a robot can achieve full autonomy.
    Visual Inertial Odometry (VIO) algorithms have shown impressive localization results in recent years. Unfortunately, typical VIO algorithms…

    The ideal vision system for an autonomous robot would not only provide the robot’s position and orientation (localization), but also an accurate and complete model of the scene (mapping). While localization information allows for controlling the robot, a map of the scene allows for collision-free navigation; combined, a robot can achieve full autonomy.
    Visual Inertial Odometry (VIO) algorithms have shown impressive localization results in recent years. Unfortunately, typical VIO algorithms use a point cloud to represent the scene, which is hardly usable for other tasks such as obstacle avoidance or path planning.
    In this work, we explore the possibility of generating a dense and consistent model of the scene by using a 3D mesh, while making use of structural regularities to improve both mesh and pose estimates. Our experimental results show that we can achieve 26% more accurate pose estimates than state-of-the-art VIO algorithms when enforcing structural constraints, while also building a 3D mesh that provides a denser and more accurate map of the scene than a classical point cloud.

    See publication
  • Ultimate SLAM? Combining Events, Images, and IMU for Visual SLAM in HDR and High-Speed Scenarios

    IEEE Robotics and Automation Letters

    ** RA-L Best Paper Award Honorable Mention **

    Video:
    https://www.youtube.com/watch?v=jIvJuWdmemE&feature=youtu.be

    Event cameras are bio-inspired vision sensors that output pixel-level brightness changes instead of standard intensity frames. These cameras do not suffer from motion blur and have a very high dynamic range, which enables them to provide reliable visual information during high speed motions or in scenes characterized by high dynamic range. However, event cameras…

    ** RA-L Best Paper Award Honorable Mention **

    Video:
    https://www.youtube.com/watch?v=jIvJuWdmemE&feature=youtu.be

    Event cameras are bio-inspired vision sensors that output pixel-level brightness changes instead of standard intensity frames. These cameras do not suffer from motion blur and have a very high dynamic range, which enables them to provide reliable visual information during high speed motions or in scenes characterized by high dynamic range. However, event cameras output only little information when the amount of motion is limited, such as in the case of almost still motion. Conversely, standard cameras provide instant and rich information about the environment most of the time (in low-speed and good lighting scenarios), but they fail severely in case of fast motions, or difficult lighting such as high dynamic range or low light scenes. In this paper, we present the first state estimation pipeline that leverages the complementary advantages of these two sensors by fusing in a tightly-coupled manner events, standard frames, and inertial measurements. We show on the publicly available Event Camera Dataset that our hybrid pipeline leads to an accuracy improvement of 130% over event-only pipelines, and 85% over standard-frames-only visual-inertial systems, while still being computationally tractable. Furthermore, we use our pipeline to demonstrate - to the best of our knowledge - the first autonomous quadrotor flight using an event camera for state estimation, unlocking flight scenarios that were not reachable with traditional visual-inertial odometry, such as low-light environments and high-dynamic range scenes.

    Other authors
    See publication

Projects

  • Visual Odometry Pipeline

    -

    Implementation of a feature-based monocular Visual Odometry Pipeline with bundle adjustment optimization.
    Open-source code available in GitHub!
    https://github.com/ToniRV/visual-odometry-pipeline

    See project
  • Robotics Programming Lab

    -

    In this laboratory-class we used the Thymio robot developed by EPFL to implement the following algorithms:
    1) PID position control using wheel odometry
    2) Tangent Bug obstacle avoidance algorithm
    3) A* path planning algorithm
    4) Object recognition with a Primesense 1.09 with Spin Image descriptors (in C++ and ROS)
    5) Particle filter localization with a Primesense 1.09 used as laser scanner (in C++ and ROS)
    6) Final Search and rescue to integrate all the algorithms: the robot…

    In this laboratory-class we used the Thymio robot developed by EPFL to implement the following algorithms:
    1) PID position control using wheel odometry
    2) Tangent Bug obstacle avoidance algorithm
    3) A* path planning algorithm
    4) Object recognition with a Primesense 1.09 with Spin Image descriptors (in C++ and ROS)
    5) Particle filter localization with a Primesense 1.09 used as laser scanner (in C++ and ROS)
    6) Final Search and rescue to integrate all the algorithms: the robot had to localize, plan a path, stop at a given location, locate and recognize objects, and navigate its way towards the goal, while avoiding obstacles.
    The total project resulted in almost 6000 lines of code.

Honors & Awards

  • IEEE/CVF WACV 2023 – Best Paper Award Honorable Mention

    IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

    “Probabilistic Volumetric Fusion for Dense Monocular SLAM”, Antoni Rosinol, John J. Leonard, Luca Carlone

  • Rafael Del Pino Fellowship

    Rafael Del Pino Foundation

    Awarded to the top 2% (500+ applications).

  • DRF PhD Founder Track Fellow

    Dorm Room Fund

    Dorm Room Fund is a student-run VC backed by First Round Capital investing in student founders across the US.

  • IEEE R-AL – Best Paper Award Honorable Mention

    IEEE Robotics and Automation Letters

    "UltimateSLAM" received the IEEE Robotics and Automation Letters 2018 Best Paper Award Honorable Mention during the award session at the ICRA 2019 conference in Montreal. It ranked 2nd out of 720 papers published by RAL in 2018.

  • La Caixa Fellowship

    La Caixa Foundation

    Ranking 1st among candidates in the Science and Engineering track (446 applications).

  • Siemens - Future Makers AI Challenge – 1st Prize

    Massachusetts Institute of Technology

    http://news.mit.edu/2018/mit-grad-students-mannai-rosinol-vidal-win-siemens-futuremakers-first-prize-0904

  • HackUPC - Financial Chatbot Award – 1st Prize

    HackUPC

    During HackUPC 2017 hackathon (700 hackers), we created the chatbot ALDA.
    ALDA lets you analyze your spending in a convenient, effortless way. You can ask it how much you spend on Uber, or analyze what is your current spending in different subscription-based services.
    https://hackupc.com

  • IROS - Autonomous Drone Race – 2nd Prize

    IROS

    The IROS 2017 Autonomous Drone Race in Vancouver is a technical challenge open worldwide to robotic researchers to showcase their solutions for the autonomous flight of agile drones in complex scenes.

  • Zeno Karl Schindler Fellowship

    Zeno Karl Schindler Foundation

  • International Self-Driving Car Competition – 1st Prize

    University of Arizona

    Best Overall and Best Object Classification.

    The competition consists in controlling an autonomous car equipped with a LIDAR, two lateral RGB cameras and a 2D laser scanner, with the objective of recreating the real world as accurately as possible. http://catvehiclechallenge.org/

    https://www.youtube.com/watch?v=ZtSKxJJSUqE&t=160

  • ETHZ - From Science to Startup – 3rd Prize

    ETHZ Entrepreneur Club

  • ETHZ - Autonomous Search and Rescue – 1st Prize

    Robotics Programming Lab

    Search and rescue competition using a RGB-D camera on top of a Thymio ground robot with peripheral laser sensors: the robot had to localize itself, plan a path to the given goal, and label objects placed on the scene, all while avoiding obstacles.

  • ETHZ - Entrepreneur Award – 3rd Prize

    ETH Entrepreneur Club

    Our startup Velohub, and its flagship product Blinkers, received the 3rd prize of the ETH Entrepreneur Club (3,000 CHF).

Languages

  • English

    Full professional proficiency

  • Spanish

    Native or bilingual proficiency

  • French

    Full professional proficiency

  • Catalan

    Native or bilingual proficiency

More activity by Antoni

View Antoni’s full profile

  • See who you know in common
  • Get introduced
  • Contact Antoni directly
Join to view full profile

Other similar profiles

Explore collaborative articles

We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.

Explore More