News
- Awarded the Amazon AI PhD Fellowship (2025-2027) (August 2025).
- COLLAGE accepted to CoRL 2025 (Aug 2025).
- Started PhD (CS) at the University of Texas at Austin (Aug 2024).
- Won the Datacomp challenge at ICCV 2023 (Oct 2023).
- Completed MS in Computer Science at UCSD and joined TikTok/ByteDance Seed team as a Research Engineer (June 2023).
- GraphIRL accepted for oral presentation (top 6%) to CoRL 2022 (Aug 2022).
- Interning at TikTok/ByteDance (June 2022).
- TOT accepted to CVPR 2022 (Feb 2022).
- LAV accepted to CVPR 2021 (Feb 2021).
|
Publications and Preprints
*: Equal Contribution. †: Equal Advising.
|
|
COLLAGE: Adaptive Fusion-based Retrieval for Augmented Policy Learning
Sateesh Kumar,
Shivin Dass,
Georgios Pavlakos†,
Roberto Martín-Martín†,
Conference on Robot Learning (CoRL) , 2025
project page
/
arXiv
COLLAGE is a few-shot imitation learning method that adaptively fuses demonstrations from multiple similarity modalities. It estimates each modality’s usefulness by training a policy on its retrieved data and measuring how probable the target actions are under that policy.
|
|
Finetuned Multimodal Language Models Are
High-Quality Image-Text Data Filters
Weizhi Wang,
Khalil Mrini,
Linjie Yang,
Sateesh Kumar,
Xifeng Yan
Heng Wang,
arxiv , 2024
project page
/
arXiv
This paper introduces a method that uses finetuned multimodal language models to filter image–text pairs more accurately than traditional metrics like CLIPScore. It defines multiple specialized quality metrics and builds instruction tuning data guided by stronger models such as GPT-4 to train the models to score data effectively. The result is a reliable and efficient filter that better evaluates image–text alignment.
|
|
The Devil is in the Details: A Deep Dive into the Rabbit Hole of Data Filtering
Haichao Yu,
Yu Tian,
Sateesh Kumar,
Linjie Yang,
Heng Wang
International Conference on Computer Vision (ICCV) DataComp Workshop , 2023
(Ranked 1st in DataComp challenge)
arXiv
We introduce a three-stage filtering strategy for enhancing model performance. It focuses on single-modality filtering, cross-modality filtering, and data distribution alignment. The proposed approach significantly surpasses previous methods on the DataComp benchmark.
|
|
Graph Inverse Reinforcement Learning from Diverse Videos
Sateesh Kumar,
Jonathan Zamora*,
Nicklas Hansen*,
Rishabh Jangir,
Xiaolong Wang
Conference on Robot Learning (CoRL) , 2022 (Oral)
project page
/
arXiv
GraphIRL is a self-supervised method for learning a task reward solely from videos.
We build an object-centric graph abstraction from video demonstrations and then learn an embedding space that captures task progression in a self-supervised manner by exploiting the temporal cue in the videos.
|
|
Unsupervised Action Segmentation by Joint Representation Learning and Online Clustering
Sateesh Kumar*,
Sanjay Haresh*,
Awais Ahmed,
Andrey Konin ,
M. Zeeshan Zia,
Quoc-Huy Tran
CVPR, 2022
project page
/
arXiv
We propose temporal optimal transport for jointly learning representations and performing online clustering in an unsupervised manner.
The approach learns prototype vectors via backpropogation. The prototype vectors are initialized at random and act as cluster centroids.
|
|
Learning by Aligning Video in Time
Sateesh Kumar*,
Sanjay Haresh*,
Huseyin Coskun,
Shahram N. Syed,
Andrey Konin ,
M. Zeeshan Zia,
Quoc-Huy Tran
CVPR, 2021
project page
/
arXiv
We propose alignment as pre-text task for self-supervised video representation learning.
The proposed approach leverages differentiable dynamic time warping for learning global alignment across pairs of videos.
|
|
Towards Anomaly Detection in Dashcam Videos
Sateesh Kumar*,
Sanjay Haresh*,
M. Zeeshan Zia
Quoc-Huy Tran
IV, 2020
talk
/
arXiv
We collect a video dataset of road-based anomalies. We propose an object-object interaction reasoning approach for detecting anomalies without additional supervision.
We experiment with reconstruction based and one-class classification based approaches.
|
|