Skip to main content


Pre-extracted features for the dataset, for accessibility & quick iteration purposes.


Download with --parts features/omnivore_video

These are extracted with the same code as Ego4D and hence are the same: see Ego4D's documentation. See the Feature Extraction README if interested in contributing another model.

How Features are Extracted (What is Input to the Model)

Here is how each video is extracted:

  • Features are extracted for each take and camera (cam_id) and camera stream (stream_id)
  • A stride of 16/30 seconds is used, with a window size 32/30 seconds.
    • If the stride is not divisible by the total duration time, then the last [n - 32/30, n) seconds of video is used as the last window.

What Features are Available

Currently we only extract features from Omnivore Swin-L's video head (omnivore_video) using a window size of ~32 frames (more accurately 32/30 seconds).

How to Read the Features

Download with --parts features/omnivore_video.

Once downloaded, each feature will be available under <download-dir>/features/<take_uid>_<cam_id>_<stream_id>.pt. Use torch.load to load each file.


  • <download-dir>: is the directory you download the data to
  • <take_uid>: is the identifier for the take
  • <cam_id>: is the identifier for the camera, e.g. aria01, cam01, etc. This is same ID in the captures.json or takes.json file
  • <stream_id>: is the identifier for the video stream. For GoPro cameras this will always be 0, but for Aria it will only be rgb as we do not currently extract features from the SLAM (L/R) or Eye cameras

For training purposes, we recommend you pre-process them into a HDF5 dataset, see the function save_ego4d_features_to_hdf5 to do so (you will have to modify it) and LabelledFeatureDset for usage during training; you can refer to clep as an example.