Expert Commentary
- To download transcriptions use
--parts annotations
. You can use--benchmarks expert_commentary
to filter for just the transcription file. - To download all data for expert commentary use
--parts expert_commentary
(and--parts annotations
for train/val split transcription file)
Expert Commentary
Please refer to the Jupyter notebook tutorial for an overview and how to use the available annotation files.
We enlist subject domain experts (e.g. professional chefs, coaches, performers) to provide commentary on the takes in our dataset. Unlike prior datasets, which primarily focus on what is happening, expert commentary additionally provides insights on the why of the demonstration, and how well the demonstration is performed.
Each take may be commentated on by several different in-domain experts, each of whom may have different styles or aspects they prioritize; the expert providing a particular commentary is identifiable by a unique 6 character ID.
The expert commentary consists of three types of data:
- Commentaries: As experts watch each take, they pause the video as desired to record verbal commentary. Experts were instructed to keep commentaries retrospective -- i.e. they should not reference events that happen after the frame on which they've paused. We provide both the video timestamp when the expert paused the video, as well as the order in which the commentaries were recorded. We provide both the original audio recordings, as well as auto-transcriptions using Whisper.
- Spatial drawings: For each commentary, experts were given the option to draw on the video player to provide visual grounding to their comments. When available, we provide the stroke paths as a temporal sequence of pixel coordinates.
- Proficiency: After watching each take, experts provided a single proficiency score to the entire demonstration on a scale from 1-10. Experts were encouraged to utilize the full dynamic range across all the takes they viewed, but were otherwise given the freedom to define the scaling themselves.
Annotations Structure
The annotations for Expert Commentary are present in two forms:
- An aggregation of all of the commentary transcriptions in a single JSON file (per train/val split)
- Available via
--parts annotations --benchmarks expert_commentary
(note the--benchmarks
flag is optional)
- Available via
- A separate directory containing all the audio recordings and metadata associated to a particular annotation instance from an expert annotator.
- Available via
--parts expert_commentary
- Available via
The first form only contains transcriptions of the commentaries. The second form in addition to containing transcriptions, contains the audio recordings, spatial drawing data and the proficiency score (and reason why) per annotation instance.
The second form is significantly larger than the first in terms of file size, which is one of the reasons why the annotations are structured into two parts. The other reason for having the first form is convenience: it serves as an easy to use index to the second form, i.e. just by reading one file you can get a list of all the commentaries present for one of the train/avl splits.
First Form Structure (expert_commentary_<split>.json
)
We transcribe each annotation and aggregate these transcriptions into a single
JSON file per split (--parts annotations
). The test set is redacted due to
the Proficiency benchmark task. This
single annotation file (per split) does not include spatial drawing data or
proficiency information, but can be used to refer to the more granular
annotation data as described below.
Sample JSON
Here is a JSON snippet showing a sample annotation from the expert_commentary_<split>.json
file:
{
"ds": "YYYYDDMM_HH:MM:SS", # when the data was exported
"annotations": {
"<take_uid1>": [
{
// take identifiers
"take_name": "cmu_bike15_4",
"take_uid": "816c4bd2-a5ba-4e40-854f-c98d771d1060",
// task information for the take
"task_id": 4004,
"task_name": "Clean and Lubricate the Chain",
// where the commentary is located relative to annotations/expert_commentary/
"commentary": "cmu_bike15_4/qfbn5o", // last 6 characters (after the slash) is the expert's uid
"commentary_data": [
{
"recording": "4.webm", // associated audio file
// where in the video the commentary was made
"video_time": 27.833885,
// a transcription of the commentary
"text": " Our mechanic is using a 15mm ratcheting combination wrench, which will be a great tool for the job and will make this job quicker and more efficient.",
// how long the audio recording is (approximately)
"duration_approx": 12.76620000000298
// whether there was an error transcribing the audio
"error": false,
},
...
]
}
...
]
"<take_uid2>": [ ... ],
...
}
}
Second Form Structure (folder per annotation instance)
For audio recordings, proficiency and spatial drawing data, you will need to refer to the expert_commentary
folder under the annotations
folder (you will need to download with --parts expert_commentary
). To read the spatial drawing correctly, please refer to the function get_paths_for_commentary_time
in the python module ego4d.egoexo.expert_commentary
(example usage is shown in the tutorial notebook).
This folder is structured as follows:
|-- annotations/expert_commentary
|-- [take_name]
|-- [expert_uid]
|-- recordings
|-- 0.webm
|-- 1.webm
...
|-- data.json
|-- transcriptions.json
|-- [expert_uid]
...
|-- [take_name]
...
The contents under each annotations/expert_commentary/[take_name]/[expert_uid]
directory comprise an expert commentary annotation, consisting of the following files:
recordings/[x].webm
: Audio recordings of a commentary. Here [x] encodes the order in which the recordings were recorded, in real time. Expertsdata.json
: Contains expert commentary metadata, spatial drawing paths, and proficiency score/explanations. Some of the more relevant fields:user_id
: The 6 character unique ID of the expert providing the commentaryvideo_name
: The video name of the take commentated on by the expert.annotations
: A list, containing information per commentaryrecording_path
: Name of the corresponding recording file ([x].webm
)duration_approx
: Length of the recording in secondsvideo_time
: Timestamp in the video when the expert paused, to provide commentaryevents
: Data for reconstruction the spatial drawings. Thepaths
field contains pairs of data points, whosefrom
andto
fields provide the (x,y) coordinates and timestamps of the start and end of each stroke.proficiency
: Demonstration proficiency informationrating
: A score from 1-10why
: A short text explanation of why therating
was chosen
transcriptions.json
: Auto-transcriptions of the recordings files, using Whisper. Note that these may contain errors, as auto-transcriptions may not be perfect. Each entry in the transcription file is indexed by the recording file name ([x].webm
), with the following fields:text
: The text of the transcriptionlanguage
: The language code of the transcription. All our experts are native English speakers, so this should been
for all transcriptions.error
: a boolean flag to mark whether there was an errorerror_desc
: A description of the error iferror
istrue
(traceback.format_exc()
)- NOTE: this file is aggregated into the first form, i.e. you do not need to read this file if you are reading the first form's file.