Field categories organize the YouTube Extended Dataset’s 24 fields into logical groupings for easier understanding and implementation.
Channel Fields
YouTube channel metadata and coverage information.
Field | Type | Description |
---|
channel_id | STRING | Immutable Babbl Labs internal unique identifier (UUID) |
channel_custom_url | STRING | Mutable custom URL for channel |
channel_name | STRING | Channel title from YouTube metadata |
channel_description | STRING | Channel description from YouTube metadata (default: NONE) |
channel_locale | STRING | Geographic country code (ISO 3166-1 alpha-2) (default: NONE) |
channel_published_at | FLOAT | Unix timestamp when channel was created |
channel_coverage_initiated_at | FLOAT | Unix timestamp when we initiated coverage |
Video Fields
Video-level metadata and publication information.
Field | Type | Description |
---|
video_published_at | FLOAT | Unix timestamp when video was originally published |
video_title | STRING | Video title from YouTube metadata |
video_description | STRING | Video description from YouTube metadata (default: NONE) |
video_language | STRING | Video language code (ISO 639-1) |
Processing Fields
Data processing timeline and model versioning information.
Field | Type | Description |
---|
downloaded_at | FLOAT | Unix timestamp when we downloaded video |
transcribed_at | FLOAT | Unix timestamp when we transcribed video |
transcription_version_tag | STRING | Identifier for transcription model used |
recorded_at | FLOAT | Unix timestamp when video was added to dataset |
Segment Fields
Transcript segment identification, timing, and content information.
Field | Type | Description |
---|
segment_id | STRING | Unique identifier for transcript segment across all videos |
segment_start_ts | FLOAT | Starting point in seconds from video start |
segment_end_ts | FLOAT | End point in seconds from video start |
segment_text | STRING | Complete verbatim transcript text for segment (5-60s of speech) |
segment_char_start | INT | Character index from start of video transcript |
segment_char_end | INT | Character index to end of segment in transcript |
Speaker Fields
Speaker identification and context information (fields with NONE defaults).
Field | Type | Description |
---|
speaker_name | STRING | Name of speaker if identifiable (default: NONE) |
speaker_associated_entity | STRING | Entity/company speaker is associated with (default: NONE) |
speaker_position | STRING | Known title of speaker (default: NONE) |
speaker_role_context | STRING | Role within video context (HOST, GUEST, etc.) (default: NONE) |
Transcript Metrics
Complete transcript statistics and duration information.
Field | Type | Description |
---|
transcript_total_char_count | INT | Total characters in complete video transcript |
transcript_total_duration | FLOAT | Total video duration in seconds |
Data Types & Constraints
Type | Format | Example |
---|
STRING | Variable-length text | "CNBC" |
FLOAT | Floating-point number | 49.5 |
INT | Integer number | 3543 |
All timestamp fields use Unix timestamp format. Fields with NONE defaults handle missing information gracefully. Geographic codes follow ISO 3166-1 alpha-2, language codes follow ISO 639-1.