Skip to main content
Field categories organize the Extended Dataset’s 24 fields into logical groupings for easier understanding and implementation.

Channel Fields

YouTube channel metadata and coverage information.
FieldTypeDescription
channel_idSTRINGImmutable Babbl Labs internal unique identifier (UUID)
channel_custom_urlSTRINGMutable custom URL for channel
channel_nameSTRINGChannel title from YouTube metadata
channel_descriptionSTRINGChannel description from YouTube metadata (default: NONE)
channel_localeSTRINGGeographic country code (ISO 3166-1 alpha-2) (default: NONE)
channel_published_atFLOATUnix timestamp when channel was created
channel_coverage_initiated_atFLOATUnix timestamp when we initiated coverage

Video Fields

Video-level metadata and publication information.
FieldTypeDescription
video_published_atFLOATUnix timestamp when video was originally published
video_titleSTRINGVideo title from YouTube metadata
video_descriptionSTRINGVideo description from YouTube metadata (default: NONE)
video_languageSTRINGVideo language code (ISO 639-1)

Processing Fields

Data processing timeline and model versioning information.
FieldTypeDescription
downloaded_atFLOATUnix timestamp when we downloaded video
transcribed_atFLOATUnix timestamp when we transcribed video
transcription_version_tagSTRINGIdentifier for transcription model used
recorded_atFLOATUnix timestamp when video was added to dataset

Segment Fields

Transcript segment identification, timing, and content information.
FieldTypeDescription
segment_idSTRINGUnique identifier for transcript segment across all videos
segment_start_tsFLOATStarting point in seconds from video start
segment_end_tsFLOATEnd point in seconds from video start
segment_textSTRINGComplete verbatim transcript text for segment (5-60s of speech)
segment_char_startINTCharacter index from start of video transcript
segment_char_endINTCharacter index to end of segment in transcript

Speaker Fields

Speaker identification and context information (fields with NONE defaults).
FieldTypeDescription
speaker_nameSTRINGName of speaker if identifiable (default: NONE)
speaker_associated_entitySTRINGEntity/company speaker is associated with (default: NONE)
speaker_positionSTRINGKnown title of speaker (default: NONE)
speaker_role_contextSTRINGRole within video context (HOST, GUEST, etc.) (default: NONE)

Transcript Metrics

Complete transcript statistics and duration information.
FieldTypeDescription
transcript_total_char_countINTTotal characters in complete video transcript
transcript_total_durationFLOATTotal video duration in seconds

Data Types & Constraints

TypeFormatExample
STRINGVariable-length text"CNBC"
FLOATFloating-point number49.5
INTInteger number3543
All timestamp fields use Unix timestamp format. Fields with NONE defaults handle missing information gracefully. Geographic codes follow ISO 3166-1 alpha-2, language codes follow ISO 639-1.