Skip to main content
The YouTube Extended Dataset provides granular transcript-level data from YouTube financial media, offering near complete segment-by-segment coverage (approximately 85%) with detailed speaker information and comprehensive metadata. This dataset is ideal for deep linguistic analysis, speaker tracking, and fine-grained content analysis across YouTube’s financial media landscape.

Complete Data Dictionary

FieldTypeDescription
segment_idUUIDPrimary Key - Globally unique identifier for each transcript segment
video_idUUIDUnique YouTube video identifier
channel_idUUIDImmutable Babbl Labs internal unique identifier for channel
channel_uriUUIDImmutable YouTube’s unique identifier for channel
channel_custom_urlSTRINGMutable custom URL for channel
channel_nameSTRINGChannel title from YouTube metadata
channel_descriptionSTRINGChannel description from YouTube metadata
channel_localeENUMChannel geographic country (ISO 3166-1 alpha-2)
channel_published_atTIMESTAMPTimestamp when channel was created
channel_coverage_initiated_atTIMESTAMPTimestamp when we initiated coverage
video_titleSTRINGVideo title from YouTube metadata
video_descriptionSTRINGVideo description from YouTube metadata
video_languageSTRINGVideo language code (ISO 639-1)
video_published_dtTIMESTAMPTimestamp when video was originally published
video_download_dtTIMESTAMPTimestamp when we downloaded the video
video_transcribed_atTIMESTAMPTimestamp when we transcribed the video
video_in_dataset_atTIMESTAMPTimestamp when video was included in dataset
model_transcription_tagSTRINGIdentifier for transcription model used
segment_startFLOATStarting point of segment in seconds from video start
segment_endFLOATEnd point of segment in seconds from video start
segment_start_charINTCharacter index where segment starts in transcript
segment_end_charINTCharacter index where segment ends in transcript
segment_textSTRINGComplete verbatim transcript text for this segment
speaker_nameSTRINGName of speaker (optional - may be null)
I