Overview

The Extended Dataset provides granular transcript-level data from YouTube financial media, offering near complete segment-by-segment coverage (approximately 85%) with detailed speaker information and comprehensive metadata. This dataset is ideal for deep linguistic analysis, speaker tracking, and fine-grained content analysis across YouTube’s financial media landscape.

Complete Data Dictionary

Field	Type	Description
segment_id	UUID	Primary Key - Globally unique identifier for each transcript segment
video_id	UUID	Unique YouTube video identifier
channel_id	UUID	Immutable Babbl Labs internal unique identifier for channel
channel_uri	UUID	Immutable YouTube’s unique identifier for channel
channel_custom_url	STRING	Mutable custom URL for channel
channel_name	STRING	Channel title from YouTube metadata
channel_description	STRING	Channel description from YouTube metadata
channel_locale	ENUM	Channel geographic country (ISO 3166-1 alpha-2)
channel_published_at	TIMESTAMP	Timestamp when channel was created
channel_coverage_initiated_at	TIMESTAMP	Timestamp when we initiated coverage
video_title	STRING	Video title from YouTube metadata
video_description	STRING	Video description from YouTube metadata
video_language	STRING	Video language code (ISO 639-1)
video_published_dt	TIMESTAMP	Timestamp when video was originally published
video_download_dt	TIMESTAMP	Timestamp when we downloaded the video
video_transcribed_at	TIMESTAMP	Timestamp when we transcribed the video
video_in_dataset_at	TIMESTAMP	Timestamp when video was included in dataset
model_transcription_tag	STRING	Identifier for transcription model used
segment_start	FLOAT	Starting point of segment in seconds from video start
segment_end	FLOAT	End point of segment in seconds from video start
segment_start_char	INT	Character index where segment starts in transcript
segment_end_char	INT	Character index where segment ends in transcript
segment_text	STRING	Complete verbatim transcript text for this segment
speaker_name	STRING	Name of speaker (optional - may be null)

Core Dataset

Extended Dataset

​Complete Data Dictionary

Complete Data Dictionary