Field Categories

Field categories organize the Extended Dataset’s 24 fields into logical groupings for easier understanding and implementation.

Channel Fields

YouTube channel metadata and coverage information.

Field	Type	Description
channel_id	STRING	Immutable Babbl Labs internal unique identifier (UUID)
channel_custom_url	STRING	Mutable custom URL for channel
channel_name	STRING	Channel title from YouTube metadata
channel_description	STRING	Channel description from YouTube metadata (default: NONE)
channel_locale	STRING	Geographic country code (ISO 3166-1 alpha-2) (default: NONE)
channel_published_at	FLOAT	Unix timestamp when channel was created
channel_coverage_initiated_at	FLOAT	Unix timestamp when we initiated coverage

Video Fields

Video-level metadata and publication information.

Field	Type	Description
video_published_at	FLOAT	Unix timestamp when video was originally published
video_title	STRING	Video title from YouTube metadata
video_description	STRING	Video description from YouTube metadata (default: NONE)
video_language	STRING	Video language code (ISO 639-1)

Processing Fields

Data processing timeline and model versioning information.

Field	Type	Description
downloaded_at	FLOAT	Unix timestamp when we downloaded video
transcribed_at	FLOAT	Unix timestamp when we transcribed video
transcription_version_tag	STRING	Identifier for transcription model used
recorded_at	FLOAT	Unix timestamp when video was added to dataset

Segment Fields

Transcript segment identification, timing, and content information.

Field	Type	Description
segment_id	STRING	Unique identifier for transcript segment across all videos
segment_start_ts	FLOAT	Starting point in seconds from video start
segment_end_ts	FLOAT	End point in seconds from video start
segment_text	STRING	Complete verbatim transcript text for segment (5-60s of speech)
segment_char_start	INT	Character index from start of video transcript
segment_char_end	INT	Character index to end of segment in transcript

Speaker Fields

Speaker identification and context information (fields with NONE defaults).

Field	Type	Description
speaker_name	STRING	Name of speaker if identifiable (default: NONE)
speaker_associated_entity	STRING	Entity/company speaker is associated with (default: NONE)
speaker_position	STRING	Known title of speaker (default: NONE)
speaker_role_context	STRING	Role within video context (HOST, GUEST, etc.) (default: NONE)

Transcript Metrics

Complete transcript statistics and duration information.

Field	Type	Description
transcript_total_char_count	INT	Total characters in complete video transcript
transcript_total_duration	FLOAT	Total video duration in seconds

Data Types & Constraints

Type	Format	Example
STRING	Variable-length text	`"CNBC"`
FLOAT	Floating-point number	`49.5`
INT	Integer number	`3543`

All timestamp fields use Unix timestamp format. Fields with NONE defaults handle missing information gracefully. Geographic codes follow ISO 3166-1 alpha-2, language codes follow ISO 639-1.

Overview

Core Dataset

Extended Dataset

Channel Fields

Video Fields

Processing Fields

Segment Fields

Speaker Fields

Transcript Metrics

Data Types & Constraints

Overview

Core Dataset

Extended Dataset

​Channel Fields

​Video Fields

​Processing Fields

​Segment Fields

​Speaker Fields

​Transcript Metrics

​Data Types & Constraints

Channel Fields

Video Fields

Processing Fields

Segment Fields

Speaker Fields

Transcript Metrics

Data Types & Constraints