Skip to main content

YouTube Datasets Collection

Babbl Labs is an unrivaled social video intelligence datasets combining metadata, transcript analysis, entity recognition, and sentiment analysis across thousands of channels and millions of videos from the world’s largest video platform, YouTube.

Quick Setup Guide

Get started with S3 access and helper functions in minutes.

Choose Your Dataset

We offer two complementary datasets designed for different research needs and use cases.

YouTube Core Dataset

20,000+ channels monitored • 5+ years historical data • 5-10% transcript coverage • 40+ Fields The YouTube Core Dataset provides unprecedented structure to YouTube audiovisual data at scale, including advanced entity mapping, speaker identification, sentiment analysis and financial instrument mapping. Built around specific mentions of companies, brands, and financial instruments within YouTube content. Key Features:
  • Named Entity Recognition - Companies, brands, people, products with precise mapping
  • Financial Instrument Mapping - FIGI identifiers and ticker symbols for public companies
  • Advanced Sentiment Analysis - Generic sentiment + overt buy/sell recommendations
  • Speaker Intelligence - Host/guest identification with corporate affiliations
  • Segment-Level Precision - Sentiment tied to specific transcript segments mentioning entities

YouTube Extended Dataset

20,000+ channels monitored • 5+ years historical data • 80%+ transcript coverage • 24 Fields The YouTube Extended Dataset provides comprehensive transcript-level data with detailed speaker tracking and granular segment analysis. Approximately 85% of spoken content captured with detailed metadata and processing information. Key Features:
  • Near Complete Transcript Coverage - 85% of spoken content captured in sequential segments
  • Temporal Precision - Start/end timestamps accurate to tenths of seconds
  • Character Positioning - Exact indices within transcript coverage
  • Comprehensive Speaker Data - Names, affiliations, roles, positions with optional handling
  • Processing Transparency - Complete audit trail of transcription and processing steps

Dataset Comparison

FeatureYouTube Core DatasetYouTube Extended Dataset
Primary Use CaseEntity sentiment analysisComplete transcript analysis
Data StructureEntity mentions with contextSequential transcript segments
Field Count40+ fields24 fields
Sentiment Analysis✅ Multi-layered❌ Not included
Entity Recognition✅ Advanced NER + FIGI❌ Not included
Complete Transcripts❌ Context segments only✅ Near complete (85%)
Speaker Tracking✅ Basic identification✅ Comprehensive details
Financial Mapping✅ Ticker symbols + FIGI❌ Not included
Processing Metadata✅ Model versioning✅ Complete audit trail

Common Use Cases

Getting Started

Both datasets share the same S3 access patterns and helper functions, making it easy to work with either or both.
YouTube Extended Dataset is the Foundation: The YouTube Extended Dataset provides the comprehensive transcript foundation (85% coverage) from which the YouTube Core Dataset is derived. The YouTube Extended Dataset contains transcripts and speaker data, while the YouTube Core Dataset adds entity recognition and sentiment analysis on top of selected segments.
I