YouTube Datasets Collection
Babbl Labs is an unrivaled social video intelligence datasets combining metadata, transcript analysis, entity recognition, and sentiment analysis across thousands of channels and millions of videos from the world’s largest video platform, YouTube.Quick Setup Guide
Get started with S3 access and helper functions in minutes.
Choose Your Dataset
We offer two complementary datasets designed for different research needs and use cases.YouTube Core Dataset
20,000+ channels monitored • 5+ years historical data • 5-10% transcript coverage • 40+ Fields The YouTube Core Dataset provides unprecedented structure to YouTube audiovisual data at scale, including advanced entity mapping, speaker identification, sentiment analysis and financial instrument mapping. Built around specific mentions of companies, brands, and financial instruments within YouTube content.Entity Recognition
Companies, brands, and financial instruments with FIGI mapping and ticker symbols.
Sentiment Analysis
Multi-layered sentiment including buy/sell signals and generic sentiment toward entities.
- Named Entity Recognition - Companies, brands, people, products with precise mapping
- Financial Instrument Mapping - FIGI identifiers and ticker symbols for public companies
- Advanced Sentiment Analysis - Generic sentiment + overt buy/sell recommendations
- Speaker Intelligence - Host/guest identification with corporate affiliations
- Segment-Level Precision - Sentiment tied to specific transcript segments mentioning entities
YouTube Extended Dataset
20,000+ channels monitored • 5+ years historical data • 80%+ transcript coverage • 24 Fields The YouTube Extended Dataset provides comprehensive transcript-level data with detailed speaker tracking and granular segment analysis. Approximately 85% of spoken content captured with detailed metadata and processing information.Near Complete Transcripts
Near complete verbatim transcripts (85% coverage) with precise timing and character positioning.
Speaker Tracking
Comprehensive speaker identification across videos with role context and affiliations.
- Near Complete Transcript Coverage - 85% of spoken content captured in sequential segments
- Temporal Precision - Start/end timestamps accurate to tenths of seconds
- Character Positioning - Exact indices within transcript coverage
- Comprehensive Speaker Data - Names, affiliations, roles, positions with optional handling
- Processing Transparency - Complete audit trail of transcription and processing steps
Dataset Comparison
Feature | YouTube Core Dataset | YouTube Extended Dataset |
---|---|---|
Primary Use Case | Entity sentiment analysis | Complete transcript analysis |
Data Structure | Entity mentions with context | Sequential transcript segments |
Field Count | 40+ fields | 24 fields |
Sentiment Analysis | ✅ Multi-layered | ❌ Not included |
Entity Recognition | ✅ Advanced NER + FIGI | ❌ Not included |
Complete Transcripts | ❌ Context segments only | ✅ Near complete (85%) |
Speaker Tracking | ✅ Basic identification | ✅ Comprehensive details |
Financial Mapping | ✅ Ticker symbols + FIGI | ❌ Not included |
Processing Metadata | ✅ Model versioning | ✅ Complete audit trail |
Common Use Cases
Choose YouTube Core Dataset For
Market Intelligence - Factor-based / quantitative trading strategies, track sentiment toward specific companies, analyze buy/sell signals, monitor brand mentions, competitive intelligence, financial research.
Choose YouTube Extended Dataset For
Content Analysis - GenAI / LLM training datasets, NLP training data, linguistic research, speaker network analysis, topic modeling, conversation flow analysis, content search.
Getting Started
Both datasets share the same S3 access patterns and helper functions, making it easy to work with either or both.S3 Setup Guide
Configure AWS S3 access to download and work with datasets.
Helper Functions
Utility functions and code examples for efficient data processing.
Summary Statistics
Overview of dataset coverage, sizes, and key insights.
Data Dictionaries
Complete field definitions and schemas for both datasets.
YouTube Extended Dataset is the Foundation: The YouTube Extended Dataset provides the comprehensive transcript foundation (85% coverage) from which the YouTube Core Dataset is derived. The YouTube Extended Dataset contains transcripts and speaker data, while the YouTube Core Dataset adds entity recognition and sentiment analysis on top of selected segments.