Skip to main content

Datasets Collection

Babbl Labs is an unrivaled social video intelligence datasets combining metadata, transcript analysis, entity recognition, and sentiment analysis across thousands of channels and millions of videos from the world’s largest video platform, YouTube.

Quick Setup Guide

Get started with S3 access and helper functions in minutes.

Choose Your Dataset

We offer two complementary datasets designed for different research needs and use cases.

Core Dataset

20,000+ channels monitored • 5+ years historical data • 5-10% transcript coverage • 40+ Fields The Core Dataset provides unprecedented structure to YouTube audiovisual data at scale, including advanced entity mapping, speaker identification, sentiment analysis and financial instrument mapping. Built around specific mentions of companies, brands, and financial instruments within YouTube content. Key Features:
  • Named Entity Recognition - Companies, brands, people, products with precise mapping
  • Financial Instrument Mapping - FIGI identifiers and ticker symbols for public companies
  • Advanced Sentiment Analysis - Generic sentiment + overt buy/sell recommendations
  • Speaker Intelligence - Host/guest identification with corporate affiliations
  • Segment-Level Precision - Sentiment tied to specific transcript segments mentioning entities

Extended Dataset

20,000+ channels monitored • 5+ years historical data • 80%+ transcript coverage • 24 Fields The Extended Dataset provides comprehensive transcript-level data with detailed speaker tracking and granular segment analysis. Approximately 85% of spoken content captured with detailed metadata and processing information. Key Features:
  • Near Complete Transcript Coverage - 85% of spoken content captured in sequential segments
  • Temporal Precision - Start/end timestamps accurate to tenths of seconds
  • Character Positioning - Exact indices within transcript coverage
  • Comprehensive Speaker Data - Names, affiliations, roles, positions with optional handling
  • Processing Transparency - Complete audit trail of transcription and processing steps

Dataset Comparison

FeatureCore DatasetExtended Dataset
Primary Use CaseEntity sentiment analysisComplete transcript analysis
Data StructureEntity mentions with contextSequential transcript segments
Field Count40+ fields24 fields
Sentiment Analysis✅ Multi-layered❌ Not included
Entity Recognition✅ Advanced NER + FIGI❌ Not included
Complete Transcripts❌ Context segments only✅ Near complete (85%)
Speaker Tracking✅ Basic identification✅ Comprehensive details
Financial Mapping✅ Ticker symbols + FIGI❌ Not included
Processing Metadata✅ Model versioning✅ Complete audit trail

Common Use Cases

Getting Started

Both datasets share the same S3 access patterns and helper functions, making it easy to work with either or both.
Extended Dataset is the Foundation: The Extended Dataset provides the comprehensive transcript foundation (85% coverage) from which the Core Dataset is derived. The Extended Dataset contains transcripts and speaker data, while the Core Dataset adds entity recognition and sentiment analysis on top of selected segments.