Overview

Datasets Collection

Babbl Labs is an unrivaled social video intelligence datasets combining metadata, transcript analysis, entity recognition, and sentiment analysis across thousands of channels and millions of videos from the world’s largest video platform, YouTube.

Quick Setup Guide

Get started with S3 access and helper functions in minutes.

Choose Your Dataset

We offer two complementary datasets designed for different research needs and use cases.

Core Dataset

20,000+ channels monitored • 5+ years historical data • 5-10% transcript coverage • 40+ Fields The Core Dataset provides unprecedented structure to YouTube audiovisual data at scale, including advanced entity mapping, speaker identification, sentiment analysis and financial instrument mapping. Built around specific mentions of companies, brands, and financial instruments within YouTube content.

Entity Recognition

Companies, brands, and financial instruments with FIGI mapping and ticker symbols.

Sentiment Analysis

Multi-layered sentiment including buy/sell signals and generic sentiment toward entities.

Key Features:

Named Entity Recognition - Companies, brands, people, products with precise mapping
Financial Instrument Mapping - FIGI identifiers and ticker symbols for public companies
Advanced Sentiment Analysis - Generic sentiment + overt buy/sell recommendations
Speaker Intelligence - Host/guest identification with corporate affiliations
Segment-Level Precision - Sentiment tied to specific transcript segments mentioning entities

Extended Dataset

20,000+ channels monitored • 5+ years historical data • 80%+ transcript coverage • 24 Fields The Extended Dataset provides comprehensive transcript-level data with detailed speaker tracking and granular segment analysis. Approximately 85% of spoken content captured with detailed metadata and processing information.

Near Complete Transcripts

Near complete verbatim transcripts (85% coverage) with precise timing and character positioning.

Speaker Tracking

Comprehensive speaker identification across videos with role context and affiliations.

Key Features:

Near Complete Transcript Coverage - 85% of spoken content captured in sequential segments
Temporal Precision - Start/end timestamps accurate to tenths of seconds
Character Positioning - Exact indices within transcript coverage
Comprehensive Speaker Data - Names, affiliations, roles, positions with optional handling
Processing Transparency - Complete audit trail of transcription and processing steps

Dataset Comparison

Feature	Core Dataset	Extended Dataset
Primary Use Case	Entity sentiment analysis	Complete transcript analysis
Data Structure	Entity mentions with context	Sequential transcript segments
Field Count	40+ fields	24 fields
Sentiment Analysis	✅ Multi-layered	❌ Not included
Entity Recognition	✅ Advanced NER + FIGI	❌ Not included
Complete Transcripts	❌ Context segments only	✅ Near complete (85%)
Speaker Tracking	✅ Basic identification	✅ Comprehensive details
Financial Mapping	✅ Ticker symbols + FIGI	❌ Not included
Processing Metadata	✅ Model versioning	✅ Complete audit trail

Common Use Cases

Choose Core Dataset For

Market Intelligence - Factor-based / quantitative trading strategies, track sentiment toward specific companies, analyze buy/sell signals, monitor brand mentions, competitive intelligence, financial research.

Choose Extended Dataset For

Content Analysis - GenAI / LLM training datasets, NLP training data, linguistic research, speaker network analysis, topic modeling, conversation flow analysis, content search.

Getting Started

Both datasets share the same S3 access patterns and helper functions, making it easy to work with either or both.

S3 Setup Guide

Configure AWS S3 access to download and work with datasets.

Helper Functions

Utility functions and code examples for efficient data processing.

Summary Statistics

Overview of dataset coverage, sizes, and key insights.

Data Dictionaries

Complete field definitions and schemas for both datasets.

Extended Dataset is the Foundation: The Extended Dataset provides the comprehensive transcript foundation (85% coverage) from which the Core Dataset is derived. The Extended Dataset contains transcripts and speaker data, while the Core Dataset adds entity recognition and sentiment analysis on top of selected segments.

Core Dataset

Extended Dataset

​Datasets Collection

Quick Setup Guide

​Choose Your Dataset

​Core Dataset

Entity Recognition

Sentiment Analysis

​Extended Dataset

Near Complete Transcripts

Speaker Tracking

​Dataset Comparison

​Common Use Cases

Choose Core Dataset For

Choose Extended Dataset For

​Getting Started

S3 Setup Guide

Helper Functions

Summary Statistics

Data Dictionaries

Datasets Collection

Choose Your Dataset

Core Dataset

Extended Dataset

Dataset Comparison

Common Use Cases

Getting Started