Skip to main content

Prerequisites

Before you begin, make sure you have Python installed on your system along with the following packages:
pip install boto3 pandas matplotlib seaborn tqdm python-dotenv

Complete Dependencies List

For the full analysis capabilities, install these packages:
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import boto3
from io import StringIO
from typing import Tuple, Optional, List
from os import getenv
import warnings
from dotenv import load_dotenv
import tempfile
from concurrent.futures import ThreadPoolExecutor
from tqdm.notebook import tqdm

Authentication

You’ll receive AWS credentials from Babbl Labs to access your dataset. These credentials include:
  • Access Key ID: Your unique AWS access key
  • Secret Access Key: Your AWS secret key
  • S3 Bucket Name: The specific bucket containing your dataset
Keep your AWS credentials secure and never commit them to version control. Store them as environment variables or in a secure configuration file.

Configuration

Create a .env file in your project directory with your credentials:
# Create .env file
ACCESS_KEY=your_access_key_here
SECRET_KEY=your_secret_key_here
S3_BUCKET_NAME=your_bucket_name_here
Then load them in your Python code:
from dotenv import load_dotenv
from os import getenv
import warnings

# Load environment variables from .env file
load_dotenv(override=True)
warnings.filterwarnings("ignore")

# AWS credentials from environment
ACCESS_KEY = getenv("ACCESS_KEY")
SECRET_KEY = getenv("SECRET_KEY")
S3_BUCKET_NAME = getenv("S3_BUCKET_NAME")
S3_REGION = "us-east-1"

# Validate credentials are loaded
if any(v is None for v in [ACCESS_KEY, SECRET_KEY, S3_BUCKET_NAME]):
    raise ValueError("Please set AWS credentials and bucket name in .env file.")

Alternative: Direct Configuration

If you prefer to set credentials directly in code (not recommended for production):
# AWS credentials (provided by Babbl Labs)
ACCESS_KEY = "your_access_key_here"
SECRET_KEY = "your_secret_key_here"  
S3_BUCKET_NAME = "your_bucket_name_here"
S3_REGION = "us-east-1"

Next Steps

Once your S3 connection is working:
  1. Helper Functions - Learn about utility functions for data loading and processing
  2. Summary Statistics - Explore your dataset with built-in analysis tools
  3. YouTube Core Dataset - Dive into the main YSMEI dataset features
I