AI Discourse at Scale: Topic, Sentiment, and Community Dynamics Across Reddit
DSAN 6000 Big Data Analytics - Fall 2024
Team Members
- Mandy Sun
- Erika Atoma
- Yu-Chien(Violet) Lin
- Anna Hyunjung Kim
Project Overview
High-Level Problem Statement:
As AI adoption accelerated at an unprecedented pace, online communities became one of the most immediate reflections of how society interpreted this change. Yet, while AI tools grew exponentially, far less is understood about how everyday users talked about them—what excited them, what worried them, and how discussions evolved across different AI-focused spaces.
Our core problem is to uncover how AI discourse formed, shifted, and diversified within Reddit communities during the peak of the AI boom. By analyzing subreddit trajectories, engagement dynamics, sentiment change, and topical divergence, we aim to understand what digital conversations can reveal about public perception, technological anxiety, and community-driven innovation.
Dataset
Source: Reddit Comments and Submissions
Time Period: June 1, 2023 to July 31, 2024
Data Scale and Size:
Data Type Total Rows Size (GB) Comments 3,675,768,958 7.7 Submissions 567,890,869 0.384 Subreddits:
Thematic Category Included Subreddits General Discussion & Public Sentiment AskReddit AI Product & Model Communities ChatGPT, ClaudeAI, PerplexityAI, OpenAI, GPT4 Open-Source Model Development & Creative Tools LocalLLaMA, StableDiffusion Technical, Research, and Employment Communities MachineLearning, datascience, programming, computerscience Societal, Political, and Futurist Perspectives Futurology, neoliberal, singularity Filtered Data Scale:
subreddit num_comments num_submissions total_rows AskReddit 55851868 2686082 58537950 neoliberal 4673171 35352 4708523 ChatGPT 1748893 145862 1894755 singularity 1166552 37112 1203664 Futurology 898463 15884 914347 StableDiffusion 763492 74526 838018 LocalLLaMA 425095 31020 456115 OpenAI 337768 25574 363342 programming 325140 31396 356536 datascience 193664 24717 218381 MachineLearning 146901 37191 184092 ClaudeAI 59790 5097 64887 computerscience 31181 7504 38685 GPT4 4009 1623 5632 PerplexityAI 33 28 61
Business Questions
We address 10 business questions spanning three analytical approaches:
Exploratory Data Analysis (EDA):
- What percentage of highly-scored submissions generate low-quality discussions (few comments with high controversiality)?
- How has the popularity and sentiment around different types of generative AI (e.g., image, text, music, code) evolved across Reddit subreddits over time, and what patterns emerge in user engagement and comment dynamics?
- When is the optimal time to post content on AI-related subreddits to maximize engagement and content quality?
- What are the behavioral patterns of the top 1% contributors across the AI/tech sphere? Are top contributors specialists or generalists? Which subreddit combinations are most frequently co-visited? How long have users been active?
Natural Language Processing (NLP):
- What are the dominant topics discussed in AI-related Reddit posts?
- What sentiment patterns characterize each AI and tech subreddit when focusing specifically on high-engagement users?
- How has sentiment toward different generative AI tools (e.g., ChatGPT, Midjourney, Sora) shifted over time across subreddits?
- What account (e.g. the percentage) of pessimistic words used to describe AI?
Machine Learning (ML):
- Can we predict the comment score of a new AI-related post?
- Can we predict whether an AI-related Reddit post will go viral?
See BUSINESS_QUESTIONS.md for detailed technical approaches.
Methodology
Data Processing Infrastructure
- Platform: Apache Spark cluster on AWS EC2
- Processing: Distributed computing with PySpark
- Storage: Amazon S3 for data lake architecture
- Scale: Processing hundreds of millions of rows
Analysis Pipeline
- Data Acquisition & Filtering (Milestone 0)
- Copy full Reddit dataset from source bucket
- Filter to subreddits of interest
- Generate dataset statistics
- Exploratory Data Analysis (Milestone 1) → EDA Page
- Statistical analysis
- Temporal patterns
- Community comparisons
- Natural Language Processing (Milestone 2) → NLP Page
- Sentiment analysis
- Topic modeling
- Text mining
- Machine Learning (Milestone 3) → ML Page
- Feature engineering
- Model training and evaluation
- Prediction and classification
- Final Analysis (Milestone 4) → Conclusion
- Synthesis of findings
- Business recommendations
Repository
- GitHub: https://github.com/gu-dsan6000/fall-2025-project-team05
- Documentation: All code, data outputs, and analysis documentation available in repository
Last updated: Dec 10, 2025