AI Discourse at Scale: Topic, Sentiment, and Community Dynamics Across Reddit

DSAN 6000 Big Data Analytics - Fall 2024

Team Members

Mandy Sun
Erika Atoma
Yu-Chien(Violet) Lin
Anna Hyunjung Kim

Project Overview

High-Level Problem Statement:

As AI adoption accelerated at an unprecedented pace, online communities became one of the most immediate reflections of how society interpreted this change. Yet, while AI tools grew exponentially, far less is understood about how everyday users talked about them—what excited them, what worried them, and how discussions evolved across different AI-focused spaces.

Our core problem is to uncover how AI discourse formed, shifted, and diversified within Reddit communities during the peak of the AI boom. By analyzing subreddit trajectories, engagement dynamics, sentiment change, and topical divergence, we aim to understand what digital conversations can reveal about public perception, technological anxiety, and community-driven innovation.

Dataset

Source: Reddit Comments and Submissions
Time Period: June 1, 2023 to July 31, 2024
Data Scale and Size:

Data Type Total Rows Size (GB)

Comments 3,675,768,958 7.7

Submissions 567,890,869 0.384

Data Type	Total Rows	Size (GB)
Comments	3,675,768,958	7.7
Submissions	567,890,869	0.384

Subreddits:

Thematic Category	Included Subreddits
General Discussion & Public Sentiment	AskReddit
AI Product & Model Communities	ChatGPT, ClaudeAI, PerplexityAI, OpenAI, GPT4
Open-Source Model Development & Creative Tools	LocalLLaMA, StableDiffusion
Technical, Research, and Employment Communities	MachineLearning, datascience, programming, computerscience
Societal, Political, and Futurist Perspectives	Futurology, neoliberal, singularity

Filtered Data Scale:

subreddit	num_comments	num_submissions	total_rows
AskReddit	55851868	2686082	58537950
neoliberal	4673171	35352	4708523
ChatGPT	1748893	145862	1894755
singularity	1166552	37112	1203664
Futurology	898463	15884	914347
StableDiffusion	763492	74526	838018
LocalLLaMA	425095	31020	456115
OpenAI	337768	25574	363342
programming	325140	31396	356536
datascience	193664	24717	218381
MachineLearning	146901	37191	184092
ClaudeAI	59790	5097	64887
computerscience	31181	7504	38685
GPT4	4009	1623	5632
PerplexityAI	33	28	61

Business Questions

We address 10 business questions spanning three analytical approaches:

Exploratory Data Analysis (EDA):

What percentage of highly-scored submissions generate low-quality discussions (few comments with high controversiality)?
How has the popularity and sentiment around different types of generative AI (e.g., image, text, music, code) evolved across Reddit subreddits over time, and what patterns emerge in user engagement and comment dynamics?
When is the optimal time to post content on AI-related subreddits to maximize engagement and content quality?
What are the behavioral patterns of the top 1% contributors across the AI/tech sphere? Are top contributors specialists or generalists? Which subreddit combinations are most frequently co-visited? How long have users been active?

Natural Language Processing (NLP):

What are the dominant topics discussed in AI-related Reddit posts?
What sentiment patterns characterize each AI and tech subreddit when focusing specifically on high-engagement users?
How has sentiment toward different generative AI tools (e.g., ChatGPT, Midjourney, Sora) shifted over time across subreddits?
What account (e.g. the percentage) of pessimistic words used to describe AI?

Machine Learning (ML):

Can we predict the comment score of a new AI-related post?
Can we predict whether an AI-related Reddit post will go viral?

See BUSINESS_QUESTIONS.md for detailed technical approaches.

Methodology

Data Processing Infrastructure

Platform: Apache Spark cluster on AWS EC2
Processing: Distributed computing with PySpark
Storage: Amazon S3 for data lake architecture
Scale: Processing hundreds of millions of rows

Analysis Pipeline

Data Acquisition & Filtering (Milestone 0)
- Copy full Reddit dataset from source bucket
- Filter to subreddits of interest
- Generate dataset statistics
Exploratory Data Analysis (Milestone 1) → EDA Page
- Statistical analysis
- Temporal patterns
- Community comparisons
Natural Language Processing (Milestone 2) → NLP Page
- Sentiment analysis
- Topic modeling
- Text mining
Machine Learning (Milestone 3) → ML Page
- Feature engineering
- Model training and evaluation
- Prediction and classification
Final Analysis (Milestone 4) → Conclusion
- Synthesis of findings
- Business recommendations

Repository

GitHub: https://github.com/gu-dsan6000/fall-2025-project-team05
Documentation: All code, data outputs, and analysis documentation available in repository

Last updated: Dec 10, 2025