Evaluate Subreddit Posts in Bulk Using GPT4 Prompting
Reddit is a rich source of user-generated content. People share opinions, ask questions, and discuss a wide range of topics. Analyzing subreddit posts reveals what users care about, emerging trends, and common pain points. Combined with OpenAIās language models, you can quickly generate summaries, topic ideas, or keyword suggestions tailored to your goals.
In this tutorial, you will:
- Fetch posts from a subreddit using the Reddit API (via
asyncpraw). - Filter results by an optional keyword to narrow the scope.
- Send aggregated content to OpenAIās API to summarize or generate content ideas.
- View and save the results in a convenient format for further analysis.
Using Google Colab, you can iterate on prompts, conduct SEO research, and generate content outlines for blog posts, newsletters, or study materials.
Table of Contents
Requirements
- Google Colab:
A browser-based environment where you can run Python code directly.
(https://colab.research.google.com/) - OpenAI API Key:
Sign up at OpenAIās platform to access their API. - Reddit API Credentials:
- Sign in to Reddit and go to https://www.reddit.com/prefs/apps
- Create a new āpersonal use scriptā to obtain your Client ID and Client Secret.
Getting Started in Google Colab
- Open a New Notebook:
Open Google Colab and create a new notebook. - Install Required Libraries:
In a cell, run:This will install all necessary packages for this tutorial.
- Set Up Your Secrets:
Provide your OpenAI API key and Reddit credentials either directly in the script or by loading them from a separate JSON file.
The Script
The complete script follows. Run it in your Google Colab environment by placing it into a cell after installation and updating your credentials and parameters as needed. The code assumes you ran the installation step first.
Important: Before running, fill in your openai_key, reddit_clientid, reddit_secret, and specify common_subreddit_name and prompt. Optionally, provide a keyword in optional_keyword_search.
import asyncpraw
import nest_asyncio
import pandas as pd
import asyncio
from datetime import datetime
from openai import OpenAI
import json
import os
# Fill in your credentials here or load them from a file
secrets = {
"openai_key": "", # Your OpenAI Key
"reddit_clientid": "", # Your Reddit Client ID
"reddit_secret": "" # Your Reddit Secret
}
openai_secret = secrets["openai_key"]
reddit_clientid = secrets["reddit_clientid"]
reddit_secret = secrets["reddit_secret"]
# Prompt and configuration
prompt = "" # e.g. "Summarize these posts with a focus on SEO insights."
common_subreddit_name = "" # e.g. "AskReddit"
optional_keyword_search = "" # e.g. "marketing", leave blank if not needed
user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'
cache_file = "reddit_cache.json"
# Load cache if it exists
if os.path.exists(cache_file):
with open(cache_file, "r") as f:
df_cache = json.load(f)
else:
df_cache = {}
nest_asyncio.apply()
async def fetch_posts(subreddit_name, limit=500): #500 is around the limit where Reddit doesn't 429
reddit = asyncpraw.Reddit(client_id=reddit_clientid,
client_secret=reddit_secret,
user_agent=user_agent)
subreddit = await reddit.subreddit(subreddit_name)
posts = []
count = 0
after = None
while count < limit:
async for post in subreddit.new(limit=100, params={'after': after}):
posts.append([datetime.fromtimestamp(post.created_utc).isoformat(), post.title, post.selftext])
count += 1
if count >= limit:
break
after = post.name # For paginated fetching
await reddit.close()
return pd.DataFrame(posts, columns=['Date', 'Title', 'Content'])
def filter_df(df, optional_keyword_search):
if optional_keyword_search:
df = df[df['Title'].str.contains(optional_keyword_search, case=False, na=False) |
df['Content'].str.contains(optional_keyword_search, case=False, na=False)]
return df
def openai_api(prompt):
client = OpenAI(
api_key=openai_secret,
base_url="https://api.openai.com/v1/chat/completions"
)
completion = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "user", "content": prompt}
]
)
return completion.choices[0].message.content
async def main():
try:
subreddit_name = common_subreddit_name
# Check cache
if subreddit_name in df_cache:
df = pd.DataFrame(df_cache[subreddit_name])
else:
df = await fetch_posts(subreddit_name)
df_cache[subreddit_name] = df.to_dict(orient='records')
with open(cache_file, "w") as f:
json.dump(df_cache, f)
if optional_keyword_search:
df = filter_df(df, optional_keyword_search)
if df.empty:
print("No posts found.")
else:
df_content = '. '.join(df['Content'].tolist())
full_prompt = f"{prompt}. The content to evaluate is: {df_content}"
output = openai_api(full_prompt)
print(f"\nGPT Response for subreddit: {subreddit_name}")
if optional_keyword_search:
print(f"Filtering for keyword: '{optional_keyword_search}'")
print(output)
print("\nRaw Post Data:")
display(df) # In Colab, display DataFrame nicely
# Optionally save the raw data
df.to_csv('reddit_posts.csv', index=False)
print("Raw data saved as 'reddit_posts.csv'.")
except Exception as e:
print(f"An error occurred: {e}")
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
loop.run_until_complete(main())
How It Works
- Data Fetching:
Usingasyncpraw, the script connects to Reddit and fetches posts from the specified subreddit. It fetches posts in increments of 100; use the script’s limit parameter to control the total number retrieved. - Filtering by Keyword:
If you supply a keyword inoptional_keyword_search, the script narrows the results to posts that contain this keyword in either the title or the content. - OpenAI Integration:
The combined post content is fed into the OpenAI API along with your chosen prompt. For example, if you ask for āSEO-related insights,ā the model will highlight relevant keywords or suggest topics. - Output:
The modelās response is printed to the console, along with the raw Reddit data. A CSV of the raw data is also saved for your records. In Colab,display(df)shows a nicely formatted DataFrame. You can use the CSV file for further analysis outside of Colab.
Running the Script
- Ensure youāve filled in your
openai_key,reddit_clientid,reddit_secret,common_subreddit_name, andprompt. - Run the code cell. Colab will handle each step, and once complete you will see the GPT output and a preview of the fetched Reddit data.
Next Steps
- Refine Your Prompt:
Experiment with different prompts to generate outlines, topic clusters, or long-form summaries. - Adjust the Subreddit and Keywords:
Try different subreddits or keywords to gather content that aligns with your interests or SEO niche. - Use Results for Analysis:
Downloadreddit_posts.csvfrom the left file explorer in Colab and analyze it locally. The GPT response can be copy-pasted into your notes or integrated into a content strategy.
With this pipeline, you have a flexible tool for mining Reddit for insights and leveraging advanced language models to interpret large amounts of user-generated content.
Don’t forget to follow me online: LinkedIn and BlueSky
- Evaluate Subreddit Posts in Bulk Using GPT4 Prompting - December 12, 2024
- Calculate Similarity Between Article Elements Using spaCy - November 13, 2024
- Audit URLs for SEO Using ahrefs Backlink API Data - November 11, 2024












