Evaluate Subreddit Posts in Bulk Using GPT4 Prompting

Read Time: 3 minutes

Readability: Accessible (Clear & approachable)

Core Topics: RedditColabAPIscriptposts

Reddit is a rich source of user-generated content. People share opinions, ask questions, and discuss a wide range of topics. Analyzing subreddit posts reveals what users care about, emerging trends, and common pain points. Combined with OpenAI’s language models, you can quickly generate summaries, topic ideas, or keyword suggestions tailored to your goals.

In this tutorial, you will:

Fetch posts from a subreddit using the Reddit API (via asyncpraw).
Filter results by an optional keyword to narrow the scope.
Send aggregated content to OpenAI’s API to summarize or generate content ideas.
View and save the results in a convenient format for further analysis.

Using Google Colab, you can iterate on prompts, conduct SEO research, and generate content outlines for blog posts, newsletters, or study materials.

Table of Contents

Requirements

Google Colab:
A browser-based environment where you can run Python code directly.
(https://colab.research.google.com/)
OpenAI API Key:
Sign up at OpenAI’s platform to access their API.
Reddit API Credentials:
- Sign in to Reddit and go to https://www.reddit.com/prefs/apps
- Create a new “personal use script” to obtain your Client ID and Client Secret.

Getting Started in Google Colab

Open a New Notebook:
Open Google Colab and create a new notebook.
Install Required Libraries:
In a cell, run:

!pip install asyncpraw openai nest_asyncio pandas

This will install all necessary packages for this tutorial.
Set Up Your Secrets:
Provide your OpenAI API key and Reddit credentials either directly in the script or by loading them from a separate JSON file.

The Script

The complete script follows. Run it in your Google Colab environment by placing it into a cell after installation and updating your credentials and parameters as needed. The code assumes you ran the installation step first.

Important: Before running, fill in your openai_key, reddit_clientid, reddit_secret, and specify common_subreddit_name and prompt. Optionally, provide a keyword in optional_keyword_search.

import asyncpraw
import nest_asyncio
import pandas as pd
import asyncio
from datetime import datetime
from openai import OpenAI
import json
import os

# Fill in your credentials here or load them from a file
secrets = {
    "openai_key": "",          # Your OpenAI Key
    "reddit_clientid": "",     # Your Reddit Client ID
    "reddit_secret": ""         # Your Reddit Secret
}

openai_secret = secrets["openai_key"]
reddit_clientid = secrets["reddit_clientid"]
reddit_secret = secrets["reddit_secret"]

# Prompt and configuration
prompt = ""                  # e.g. "Summarize these posts with a focus on SEO insights."
common_subreddit_name = ""   # e.g. "AskReddit"
optional_keyword_search = "" # e.g. "marketing", leave blank if not needed

user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'

cache_file = "reddit_cache.json"

# Load cache if it exists
if os.path.exists(cache_file):
    with open(cache_file, "r") as f:
        df_cache = json.load(f)
else:
    df_cache = {}

nest_asyncio.apply()

async def fetch_posts(subreddit_name, limit=500): #500 is around the limit where Reddit doesn't 429
    reddit = asyncpraw.Reddit(client_id=reddit_clientid,
                              client_secret=reddit_secret,
                              user_agent=user_agent)
    subreddit = await reddit.subreddit(subreddit_name)

    posts = []
    count = 0
    after = None

    while count < limit:
        async for post in subreddit.new(limit=100, params={'after': after}):
            posts.append([datetime.fromtimestamp(post.created_utc).isoformat(), post.title, post.selftext])
            count += 1
            if count >= limit:
                break
            after = post.name  # For paginated fetching

    await reddit.close()
    return pd.DataFrame(posts, columns=['Date', 'Title', 'Content'])

def filter_df(df, optional_keyword_search):
    if optional_keyword_search:
        df = df[df['Title'].str.contains(optional_keyword_search, case=False, na=False) |
                 df['Content'].str.contains(optional_keyword_search, case=False, na=False)]
    return df

def openai_api(prompt):
    client = OpenAI(
        api_key=openai_secret,
        base_url="https://api.openai.com/v1/chat/completions"
    )

    completion = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "user", "content": prompt}
        ]
    )

    return completion.choices[0].message.content

async def main():
    try:
        subreddit_name = common_subreddit_name

        # Check cache
        if subreddit_name in df_cache:
            df = pd.DataFrame(df_cache[subreddit_name])
        else:
            df = await fetch_posts(subreddit_name)
            df_cache[subreddit_name] = df.to_dict(orient='records')
            with open(cache_file, "w") as f:
                json.dump(df_cache, f)

        if optional_keyword_search:
            df = filter_df(df, optional_keyword_search)

        if df.empty:
            print("No posts found.")
        else:
            df_content = '. '.join(df['Content'].tolist())
            full_prompt = f"{prompt}. The content to evaluate is: {df_content}"
            output = openai_api(full_prompt)

            print(f"\nGPT Response for subreddit: {subreddit_name}")
            if optional_keyword_search:
                print(f"Filtering for keyword: '{optional_keyword_search}'")
            print(output)
            print("\nRaw Post Data:")
            display(df)  # In Colab, display DataFrame nicely

            # Optionally save the raw data
            df.to_csv('reddit_posts.csv', index=False)
            print("Raw data saved as 'reddit_posts.csv'.")

    except Exception as e:
        print(f"An error occurred: {e}")

loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
loop.run_until_complete(main())

How It Works

Data Fetching:
Using asyncpraw, the script connects to Reddit and fetches posts from the specified subreddit. It fetches posts in increments of 100; use the script’s limit parameter to control the total number retrieved.
Filtering by Keyword:
If you supply a keyword in optional_keyword_search, the script narrows the results to posts that contain this keyword in either the title or the content.
OpenAI Integration:
The combined post content is fed into the OpenAI API along with your chosen prompt. For example, if you ask for “SEO-related insights,” the model will highlight relevant keywords or suggest topics.
Output:
The model’s response is printed to the console, along with the raw Reddit data. A CSV of the raw data is also saved for your records. In Colab, display(df) shows a nicely formatted DataFrame. You can use the CSV file for further analysis outside of Colab.

Running the Script

Ensure you’ve filled in your openai_key, reddit_clientid, reddit_secret, common_subreddit_name, and prompt.
Run the code cell. Colab will handle each step, and once complete you will see the GPT output and a preview of the fetched Reddit data.

Next Steps

Refine Your Prompt:
Experiment with different prompts to generate outlines, topic clusters, or long-form summaries.
Adjust the Subreddit and Keywords:
Try different subreddits or keywords to gather content that aligns with your interests or SEO niche.
Use Results for Analysis:
Download reddit_posts.csv from the left file explorer in Colab and analyze it locally. The GPT response can be copy-pasted into your notes or integrated into a content strategy.

With this pipeline, you have a flexible tool for mining Reddit for insights and leveraging advanced language models to interpret large amounts of user-generated content.

Don’t forget to follow me online: LinkedIn and BlueSky

Author
Recent Posts

Follow me

Greg Bernhardt

5+ years as Sr. SEO Specialist for Shopify. 25+ years of experience in web design, web development, and web marketing. Education in Information Sciences from UW-Milwaukee. Managing the largest online US physics community. Enjoy learning about AI, search engines, SEO, chrome tricks, Python, knowledge graphs, data science, and more!

Follow me

Latest posts by Greg Bernhardt (see all)