subreddit-llm
Estimated Read Time: 5 minute(s)
Common Topics: reddit, df, posts, content, openai

Reddit is a goldmine of user-generated content—people share their thoughts, ask questions, and discuss topics of all kinds. By analyzing subreddit posts, you can gain insights into what users care about, emerging trends, and pain points. Pair this with OpenAI’s powerful language models, and you can quickly generate summaries, topic suggestions, or keyword ideas tailored to your needs.

In this tutorial, you will:

  1. Fetch posts from a subreddit using the Reddit API (via asyncpraw).
  2. Filter the results by an optional keyword to narrow down the focus.
  3. Send the aggregated content to OpenAI’s API to summarize or generate new content ideas.
  4. View and save the results in a convenient format for further analysis.

Following these steps in Google Colab, you can easily iterate and refine your prompts, conduct SEO research, or generate content outlines for blog posts, newsletters, or study materials.

Requirements

Getting Started in Google Colab

  1. Open a New Notebook:
    Go to Google Colab and open a new notebook.
  2. Install Required Libraries:
    In a cell, run:

    !pip install asyncpraw openai nest_asyncio pandas

    This will install all necessary packages for this tutorial.

  3. Set Up Your Secrets:
    You will need to provide your OpenAI API key and Reddit credentials. This can be done directly in the script or by using a separate JSON file.

The Script

Below is the complete script. Run it in your Google Colab environment by placing it into a cell after installation and adjusting your credentials and parameters as needed. The following code assumes you are working in Colab and have run the installation cell first.

Important: Before running, ensure you fill in your openai_key, reddit_clientid, reddit_secret, and specify common_subreddit_name and prompt. Optionally, provide a keyword in optional_keyword_search.

import asyncpraw
import nest_asyncio
import pandas as pd
import asyncio
from datetime import datetime
from openai import OpenAI
import json
import os

# Fill in your credentials here or load them from a file
secrets = {
    "openai_key": "",          # Your OpenAI Key
    "reddit_clientid": "",     # Your Reddit Client ID
    "reddit_secret": ""         # Your Reddit Secret
}

openai_secret = secrets["openai_key"]
reddit_clientid = secrets["reddit_clientid"]
reddit_secret = secrets["reddit_secret"]

# Prompt and configuration
prompt = ""                  # e.g. "Summarize these posts with a focus on SEO insights."
common_subreddit_name = ""   # e.g. "AskReddit"
optional_keyword_search = "" # e.g. "marketing", leave blank if not needed

user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'

cache_file = "reddit_cache.json"

# Load cache if it exists
if os.path.exists(cache_file):
    with open(cache_file, "r") as f:
        df_cache = json.load(f)
else:
    df_cache = {}

nest_asyncio.apply()

async def fetch_posts(subreddit_name, limit=500): #500 is around the limit where Reddit doesn't 429
    reddit = asyncpraw.Reddit(client_id=reddit_clientid,
                              client_secret=reddit_secret,
                              user_agent=user_agent)
    subreddit = await reddit.subreddit(subreddit_name)

    posts = []
    count = 0
    after = None

    while count < limit:
        async for post in subreddit.new(limit=100, params={'after': after}):
            posts.append([datetime.fromtimestamp(post.created_utc).isoformat(), post.title, post.selftext])
            count += 1
            if count >= limit:
                break
            after = post.name  # For paginated fetching

    await reddit.close()
    return pd.DataFrame(posts, columns=['Date', 'Title', 'Content'])

def filter_df(df, optional_keyword_search):
    if optional_keyword_search:
        df = df[df['Title'].str.contains(optional_keyword_search, case=False, na=False) |
                 df['Content'].str.contains(optional_keyword_search, case=False, na=False)]
    return df

def openai_api(prompt):
    client = OpenAI(
        api_key=openai_secret,
        base_url="https://api.openai.com/v1/chat/completions"
    )

    completion = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "user", "content": prompt}
        ]
    )

    return completion.choices[0].message.content

async def main():
    try:
        subreddit_name = common_subreddit_name

        # Check cache
        if subreddit_name in df_cache:
            df = pd.DataFrame(df_cache[subreddit_name])
        else:
            df = await fetch_posts(subreddit_name)
            df_cache[subreddit_name] = df.to_dict(orient='records')
            with open(cache_file, "w") as f:
                json.dump(df_cache, f)

        if optional_keyword_search:
            df = filter_df(df, optional_keyword_search)

        if df.empty:
            print("No posts found.")
        else:
            df_content = '. '.join(df['Content'].tolist())
            full_prompt = f"{prompt}. The content to evaluate is: {df_content}"
            output = openai_api(full_prompt)

            print(f"\nGPT Response for subreddit: {subreddit_name}")
            if optional_keyword_search:
                print(f"Filtering for keyword: '{optional_keyword_search}'")
            print(output)
            print("\nRaw Post Data:")
            display(df)  # In Colab, display DataFrame nicely

            # Optionally save the raw data
            df.to_csv('reddit_posts.csv', index=False)
            print("Raw data saved as 'reddit_posts.csv'.")

    except Exception as e:
        print(f"An error occurred: {e}")

loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
loop.run_until_complete(main())

How It Works

  1. Data Fetching:
    Using asyncpraw, the script connects to Reddit and fetches posts from the specified subreddit. It retrieves up to 900 recent posts in increments of 100.
  2. Filtering by Keyword:
    If you supply a keyword in optional_keyword_search, the script narrows the results to posts that contain this keyword in either the title or the content.
  3. OpenAI Integration:
    The combined post content is fed into the OpenAI API along with your chosen prompt. For example, if you ask for “SEO-related insights,” the model will try to highlight relevant keywords or suggest topics.
  4. Output:
    The model’s response is printed to the console, along with the raw Reddit data. A CSV of the raw data is also saved for your records.In Colab, display(df) shows a nicely formatted DataFrame. You can use the CSV file for further analysis outside of Colab.

Running the Script

  • Ensure you’ve filled in your openai_key, reddit_clientid, reddit_secret, common_subreddit_name, and prompt.
  • Just run the code cell. Colab will handle each step, and once complete, you will see the GPT output and a preview of the fetched Reddit data.

Next Steps

  • Refine Your Prompt:
    Experiment with different prompt messages to generate outlines, topic clusters, or long-form summaries.
  • Adjust the Subreddit and Keywords:
    Try different subreddits or keywords to gather content that is aligned with your interests or SEO niche.
  • Use Results for Analysis:
    Download reddit_posts.csv from the left file explorer in Colab and analyze it locally. The GPT response can be copy-pasted into your notes or integrated into a content strategy.

With this pipeline, you have a flexible tool for mining Reddit for insights and leveraging advanced language models to make sense of large amounts of user-generated content.

Don’t forget to follow me online: LinkedIn and BlueSky

Greg Bernhardt
Follow me

Leave a Reply