Reddit is a goldmine of user-generated content—people share their thoughts, ask questions, and discuss topics of all kinds. By analyzing subreddit posts, you can gain insights into what users care about, emerging trends, and pain points. Pair this with OpenAI’s powerful language models, and you can quickly generate summaries, topic suggestions, or keyword ideas tailored to your needs.
In this tutorial, you will:
- Fetch posts from a subreddit using the Reddit API (via
asyncpraw
). - Filter the results by an optional keyword to narrow down the focus.
- Send the aggregated content to OpenAI’s API to summarize or generate new content ideas.
- View and save the results in a convenient format for further analysis.
Following these steps in Google Colab, you can easily iterate and refine your prompts, conduct SEO research, or generate content outlines for blog posts, newsletters, or study materials.
Table of Contents
Requirements
- Google Colab:
A browser-based environment where you can run Python code directly.
(https://colab.research.google.com/) - OpenAI API Key:
Sign up at OpenAI’s platform to access their API. - Reddit API Credentials:
- Sign in to Reddit and go to https://www.reddit.com/prefs/apps
- Create a new “personal use script” to obtain your Client ID and Client Secret.
Getting Started in Google Colab
- Open a New Notebook:
Go to Google Colab and open a new notebook. - Install Required Libraries:
In a cell, run:This will install all necessary packages for this tutorial.
- Set Up Your Secrets:
You will need to provide your OpenAI API key and Reddit credentials. This can be done directly in the script or by using a separate JSON file.
The Script
Below is the complete script. Run it in your Google Colab environment by placing it into a cell after installation and adjusting your credentials and parameters as needed. The following code assumes you are working in Colab and have run the installation cell first.
Important: Before running, ensure you fill in your openai_key
, reddit_clientid
, reddit_secret
, and specify common_subreddit_name
and prompt
. Optionally, provide a keyword in optional_keyword_search
.
import asyncpraw import nest_asyncio import pandas as pd import asyncio from datetime import datetime from openai import OpenAI import json import os # Fill in your credentials here or load them from a file secrets = { "openai_key": "", # Your OpenAI Key "reddit_clientid": "", # Your Reddit Client ID "reddit_secret": "" # Your Reddit Secret } openai_secret = secrets["openai_key"] reddit_clientid = secrets["reddit_clientid"] reddit_secret = secrets["reddit_secret"] # Prompt and configuration prompt = "" # e.g. "Summarize these posts with a focus on SEO insights." common_subreddit_name = "" # e.g. "AskReddit" optional_keyword_search = "" # e.g. "marketing", leave blank if not needed user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)' cache_file = "reddit_cache.json" # Load cache if it exists if os.path.exists(cache_file): with open(cache_file, "r") as f: df_cache = json.load(f) else: df_cache = {} nest_asyncio.apply() async def fetch_posts(subreddit_name, limit=500): #500 is around the limit where Reddit doesn't 429 reddit = asyncpraw.Reddit(client_id=reddit_clientid, client_secret=reddit_secret, user_agent=user_agent) subreddit = await reddit.subreddit(subreddit_name) posts = [] count = 0 after = None while count < limit: async for post in subreddit.new(limit=100, params={'after': after}): posts.append([datetime.fromtimestamp(post.created_utc).isoformat(), post.title, post.selftext]) count += 1 if count >= limit: break after = post.name # For paginated fetching await reddit.close() return pd.DataFrame(posts, columns=['Date', 'Title', 'Content']) def filter_df(df, optional_keyword_search): if optional_keyword_search: df = df[df['Title'].str.contains(optional_keyword_search, case=False, na=False) | df['Content'].str.contains(optional_keyword_search, case=False, na=False)] return df def openai_api(prompt): client = OpenAI( api_key=openai_secret, base_url="https://api.openai.com/v1/chat/completions" ) completion = client.chat.completions.create( model="gpt-4o-mini", messages=[ {"role": "user", "content": prompt} ] ) return completion.choices[0].message.content async def main(): try: subreddit_name = common_subreddit_name # Check cache if subreddit_name in df_cache: df = pd.DataFrame(df_cache[subreddit_name]) else: df = await fetch_posts(subreddit_name) df_cache[subreddit_name] = df.to_dict(orient='records') with open(cache_file, "w") as f: json.dump(df_cache, f) if optional_keyword_search: df = filter_df(df, optional_keyword_search) if df.empty: print("No posts found.") else: df_content = '. '.join(df['Content'].tolist()) full_prompt = f"{prompt}. The content to evaluate is: {df_content}" output = openai_api(full_prompt) print(f"\nGPT Response for subreddit: {subreddit_name}") if optional_keyword_search: print(f"Filtering for keyword: '{optional_keyword_search}'") print(output) print("\nRaw Post Data:") display(df) # In Colab, display DataFrame nicely # Optionally save the raw data df.to_csv('reddit_posts.csv', index=False) print("Raw data saved as 'reddit_posts.csv'.") except Exception as e: print(f"An error occurred: {e}") loop = asyncio.new_event_loop() asyncio.set_event_loop(loop) loop.run_until_complete(main())
How It Works
- Data Fetching:
Usingasyncpraw
, the script connects to Reddit and fetches posts from the specified subreddit. It retrieves up to 900 recent posts in increments of 100. - Filtering by Keyword:
If you supply a keyword inoptional_keyword_search
, the script narrows the results to posts that contain this keyword in either the title or the content. - OpenAI Integration:
The combined post content is fed into the OpenAI API along with your chosen prompt. For example, if you ask for “SEO-related insights,” the model will try to highlight relevant keywords or suggest topics. - Output:
The model’s response is printed to the console, along with the raw Reddit data. A CSV of the raw data is also saved for your records.In Colab,display(df)
shows a nicely formatted DataFrame. You can use the CSV file for further analysis outside of Colab.
Running the Script
- Ensure you’ve filled in your
openai_key
,reddit_clientid
,reddit_secret
,common_subreddit_name
, andprompt
. - Just run the code cell. Colab will handle each step, and once complete, you will see the GPT output and a preview of the fetched Reddit data.
Next Steps
- Refine Your Prompt:
Experiment with different prompt messages to generate outlines, topic clusters, or long-form summaries. - Adjust the Subreddit and Keywords:
Try different subreddits or keywords to gather content that is aligned with your interests or SEO niche. - Use Results for Analysis:
Downloadreddit_posts.csv
from the left file explorer in Colab and analyze it locally. The GPT response can be copy-pasted into your notes or integrated into a content strategy.
With this pipeline, you have a flexible tool for mining Reddit for insights and leveraging advanced language models to make sense of large amounts of user-generated content.
- Evaluate Subreddit Posts in Bulk Using GPT4 Prompting - December 12, 2024
- Calculate Similarity Between Article Elements Using spaCy - November 13, 2024
- Audit URLs for SEO Using ahrefs Backlink API Data - November 11, 2024