Reddit Sentiment Analysis

CI/CD Data Engineering, ETL Automation, & Analytics

Project Outcomes:

Implemented an automated daily ETL pipeline that ingests 100 Reddit posts per day from the subreddit r/economics (via Reddit's public API) into a PostgreSQL database, with a Power BI report built on top to visualize the processed data.

This project was at first meant to be a very simple exercise in scraping publicly available data (I settled on scraping Reddit because they provide a free API) and dumping the data in a database after transforming it. However, as the project progressed, I began to see more and more potential in the project and decided that I wanted to leverage what I learned about cloud to make this project a cloud-handled and automated ETL pipeline.

I initially implemented this as an AWS CI/CD pipeline integrated with my GitHub repo. I gained some pretty thorough hands-on experience with CodePipeline, CodeBuild, AWS Lambda, Glue, and even IAM. While AWS offered a lot of really powerful and convenient solutions, I realized that for this project (especially integrating with Power BI for reporting) Azure provided me a more streamlined path. While evaluating these two platforms, I figured that Azure may be able to support this specific project more than AWS. That said, the AWS experience was invaluable for understanding how to design scalable pipelines and manage cloud resources.

CodePipeline that takes code from my GitHub, transforms it into a zip with dependecies installed, and places the file in a dedicated S3 bucket. Currently, no service is configured to utilize this zip, (Although Glue was a consideration).

Amazon S3 Bucket (essentially BLOB storage) where my code sits after CI/CD.

Implementing this in Azure I found to be a lot more intuitive than on AWS. Azure Functions has a built-in capability to continuously deploy with GitHub, which I found to be incredibly convenient. From there, I connected my GitHub with Azure Functions, stored my secrets in the key vault, made sure my Python file could connect to the Reddit API and PostgreSQL database I provisioned, and set up the automation.

I'm working within the bounds of an Azure student account, so I was faced with pricing decisions as I was provisioning my resources. For example, when provisioning my database, I had to turn off Availability Zones in order to drive down costs. While this is fine in a personal project environment, this probably wouldn't fly in a formal production environment, in which case zone/geo redundancy becomes critical for uptime.

PostgresSQL server provisioned on Azure. Billed based on traffic, so keeping read/write frequency low can help minimize costs.

Management group for my Azure resources related to this project

One issue I ran into was constant ImportModuleErrors, but after looking through some documentation I discovered a solution. Since the Pandas and NumPy libraries can be somewhat heavy, it's usually best to utilize remote deploy to Azure so that Azure can handle downloading dependencies on deploy instead of pre-packaged. This usually works best on a Linux instance, especially premium, since Linux leverages Oryx to download these hefty libraries a little more smoothly.

Azure Functions instance. The file being executed flows up to 100 records into the database when called (100 is Reddit’s max limit per request). Function is configured to fire once a day, in order to keep costs down.

Application Insights instance for my Azure Functions. My main way of detecting successful runs and exceptions. As shown, all of my requests were failing up until around 2:45, when I ironed out all the issues in the pipeline.

As for my modeling using Power BI, I opted at first to use DirectQuery in order to have Power BI stay linked with the state of my database at all times by writing queries to the DB instead of importing that data; however, I soon switched to imports.

The reason for this is that Power BI DirectQuery doesn't support all the queries and custom columns I wanted to create, so imports ended up being the better pick.

My PowerBI report as of 8/29/2025

The model assigns posts to a series of topics, so that we can analyze how the internet is reacting to certain aspects of modern global economics

I wanted to highlight this screenshot because of the interesting circumstances revolving around the “Recession” topic of posts.

Despite making up around 1% of trending posts, the topic has very high engagement in terms of comments and upvotes, and exceeds every other category in terms of average engagement

This suggests that when my database was updated, posts with the topic of “Recession” were just beginning to blow up on the subreddit.