Building a Movie Recommendation Engine: Pilot

Building a Movie Recommendation Engine: Pilot

And who am I?

Welcome! My name is Máté and I'm a 27-years-old Software Developer from Hungary and I'm currently living in the Netherlands. In the last 5 years I have had the chance to work on the frontend of an agricultural webshop, dive deep into the Bluetooth Mesh protocol to create developer tools for it, create a library to share assets from presentations and right now I'm trying to figure out how happy trees are based on 4-5 images.

I was lucky as I had the chance to work with lots of interesting technologies and I realized that I want to focus on cloud and data. So my goal for this year is to become a Data Engineer. (Well just like for last year when, I bought a Datacamp subscription and started the Data Engineer Career Path) So this year I will create some projects and actually write about them.

Why am I doing this?

There is nothing better on a cold, dark, rainy Dutch winter night than cozy up with some popcorn and hot chocolate and watch some movies or series with your significant other.

Except when all the popcorn is gone, the hot chocolate is cold and you still haven’t decided what to watch on Netflix. Unfortunately, I have been in this situation too many times. So did these experiences inspire me to create an awesome app which will help me to decide easier next time? Absolutely no! There are too many very good movie recommendation sites out there already.

I decided to create this project because:

  • I found a nice dataset on movie ratings between 1995 and 2023.

  • I want to learn batch data processing with Spark.

  • Next time my parents claim that their phone is listening to them, I will be able to explain to them that it is just a recommendation engine (like I did with those movies) that knows everything about them

What will I use?

Obviously I will need some tools to create this project and my main focus is Apache Spark, with PySpark. I also want to get more familiar with Apache Airflow as well so I will use it to orchestrate a data pipeline. These two technologies were covered in great detail in the DataCamp Career Path and they seem to be very often used everywhere in the Data Engineering field.

The whole project will be deployed to AWS (which I will manage with Terraform). As AWS has the largest market share as cloud provider and I already have applications deployed there it only make sense to use it (and I've also bought an EC2 saving plan already anyways).

What about the movies?

I will use the MovieLens dataset for this project. I could write a long, nice description about this dataset, but it wouldn't be any better than the one that's already available on their page so to learn more check it out here: https://grouplens.org/datasets/movielens/

There are three dataset available here that I'm planning to use: latest small (100,000 ratings), 25M (25,000,000 ratings) and latest full (33,000,000 ratings).

My main focus in these datasets (probably) will be the ratings, genres and tags for each film. I expect these will be sufficient for creating an effective recommendation, but we will find it out in some later posts so stay tuned for those!

What will it look like?

As I have one data source and fairly simple data structure I don't want an over-complicated architecture for my project. I believe that the simple solutions are great solutions as the next developer will need to understand it as well. Also as I'm learning so I will do the best I can, but my main priority is to finish the project and learn something.

The project will have two parts. The first, and most important, is a data pipeline where I will feed in the 25M and latest full datasets. They will be split up between Spark executors, cleansed and transformed and finally merged again. The merged data will be split again to train and validation set to train and validate a machine learning model. The second part is a simple web application to show that the model actually works and you will be able to try it out and get some recommendation.

I expect that the model might not perform very well on the first try or the great people at GroupLens Research create a new dataset (or someone else) and I want my model to be better, so the data pipeline have to be able to accept new data sources and the model training phases have to provide opportunities for fine tuning.

What do I want to achieve?

As I mentioned my main goal is to get closer to become a Data Engineer and to learn Spark. Can I measure those? Nah... So my goal in the end is a web application where the user is provided with some questions like: "How do you like this film?", "What's your favorite genre?" or "Which tag do you like the most?" and in the end they will get some movie recommendation to watch.

Thank you for reading it this far! I'm really looking forward to do this project and I hope you'll join me on this journey (or read the later posts if I've already finished with the project).