YouTube channel analytics tool
- Tech Stack
- Architecture
- How It Works
- Challenges & Resolutions
- How To Run
- Future Updates & Improvements
- Java 23 (OpenJDK)
- Spring Boot 3
- Python 3
- React 18
- Docker
- Kubernetes
The user is presented with a search bar where they can submit a YouTube channel name (e.g. @NASA
). On submit, the React application validates the user's input and sends a GET request to the Spring Boot middleware application which acts as a bridge between the client and the Python web scraper.
The Spring Boot middleware system uses a reactive programming model to make requests to the Python web scraper (more specifically it does this using Java's WebClient, Flux and Mono APIs and parallelizes all the requests it has to make).
The results of these requests are returned to the user's browser (the React application) via an open SSE (Server-Sent Events) connection.
This provides real-time updates to the client from the server and allows for a non-blocking flow of data from the server to the client (i.e. the client doesn't have to wait until all requests are executed on the server, it receives and displays them as they are pushed from the server).
One significant challenge of building this service was the 3rd party web scraper used and the nature of web scraping itself.
As we're not using any sort of official API, some limitations exist.
For example, the web scraper works like this: you give it a channel name
and the number
of videos you want, and it scrapes them sequentially, one at a time.
This means that it can be quite slow for a YouTube channel with a large number of videos.
To combat this, for each search the user makes, the Spring Boot service sends a blast of multiple asynchronous parallel requests, each requesting increasing volumes of videos. The initial request (requesting just 1 video) returns quickly relative to the others and the middleware pushes it immediately back to the UI via SSE (Server-Sent Events). This is to give the best possible UX to the user, so that they see some results immediately. The rest of the results continue to be processed in the background and the user sees them stream through to the UI in real time.
Docker Compose run instructions
Execute the following command at the root of this project
make
Open your browser at http://localhost:3000
to see the application running.
To shut down
make down-local
Minikube run instructions
Start Minikube
minikube start
Create the yt-chanalyzer-ns
namespace
kubectl create namespace yt-chanalyzer-ns
Start the pods
kubectl apply -f kubernetes
Execute this command to see the pods starting up
kubectl get pods --watch
Expose the URL
minikube service react-chanalyzer --url -n yt-chanalyzer-ns
You will see output similar to the following
http://127.0.0.1:59153
❗ Because you are using a Docker driver on darwin, the terminal needs to be open to run it.
Copy the output address into your browser and you will see the app running
EKS deployment instructions
The following section assumes you have some familiarity with AWS and Kubernetes/EKS
Create a Kubernetes cluster (this process can take 15 - 20 mins)
eksctl create cluster --region=eu-west-2 --name=yt-chanalyzer --nodes=1 --node-type=t2.small
Switch to the correct context for your new cluster
aws eks update-kubeconfig --name yt-chanalyzer --region eu-west-2
Associate the OIDC provider
eksctl utils associate-iam-oidc-provider --cluster yt-chanalyzer-cluster --approve --region eu-west-2
Create the IAM policy
aws iam create-policy \
--policy-name AWSLoadBalancerControllerIAMPolicy \
--policy-document file://iam_policy.json
Create the IAM Service Account (replace <ACCOUNT_ID>
with your account ID)
eksctl create iamserviceaccount \
--cluster=yt-chanalyzer \
--namespace=kube-system \
--region=eu-west-2 \
--name=aws-load-balancer-controller \
--role-name AmazonEKSLoadBalancerControllerRole \
--attach-policy-arn=arn:aws:iam::<ACCOUNT_ID>:policy/AWSLoadBalancerControllerIAMPolicy \
--override-existing-serviceaccounts \
--approve
Add the helm chart for creating the controller
helm repo add eks https://aws.github.io/eks-charts
Check for udpates to helm chart
helm repo update eks
Install the AWS load balancer controller with the helm chart (replace <VPC_ID>
with the VPC ID of your Kubernetes cluster)
helm install aws-load-balancer-controller eks/aws-load-balancer-controller \
-n kube-system \
--set clusterName=yt-chanalyzer \
--set serviceAccount.create=false \
--set serviceAccount.name=aws-load-balancer-controller \
--set region=eu-west-2 \
--set vpcId=<VPC_ID>
Deploy the application
kubectl apply -f kuberenetes
The application will now be live, execute the following command to get its web address
kubes get ingress
You will see output similar to the following
NAME CLASS HOSTS ADDRESS PORTS AGE
ingress-yt-chanalyzer alb * k8s-ytchanal-ingressy-88dc9dd409-569757692.eu-west-2.elb.amazonaws.com 80 1m
Copy the address from the ADDRESS
column into your browser (if your browser enforces https make sure to manually change it to http as we have not set up an SSL certificate for the application yet)
- CI/CD pipeline with Jenkins and GitHub Actions
- End-to-end testing
- Re-write the web scraper to be more efficient (I am interested in using Kotlin's skrape{it} library for this)
- Semantic versioning
- Add linters