Background

Youtube Views Predictor

A comprehensive guide to getting more views on YouTube backed by Machine Learning

This project was built by Allen Wang, Aravind Srinivasan, Kevin Yee and Ryan O’Farrell. Our scripts and models can be found here.

Input your own thumbnail and title into our model to predict views for your video here.

Over the past 5 years YouTube has paid out more than $5 billion to YouTube content creators. Popular YouTuber PewDiePie made $5 million in 2016 from YouTube alone, not including sponsorships, endorsements and other deals outside of YouTube. With more and more companies turning to YouTube influencers to capture the millennial audience, getting people to watch your videos on YouTube is becoming increasingly lucrative.

Our goal is to create a model that can help influencers predict the number of views for their next video. Content on Youtube covers a broad range of genres such as comedy, sports, fashion, gaming and fitness. Due to the sheer scale of the problem, we narrowed our scope to fitness related videos. Fitness content is a huge part of YouTube. People are flocking to free online fitness content for advice instead of hiring an expensive personal trainer.

A person looking at related videos suggested by YouTube will first see the title and the thumbnail. If more potential views can be generated with specific titles and thumbnails, a YouTuber could use this information to generate the maximum potential views with video content they worked hard on. Therefore, our goal was to create a model using non-video features to predict the view count that fitness influencers can use to help grow their channel.

We were unable to find an appropriate dataset, so we scraped our own. We used YouTube’s 8M Dataset which contains contains 32 GB worth of pre-labeled data categorized by various genres (i.e. Sports, Fashion, Movies). We filtered out all the data with labels related to ‘Fitness and Gym’ which gave us 15,305 videos. To increase the size of our dataset we scraped videos of every user in our previous dataset. We now had 115,362 videos to work with. We scraped the following features for each video:

  • Title
  • Thumbnail
  • Description
  • Like Count
  • Dislike Count
  • View Count
  • Favorite Count
  • Comment Count
  • Date Published
  • Subscriber count of the channel
  • Number of videos posted by the channel
  • View count of the whole channel
  • Comment count of the previous video posted by the channel
  • View count of the previous video posted by the channel
  • Title of the previous video posted by the channel
  • Age of the channel

We focused on the video’s title and the thumbnail image, as those are the main features a user would see when browsing videos. We had to extract meaningful features from the thumbnail and the title to take them into account in our models.

Similar to the effectiveness of clickbait titles that we see in websites such as BuzzFeed, we wanted to see the effect of clickbait titles and thumbnails on YouTube videos. Specifically, from looking at successful YouTubers focusing on fitness, we noticed a few common characteristics:

  • The title generates enthusiasm with excessive caps and exclamation points
  • The title makes guarantees and promises shortcuts
  • The title includes a list
  • The image includes a fit man or woman

We tried training neural networks on the titles and the thumbnails (more on that later), but did not get very promising results.

We decided to proceed in a different route — specifically using pre-trained networks as feature extractors. We found an open-sourced NSFW Scorer by Yahoo as well as a clickbait scorer. We ran these on the previous title, current title, and the thumbnails, and were provided new features that represented the information from them, resulting in new, usable features.

Our main goal is to produce a model that predicts the number of views (or the difference in views). First, we drop some outliers — that is, videos that have gone “viral,” which we define as videos with view counts exceeding 1,000,000.

We can see that this is heavily skewed which is understandable — most common YouTubers probably won’t have that many views. Also, it seems that the videos from the YouTube-8M dataset was purely randomly sampled (i.e. not biased towards popular videos) as the objective of it was to tag categories given video-level information.

When we finally get to our predictors, we want to predict something that resembles a Gaussian curve. Luckily, we can apply a log transformation to ViewCount to get it to do just that.

Another quantity we can try to predict is the difference in view count. We start by removing outliers — videos which view counts increase or decrease by more than 5x.

Notice that in our videos, the difference between views typically fluctuates around 0 — but actually centers around -1. This was calculated as:

So the entries where the percent difference is around -1 are those that have a ViewCount (of the current video) around 0. This is interesting — most of our current view counts are videos that were scraped more recently. It could be the case that the video hasn’t been out long enough to get its “true” view count, so to speak. Therefore we probably need a feature that represents the time between when it was uploaded and when we scraped it.

Finally, let’s take a look at the features we’ve extracted from the title and thumbnail:

Clickbait Score

We used a pre-trained network to extract a clickbait score for each title. The clickbait score goes from 0 to 1, and the higher the score the more “clickbait-y” the title. The clickbait scores were distributed as show below:

We were interested in seeing if YouTubers used varying levels of “clickbait-y” titles on their channel. So we calculated the difference in clickbait scores across videos for each YouTuber and plotted that distribution:

Interestingly, we see that the difference in clickbait score almost looks like a zero-mean Gaussian curve. This means that we don’t expect YouTubers to stray from their default “clickbait-iness” in their titles.

Finally, we compared the clickbait scores of the bottom 10% and the top 10% of view counts:

It turns out that “clickbait-y” titles aren’t restricted to the top YouTubers, and that using it probably won’t be a fool-proof way to generate more views. The overall effect of “clickbait-iness” on the view count is unclear, but we assume that this feature won’t provide much predictive power in our model.

Next, we look at an actual scatter plot between the clickbait scores and ViewCount:

From this plot, we notice that there is not much correlation between the view count and the clickbait score implying that clickbait probably isn’t a prerequisite for virality.

Next, we decided to take a look at the actual words in the titles.

Common Words and N-Grams
To verify our intuition behind tricks used in titles, we decided to find the most common words and n-grams. Filtering out some common words, like ‘the’, ‘to’, ‘and’, etc., some very common words and trigrams:

NSFW Score

Lets look at the distribution of the NSFW Score that we extracted from the thumbnails:

The nsfw scores are heavily skewed towards 0 and mean is 0.107. This gets interesting when we look at average nsfw scores of the top 10% viewed videos and the bottom 10% viewed videos.

For the the top 10% the average nsfw_score is 0.158 and for the bottom 10% the average nsfw_score is 0.069. This seems to provide more predictive power than clickbait scores, and confirms what we have known all long — sex sells.

Using an GradientBoostedRegressor, we plotted the feature importances:

Ultimately, it looks like past performance dictates future success. The best predictor of how well your channel will do is the number of views your previous videos have had. The suggestive nature of your thumbnail and the “clickbait-iness” of a videos title has marginal influence on the number of views a viewer can get. Finally, we used a XGBRegressor to predict the log transformed ViewCount. We used cross-validation to get:

R⊃2; = 0.750 ± 0.007
RMSE = 0.970 ± 0.021

From our prediction to true value plot, we can see that the model seems to fit the data well. Also, the residual plot implies that the deviation in errors is due to a zero-mean Gaussian distribution.

Finally, we exponentiated our output to get the true number of views:

RMSE = 8727.0 ± 100.9

This essentially means that if a YouTuber were to use our predictor, they could expect the actual results be within 8800 actual views of the model’s results. For an amateur YouTuber with around 1,000 views, this is sort of useless, but for a YouTuber with around 100,000 views, this begins to become more useful. Ultimately, however, predicting the number of views is inherently difficult, so these results are about what we expected.

Originally when we planned this project, we were trying to predict the number of views from the titles and thumbnails themselves. Unfortunately, we found very quickly that the number of views had more to do with the channel information itself — their typical number of views, subscriber count, etc. This section will cover various other models we experimented with to estimate the influence of titles and thumbnails on views.

Recurrent Networks/LSTMs on Titles 
Since we had two separate sequences of text, we needed to find a way to present them as an input to the network. We decided to combine the previous and current title with an obvious separating token. If there truly was a difference between the different titles, then the network should pick that up.

We used GloVe embeddings to convert each title into a sequence of vectors, then zero-padded each sequence to become the same length.

First we tried a network which is a fairly standard network used in NLP tasks:

We noticed that the network trained quickly, but although the training loss was decreasing rapidly, the validation loss actually began to increase. This was a sign that the model was probably overfitting. Keeping this in mind we built a second network:

The difference was that in this new network, we added in more LSTM units per layer as well as another LSTM layer. We subjected each to more normalization so that we could avoid overfitting. We trained this for around 30 epochs and noticed that the training loss would decrease slightly, but also that the validation loss fluctuated a lot. Ultimately, there seemed to be more noise than signal just using the titles of the videos. Each epoch took a long time to train, especially with this many LSTM units, so we decided not to proceed with this route.

Convolutional Neural Network: Male-Female
The goal here was to verify if the gender of the person in the thumbnail is correlated with the number of views. We used this pre-trained CNN to extract a binary gender feature. However, like most gender classification CNNs, our network had trouble identifying faces in the thumbnail. Our network also had difficulty handling thumbnails without a person in it. Given the problems with this approach and the time it would require to extract the faces from each thumbnail, we decided not to use this as a feature extractor.

We had a lot of different ideas for the project, but were maybe originally too ambitious for our goals. We were originally trying to predict the view count given only the titles and thumbnails. We were hoping that neural networks would be able to learn hidden features in the way top YouTubers wrote their titles and created thumbnails, but quickly found out that this was wishful thinking. However, instead, we were able to find features that were more meaningful for a predictor than the raw titles and thumbnails, and were ultimately able to create a predictor that could be useful for moderately sized YouTube channels. Some more things that we could have tried if we had more time would include

  • Expanding to different genres
  • Applying sentiment analysis on comments to create a more robust “user profile” that could be used as a feature
  • Using sentiment analysis on comments to create a robust “reception” feature, (similar to like/dislike) which could then be predicted
  • Using generative models to create comments
  • Training a CNN on the thumbnail images — since NSFW score seemed to provide more predictive power than the clickbait scores, it is possible that a CNN applied to thumbnails would have performed better than the LSTMs trained on the titles