MLWriteup.html

<html><head><meta content="text/html; charset=UTF-8" http-equiv="content-type"><style type="text/css">ol{margin:0;padding:0}table td,table th{padding:0}.c0{color:#000000;font-weight:400;text-decoration:none;vertical-align:baseline;font-size:10pt;font-family:"Arial";font-style:normal}.c1{padding-top:0pt;padding-bottom:0pt;line-height:1.15;orphans:2;widows:2;text-align:left;height:11pt}.c2{padding-top:0pt;padding-bottom:0pt;line-height:1.15;orphans:2;widows:2;text-align:left}.c5{padding-top:0pt;padding-bottom:0pt;line-height:1.15;orphans:2;widows:2;text-align:center}.c7{color:#000000;text-decoration:none;vertical-align:baseline;font-family:"Arial";font-style:normal}.c8{background-color:#ffffff;max-width:468pt;padding:72pt 72pt 72pt 72pt}.c3{font-size:10pt;font-style:italic}.c6{font-weight:700}.c4{font-size:10pt}.title{padding-top:0pt;color:#000000;font-size:26pt;padding-bottom:3pt;font-family:"Arial";line-height:1.15;page-break-after:avoid;orphans:2;widows:2;text-align:left}.subtitle{padding-top:0pt;color:#666666;font-size:15pt;padding-bottom:16pt;font-family:"Arial";line-height:1.15;page-break-after:avoid;orphans:2;widows:2;text-align:left}li{color:#000000;font-size:11pt;font-family:"Arial"}p{margin:0;color:#000000;font-size:11pt;font-family:"Arial"}h1{padding-top:20pt;color:#000000;font-size:20pt;padding-bottom:6pt;font-family:"Arial";line-height:1.15;page-break-after:avoid;orphans:2;widows:2;text-align:left}h2{padding-top:18pt;color:#000000;font-size:16pt;padding-bottom:6pt;font-family:"Arial";line-height:1.15;page-break-after:avoid;orphans:2;widows:2;text-align:left}h3{padding-top:16pt;color:#434343;font-size:14pt;padding-bottom:4pt;font-family:"Arial";line-height:1.15;page-break-after:avoid;orphans:2;widows:2;text-align:left}h4{padding-top:14pt;color:#666666;font-size:12pt;padding-bottom:4pt;font-family:"Arial";line-height:1.15;page-break-after:avoid;orphans:2;widows:2;text-align:left}h5{padding-top:12pt;color:#666666;font-size:11pt;padding-bottom:4pt;font-family:"Arial";line-height:1.15;page-break-after:avoid;orphans:2;widows:2;text-align:left}h6{padding-top:12pt;color:#666666;font-size:11pt;padding-bottom:4pt;font-family:"Arial";line-height:1.15;page-break-after:avoid;font-style:italic;orphans:2;widows:2;text-align:left}</style></head><body class="c8 doc-content"><p class="c2"><span class="c6 c4">The Algorithm (at a glance)</span></p><p class="c2"><span class="c4">We use </span><span class="c3">Alex&rsquo;s Quick and Dirty Classifier</span><span class="c4">, a homemade machine learning algorithm that computes the probability that a new data point is a member of a given set of previously seen </span><span class="c3">positive</span><span class="c4">&nbsp;data points; this is computed against a </span><span class="c3">negative</span><span class="c0">&nbsp;set for which no data points are available, which we define in terms of some kind of assumed/implicit prior distribution. Our algorithm is similar to a Gaussian Mixture Model, but heavily modified and repurposed for a classification task. We hope to use this algorithm to learn supervised classification given only unsupervised data (not an easy task!).</span></p><p class="c1"><span class="c0"></span></p><p class="c2"><span class="c7 c6 c4">Predicting Hit TikTok Songs</span></p><p class="c2"><span class="c0">In the case of predicting TikTok songs, we want to perform the supervised task of classifying songs as &ldquo;hit&rdquo; and &ldquo;not a hit.&rdquo; The crux lies in the fact that we have a small set of data points for &ldquo;hit&rdquo; songs, and we have no data points at all for &ldquo;not a hit&rdquo; songs, and there are a lot of them, so it would be unreasonable for us to try to collect a representative sample of the set of all non-hit songs. That would amount to collecting enough data points to represent almost all the world&rsquo;s music! </span></p><p class="c1"><span class="c0"></span></p><p class="c2"><span class="c4">In the context of our prediction algorithm, we see that our small set of data points for hit songs is our </span><span class="c3">positive</span><span class="c4">&nbsp;set, and the very large set of non-hit songs for which we have no data is the </span><span class="c3">negative</span><span class="c0">&nbsp;set. To address this lack of data, our algorithm assumes that any given song is, by default, not going to go viral. The set of songs that go viral is an extremely small subset of the songs on the internet, so it is safe to assume that the vast majority of songs will not go viral. Later, we can tweak certain parameters of our model to adjust the &ldquo;strength&rdquo; of this assumption (e.g. do we assume that the average song has a 1% chance of going viral? 10% chance? 50% chance?).</span></p><p class="c1"><span class="c0"></span></p><p class="c2"><span class="c0">In order to predict if a song will be a hit, we see if a song has similar qualities to other songs that we know are hits. In order to do this, we have to define some sort of similarity metric between songs. To this end, we use 6 song features computed by the Spotify API: Acousticness, Danceability, Energy, Liveness, Loudness, and Tempo.</span></p><p class="c1"><span class="c0"></span></p><p class="c5"><span style="overflow: hidden; display: inline-block; margin: 0.00px 0.00px; border: 0.00px solid #000000; transform: rotate(0.00rad) translateZ(0px); -webkit-transform: rotate(0.00rad) translateZ(0px); width: 594.50px; height: 236.28px;"><img alt="" src="images/image1.png" style="width: 594.50px; height: 236.28px; margin-left: 0.00px; margin-top: 0.00px; transform: rotate(0.00rad) translateZ(0px); -webkit-transform: rotate(0.00rad) translateZ(0px);" title=""></span></p><p class="c5"><span class="c0">Figure 1: Correlation Matrix for our Predictive Features</span></p><hr style="page-break-before:always;display:none;"><p class="c1"><span class="c0"></span></p><p class="c2"><span class="c6 c4 c7">The Algorithm (implementation)</span></p><p class="c2"><span class="c0">We center 30 Gaussian probability mass functions with height 1.0 on the data points corresponding to each of our 30 TikTok hit songs. To predict the probability of a new song being a hit, we take its n-dimensional Euclidean distance (n=6 because we have a 6-dimensional feature space) from each of the 30 points and plug those distances as plus-or-minus x into their corresponding Gaussian mass functions. We simply take the max() of these 30 probability mass function outputs to get our predicted hit probability. Essentially, this method guesses the probability that a song goes viral by checking how close it is to known hit songs in our dataset.</span></p><p class="c1"><span class="c0"></span></p><p class="c2"><span class="c7 c6 c4">Preliminary Findings</span></p><p class="c2"><span class="c0">Figure 2 provides a visualization of the single-feature predictor functions for each of our 6 features. These can sort of be seen as average &ldquo;cross sections&quot; of our overall model, which would normally be a 6-dimensional mass function whose dimensions represent acousticness, danceability, energy, liveness, loudness, and tempo.</span></p><p class="c1"><span class="c0"></span></p><p class="c2"><span class="c4 c6">What does this mean?</span><span class="c4">&nbsp;These plots show the concentration of viral TikTok songs at different values of Acousticness, Danceability, Energy, Liveness, Loudness, and Tempo. Looking over these plots, we can make inferences about what features make a hit song! For example, many hit songs have Acousticness values between 0.0 and 0.15. We also see that Energy values 0.3 and 0.6 are good, and a Danceability value of 0.2 is definitely </span><span class="c3">not</span><span class="c0">&nbsp;good. Loudness values are relatively evenly distributed for hit songs, but lower values from -7 to -2 are generally a safe bet for a hit song. Most importantly, a hit song should be high in Liveness (almost all of them are skewed toward high Liveness values) and its tempo shouldn&rsquo;t be in the range between 0 and 80. Interestingly, it seems that TikTok viewers generally just don&rsquo;t like slow songs.</span></p><p class="c1"><span class="c0"></span></p><p class="c1"><span class="c0"></span></p><p class="c5"><span style="overflow: hidden; display: inline-block; margin: 0.00px 0.00px; border: 0.00px solid #000000; transform: rotate(0.00rad) translateZ(0px); -webkit-transform: rotate(0.00rad) translateZ(0px); width: 620.50px; height: 315.22px;"><img alt="" src="images/image2.png" style="width: 620.50px; height: 315.22px; margin-left: 0.00px; margin-top: 0.00px; transform: rotate(0.00rad) translateZ(0px); -webkit-transform: rotate(0.00rad) translateZ(0px);" title=""></span></p><p class="c5"><span class="c0">Figure 2: Single-Feature Predictor Functions for Acousticness, Danceability, Energy, Liveness, Loudness, and Tempo</span></p></body></html>