I. Introduction: Data
In the article: "Linear extrapolation to predict the future of Despacito,” we predict the date when Despacito becomes the most viewed video on YouTube. Here we will deepen this study by sharing all the mathematical calculations, which are based on the least squares method. Understanding the origins of this method requires an understanding of differential calculus. In case you don’t know calculus, I suggest you jump to page 4 where we directly apply the method. The numerical steps were programmed in Python (Jupyter Notebook). To reproduce the results you can download the file from my repository @Github.
In the article we show the results of the 5 most viewed songs on Youtube. In this tutorial, to simplify the analysis, we focus on Despacito and See You Again. As a first step, we collect data regarding the number of views on Youtube for each song. We do it every 24 hours from July 4th to July 9th. 6 values are obtained, as shown in the following table:
Our goal is to find the equation of the line that best fits the values given in Table I. Let's focus on Despacito ($d$)... The equation of the line of Despacito is given by,
$$y_{d} = m_d x + b_d. \tag 1$$This line will not necessarily pass through all 6 values given in Table I because the real data does not describe a perfect line, so there will be an error. If $y$ corresponds to the real data and $y_d$ corresponds to the result obtained from the linear equation, then the error will be given by the difference, $y_d-y$. However, since an error as such can give positive and negative values, it will be convenient to square it, i.e.,
$$\begin{align}\xi &=\sum_i^n (y_{d_i}-y_i)^2, \\ & = \sum_i^n (m_{d} x_i+ b_{d} – y_i)^2. \tag 2\end{align}$$$\xi$ represents the total error, and is given by summing up the error over all the possible points, that is, $n=6$ for the case given in Table I. Ideally, we want $\xi$ to be zero, but as we just said that is not possible, so what we are looking for is to find the minimum error and since this error is given to the square, that is why we call it the method of least squares. In our case, if the equation of the line is defined with variables $m_d$ and $b_d$, then we have to find the values of these variables such that they grant the least possible value for $\xi$. Mathematically, considering differential calculus, these values are obtained by finding the first derivatives of $\xi$ and equating them to zero, i.e.,
$$\begin{align}\frac{\partial \xi}{\partial m_{d}}=0, ~~~ \frac{\partial \xi}{\partial b_{d}}=0.\tag 3 \end{align}$$The resolution of Eq. (3) is described on the next page.
Views: 1 Github
Notifications
Receive the new articles in your email