Abstract
Abstract
Farsite Group, a data science firm based in Columbus, Ohio, launched a highly visible campaign in early 2013 to use predictive analytics to forecast the winners of the 85th Annual Academy Awards. The initiative was fun and exciting for the millions of Oscar viewers, but it also illustrated how data science could be further deployed in the media and entertainment industries. This article explores the current and potential use cases for big data and predictive analytics in those industries. It further discusses how the Farsite Forecast was built, as well as how the model was iterated, how the projections performed, and what lessons were learned in the process.
A
The traditional Hollywood critics, such as Scott Feinberg of The Hollywood Reporter, were confident that their years of experience and relationships within the industry would translate into better insights. “I know there is no scientific way of predicting the Oscars,” said Feinberg to the Wall Street Journal. He believes that data science cannot substitute for “underground intelligence” from the 6,000 Academy voters, who he notes have “subjective, whimsical preferences.” 1
However, from elections to corporate consulting, data scientists appreciate the potential of predictive analytics to improve forecasts and offer intelligent insights into industries and companies. The Oscars were another symbol (and a fun one at that) of the applications of data science for the media and entertainment industry. Corporate decision making in all industries can benefit from data science. And the media and entertainment industries, where the majority of executives start in the mailroom and learn their jobs in apprentice-like positions, are well suited for continued deployment of data analysis.
At an industry conference in early 2012, a panel on data science and film focused almost exclusively on social listening analytics. While social listening is helpful in crafting marketing, which is one of the largest expenditures in entertainment, there are countless deployments of data science that will empower revenue growth, profit maximization, and operational efficiency.
The majority of studios and film funds have relied on Monte Carlo models for years in order to “predict” a likely range at the box office. These black box models have performed mediocre for studios, primarily due to the assumptions that go into them. In many cases, a recency bias and the common fallacy of poorly accounting for the probability of rare events resulted in overly optimistic forecasts. Not every film musical will have the success of Chicago; just because another musical had the success of Chicago does not mean that the next musical will be as successful as, or even within 20% of, Chicago's take. As a result of these shortcomings, some studios now understand that data science can assist them in advancing their box office projection methods.

How the Oscar Is Won. During the Oscar season Farsite provided daily content about the project on FarsiteForecast.com One of the first posts illustrated the key variables and how the models worked.
In correlation with the box office, studios have ample room to explore data science in “ultimates.” In addition to the box office, films earn revenue through a variety of distribution windows, including DVD sales, international sales, and cable and television broadcasts. Historically, a matrix of values was applied to the box office to derive the overall “ultimate” revenue for a film (i.e., the revenue including box office, DVD, on-demand, broadcast, etc.). Nevertheless, between the decline in DVD revenue and some poorly built film finance models, many studios have been forced to restructure their financing arrangements. In addition, the relationship between the box office and ultimates is not an immutable law of nature: many films have different fractions of their ultimate revenue explained by the box office. The Weinstein Company was forced to reevaluate its library value in 2010 after the studio's slate couldn't meet debt obligations. This was only one example of a slate that suffered from overvaluation at the hands of bad ultimates modeling.
“GOING FORWARD, DATA SCIENCE WILL BE ADOPTED AND INTEGRAL TO CRITICAL MANIFESTATIONS IN THE MEDIA AND ENTERTAINMENT SECTOR.”
Going forward, data science will be adopted and integral to critical manifestations in the media and entertainment sector. For film, music, digital media, and television, big data and predictive analytics will offer some value-added deployments. From our work in retail, healthcare, and advocacy, Farsite has cultivated a series of use cases that have direct benefits to the media and entertainment space.
First, the retail data analytics world is built around location-based analysis (where people live, where they are exposed to advertising, and where they shop); that same perspective will benefit media analysis. Both studios and theater exhibitors have ticket sales data by date and time, by film, and by theater going back years. This information encompasses one meaningful data set, which with the correct model provides great predictive value for future box office estimates by film, date, and theater. When combined with affinity information, studios can also use the profile to better tailor their marketing budgets. Gone will be the days of billion-dollar P&A (prints and advertising) expenditures for studios. Instead, studios will adopt the microtargeting techniques perfected during the 2012 election cycle.
Geospatial analysis is not limited to studios and exhibitors. It can also be deployed for live-event companies that are planning a national tour for a recording artist—helping to better project what cities to play and what ticket prices to charge for maximum profit.
Beyond sales forecasting and microtargeting, given the integration of media properties across platforms, including digital, it is only big data that can build a comprehensive marketing attribution model. And in conjunction with that, media organizations deploying data science methods can use test-and-learn within these marketing attribution models to optimize viewership of content across various channels.

Box Office Gross & the Oscars: By Year. Winning an Oscar doesn't guarantee commercial success but it certainly helps. On average, films that won an Oscar have a higher box office gross as a function of production costs.
While the content development process will never be subject to the whims of an algorithm, data science can be helpful to development executives and producers. From selecting which franchises will yield the highest probability of return to determining which pilots are most likely to succeed in which time slots, the incorporation of data science will make every department and executive smarter and more empowered to succeed. Core to these successes is simply an internal team that understands the power data may have for their organization, a wealth of data, and a data science team like Farsite who knows both the tools and the industry.
It was the desire to promote the power of data science in media and entertainment that gave us the idea for the Oscar forecast. But why predict the Oscars? First, there is ample public data. We were able to utilize decades of motion picture data alongside decades of Academy Awards information in order to build our predictive model. In addition to the static inputs, we could layer real-time signals such as nominations and winners of other awards, including the various guild awards, the Golden Globes, and the British Academy of Film and Televisions Arts (BAFTA).

Audience and Critics Scores for Oscar Nominees & Winners. It comes as no surprise that popular and critical acclaim is often associated with Oscar(r) nominees and winners. A notable exception is Forest Gump, which scored highly from audience reviews but received a low rating from critics.
We also leveraged data that's available from the online movie site Rotten Tomatoes. Through the web site's api, we were able to collect information on genre, runtime, cast, audience reviews, and critics' reviews. Using api calls, we built and dynamically updated a database of films. The updating was done for the reviews of the current crop of Oscar nominees, because several were released right before the nomination deadline. The audience and critics' scores from Rotten Tomatoes turned out to be an important indicator in our Best Picture forecast. We performed an exploratory sentiment analysis of the text reviews returned by the api but found the numeric score to be a more reliable predictor. This could have been a result of our text analysis approach, but it is likely that the numeric score encapsulates the sentiment of the reviewers more accurately.
Finally, there is an active prop-betting market for the Oscars, which historically has been a strong indicator of the winners. We acquired historical volume and price data from Intrade for the purposes of back testing this market's accuracy and identifying the significance of trends in the market. We found that Intrade could be very predictive in some cases. For example, in the best picture category, for the five years previous to the 2012 Oscars, Intrade had settled on a favorite more than 30 days out from the event and was correct all five years. The other categories, notably the supporting actor/actress categories, were more thinly traded and were not correct in all previous years. Despite having access to historical data from Intrade, we had to deal with the fact that Intrade stopped operating in the United States very recently. This was a point of risk for our modeling because we did not know how this change would affect the market's predictions.
Another reason for forecasting the Oscars is that it's a highly visible event, earning the attention of both the media and entertainment industries as well as the broader movie-going public. More than 40 million Americans tune in to watch the show, which is the symbol and yearly culmination of the film industry. And his industry itself is a large and meaningful industry worthy of study. Approximately 1.4 billion tickets were sold in 2012, accounting for nearly $11 billion in domestic box office sales.
So what did we learn? Farsite was 83% accurate, correctly predicting five of the top six awards. This was in line with or above all data-driven forecasts. Crucial to our success was building a model that was iterative and could adapt to the fluid Award season. When the model was initially released, for example, Steven Spielberg's Lincoln was the leading contender for Best Picture. This was based on the early static variables, including total number of nominations for a film (12) and correlation with Best Director, where Spielberg led in the model. However, as Argo won guild award after guild award, and the betting markets adjusted to reflect new odds in Argo's favor, the model adjusted as well. In corporate data science, having a model that can incorporate new signals is important for success.
Data was certainly the victor in the tightest race: Best Supporting Actor. Two contenders were in a close battle for the Oscar—Tommy Lee Jones and Christoph Waltz. While most pundits had Tommy Lee Jones, Farsite's model correctly projected Christoph Waltz. One reason for this difference? Pundits discount the Golden Globes as a signal for the Oscars. While the Globes are not a strong predictor of the winner in many categories, the best supporting actor category has a stronger relationship. As such, data science reflected what experts discounted.
An important learning point was dissecting the one missed projection. Farsite, along with many other prognosticators, had projected Steven Spielberg would win Best Director. One reason for this error related to the contenders. At the Critics Choice Awards, the Golden Globes, and the Directors Guild Awards, one man cleaned up: Ben Affleck for Argo. However, Affleck was not nominated for the Oscar for Best Director. As a result, three of the data signals for projecting Best Director were not helpful. This led to an over-reliance on the prop trading markets, where Spielberg was the hands-down leader. At the end of the day, Ang Lee won the Oscar. Hindsight is 20/20, and we can retrofit any number of explanations. The best is that Lee's film was honored with nominations and wins for technical achievement in many categories, which could have been a good indicator of his strong candidacy for Best Director.
The project was incredibly fun and instructive. In addition to the promotion of data science in the media sector, the project illustrated some of the strengths of data and gave us a few considerations for future entertainment data projects. We will continue to explore and develop thought leadership around the deployment of data science in popular culture applications—including entertainment. And we encourage the data community to continue exploring how our discipline integrates with the creative industries. Better data science deployment will only help to further support the arts as a sustainable, growing, and profitable sector.
