Research Papers

presented by  

Overview

Every year, the MIT Sloan Sports Analytics Conference Research Paper Competition brings exciting and innovative insight and changes to the way we analyze sports. With submissions on topics ranging from the spelling bee to rugby, basketball, and more, we represent the largest forum for groundbreaking research in sports. The Research Paper Competition is an incredible opportunity to reach a diverse audience while still contributing to the advancement of analytics in sports.

Previous year’s top papers were featured on top media outlets through the world, and captured the attention of representatives from numerous professional sports teams.

For full rules of the competition, see the rules page.

We are pleased to announce the Research Papers and Posters selected for SSAC 2017:

2017 MIT SLOAN SPORTS ANALYTICS CONFERENCE Research Paper

Description

Download Full Paper Here

In this work we develop a novel attribute-based representation of a basketball player’s body pose during his three point shot. The attributes are designed to capture the high-level body movements of a player during the different phases of his shot, e.g., the jump and release. Analysis is performed on 1500 labelled three point shots. Normalized for game context, we use Pearson’s Chi-squared test to quantify differences in attribute distributions for made and missed shots, where we observe statistically significant differences in distributions of attributes describing the style of movement, e.g., walk, run, or hop, in various phases of the shot. Similarly, with fixed shot outcomes, across game contexts, we observe statistically significant differences in distributions of attributes describing the pass quality, direction of movement, and footwork. We also present a case study for Stephen Curry, where we observe that Curry moves much more than the average player in all phases of his shot and he takes a higher proportion of off-balance shots compared to the average player.

Speakers
Description

Download Full Paper Here

This paper develops a novel statistical method to detect abnormal performances in Major League Baseball. The career trajectory of each player’s yearly home run total is modeled as a dynamic process that randomly steps through a sequence of ability classes as the player ages. Performance levels associated with the ability classes are also modeled as dynamic processes that evolve with age. The resulting switching Dynamic Generalized Linear Model (sDGLM) models each player’s career trajectory by borrowing information over time across a player’s career and locally in time across all professional players under study. Potential structural breaks from the ability trajectory are indexed by a dynamically evolving binary status variable that flags unusually large changes to ability. We develop an efficient Markov chain Monte Carlo algorithm for Bayesian parameter estimation by augmenting a forward filtering backward sampling (FFBS) algorithm commonly used in dynamic linear models with a novel Polya-Gamma parameter expansion technique. We validate the model’s ability to detect abnormal performances by examining the career trajectories of several known PED users and by predicting home run totals for the 2006 season. The method is capable of identifying Alex Rodriguez, Barry Bonds and Mark McGwire as players whose performance increased abnormally, and the predictive performance is competitive with a Bayesian method developed by Jensen et al. (2009) and two other widely utilized forecasting systems.

Speakers
Description

Download Full Paper Here

Major League Baseball teams are turning to analytics in an attempt to gain a number of small advantages that, in the composite, may result in significantly altering the odds of winning in their favor.  This paper addresses two related areas surrounding the evolving strategies of baseball in the wake of the sports analytics movement, and, in particular, the potential of unconventional uses of relief and starting pitchers. The first area addresses the home-field advantage and proposes a strategy for starters of visiting teams that can be used to remove roughly one half of the home team’s first-inning advantage. A byproduct of this analysis is a set of proper adjustments that must be made to calculate the true home-field advantage, which is roughly 0.429 runs/game, rather than the 0.133 runs/game suggested by the scoring data. The second area addresses pitching performance degradation for both starters and relievers and tackles the age-old question of when to remove a pitcher from a game. The strikes-to-balls ratio is tracked as a function of pitch count to identify trigger points that appear to act as thresholds in pitcher degradation. This comparative analysis of inter- and intra-pitcher performance uses data that can be easily measured during a game, and could better inform and support managers in real-time decision making for pitcher changes. Finally, this work concludes with a summary of additional pitching strategies and a list of potential bullpen tactics for future research.

Speakers
Description

Download Full Paper Here

Current state-of-the-art sports metrics such as “Wins-above-Replacement” in baseball, “Expected Point Value” in basketball, and “Expected Goal Value” in soccer and hockey are now commonplace in performance analysis. These measures have enhanced our ability to compare and value performance in sport.  But they are inherently limited because they are tied to a discrete outcome of a specific event. With the widespread (and growing) availability of player and ball tracking data comes the potential to quantitatively analyze and compare fine-grain movement patterns. An excellent example of this was the “ghosting” system developed by the Toronto Raptors to analyze player decision-making in STATS SportVU tracking data. Specifically, the Raptors created software to predict what a defensive player should have done instead of what they actually did. Motivated by the original “ghosting” work, we showcase an automatic “data-driven ghosting” method using advanced machine learning methodologies called “deep imitation learning”, applied to a season’s worth of tracking data from a recent professional league in soccer. Our ghosting method, which avoids substantial manual human annotation, results in a data-driven system that allows us to answer the question “how should this player or team have played in a given game situation compare to the league average?”. In addition, by “fine-tuning” our league average model to the tracking data from a particular team, our ghosting technique can estimate how each team might have approached the situation. Our method enables counterfactual analysis of effectiveness of defensive positioning as both a measurable and viewable quantity for the first time.

Speakers
Description

Download Full Paper Here

Our work explores how sports analytics can be used to broaden the application of math and statistics for youth, by teaching students to gather and analyze data directly linked to their own basketball performance. We believe this approach not only allows players and coaches to use analytics to improve individual performance, but it also provides an authentic, contextualized introduction to science, technology, engineering, and mathematics (STEM) fields. To investigate the potential of sports analytics as a method to increase youth athletes’ understanding of basketball performance and training, we created an open source shooting program that generates personalized heat maps based on user-collected data. We organized a series of free shooting clinics with local middle and high school coaches during which 98 youth participants collected their own shooting data. This was used to generate heat map representations of their own performance. Youth used these heatmaps to visualize their own shooting percentages and scoring efficiency from 14 locations around the court in order to inform their own training and in-game decisions. Program outcomes were evaluated through the use of a pre- and post-tests that assessed the students’ perceptions of their own interest and ability within basketball, analytics, and STEM. Our work showed that the use of analytics within youth programs increased youth confidence in basketball training, and knowledge of applications of STEM and analytics within sports. We believe this approach can be integrated within a professional sports organization’s existing youth programs to promote STEM education as a public good.

Speakers
Description

Download Full Paper Here

A major analytics challenge in Mixed Martial Arts (MMA) is understanding the differences between fighters that are essential for both establishing matchups and facilitating fighter analysis. Here, we model ~18,000 fighters as mixtures of 10 data-defined prototypical martial arts styles, each with characteristic ways of winning. Fighters of interest generally have few bouts, often less than 10, on which we can base our analysis. Accordingly, our approach balances this typically modest amount of fighter-specific data with broader patterns across fighters in order to accurately predict the performance of individual fighters. While we define prototypical styles based upon how fighters win fights, we find that styles also determine how likely they are to win. We also find that the impact of style is similar to that of experience. Reliably winning MMA fights requires dynamic striking (kicks, elbows and knees) as well as the capacity to “go the distance”, i.e., win by decision. By contrast, fighters who favor submissions that sacrifice positional control (the guillotine choke and leg submissions), tend to win less overall. While previously under-appreciated, style has massively shaped the history of MMA; the most successful athletes in the sport (defending United Fighting Championships [UFC] champions) almost exclusively favor the high-impact styles that we uncovered.

Speakers
Description

Download Full Paper Here

We present Possession Sketches, a new machine learning method for organizing and exploring a database of basketball player-tracks. Our method organizes basketball possessions by offensive structure. We first develop a model for populating a dictionary of short, repeated, and spatially registered actions. Each action corresponds to an interpretable type of player movement. We examine statistical patterns in these actions, and show how they can be used to describe individual player behavior. Leveraging this vocabulary of actions, we develop a hierarchical model that describes interactions between players. Our approach draws on the topic-modeling literature, extending Latent Dirichlet Allocation (LDA) through a novel representation of player movement data which uses techniques common in animation and video game design. We show that our model is able to group together possessions with similar offensive structure, allowing for efficient search and exploration of the entire database of player-tracking data. We show that our model finds repeated offensive structure in teams (e.g. strategy), providing a much more sophisticated, yet interpretable lens into basketball player-tracking data.

Speakers
Description

Download Full Paper Here

With $57 billion allocated towards sponsorship annually, it has become an integral part of the marketing mix for firms and is necessary for the survival of many sport organizations. Despite the importance of these partnerships, conditions that can jeopardize what is intended by both sides to be a long-term relationship are under-researched. Utilizing survival analysis modeling to empirically examine a longitudinal dataset of 69 global sponsorships of the Olympic Games and FIFA World Cup, this research seeks to isolate factors that may predict the dissolution of such partnerships and test a dynamic, integrated model of sponsorship decision-making. Results indicate that groups of dyadic, seller-focused, and buyer-focused factors all predict a significant amount of incremental variance in the hazard (i.e., probability) of sponsorship dissolution. Among variables that are statistically significant predictors of the dissolution of sponsorships are economic conditions, such as the presence of an inflationary economy in the home country of the sponsor. For example, every one percent increase in the average annual growth rate of the consumer price index during the term of the sponsorship increases the hazard of dissolution by 28.3%. From the perspective of the sponsored property, increased clutter is also detrimental, with every one additional sponsor added increasing the hazard of dissolution by 46.7%, demonstrating the importance of exclusivity in global sponsorships. Consistent with past research on sponsoring brands, both congruence and high levels of brand equity reduces the hazard of dissolution, by 70.6% and 65.9%, respectively.

Speakers

2017 Research Paper Posters

The 2017 Research Paper posters selected for the Conference are listed below.
Description

Download Full Paper Here

Although 3-point shooting is an essential aspect of winning games, shooting percentages have remained stagnant for decades. Here, we analyze 6 shooter factors from over 1.1 million 3-point shots captured by the Noah shooting system to quantitatively define high percentage shooting and shooter improvement. We find significant associations between all of these 6 shooter factors and shooting percentage. Furthermore, we use the interaction of these factors to define the region in the hoop where shots are guaranteed to score. Of the 6 factors, 4 are directly actionable using new technologies for instant feedback. We use machine learning to predict shooting percentage within 1.5% using only these 4 factors as input. Finally, we grouped players by their proficiency at these 4 factors and show case studies about the dissimilar training approaches that will lead to optimal improvement for two of these groups.

Speakers
Description

Download Full Paper Here

Common player metrics in the sports of basketball, soccer and hockey normally fit into one of two categories: offensive and defensive statistics. Comparing players becomes ambiguous across numerous metrics, even in the same category. Hence it would be ideal to be able to compare players with a meaningful statistic that encompasses some measure of both categories, while rewarding playmaking as well. We construct a directed graph from the flow of a game and then calculate a new statistic, based on the well-known PageRank algorithm, for each player in the game. Players can be compared via their “relative ranks”, which is a measure of their importance to the flow of the game, taking into account the offensive and defensive plays the player has made during the game. In this paper we explore this model, through its basic mathematical properties as well as through experimental examples, and propose it as a valid metric that could easily be implemented in mainstream sports analytics culture for any passing sport.

Speakers
Description

Download Full Paper Here

Football organizations spend significant resources classifying on-field actions. Since 2014, radio-frequency identification (RFID) tracking technology has been used to continuously monitor the on-field locations of NFL players. Using spatiotemporal American football data, this research provides a framework to autonomously classify receiver routes.

Speakers
Description

Download Full Paper Here

Using new game events and location data, we introduce a player performance assessment system that supports drafting, trading, and coaching decisions in the NHL. Players who tend to play in similar locations are clustered together using machine learning techniques, which capture similarity in styles and roles. Clustering players avoids apples-to-oranges comparisons, like comparing offensive and defensive players. Within each cluster, players are ranked according to how much their actions impact their team’s chance of scoring the next goal.  Our player ranking is based on assigning location-dependent values to actions. A high-resolution Markov model also pinpoints the game situations and rink locations in which players tend to do actions with exceptionally high/low values.

Speakers
Description

Download Full Paper Here

This paper seeks to outline and quantify methods to objectively rate six fundamental skills in volleyball: i) serve, ii) reception, iii) set, iv) attack, v) block, and vi) dig. While these skills are currently rated in competitive volleyball, there is no method currently in place that will consistently and objectively rate players and teams. With the ability to consistently grade these fundamentals across a large amount of data, it becomes possible to accurately predict matchups and determine player and team success.

Speakers
Description

Download Full Paper Here

Measuring defensive success in hockey is difficult due to a historical reliance on shot- based metrics. In other sports, passing data has led to major insights on defensive play. In 2015, 20 fans began tracking each pass leading to a shot to help understand the factors that lead to goals. For each shot, trackers recorded the sequence of passes which preceded it. Data collected included shot time, passer(s) and shooter, pass locations (center/left/right and offensive/defensive/neutral zone), and other known predictors of shot success (one-timers, odd-man rushes). Using this data, new metrics were developed to better evaluate the defensive impacts of both players and coaching systems. Defensive passing metrics are significant improvements upon existing defensive measures, and are both more repeatable and more predictive of future defensive results. Passing metrics have a broad range of applications at both the in- game and managerial level, and can inform tactical decisions, measure system success, and assist with player evaluation.

Speakers
Description

Download Full Paper Here

Passing is the backbone of soccer and forms the basis of important decisions made by managers and owners; such as buying players, picking offensive or defensive strategies or even defining a style of play. These decisions can be supported by analyzing how a player performs and how his style affects team performance. The flow of a player or a team can be studied by finding unique passing motifs from the patterns in the subgraphs of a possession-passing network of soccer games. These flow motifs can be used to analyze individual players and teams based on the diversity and frequency of their involvement in different motifs. Building on the flow motif analyses, we introduce an expected goals model to measure the effectiveness of each style of play. We also make use of a novel way to represent motif data that is easy to understand and can be used to compare players, teams and seasons. Our data set has the last 4 seasons of 6 big European leagues with 8219 matches, 3532 unique players and 155 unique teams. We will use flow motifs to analyze different events, such as for example the transfer of Claudio Bravo to Pep Guardiola’s Manchester City, who Jean Seri is and why he must be an elite midfielder and the difference in attacking style between Lionel Messi and Cristiano Ronaldo. Ultimately, an analysis of Post-Fàbregas Arsenal is conducted wherein different techniques are combined to analyze the impact the acquisition of Mesut Özil and Alexis Sánchez had on the strategies implemented at Arsenal.

Speakers
Description

Download Full Paper Here

We offer a new lens in this paper for viewing what it means for a team to exhibit good chemistry. Our aim is to quantify the “David Ross Effect,” or the indirect impact that an individual player can have on team wins through making their teammates better. To measure the strength of player interactions, we decompose FanGraphs’ wins-above-replacement metric, fWAR, with a spatial factor model. We then construct refinements of fWAR based on network statistics to isolate a player’s own contribution to team wins irrespective of his teammates, and his contribution adjusted for his effect on his teammates. We refer to the total network effect of a team’s players on each other as tcWAR, or team chemistry WAR. With this new metric, we document that high winning percentage teams do in fact tend to exhibit good team chemistry. A player’s net impact on his team through his teammates is what we call pcWAR, or player chemistry WAR. By constructing conditional age-position profiles for pcWAR, we show that designated hitters, relief pitchers, first basemen, and catchers make positive contributions to team chemistry at younger ages on average than other players. We then classify players based on their “intangibles,” defined by where they fall in relation to their profile. Looking at David Ross reveals a player who not only consistently outperformed his profile, but did so at a position that tends to support team chemistry more generally.

Speakers
Description

Download Full Paper Here

Judging a gymnastics routine is a noisy process, and the performance of judges varies widely. The International Federation of Gymnastics (FIG), in collaboration with Longines and the Université de Neuchâtel, is designing and implementing an improved statistical engine to analyze the performance of gymnastics judges during and after major competitions like the Olympic Games and the World Championships. The engine, called the Judge Evaluation program (JEP), has three objectives: (1) provide constructive feedback to judges, executive committees and national federations; (2) assign the best judges to the most important competitions; and (3) detect bias and outright cheating. In this article, using data from international competitions held during the 2013-2016 Olympic cycle, we first develop a marking score evaluating the accuracy of the marks given by judges. We then study ranking scores assessing to what extent judges rate gymnasts in the correct order, and explain why we ultimately chose not to implement them. We study outlier detection to pinpoint athletes that were poorly evaluated by judges. Finally, we discuss interesting observations and discoveries that led to recommendations to the FIG.

Speakers
Description

Download Full Paper Here

This paper discusses how data analytics can be conducted on muscle usage in extreme racing conditions to find actionable insights for the driver. One of the important insights is how to minimize driver’s muscle fatigue during a race, because IndyCar has regulations forbidding the use of power steering. This paper tackles two technological challenges: 1. data validation on noisy signal obtained from wearable device in extreme condition, 2. data cultivation to find actionable insights for the driver from heterogeneous racing data. First, we propose a data quality assessment technique based on supervised learning, enabling the judgment of whether data is reliable or not. This qualitative analysis revealed that the data validation method works with 99.5% accuracy to classify data as reliable or not. We also provide the real-time wearable monitoring tools to remind driver of wearable’s detachment from the driver’s body. Second, using the clean data classified by the first step, we propose a data visualization tool based on unsupervised learning that enables the driver or mechanics to discover useful feedback. We identified and demonstrated several actionable insights, such as identifying potential improvement points or potential relaxation points.

Speakers
Description

Download Full Paper Here

In this paper, we present a model for ball control in soccer based on the concepts of how long it takes a player to reach the ball (time-to-control) and how long it takes a player to control the ball (time-to-control). We use this model to quantify the likelihood that a given pass will succeed. We determine the free parameters of the model using tracking and event data from the 2015-2016 Premier League season. On a reserved test set, the model correctly predicts the receiving team with an accuracy of 81% and the specific receiving player with an accuracy of 68%. Though based on simple mathematical concepts, various phenomena are emergent such as the effect of pressure on receiving a pass. Using the pass probability model, we derive a number of innovative new metrics around passing that can be used to quantify the value of passes and the skill of receivers and defenders. Computed per-team over a 38-game dataset, these metrics are found to correlate strongly with league standing at the end of the season. We believe that this model and derived metrics will be useful for both post-match analysis and player scouting. Lastly, we apply the approach used to computing passing probabilities to calculate a pitch control function that can be used to quantify and visualize regions of the pitch controlled by each team.

Speakers
Description

Download Full Paper Here

Until recently, ranking the skills of golfers on the PGA TOUR was best accomplished by using imprecise summary statistics such as Driving Accuracy and Average Putts Per Round. Since 2003, The PGA TOUR, through their ShotLink Intelligence™ program, has collected detailed shot-level data, which provides coordinates of the locations of shots along with other information. Through the analysis of this data, more fine-grained and precise estimates of the skills of golfers on tour are now possible. The problem of estimating the skill of golfers in different aspects of the game given data from competitions is not simple. This work recognizes a wide array of statistical challenges associated with this problem, which a number of previous approaches to the problem have failed to adequately acknowledge. A brand new approach to the problem is presented which invokes comparisons of the quality of shots taken on the same hole during the same round. The comparisons are utilized in a Network Analysis technique, which is generalized to suit the needs of the problem. This approach is supported with empirical evidence of stronger correlations with the future success of the golfers than the system currently used by the PGA TOUR.

Speakers
Description

Download Full Paper Here

Betting on the result of a soccer match is a rapidly growing market, and online real-time odds exists. Market odds for all possible score outcomes as well as outright win, lose and draw are available in real time. The goal of our study is to show how a parsimonious two-parameter model can flexibly model the evolution of the market odds matrix of final scores. We provide a non-linear objective function to fit our Skellam model to instantaneous market odds matrix. We then define the implied volatility of an EPL game and use this as a diagnostic to show how the market’s expectation changes over the course of a game. A key feature of our analysis is to use the real-time odds to re-calibrate the expected scoring rates instantaneously as events evolve in the game. This allows us to assess how market expectations change according to exogenous events such as corner kicks, goals, and red cards. A plot of the implied volatility provides a diagnostic tool to show how the market reacts to event information. In particular, we study the evolution of the odds implied final score prediction over the course of the game. Our dynamic Skellam model fits the scoring data well in a calibration study of 1520 EPL games from the 2012 – 2016 seasons. One advantage of viewing market odds through the lens of a probability model is the ability to obtain more accurate estimates of winning probabilities.

Speakers
Description

Download Full Paper Here

Play-by-play is an important data source for analysis, particularly for basketball leagues that cannot afford the infrastructure for collecting video tracking data – it enables advanced metrics like adjusted plus-minus and lineup analysis like With Or Without You (WOWY). However, this analysis is not possible unless all substitutions are recorded and are correct. In this paper we use six seasons of play-by-play from the Canadian university league to derive a framework for automated cleaning of play-by-play that is littered with substitution logging errors. These errors include missing substitutions, unequal number of players subbing in and out, substitution patterns of a player not alternating between in/out, etc. We define features to build a prediction model for identifying correct/incorrect recorded substitutions and outline a simple heuristic for player activity to use for inferring the players who were not accounted for in the substitutions. We define two performance measures for objectively quantifying the effectiveness of this framework. This opens up a set of statistics that were not obtainable for the Canadian interuniversity league that improves the business; coaches can improve strategy leading to a more competitive product, and media can introduce modern statistics in their coverage to increase engagement from fans.

Speakers
Description

Download Full Paper Here

League of Legends is a popular esports title in the ‘Action Real-Time Strategy’ genre – currently boasting over 100 million monthly active users, and over 130 champions playable in-game. The term “meta” is used to describe the subset of playable champions that is widely considered “viable” and/or “dominant” in professional play. This meta evolves rapidly because the game’s developer, Riot Games, continually releases new champions (every 6-10 weeks) and rebalances the relative strength of existing champions (every 2-3 weeks). To date, approaches aimed at identifying changepoints in the meta have largely relied on subjective opinion and feel. This paper therefore presents an empirical methodology for characterizing the frequency with which “meta shifts” occur over time, applying an offline agglomerative changepoint model to professional champion select data from the past three seasons. We then derive and introduce a new statistical measure called Champion Viability Score (CVS) – which quantifies how the “priority” or “value” that professional teams place on different champions changes over time, applying an exponential smoothing approach to the time series data above. Together, these methods empower esports managers, coaches, and analysts to (1) better understand the historical viability of champions in past meta regimes (2) better adapt and develop strategies to align with the current meta, and (3) better anticipate and prepare for inevitable changes to the future meta.

Speakers