Interview with Baseball Database Innovator Sean Lahman
PAC’s Sports recently interviewed award-winning database journalist and author Sean Lahman. Sean is a visionary in the field of sport statistics, and led the first significant effort to develop a database of baseball statistics that was made freely available to the general public. He created the Lahman Baseball Database, which is a collection of baseball statistics for every team and player in Major League history that also allows for the simulation and recreation of historical seasons from baseball history. His baseball website (The Baseball Archive) was formed in July of 1995, earning the distinction of being the oldest and longest running baseball website still in existence
Sean’s efforts to document the statistical history of sports has also extended to pro football, basketball, and tennis, as he has edited or contributed to the definitive encyclopedias of these respective sports. Additionally, Sean has written numerous books; he created the annual Pro Football Prospectus in 2002 and later authored The Pro Football Historical Abstract in 2008. Sean’s football abstract won the 2008 Nelson Ross Award for “outstanding achievement in pro football research and historiography”.
Our interview covered Sean’s experiences compiling data for various sports, sabermetrics, his thoughts on some baseball’s premier free agents, and recent developments in sports research.
Q: Can you provide some background information on how you became interested in sports statistics? Where you always a sports fan or were you more drawn to the analytical side of things?
A: I grew up following baseball and football, but baseball was always my biggest passion. Like many young fans of my generation, I was fascinated by the work of Bill James. His first book came out when I was in the 8th grade, and it opened up a whole new world to me. When I was in college, I found the book "The Hidden Game of Baseball" by John Thorn and Pete Palmer. While James was focused on current players, Thorn and Palmer spent a lot of time discussing historical figures, and their book gave a crash course in the history of statistical analysis. It introduced me to a community of people who viewed baseball research as serious business. I spent countless hours at the university library tracking down obscure research journals and started building spreadsheets to do my own analysis.
Q: Please tell us a little about the work you have done in your career thus far related to sports and the historical databases you have created; what is the purpose of them, who are the users, what has the response been to your findings?
A: Anyone who starts doing statistical analysis eventually comes up against the same problem: having raw data to work with. Folks would have to build their own, but were reluctant to share. Part of that was self-preservation. If you were the only one with the numbers you could be the go-to guy when someone else wanted answers, and parcel them out a handful at a time.
It was clear that I had some expertise at building databases, and that I was better at that then I was at analyzing the data on the back end. I figured that making a full database available would spur new work by people who were smarter than me, and so I started making my baseball database available on the web during the fall of 1994. And it didn't take long for my vision to come true. There was a resurgence of sabermetric analysis, which gave birth to things like the Baseball Prospectus. The availability of data also revolutionized baseball simulations, with games like Baseball Mogul and Out of the Park Baseball enabling you to recreate any season from history. And best of all, folks used the data to build great online encyclopedias, most notably Sean Forman's Baseball-Reference.com.
In the late 1990s I worked for Total Sports, a book publishing company founded by John Thorn, who's now the official historian of Major League Baseball. His vision was to create an encyclopedia for each sport that set the gold standard, and each of those would require a comprehensive database as its foundation. In some cases, that meant partnering with folks like Pete Palmer or David Neft, giants in the sports reference field who had built their own databases. In other cases, it meant working with source material to build new databases.
I've always been more interested in the work of compiling data than doing serious number crunching. For example, I've been working on collecting play-by-play accounts for NFL games. With a handful of other researchers, we've compiled and published accounts for about 80% of the games since 1960.
Q: I see your work has encompassed many different sports (baseball, pro and college football, pro and college basketball, auto racing, tennis, boxing and the Olympic games). Are there any particular sports you find more interesting or prefer researching more so than the others?
A: Each sport presents different challenges because the state of the historical record varies. Baseball has always been meticulously documented. Not so with other sports. The NFL didn't keep its own statistics until 1932, so the early record has had to be reconstructed by historians and researchers. Tennis is international and was loosely organized during the amateur era. Boxing was fairly well covered by the newspapers, but it wasn't uncommon for fighters to use pseudonyms for one reason or another, or to inflate their records with accounts of fights that never happened.
As I mentioned, the goal at Total Sports was to create comprehensive encyclopedias for all sports. We did that for baseball, football, and hockey, and had several others in the works when the company started having financial difficulties and closed shop. I was particularly excited about the work I'd done on tennis. I had lunch in Manhattan with Bud Collins, the colorful and legendary tennis commentator, and he lamented that nobody had ever made the effort to compile an exhaustive player register for tennis like the ones that existed for baseball. I told him I was the guy for the job, and spent about six months sifting through various archives around the world. We were about halfway there when Total Sports went bankrupt, and another publisher ended up with the remnants. They ended up producing a book that had more photographs and fewer stats -- which was a shame. It was a missed opportunity.
The printed encyclopedia is extinct now, and the online outlets aren't likely to commit the resources needed to do these kinds of projects.
Q: Sabermetrics as a concept and its use by major league baseball front offices has exploded in the past decade. Did you use any Sabermetric principles or metrics when you performed your research pertaining to baseball? Or any other type of analysis as a measure?
A: The term sabermetrics encompasses a lot of different things, but I think it boils down to the basic principle that your decision making ought to be guided by evidence. The computer age has made that easier. Many pro teams have their own in-house systems and proprietary methods. My role has largely been to help provide the raw data that makes that sort of work possible.
Q: There are a few big name free-agent baseball players on the market this off-season, specifically Albert Pujols, Jose Reyes, Prince Fielder, and Jonathon Papelbon (who just signed with the Phillies). If you worked for a baseball front office, are there any statistics (both current and historical) that you have or would have looked at in particular before you signed them to a long-term contract? Can you project any kind of future performance for these players based on age, trends, and historical comparisons?
A: Pujols is a rare talent, and I think one of the things folks lose sight of when they look at his gaudy statistics is that he's spent his career in a ballpark that really favors pitchers. If he went to a hitter’s park like the ones in Texas or Colorado, his numbers would go through the roof. The comparable player that comes to mind is Henry Aaron, who saw his numbers surge in his mid-30s when the Braves moved from Milwaukee to Atlanta. The only question mark that I see is the cost of signing him. A team needs to avoid hamstringing itself so that it can't spend enough money on other players -- like the Rangers did with Alex Rodriguez or the Reds with Ken Griffey, Jr.
As far as Papelbon, I think there are two schools of thoughts with respect to closers. One is that you want a lock down guy to come in in the ninth whenever there's a save situation. Boston has been a member of that school, and it seems so too are the Phillies. It seems to me that this signing was a knee-jerk reaction to the struggles of Brad Lidge. For a team that a) has a lot of other talent and b) can afford to overpay, it's not such a horrible move. But most analysts (and a growing number of managers) favor another school of thought, which favors a more situational approach to bullpen usage, focusing on individual matchups.
In general, I think the history suggests Reyes and Fielder are the types of players teams overpay for and end up being disappointed with. It's less about the numbers they'll produce and more about the value.
Q: Are statistics driven metrics starting to be used more regularly in any other sports you performed research in?
A: Most NBA teams have embraced statistical analysis as an important part of their planning process. Football has always been ahead of the curve when it comes to technology. There's less emphasis on using statistics for individual player analysis -- they still prefer to rely on film study -- but they know everything there is to know at the team level, particularly with regards to play-calling patterns.
Q: What work or research have you done in particular in helping the NBA or NFL in the field of sports statistics?
A: I wrote a book on pro football that was modeled after the Bill James Historical Abstract, and I tried to make comparisons of players across the various eras in way that was fair and made sense. The book won an award from the Pro Football Researchers Association for the best book of the year, so I’d like to think that's one measure of its success.
Most of my work with teams has been to compile and supply data rather than analyzing it, but I've had contact with both NBA and NFL teams.
Q: Any people, athletes, or coaches you have worked with in the course of your research that you'd like to highlight or discuss?
A: In general, I always found the fringe players more interesting. Stars tend to have immersed themselves in their own mythology, and they've told the stories so often that they can't remember what's true and what's not.
I did a lot of interviews for my football book, and I spent a number of years in New York covering the Jets and Giants. Kurt Warner was a lot of fun to cover. He had such a commanding presence, and he was the kind of guy who just made everybody in the room feel better about themselves. Manute Bol was a fascinating guy, one who I wish I could have spent more time with. You'd be hard pressed to find anyone who came in contact with him who didn't feel a profound impact. I've also been fascinated by guys at the other end of the spectrum, guys like Pete Rose and Denny McLain, whose complex personalities make them so interesting to talk to.
Q: Where the interviews you conducted for your football book performed so you could get an idea of what the perception of certain players was from those they actually competed against?
A: What prompted me to write that book was my disappointment with another popular book which purported to rate the top football players of all time. I felt that if you were going to engage in that sort of ranking, there ought to be some methodology. In other words, if you're going to say Dick Butkus was the greatest linebacker of all time, you shouldn't have any difficulty saying why. But instead, that book and others like it are full of empty quotes about how tough a guy was, or saying "he changed the game" without explaining how.
When we say that Willie Mays was a great player, we can explain why. Great hitter, used his speed aggressively on the base paths, had tremendous range in the outfield, etc. We can describe how Bob Gibson intimidated hitters by throwing his fastball inside and how he had a great slider at a time when few other threw that pitch.
So I wanted to write about playing styles, and about the concrete things that distinguished football players from one another. That was particularly tough to do for linemen and defensive players, so I did interviews with former players who could help fill me in. I never saw Unitas in his prime, and while the numbers tell me he was great, I wanted folks to tell me what made him great. One former coach spent days showing me old game film and pinpointing things like Chris Hanburger's unorthodox tackling style, or the unusual stance that Randall McDaniel used.
Q: What work have you performed related to the sport of boxing?
A: We were working on a boxing encyclopedia at Total Sports, with a couple of veteran sportswriters signed up to compile the prose -- Bert Randolph Sugar and Phil Berger. When Berger died in 2001, the book got derailed and never got back on track. Heavyweight Hasim Rahman was training here in Rochester after he beat Lennox Lewis, which sparked my interest a little. I love to go to induction weekend at the Boxing Hall of Fame every summer. There's not a massive crowd like there is in Canton and Cooperstown, maybe a few hundred fans and a few dozen old boxers. It's kind of cool to rub shoulders with guys like Ken Norton and Angelo Dundee.
Q: You've expressed your appreciation for the work of Bill James. Have you ever had the opportunity to collaborate with him, share ideas? Or has he indicated he is a fan of all the research you have compiled?
A: Some of my first published writing appeared in one of Bill's books, an annual called "The Great American Baseball Stat Book" which came out in 1988 and 1989. He and I have bumped into each other at SABR events and exchanged email on a variety of subjects. But I do have to say that we also sparred a bit over his coverage of the Pete Rose case. I wrote that he'd misstated some of the evidence in making his defense of Rose, and he criticized me in his last book for doing so.
Since he joined the Red Sox, Bill's ability to collaborate or share much of his new work has been limited, and while that's resulted in two World Championships for Boston, his withdrawal been a loss for the baseball research community.
Q: Any new recent developments in sports research or data mining related to statistics that you are aware of?
A: The explosion of data available for baseball now is mind-blowing. The PitchFX data offers such a tremendous amount of information... it'll be five years before we really start to mine the full depths of what those numbers offer. And not for nothing, but for $100 I can get access to full video from every MLB game for the whole season. When I was growing up in the 70s, we got one game on television each week. You could be a hardcore fan and never see 80% of the players actually play. I'm also beginning to work with a small group of researchers who are trying to beef up the historical minor league data. A robust dataset will really help advance the study of player development.
In addition to his frequent speaking engagements on topics such as database journalism, data mining, and open source databases, Sean currently works on data driven stories as a reporter for the Rochester Democrat and Chronicle. Additional information on Sean and his extensive work can be found on his website, http://seanlahman.com/.Follow us on Twitter:@PACsSports