What’s Driving Big Data in the Enterprise – and Why Now?

Big Data on Ulitzer

Subscribe to Big Data on Ulitzer: eMailAlertsEmail Alerts newslettersWeekly Newsletters
Get Big Data on Ulitzer: homepageHomepage mobileMobile rssRSS facebookFacebook twitterTwitter linkedinLinkedIn

BigData Authors: Zakia Bouachraoui, Carmen Gonzalez, Elizabeth White, Andreas Grabner, William Schmarzo

Related Topics: Government News, Big Data on Ulitzer

Blog Feed Post

Model building with the iris data set for Big Data

by Joseph Rickert I think we can be sure that when American botanist Edgar Anderson meticulously collected data on three species of iris in the early 1930s he had no idea that these data would produce a computational storm that would persist well into the 21st century. The calculations started, presumably by hand, when R. A. Fisher selected this data set to illustrate the techniques described in his 1936 paper on discriminant analysis. However, they really got going in the early 1970s when the pattern recognition and machine learn community began using it to test new algorithms or illustrate fundamental principles. (The earliest reference I could find was: Gates, G.W. "The Reduced Nearest Neighbor Rule". IEEE Transactions on Information Theory, May 1972, 431-433.) Since then, the data set (or one of its variations) has been used to test hundreds, if not thousands, of machine learning algorithms. The UCI Machine Learning Repository, which contains what is probably the “official” iris data set, lists over 200 papers referencing the iris data. So why has the iris data set become so popular? Like most success stories, randomness undoubtedly plays a huge part. However, Fisher’s selecting it to illustrate a discrimination algorithm brought it to peoples attention, and the fact that the data set contains three classes, only one of which is linearly separable from the other two, makes it interesting. For similar reasons, the airlines data set used in the 2009 ASA Sections on Statistical Computing and Statistical Graphics Data expo has gained a prominent place in the machine learning world and is well on its way to becoming the “iris data set for big data”. It shows up in all kinds of places. (In addition to this blog, it made its way into the RHIPE documentation and figures in several college course modeling efforts.) Some key features of the airlines data set are: It is big enough to exceed the memory of most desktop machines. (The version of the airlines data set used for the competition contained just over 123 million records with twenty-nine variables. The data set contains several different types of variables. (Some of the categorical variables have hundreds of levels.) There are interesting things to learn from the data set. (This exercise from Kane and Emerson for example) The data set is tidy, but not clean, making it an attractive tool to practice big data wrangling. (The AirTime variable ranges from -3,818 minutes to 3,508 minutes) An additional, really nice feature of the airlines data set is that it keeps getting bigger! RITA, The Research and Innovative Technology Administration Bureau of Transportation Statistics continues to collect data which can be downloaded in .csv files. For your convenience, we have a 143M+ record version of the data set Revolution Analytics test data site  which contains all of the RITA records from 1987 through the end of 2012 available for download. The following analysis from Revolution Analytics’ Sue Ranney uses this large version of the airlines data set and illustrates how a good model, driven with enough data, can reveal surprising features of a data set. # Fit a Tweedie GLM tm <- system.time( glmout >

Read the original blog entry...

More Stories By David Smith

David Smith is Vice President of Marketing and Community at Revolution Analytics. He has a long history with the R and statistics communities. After graduating with a degree in Statistics from the University of Adelaide, South Australia, he spent four years researching statistical methodology at Lancaster University in the United Kingdom, where he also developed a number of packages for the S-PLUS statistical modeling environment. He continued his association with S-PLUS at Insightful (now TIBCO Spotfire) overseeing the product management of S-PLUS and other statistical and data mining products.<

David smith is the co-author (with Bill Venables) of the popular tutorial manual, An Introduction to R, and one of the originating developers of the ESS: Emacs Speaks Statistics project. Today, he leads marketing for REvolution R, supports R communities worldwide, and is responsible for the Revolutions blog. Prior to joining Revolution Analytics, he served as vice president of product management at Zynchros, Inc. Follow him on twitter at @RevoDavid