Hello! I am a senior studying Statistics with an additional major in Human-Computer Interactions at Carnegie Mellon University.
Outside of work, I am a dancer involved in many campus organizations like KPDC, Ballroom and ONiB.
Email: zhuoyaz@andrew.cmu.edu
LinkedIn: Alice Zhang
>
In my Data Mining class, my project team examined the dataset containing information extracted from the Airline On-Time Performance Data made available through the Bureau of Transportation Statistics of the U.S.Department of Transportation. We were specifically interested in the commercial flight activity to and from Pittsburgh International Airport (PIT) in 2015. Since flight delays are often a costly inconvenience to passengers, our goal was to predict whether or not there would be departure flight delays. We tested our predictions on a portion of the 2016 dataset, and since we only cared about flights departing from Pittsburgh, our test data was the subset of data that were departing from Pittsburgh. In order to increase our predictive accuracy, we used the R package weatherData to gather additional weather data in both departing and arrival cities. After performing exploratory analysis, we modeling our data with random forest to and adaptive boosting but found that AdaBoost achieved a lower misclassification rate, higher completeness, and higher purity than random forest did. With our AdaBoost model, we were able to achieve a testing error of 8.4%.