Alice Zhang

About Me

Hello! I am a senior studying Statistics with an additional major in Human-Computer Interactions at Carnegie Mellon University.

Outside of work, I am a dancer involved in many campus organizations like KPDC, Ballroom and ONiB.

Email: zhuoyaz@andrew.cmu.edu

LinkedIn: Alice Zhang

< Return to Projects

Statistics Graphics and Visualization

In Statistical Graphics and Visualization, I learned the pros and cons of using different graphical displays to visualize datasets and its characteristics. I gained experience working not only with uni-variate and bivariate data, but also with three- dimensional tools, group structure/clustering, and projections of higher dimensional data. In our Graphics Project, my team and I worked to report interesting findings through text analysis, statistical maps, timeseries and networks based on real data from the article “Where Police Have Killed Americans in 2015” and the Guardian reporting with census data from the American Community Survey. We explored the underlying pattern from the profiles of police killing victims, including the demographics of the victims, the methods of killing, the time and location police killings, etc with the GGPlot2 package in R. We concluded the profile of a typical victim of police killings to be armed white male in their mid 20s living in either California, Florida, or Texas, and that that while there was no specific time where there was an unusual increase or decrease in police killings, the number of police killings of non-white was disproportionately higher to their population presence in their state. Our poster was then presented at the CMU Statistics Department’s 50th Anniversary celebration. Also with R and GGPlot2, along with Plotly and other built in packages, my group and I created a web-based, data visualization tool. Each of our 8 graphs had interactive aspects which allowed users to select the variables and conditions, hover over areas for additional information, change the amount of data displayed, etc. The graphs were then published and displayed on a Shiny Dashboard using the Shiny App.

< Return to Projects

Data Mining

In my Data Mining class, my project team examined the dataset containing information extracted from the Airline On-Time Performance Data made available through the Bureau of Transportation Statistics of the U.S.Department of Transportation. We were specifically interested in the commercial flight activity to and from Pittsburgh International Airport (PIT) in 2015. Since flight delays are often a costly inconvenience to passengers, our goal was to predict whether or not there would be departure flight delays. We tested our predictions on a portion of the 2016 dataset, and since we only cared about flights departing from Pittsburgh, our test data was the subset of data that were departing from Pittsburgh. In order to increase our predictive accuracy, we used the R package weatherData to gather additional weather data in both departing and arrival cities. After performing exploratory analysis, we modeling our data with random forest to and adaptive boosting but found that AdaBoost achieved a lower misclassification rate, higher completeness, and higher purity than random forest did. With our AdaBoost model, we were able to achieve a testing error of 8.4%.