
Links to GitHub Repositories for past Data Science Projects
Capstone: German Traffic Sign Identification
Using the GTSRB and GTSDB datasets to classify 43 categories of traffic signs, as well as identifying and classifying signs in a large image. For sign recognition, an accuracy of 98% was achieved, which is impressive due to the similarity of many signs. I was able to successfully identify the existence of a sign 93% of the time, resulting in 92% total success rate of identifying a signs existence and correctly identifying which of the 43 signs it was. The false sign identification rate (found sign that wasn’t there, and classified it as one of the 43) of 0.25-0.5 signs/image.
Group Project: Tumor Classification
Group project using CNN and variants to classify tumors in mammograms. Two datasets were explored, MIAS and CBIS-DDSM. For the CBIS-DDSM dataset, we achieved 92% accuracy at identifying tumors within mammograms. Other packages were tried to mask the MIAS dataset to identify tumor regions.
Hackathon: Classifying Shelter Outcomes
Group project to try and predict shelter outcomes for dogs and cats in Austin, TX based on a closed Kaggle competition. From start to finish we had six hours to complete a model and present results. Given only eight columns of information, we were able to achieve relatively good success at predicting. After submitting our results, we would’ve placed ~60 percentile.
Classifying Subreddits Using NLP
Using NLP, trying to differentiate whether a text post was from r/DadJokes or r/AntiJokes. Different models, text manipulation, and engineered features were used to classify subreddits, and an ultimate result of 74% accuracy was achieved for differentiating between the two.
Predicting House Sale Price in Ames, IA
Taken from Kaggle competition, regression model for predicting house sale prices in Ames, IA. Focus was on feature engineering and avoiding over fitting when making the model.
SAT Scores by State and Race
Examining SAT scores and drawing top level conclusions of interesting trends with relation to race. Focused on state by state results from 2017-2019, scraping information from the College Board Suite of Assessments.