This project used data collected from PhishTank to create machine learning models that would allow users to classify websites as either phishing or legitimate websites. This would protect users by alerting users before visiting a website that will put them at risk. Logistic regression, Naive Bayes, k-nearest neighbors, decision trees, and random forest were run and attribute subset selection was performed to further improve performance.
Naive Bayes had the best prediction accuracy (86.6%), but logistic regression and random forest surpassed Naive Bayes in performance when the number of predictors were decreased using subset selection. Logistic regression had the highest prediction accuracy of 89.9%, followed by random forest at 89.3%. With more people visiting websites every day, the importance of the ability of antivirus and firewall software to safeguard users from malicious websites is growing.