Identifying and Reducing Internal Employee Threat for Lockheed Martin
Tools Used:
Isolation Forest Algorithm R SAS E-miner SAS E-guide Tableau Gephi
This project team consisted of me, Anas Laffet, Fernando Cuen, Kaitlyn Schroeder and Preetha Pai.
Anomaly detection can provide clues about an outlying minority class in the data. In this project, we analyze a simulated dataset of employees to identify insider threats.
Since we don’t have labels we need to use unsupervised learning. Reading about the state of the art methods for anomaly detection we chose the algorithm we thought was most promising: Isolation Forest.
The "Isolation Forest" algorithm finds anomalies by deliberately “overfitting” models that memorize each data point. Since outliers have more empty space around them, they take fewer steps to memorize.
The algorithm is an adaptation of random forest where the decision trees are replaced by full decision trees (every leaf is a single data point) and we keep track of the path length between the root and each leaf (data point). The final measure for each data point would be the average path length. Abnormal data points should be classified easily thus the average path should be relatively short.
The following figure illustrates the intuition behind this algorithm:
Isolation Forest Algorithm R SAS E-miner SAS E-guide Tableau Gephi
This project team consisted of me, Anas Laffet, Fernando Cuen, Kaitlyn Schroeder and Preetha Pai.
Anomaly detection can provide clues about an outlying minority class in the data. In this project, we analyze a simulated dataset of employees to identify insider threats.
Since we don’t have labels we need to use unsupervised learning. Reading about the state of the art methods for anomaly detection we chose the algorithm we thought was most promising: Isolation Forest.
The "Isolation Forest" algorithm finds anomalies by deliberately “overfitting” models that memorize each data point. Since outliers have more empty space around them, they take fewer steps to memorize.
The algorithm is an adaptation of random forest where the decision trees are replaced by full decision trees (every leaf is a single data point) and we keep track of the path length between the root and each leaf (data point). The final measure for each data point would be the average path length. Abnormal data points should be classified easily thus the average path should be relatively short.
The following figure illustrates the intuition behind this algorithm:
The framework we built is described below:
The following Sway explains the whole project procedure

lockheed_martin_project_report_final.docx |