Skip to content

Using machine learning to identify potential malicious hostnames.

License

Notifications You must be signed in to change notification settings

adbcode/host-check

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

host-check

Using machine learning to identify potential malware hostnames.

Rationale

  • Used github/StevenBlack's consolidated hostname blacklist as source of mallicious websites and Alexa's tracked top websites (100k in our case).
  • Heavy feature engineering used to generated features out of only one feature - hostnames!
  • Tested a suite of classifiers on the final scaled numeric features:
    • Logistic Regression
    • Naïve Bayes
    • K-Nearest Neighbours (k=5)
    • Random Forest
    • Stochastic Gradient Descent
  • Narrowed down tuning with only a couple classifiers and tested a custom ensemble classifier with the tuned versions using soft voting.

Results

  • ~84% accuracy when using features (with weight > 0.01)
    • ~79% accuracy with only 3 features! (signifcantly less time required for feature engineering)
  • Further improvements can be made using deep learning and/or trying different feature extractions

How to deploy

Create conda environment

  • run conda env create -f resources/host-check.yml

Replicate results only

  • Unzip resources\df_final.zip and resources\random_forest_final.zip
  • Load the pickles to your own project and split test data using train_test_split(X, y, test_size=0.2, random_state=42) where X = df.drop(['malicious'], axis=1 and y = df['malicious']
  • Test the model against the above split or with your own data!

Start from scratch*

TODO: Pipeline

  • Using the learnings from the evaluation in host-check.py, will create a pipeline to feed fresh versions of alexa and malware_pd directly to the pieline and evaluate performance.
    • Optionally with grid search built-in! (although with heavy computational load!)

References

About

Using machine learning to identify potential malicious hostnames.

Resources

License

Stars

Watchers

Forks

Languages