ML Fake Image Detection

LBP is a texture descriptor technique that characterizes the local structure of an image by comparing the intensity of a central pixel with its surrounding neighbors. By applying LBP, the project aimed to capture relevant textural details that could contribute to the understanding and classification of the images or data.The LBP is implemented as below:

  def lbp(self, path):
    try:
      image_path = self.image_dir + path
      image = io.imread(image_path, as_gray=True)
      image = resize(image, self.image_size)
      lbp = feature.local_binary_pattern(image, 8, 1, method='uniform')
      histogram, _ = np.histogram(lbp.ravel(), bins=np.arange(0, 59))
      return histogram
    except Exception as e:
      print(e)
      print(image_path)
      print("Something happened in LBP")

FFT transforms a signal from the time domain to the frequency domain, enabling the identification of different frequency components within the signal. By using FFT, we extracted frequency-based features that could provide insights into the underlying patterns or characteristics of the data.

  def fft(self, path):
    try:
      image_path = self.image_dir + path
      image = io.imread(image_path, as_gray=True)

      resized_image = resize(image, self.image_size)

      fft_image = np.fft.fft2(resized_image)
      fft_shifted = np.fft.fftshift(fft_image)

      magnitude_spectrum = np.log(1 + np.abs(fft_shifted))
      return magnitude_spectrum.flatten()
    except Exception as e:
      print(e)
      print(image_path)
      print("Something happened in FFT")

Preprocessing

In this project, the dataset has been preprocessed as below:

Handling Null Values: In any real-world dataset, There are usually a few null values.
Data cleansing is the process of identifying and correcting corrupt or inaccurate records in a dataset. It involves detecting incomplete, incorrect, inaccurate, or irrelevant parts of the data and then taking actions such as replacing, modifying, or deleting the problematic data. For instance:
- There are a few incorrect labels, such as "forest" or "Jungle" instead of "jungle," "DALL.E," and other derivatives instead of "dalle".
- Removing irrelevant parts of labels, including image formats and student IDs.
Standardization: In Standardization, we transform our values such that the mean of the values is 0 and the standard deviation is 1.
Test & Train Split: Train data is used for training the model, and test data is used for evaluating the model.

Labels Summary:

none ≡ real

Dimension Reduction

This project uses the PCA and LOL techniques to reduce dimension.

Linear Optimal Low-Rank Projection (LOL): The key intuition behind LOL is that we can jointly use the means and variances from each class (like LDA and CCA), but without requiring more dimensions than samples (like PCA), or restrictive sparsity assumptions. Using random matrix theory, we are able to prove that when the data are sampled from a Gaussian, LOL finds a better low-dimensional representation than PCA, LDA, CCA, and other linear methods

Principal Component Analysis (PCA): PCA is a widely used technique for dimension reduction. It identifies a new set of variables, called principal components, that are linear combinations of the original features. These components are ordered in terms of the amount of variance they explain in the data.

PCA results for reducing deep features into 3 dimensions are shown below:

Classification

For classification, 3 classification models are implemented including Logistic Regression, SVM, and Random Forest.

Logistic Regression

For training the model, the "Newton-Cholesky" solver is used, which is recommended when the number of samples is much larger than the number of features. The classification report and the confusion matrix are shown below, demonstrating the performance of the model:

Result	Deep Features	Handcrafted Features
Classification Report
Confusion Matrix

SVM

The optimization function for Soft SVM is written as follows:

$$ \min_{w, b, \xi} \frac{1}{2}|w|^2 + C\sum_{i=1}^{n}\xi_i $$

subject to:

$$ \begin{align*} & y_i(w^T x_i + b) \geq 1 - \xi_i, \quad i = 1, 2, \ldots, n \\ & \xi_i \geq 0, \quad i = 1, 2, \ldots, n \end{align*} $$

C is a hyperparameter which determines the trade-off between lower error or higher margin. In order to choose this hyperparameter, we used the grid search technique and the best C for deep feature equals 0.1 and for handcrafted features equals 1.

Result	Deep Features	Handcrafted Features
Classification Report
Confusion Matrix

Random Forest

Random forests or random decision forests is an ensemble learning method for classification, regression, and other tasks that operates by constructing a multitude of decision trees at training time.

Two important hyperparameters to find in the random forest method, are the number of estimators and the maximum depth. The Best Hyperparameters for deep features are found by Randomized Search CV:

Best Hyperparameters: {'n_estimators': 85, 'max_depth': 100}

The classification report and the confusion matrix are shown as below which demonstrate how well the model works with deep features:

Classification Report	Confusion Matrix

Also, the first tree is shown below:

Clustering

For clustering, 2 models are implemented including Mini Batch K-Means, and Gaussian Mixture Model.

Mini Batch K-Means

The Mini-Batch K-means algorithm is utilized as a solution to the increasing computation time of the traditional K-means algorithm when analyzing large datasets.

The clustering results for different numbers of clusters are shown below:

Number of Clusters	Deep Features	Handcrafted Features
2
3
6
9
50

The best number of clusters is elbow point in the plot of inertia with respect to the number of clusters:

Deep Features	Handcrafted Features

Gaussian Mixture Model

Gaussian Mixture Models (GMMs) are powerful probabilistic models used for clustering and density estimation. By combining multiple Gaussian components, GMMs can represent various data patterns and capture the underlying structure of the data.

The Expectation-Maximization (EM) algorithm is commonly employed to estimate the parameters of Gaussian Mixture Models (GMMs), including the mean, covariance, and cluster weights.

To find the optimal number of components in a cluster, the Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) are commonly used measures.

Deep Features	Handcrafted Features

Course Description

Course: Machine Learning [ECE 501]
Semester: Spring 2023
Institution: School of Electrical & Computer Engineering, College of Engineering, University of Tehran
Instructors: Dr. A. Dehaqani, Dr. Tavassolipour

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
fake		fake
readme_images		readme_images
real		real
Code.ipynb		Code.ipynb
FinalProject_Description.pdf		FinalProject_Description.pdf
README.md		README.md
Report.pdf		Report.pdf
features.csv		features.csv
labels.csv		labels.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ML Fake Image Detection

Table of Contents

Data Preparation

Feature Extraction

Preprocessing

Dimension Reduction

Classification

Logistic Regression

SVM

Random Forest

Clustering

Mini Batch K-Means

Gaussian Mixture Model

Course Description

About

Releases

Packages

Languages

Category	Real Image	Fake Image
Sea
Jungle
Mountain

fardinabbasi/ML_Fake_Image_Detection

Folders and files

Latest commit

History

Repository files navigation

ML Fake Image Detection

Table of Contents

Data Preparation

Feature Extraction

Preprocessing

Dimension Reduction

Classification

Logistic Regression

SVM

Random Forest

Clustering

Mini Batch K-Means

Gaussian Mixture Model

Course Description

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages