You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Detecting fake images using machine learning, involving feature extraction from real and AI-generated images, and implementing various classification models such as Random Forest, SVM, and Logistic Regression to accurately distinguish between real and fake images.
Feature extraction of 'real' and 'fake' images and implementation of the best classification method (using various machine learning models such as Random Forest, SVM, and Logistic Regression) to identify fake images.
The dataset comprises approximately 3400 images, including both real and fake images of seas, mountains, and jungles, distributed evenly. AI generative models, including Stable Diffusion, DALL.E, Dreamstudio, Crayion, and Midjourney, are used to create fake images.
A few sample images are provided below:
Category
Real Image
Fake Image
Sea
Jungle
Mountain
Furthermore, in addition to the required feature extraction process, the deep features are already available as "features.csv" along with their corresponding labels in "labels.csv".
Data Preparation
Feature Extraction
In this project, in addition to the provided deep features, handcrafted features were also extracted. The approach involved utilizing two commonly employed techniques: Local Binary Patterns (LBP) and Fast Fourier Transform (FFT).
LBP is a texture descriptor technique that characterizes the local structure of an image
by comparing the intensity of a central pixel with its surrounding neighbors. By applying LBP, the project aimed to capture relevant textural details that could contribute
to the understanding and classification of the images or data.The LBP is implemented as below:
deflbp(self,path):
try:
image_path=self.image_dir + pathimage=io.imread(image_path,as_gray=True)image=resize(image,self.image_size)lbp=feature.local_binary_pattern(image,8,1,method='uniform')histogram,_=np.histogram(lbp.ravel(),bins=np.arange(0,59))returnhistogramexceptExceptionase:
print(e)print(image_path)print("Something happened in LBP")
FFT transforms a signal from the time domain to the frequency domain, enabling
the identification of different frequency components within the signal. By using FFT,
we extracted frequency-based features that could provide insights into the underlying
patterns or characteristics of the data.
In this project, the dataset has been preprocessed as below:
Handling Null Values: In any real-world dataset, There are usually a few null values.
Data cleansing is the process of identifying and correcting corrupt or inaccurate records in a dataset. It involves detecting incomplete, incorrect, inaccurate, or irrelevant parts of the data and then taking actions such as replacing, modifying, or deleting the problematic data. For instance:
There are a few incorrect labels, such as "forest" or "Jungle" instead of "jungle," "DALL.E," and other derivatives instead of "dalle".
Removing irrelevant parts of labels, including image formats and student IDs.
Standardization: In Standardization, we transform our values such that the mean of the values is 0 and the standard deviation is 1.
Test & Train Split: Train data is used for training the model, and test data is used for evaluating the model.
Labels Summary:
none ≡ real
Dimension Reduction
This project uses the PCA and LOL techniques to reduce dimension.
Linear Optimal Low-Rank Projection (LOL):
The key intuition behind LOL is that we can jointly use the means and variances from
each class (like LDA and CCA), but without requiring more dimensions than samples
(like PCA), or restrictive sparsity assumptions. Using random matrix theory, we are
able to prove that when the data are sampled from a Gaussian, LOL finds a better
low-dimensional representation than PCA, LDA, CCA, and other linear methods
Principal Component Analysis (PCA):
PCA is a widely used technique for dimension reduction. It identifies a new set of variables, called principal components, that are linear combinations of the original features.
These components are ordered in terms of the amount of variance they explain in the
data.
PCA results for reducing deep features into 3 dimensions are shown below:
Classification
For classification, 3 classification models are implemented including Logistic Regression, SVM, and Random Forest.
Logistic Regression
For training the model, the "Newton-Cholesky" solver is used, which is recommended when the number of samples is much larger than the number of features.
The classification report and the confusion matrix are shown below, demonstrating the performance of the model:
Result
Deep Features
Handcrafted Features
Classification Report
Confusion Matrix
SVM
The optimization function for Soft SVM is written as follows:
$$
\begin{align*}
& y_i(w^T x_i + b) \geq 1 - \xi_i, \quad i = 1, 2, \ldots, n \\
& \xi_i \geq 0, \quad i = 1, 2, \ldots, n
\end{align*}
$$
C is a hyperparameter which determines the trade-off between lower error or higher margin. In order to choose this hyperparameter, we used the grid search technique and the best C for deep feature equals 0.1 and for handcrafted features equals 1.
Result
Deep Features
Handcrafted Features
Classification Report
Confusion Matrix
Random Forest
Random forests or random decision forests is an ensemble learning method for classification, regression, and other tasks that operates by constructing a multitude of decision trees at training time.
Two important hyperparameters to find in the random forest method, are the number of estimators and the maximum depth. The Best Hyperparameters for deep features are found by Randomized Search CV:
Best Hyperparameters: {'n_estimators': 85, 'max_depth': 100}
The classification report and the confusion matrix are shown as below which demonstrate how well the model works with deep features:
Classification Report
Confusion Matrix
Also, the first tree is shown below:
Clustering
For clustering, 2 models are implemented including Mini Batch K-Means, and Gaussian Mixture Model.
Mini Batch K-Means
The Mini-Batch K-means algorithm is utilized as a solution to the increasing computation time of the traditional K-means algorithm when analyzing large datasets.
The clustering results for different numbers of clusters are shown below:
Number of Clusters
Deep Features
Handcrafted Features
2
3
6
9
50
The best number of clusters is elbow point in the plot of inertia with respect to the number of clusters:
Deep Features
Handcrafted Features
Gaussian Mixture Model
Gaussian Mixture Models (GMMs) are powerful probabilistic models used for clustering and density estimation. By combining multiple Gaussian components, GMMs
can represent various data patterns and capture the underlying structure of the data.
The Expectation-Maximization (EM) algorithm is commonly employed to estimate the parameters of Gaussian Mixture Models (GMMs), including the mean, covariance, and cluster weights.
To find the optimal number of components in a cluster, the Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) are commonly used measures.
Detecting fake images using machine learning, involving feature extraction from real and AI-generated images, and implementing various classification models such as Random Forest, SVM, and Logistic Regression to accurately distinguish between real and fake images.