Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
Solution.ipynb		Solution.ipynb
SyntheticQ1.csv		SyntheticQ1.csv
seeds.csv		seeds.csv
stones.csv		stones.csv

README.md

Data Mining

Problem 1

In this problem, you are required to apply various clustering techniques on a given dataset SyntheticQ1.csv, which is an artificial dataset containing 4 convex clusters. The dataset has two attributes (X and Y) for each instance, delimited by semicolons.

Data Preprocessing
- Remove records with missing values (empty or marked with ‘?’).
- Remove records where X or Y has a negative value.
Clustering and Visualization
- K-means Clustering: Apply K-means to generate 4 clusters.
- Scatter Plot Visualization of the clusters.
- DBSCAN Clustering: Use ε = 0.5 and minPts = 3 on the dataset.
- Visualize DBSCAN clusters using a scatter plot.
- Single-Linkage Hierarchical Clustering: Generate 4 partitions and visualize with a scatter plot.
- Complete-Linkage Hierarchical Clustering: Generate 4 partitions and visualize with a scatter plot.
- Average-Linkage Hierarchical Clustering: Generate 4 partitions and visualize with a scatter plot.
Comparison
- Briefly compare and explain the outcomes of all clustering techniques.

Note: Do not use dimensionality reduction during visualization.

Problem 2

Apply clustering techniques on the seeds.csv dataset, containing 4 attributes related to plant seeds: length, width, asymmetry coefficient, and compactness coefficient.

Elbow Method: Determine the optimal number of clusters.
K-means Clustering: Use the optimal number of clusters.
Visualization and Heatmap
- Scatter Plot Visualization with dimensionality reduction.
- Heatmap of the clustering results.
Hierarchical Clustering: Apply hierarchical clustering with K partitions (found from the elbow method).
Comparison: Briefly compare K-means and hierarchical clustering outcomes.

Problem 3

Analyze stones.csv dataset, which includes height, width, density, compactness, and texture, with class labels (A to F).

Classification Modeling
- Split dataset into 60% train and 40% test.
- Decision Tree: Build a model and generate a confusion matrix and classification report.
- KNN (K=5): Build a model and generate a confusion matrix and classification report.
- SVM (Polynomial kernel, degree 3): Build a model and generate a confusion matrix and classification report.
New Entry Classification
- Classify a new entry with specified feature values using all models.

Problem 4

Using the spambase.data dataset from UCI repository, perform 500 random splits (80% train, 20% test) and apply the following classification techniques:

Decision Trees
KNN
Support Vector Machines
Logistic Regression
Naïve Bayes

Generate a summary table showing average precision, recall, F1 score, and accuracy across the 500 splits.

Problem 5

Select a benchmark dataset, partition it into train and test sets, and apply classification techniques. For each classification technique, plot the variations in accuracy, precision, recall, and F1 score for different train/test ratios.

Dataset Source: UCI Repository

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataMining

DataMining

README.md

Data Mining

Problem 1

Problem 2

Problem 3

Problem 4

Problem 5

Files

DataMining

Directory actions

More options

Directory actions

More options

Latest commit

History

DataMining

Folders and files

parent directory

README.md

Data Mining

Problem 1

Problem 2

Problem 3

Problem 4

Problem 5