Problem Statement: Perform statistical analysis on the Iris flower dataset.
Description: The iris flower data consists of 50 samples from 3 different species of iris flower namely setosa, versicolor and virginica. The dataset consists of 4 numerical/input features and 1 categorical feature/target variable. Input features are sepal length, sepal width, petal length and petal width whereas target variable is species.
Libraries Used: Numpy, Pandas, Scipy, Matplotlib, Scikit Learn, Statsmodels, Seaborn
What we have learned so far from this project:
- We have four numerical columns and just one categorical column which is our target column
- This dataset is a balanced dataset as every category has same number of instances
- Very high correlation is there between petal length and petal width
- The setosa species is the most easily distinguishable because it is less distributed
- The versicolor and virginica species are difficult to distinguish due to the overlapping of attributes
- All input features (sepal length, sepal width, petal length and petal width) are statistically significant in distinguishing the species of iris flower
- The three species (setosa,versicolor, and virginica) have different petal lengths, with only partially overlapping values at the last two of them
- We have verified that the species’ means are significantly different for all the four input features