Iris flower data set

The Iris flower data set or Fisher's Iris data set is a multivariate data set introduced by the British statistician, eugenicist, and biologist Ronald Fisher in his 1936 paper The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis. It is sometimes called Anderson's Iris data set because Edgar Anderson collected the data to quantify the morphologic variation of Iris flowers of three related species. Two of the three species were collected in the Gaspé Peninsula "all from the same pasture, and picked on the same day and measured at the same time by the same person with the same apparatus". Fisher's paper was published in the journal, the Annals of Eugenics, creating controversy about the continued use of the Iris dataset for teaching statistical techniques today.
The data set consists of 50 samples from each of three species of Iris. Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters. Based on the combination of these four features, Fisher developed a linear discriminant model to distinguish the species from each other.

Use of the data set

Based on Fisher's linear discriminant model, this data set became a typical test case for many statistical classification techniques in machine learning such as support vector machines.
The use of this data set in cluster analysis however is not common, since the data set only contains two clusters with rather obvious separation. One of the clusters contains Iris setosa, while the other cluster contains both Iris virginica and Iris versicolor and is not separable without the species information Fisher used. This makes the data set a good example to explain the difference between supervised and unsupervised techniques in data mining: Fisher's linear discriminant model can only be obtained when the object species are known: class labels and clusters are not necessarily the same.
Nevertheless, all three species of Iris are separable in the projection on the nonlinear and branching principal component. The data set is approximated by the closest tree with some penalty for the excessive number of nodes, bending and stretching. Then the so-called "metro map" is constructed. The data points are projected into the closest node. For each node the pie diagram of the projected points is prepared. The area of the pie is proportional to the number of the projected points. It is clear from the diagram that the absolute majority of the samples of the different Iris species belong to the different nodes. Only a small fraction of Iris-virginica is mixed with Iris-versicolor. Therefore, the three species of Iris are separable by the unsupervising procedures of nonlinear principal component analysis. To discriminate them, it is sufficient just to select the corresponding nodes on the principal tree.

Data set

The dataset contains a set of 150 records under five attributes - sepal length, sepal width, petal length, petal width and species.

Dataset OrderSepal lengthSepal widthPetal lengthPetal widthSpecies setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
304. setosa
314. setosa
325. setosa
335. setosa
345. setosa
354. setosa
365. setosa
375. setosa
384. setosa
394. setosa
405. setosa
415. setosa
424. setosa
434. setosa
445. setosa
455. setosa
464. setosa
475. setosa
484. setosa
495. setosa
505. setosa
517. versicolor
526. versicolor
536. versicolor
545. versicolor
556. versicolor
565. versicolor
576. versicolor
584. versicolor
596. versicolor
605. versicolor
615. versicolor
625. versicolor
636. versicolor
646. versicolor
655. versicolor
666. versicolor
675. versicolor
685. versicolor
696. versicolor
705. versicolor
715. versicolor
726. versicolor
736. versicolor
746. versicolor
756. versicolor
766. versicolor
776. versicolor
786. versicolor
796. versicolor
805. versicolor
815. versicolor
825. versicolor
835. versicolor
846. versicolor
855. versicolor
866. versicolor
876. versicolor
886. versicolor
895. versicolor
905. versicolor
915. versicolor
926. versicolor
935. versicolor
945. versicolor
955. versicolor
965. versicolor
975. versicolor
986. versicolor
995. versicolor
1005. versicolor
1016. virginica
1025. virginica
1037. virginica
1046. virginica
1056. virginica
1067. virginica
1074. virginica
1087. virginica
1096. virginica
1107. virginica
1116. virginica
1126. virginica
1136. virginica
1145. virginica
1155. virginica
1166. virginica
1176. virginica
1187. virginica
1197. virginica
1206. virginica
1216. virginica
1225. virginica
1237. virginica
1246. virginica
1256. virginica
1267. virginica
1276. virginica
1286. virginica
1296. virginica
1307. virginica
1317. virginica
1327. virginica
1336. virginica
1346. virginica
1356. virginica
1367. virginica
1376. virginica
1386. virginica
1396. virginica
1406. virginica
1416. virginica
1426. virginica
1435. virginica
1446. virginica
1456. virginica
1466. virginica
1476. virginica
1486. virginica
1496. virginica
1505. virginica

The iris data set is widely used as a beginner's dataset for machine learning purposes. The dataset is included in R base and Python in the machine learning package Scikit-learn, so that users can access it without having to find a source for it.

The following R (programming language)">R">R (programming language) code illustrates usage.

  1. "data.frame"
  1. "array"

The following Python">Python (programming language)">Python code illustrates usage.

from sklearn.datasets import load_iris
iris = load_iris

This code gives: