[scikit-learn] Creating dataset

Nicolas Hug niourf at gmail.com
Sun Nov 8 11:18:49 EST 2020


load_iris() reads a csv file, and then retrieves/sets some other info 
like the feature names and a description of the dataset (which comes 
from another file)

Then it packs everything into a Bunch object which is basically a fancy 
dict: 
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/utils/__init__.py#L63

You can take inspiration from the source code 
(https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/datasets/_base.py#L396) 
if you want to replicate what fetch_xxx() does, but you do not need a 
Bunch at all to follow the PCA article that you mentioned. As previously 
noted, you just need to understand what each piece is doing at a high 
level and slightly modify the input to the functions according to your 
needs.



On 11/8/20 2:19 PM, Mahmood Naderan wrote:
> >You need to understand what the different statements are doing; just
> >as you need to understand what processing you apply on your data
> >(whether it's preprocessing or learning) to properly use any machine
> >learning tool.
>
> I know, but the problem is that the csv file of the iris doesn't have 
> such information and as I said, I think there are some additional 
> steps that I don't know exactly what they are.
>
> For example, if you look at 
> ~/.local/lib/python3.6/site-packages/sklearn/datasets/data/iris.csv 
> you will see
>
> 150,4,setosa,versicolor,virginica
> 5.1,3.5,1.4,0.2,0
> 4.9,3.0,1.4,0.2,0
> ...
>
> So, the first line means 150 instances (rows) with 4 columns and three 
> iris types.
> However, when I use
>
> iris = load_iris()
> print(iris)
>
> I see a lot of metadata, such as:
>
> {'data': array([[5.1, 3.5, 1.4, 0.2],
>        [4.9, 3. , 1.4, 0.2],
> ...
>        [5.9, 3. , 5.1, 1.8]]), 'target': array([0, 0,...]), 'frame': 
> None, 'target_names': array(['setosa', 'versicolor', 'virginica'], 
> dtype='<U10'), 'DESCR': '.. _iris_dataset:\n\nIris plants 
> dataset\n--------------------\n\n**Data Set Characteristics:**\n\n   
>  :Number of Instances: 150 (50 in each of three classes)\n    :Number 
> of Attributes: 4 numeric, predictive attributes and the class\n   
>  :Attribute Information:\n        - sepal length in cm\n        - 
> sepal width in cm\n        - petal length in cm\n        - petal width 
> in cm\n        - class:\n                - Iris-Setosa\n               
>  - Iris-Versicolour\n      - Iris-Virginica\n                \n
>
>
> The question is how these metadata are created and stored in this package?
> I mean, what does
>
> from sklearn.datasets import load_iris
>
> do with the csv file? If I know, then I am also able to create a 
> similar dataset.
>
>
> Regards,
> Mahmood
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.python.org/pipermail/scikit-learn/attachments/20201108/6cecde46/attachment.html>


More information about the scikit-learn mailing list