.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/feature_selection/plot_rfe_with_cross_validation.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. or to run this example in your browser via JupyterLite or Binder .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_feature_selection_plot_rfe_with_cross_validation.py: =================================================== Recursive feature elimination with cross-validation =================================================== A Recursive Feature Elimination (RFE) example with automatic tuning of the number of features selected with cross-validation. .. GENERATED FROM PYTHON SOURCE LINES 10-14 .. code-block:: Python # Authors: The scikit-learn developers # SPDX-License-Identifier: BSD-3-Clause .. GENERATED FROM PYTHON SOURCE LINES 15-22 Data generation --------------- We build a classification task using 3 informative features. The introduction of 2 additional redundant (i.e. correlated) features has the effect that the selected features vary depending on the cross-validation fold. The remaining features are non-informative as they are drawn at random. .. GENERATED FROM PYTHON SOURCE LINES 22-40 .. code-block:: Python from sklearn.datasets import make_classification n_features = 15 feat_names = [f"feature_{i}" for i in range(15)] X, y = make_classification( n_samples=500, n_features=n_features, n_informative=3, n_redundant=2, n_repeated=0, n_classes=8, n_clusters_per_class=1, class_sep=0.8, random_state=0, ) .. GENERATED FROM PYTHON SOURCE LINES 41-46 Model training and selection ---------------------------- We create the RFE object and compute the cross-validated scores. The scoring strategy "accuracy" optimizes the proportion of correctly classified samples. .. GENERATED FROM PYTHON SOURCE LINES 46-67 .. code-block:: Python from sklearn.feature_selection import RFECV from sklearn.linear_model import LogisticRegression from sklearn.model_selection import StratifiedKFold min_features_to_select = 1 # Minimum number of features to consider clf = LogisticRegression() cv = StratifiedKFold(5) rfecv = RFECV( estimator=clf, step=1, cv=cv, scoring="accuracy", min_features_to_select=min_features_to_select, n_jobs=2, ) rfecv.fit(X, y) print(f"Optimal number of features: {rfecv.n_features_}") .. rst-class:: sphx-glr-script-out .. code-block:: none Optimal number of features: 3 .. GENERATED FROM PYTHON SOURCE LINES 68-73 In the present case, the model with 3 features (which corresponds to the true generative model) is found to be the most optimal. Plot number of features VS. cross-validation scores --------------------------------------------------- .. GENERATED FROM PYTHON SOURCE LINES 73-94 .. code-block:: Python import matplotlib.pyplot as plt import pandas as pd data = { key: value for key, value in rfecv.cv_results_.items() if key in ["n_features", "mean_test_score", "std_test_score"] } cv_results = pd.DataFrame(data) plt.figure() plt.xlabel("Number of features selected") plt.ylabel("Mean test accuracy") plt.errorbar( x=cv_results["n_features"], y=cv_results["mean_test_score"], yerr=cv_results["std_test_score"], ) plt.title("Recursive Feature Elimination \nwith correlated features") plt.show() .. image-sg:: /auto_examples/feature_selection/images/sphx_glr_plot_rfe_with_cross_validation_001.png :alt: Recursive Feature Elimination with correlated features :srcset: /auto_examples/feature_selection/images/sphx_glr_plot_rfe_with_cross_validation_001.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 95-102 From the plot above one can further notice a plateau of equivalent scores (similar mean value and overlapping errorbars) for 3 to 5 selected features. This is the result of introducing correlated features. Indeed, the optimal model selected by the RFE can lie within this range, depending on the cross-validation technique. The test accuracy decreases above 5 selected features, this is, keeping non-informative features leads to over-fitting and is therefore detrimental for the statistical performance of the models. .. GENERATED FROM PYTHON SOURCE LINES 104-112 .. code-block:: Python import numpy as np for i in range(cv.n_splits): mask = rfecv.cv_results_[f"split{i}_support"][ rfecv.n_features_ ] # mask of features selected by the RFE features_selected = np.ma.compressed(np.ma.masked_array(feat_names, mask=1 - mask)) print(f"Features selected in fold {i}: {features_selected}") .. rst-class:: sphx-glr-script-out .. code-block:: none Features selected in fold 0: ['feature_3' 'feature_4' 'feature_8' 'feature_14'] Features selected in fold 1: ['feature_3' 'feature_4' 'feature_8' 'feature_14'] Features selected in fold 2: ['feature_3' 'feature_4' 'feature_8' 'feature_14'] Features selected in fold 3: ['feature_3' 'feature_4' 'feature_8' 'feature_14'] Features selected in fold 4: ['feature_3' 'feature_4' 'feature_8' 'feature_14'] .. GENERATED FROM PYTHON SOURCE LINES 113-116 In the five folds, the selected features are consistent. This is good news, it means that the selection is stable across folds, and it confirms that these features are the most informative ones. .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 0.768 seconds) .. _sphx_glr_download_auto_examples_feature_selection_plot_rfe_with_cross_validation.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: binder-badge .. image:: images/binder_badge_logo.svg :target: https://mybinder.org/v2/gh/scikit-learn/scikit-learn/1.7.X?urlpath=lab/tree/notebooks/auto_examples/feature_selection/plot_rfe_with_cross_validation.ipynb :alt: Launch binder :width: 150 px .. container:: lite-badge .. image:: images/jupyterlite_badge_logo.svg :target: ../../lite/lab/index.html?path=auto_examples/feature_selection/plot_rfe_with_cross_validation.ipynb :alt: Launch JupyterLite :width: 150 px .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_rfe_with_cross_validation.ipynb ` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_rfe_with_cross_validation.py ` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: plot_rfe_with_cross_validation.zip ` .. include:: plot_rfe_with_cross_validation.recommendations .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_