Lab 4
Lab 4
Objectives
● Introduction to Dimensionality Reduction and its importance in reducing the number of variables
● Introduction to the Python library Scikit-learn for data analysis and modeling
● Use common dimensionality reduction techniques in Scikit-learn, including PCA
● Visualize the reduced data to gain insights into its characteristics
● Understand how to evaluate the effectiveness of dimensionality reduction
Required tools
Programming language: Python 3
Python libraries:
● Scikit-learn: A data analysis and modeling library, including Data mining and Machine learning
algorithms for various tasks: classification, regression, clustering, dimensionality reduction, …
Resources
Scikit-learn:
PCA:
Notebooks
● Exercise: PCA from scratch.ipynb
2. Explain how a box plot can indicate whether the values of an attribute are symmetrically
distributed.
3. What are the advantages of using a box plot over a histogram when exploring data
distributions, and in what scenarios would a histogram provide more insight?
4. How does the choice of the number and location of bins affect the appearance of a histogram,
and what methods can be used to reduce this dependency?
5. How can histograms, box plots, and density plots reveal information about skewness?
Additionally, what strategies can be employed to handle skewed data?
6. When should you use equal-width bins versus equal-frequency bins? What considerations
should be taken into account when making this choice?
7. How do you interpret the results of a quantile plot? What do straight lines and deviations from
the line indicate about the distribution of the data?
8. When might stratified sampling be more beneficial than simple random sampling, and how
does it influence the representation of subgroups within a data set?