Principal Components Analysis: Theory and Application

Ignacio Cao Ignacio Cao

Guardians

Driving Questions

  1. How does Principal Components Analysis address complex statistical dilemmas in multivariate datasets?
  2. What is the relationship between Singular Value Decomposition (SVD) and PCA, and how do these methods contribute to data analysis?

Project Introduction

Disciplines/Subjects: Mathematics, Linear Algebra, Statistics, Machine Learning

Key Themes: Matrix Decomposition, Dimensionality Reduction, Statistical Modeling, Real-World Applications


This project explores the application of Principal Components Analysis (PCA) as a statistical tool for dimensionality reduction in real-world datasets. Starting with the foundational theory, learners learn the relationship between Singular Value Decomposition (SVD) and PCA, and how PCA can address common statistical dilemmas such as high dimensionality in data. Using Python, learners apply PCA to the "Prostate Cancer" dataset, exploring how the method extracts the most important components for predicting prostate-specific antigen (PSA) levels from various clinical measurements. Through this process, learners identify and analyze the principal components, evaluate the results, and compare the PCA-derived model with traditional linear regression models. The project emphasizes both the mathematical theory behind PCA and its practical application in data science. In addition, learners write their own PCA code from scratch using SVD, reflecting on the underlying algorithm and comparing their implementation to established Python instructions.

Core Competency

Habits of mind: Curiosity, Continuous Learning, Strive for Excellence

Transferable skills: Organizing and Representing Information, Identifying Patterns and Relationships, Modeling

Content Knowledge:

Understanding PCA as a method for dimensionality reduction and its application in machine learning.

Linking Singular Value Decomposition (SVD) theory to PCA.

Utilizing Python or Excel for statistical analysis, including loading vectors, biplots, and regression models.

Evaluating statistical models using metrics such as R-squared and residual plots.

Reflecting on PCA algorithms and implementing them through coding.