How to Conduct Principal Component Analysis (With Steps)

By Indeed Editorial Team

Published 12 October 2022

The Indeed Editorial Team comprises a diverse and talented team of writers, researchers and subject matter experts equipped with Indeed's data and insights to deliver useful tips to help guide your career journey.

You can use a principal component analysis to analyse the interrelationships between a set of variables if you want to identify the underlying structure of those variables. It reduces the dimensionality of data while still retaining as much information as possible. Knowing how to conduct one can help you focus your data to make more robust conclusions. In this article, we explain what it is, outline how to conduct such an analysis and answer some FAQs to help you better understand the topic.

What is a principal component analysis?

A principal component analysis (PCA) is a statistical technique that you can use to transform a dataset into a new set of linearly uncorrelated variables. This means that each new variable is a combination of the original variables, arranged in order of how much they explain the variation in the original dataset. The first new variable is the one that explains the most variation in the dataset, the second new variable explains the second most and so on.

You can think of each new variable as a combination of the old variables, where the coefficients indicate how much each old variable contributes to the new one. The main aim of a PCA is to reduce the number of variables in a data set while maintaining as much information as possible. These are two reasons why you might want to conduct a PCA:

  • Make the data easier to visualise: When you have a lot of variables, conducting a PCA can help reduce the number of variables while still retaining most of the information in the dataset. This can make it easier to visualise the data and see patterns, allowing you to gain insights that might be difficult to find with a large number of variables.

  • Find hidden patterns in the data: PCA can also help you to find patterns in the data that might be hiding when you have a lot of variables. This is because the new variables created by the PCA have their basis on the relationships between the variables in the dataset, meaning that the patterns that might not be obvious when looking at the individual variables can become more clear when looking at the new variables.

Related: Data Analyst vs. Data Scientist: What's The Difference?

How to conduct a principal component analysis

These are the five steps you can follow when conducting a PCA:

1. Calculate the mean and standard deviation for each variable

Firstly, you can calculate the mean and standard deviation for each variable in your dataset using the information to standardise the data. Standardising the data means converting all of the variables so that they're on the same scale. Performing standardisation before conducting the PCA is important because if the variables are on different scales, then some may have more impact on the principal components than others. This can lead to biased results.

Once you calculate the mean and standard deviation for each variable, you can standardise the data. You do this by subtracting the mean and dividing the result by the standard deviation for each value of each variable. As a formula, it looks like this:

Z = (Value - Mean) / Standard deviation

Related: 48 Statistician Interview Questions (With Sample Answers)

2. Calculate the covariance matrix

Once you standardise your data, you can then calculate the covariance matrix which is a square matrix that tells you how each variable relates to the others. The aim is to identify any relationship between the variables of the input set that vary from the mean. This means that the covariance matrix can help you to identify patterns in your data.

The diagonal of the covariance matrix contains the variances of the individual variables while the other values tell you the covariances between the different variables. This matrix is a pxp symmetric matrix, meaning that the upper and lower triangular portions are the same. p is the number of dimensions in your dataset.

Related: Parameter vs. Statistic: Key Differences (With FAQ)

3. Calculate the eigenvectors and eigenvalues

Once you have the covariance matrix, you can then calculate the eigenvectors and eigenvalues to identify what the principal components are. Principal components are the vectors that define the new coordinate system for your data, meaning they're the directions along which your data varies the most. Eigenvectors are vectors that describe the direction of a linear transformation, while eigenvalues are scalars that tell you the magnitude of the transformation.

The aim of calculating these is to find out what the principal components in your data are so you can transform your data into a new set of variables that's easier to visualise and interpret. To calculate this, you take the matrix of your data and multiply it by each eigenvector. This can give you a new vector that's in the direction of the eigenvector with a length that's equal to the eigenvalue. You then sort the eigenvectors by descending order of the eigenvalues to identify the principal components.

4. Choose the principal components to keep

Next, you create a feature vector that helps you decide which principal components to keep. A feature vector is a matrix that tells you how important each principal component is. This vector consists of the eigenvectors that correspond to the largest eigenvalues. You can think of this as creating a new coordinate system for your data where the principal components are the axes.

The number of principal components you choose to keep depends on how much variance you want to explain. In general, you want to choose enough components to explain at least 85% of the variance in your data. For example, if you have 10 variables in your data, you might choose to keep six principal components which would explain around 60% of the variance. You do this by discarding the components that have lower eigenvalues of less significance, meaning they don't explain as much of the variance.

Related: Data Scientist vs. Data Engineer: Definition and Requirements

5. Transform your data

Finally, you can transform your data into the new coordinate system that the principal components define. You can do this by multiplying your original data matrix by the matrix of eigenvectors that you created in the prior step.

This can give you a new matrix with the same number of rows but fewer columns. This is your transformed data and it's in a form that's easier to visualise and interpret. The formula you use when recasting your data along the principal component axes is as follows, with t indicating the transpose of a matrix:

Final data set = (Feature vector)t x (Standardised original data set)t

Principal component analysis example

This is an example of a PCA:

A company wants to know how they can reduce the number of dimensions to more easily visualise their customer segmentation. They can do this by conducting a PCA. The company has data on the age, job, marital status and education of their customers. Each row represents a different customer and each column represents a different variable. The company standardises the data so that each variable has a mean of zero and a standard deviation of one. They then calculate the covariance matrix which is a 4x4 matrix.

Next, they calculate the eigenvectors and eigenvalues of the covariance matrix. The eigenvectors are (0.57, 0.46, -0.67, -0.46) and (0.58, -0.53, -0.48, 0.61). The eigenvalues are 2.45 and 0.79 respectively. The company then chooses to keep the first principal component as it explains the most variance in the data. They transform the data using the equation and get a new matrix with just one column, which represents the first principal component. The company can now use this transformed data set to more easily visualise their customer segmentation.

Related: What Does a Data Scientist Do and How to Become One?

Frequently asked questions

These are some answers to frequently asked questions about PCAs:

Who uses PCAs?

PCA is useful for statisticians, data scientists and machine learning engineers. It's a tool that they can use for exploratory data analysis, dimensionality reduction and feature engineering. It's also useful for visualising high-dimensional data sets.

What are the limitations of PCA?

One limitation of PCA is that it can be sensitive to outliers. This means that a few extreme values in your data can have a large impact on the results of the PCA. Another limitation is that PCA assumes that the variables in your data are linear. This means that if your variables aren't linearly related, then the PCA results might not be accurate. In this case, you might want to try a different dimensionality reduction technique.

What is the difference between PCA and factor analysis?

PCA and factor analysis are both dimensionality reduction techniques. They both find a new set of variables, namely principal components or factors, which are linear combinations of the original variables. You choose the new variables that don't correlate with each other and that explain the variance in your data.

The main difference between PCA and factor analysis is that PCA is an unsupervised technique, meaning you don't specify what the underlying factors are. Meanwhile, factor analysis is a supervised technique, meaning you specify the number of underlying factors. PCA is a more general technique while factor analysis might be more helpful if you have specific hypotheses about the factors in your data.

Explore more articles