How to do PCA plot of HapMap data

I read many papers using PCA to show different clusters of the population but hard to see a step-by-step guide for a beginner like me. Here’re the steps I did.

I use mainly plink (version 1.9) and R (simple plot) on The Phase 2 HapMap as a PLINK fileset. The dataset can be download at http://zzz.bwh.harvard.edu/plink/res.shtml

Entire HapMap (release 23, 270 individuals, 3.96 million SNPs)
– CEU (release 23, 90 individuals, 3.96 million SNPs)
– YRI (release 23, 90 individuals, 3.88 million SNPs)
– JPT+CHB (release 23, 90 individuals, 3.99 million SNPs)

Since the dataset is already in plink format, I can simply run it against plink.

278673251 Sep 2 2008 hapmap_r23a.bed
114426472 Sep 2 2008 hapmap_r23a.bim
7470 Sep 2 2008 hapmap_r23a.fam

Step 1 :
You may get the following error when running the data directly by using plink
Error: .bim file has a split chromosome. Use –make-bed by itself to remedy this.

We can fix this by running the following:

plink --bfile hapmap_r23a --make-bed -out hapmap_r23a_sorted

 

Step 2 (QC):
You remove any individuals who have less than, say, 95% genotype data (–mind 0.05) and snps with a MAF<0.01 and a HWE P-value<0.00001 as well as a genotype failure rate greater than 0.1 are also removed.

plink --bfile hapmap_r23a_sorted --make-bed --mind 0.1 --maf 0.01 --geno 0.05 --hwe 0.00001 -out hapmap_r23a_cleaned

 

Step 3 (Prune):
First we prune the dataset to remove highly correlated variants, so that variants can be considered relatively independent, which is one of the assumptions we make in principal component analysis

plink --bfile hapmap_r23a_cleaned --indep-pairwise 50 5 0.2 --out LDprunedout
This option indicates that pruning is carried out in a window of 50 SNPs at a time, with the window sliding by 5 SNPs each time, pruning to a maximum r2 threshold of 0.2 between any two SNPs in a window (please look at http://pngu.mgh.harvard.edu/~purcell/plink/summary.shtml#prune for further information). This will generate a list of uncorrelated SNPs LDprunedout.prune.in.
Then, we extract from the original dataset from use that that LDprunedout.prune.in file.
plink --bfile hapmap_r23a_cleaned --extract LDprunedout.prune.in --make-bed --out hapmap_r23a_ldpruned

 

Step 4 (PCA)
Execute the following script to do PCA

plink --bfile hapmap_r23a_ldpruned --pca --out pca_result

Files generated:

pca_result.hh
pca_result.eigenvec
pca_result.eigenval
pca_result.log

Step 5 (Plot)

This is straightforward in R.

1. Load the .eigenvec file. E.g. “eigenvec_table <- read.table(‘pca_result.eigenvec’)”
2. Use plot() on the columns you’re interested in. Top eigenvector will be in column 3 while second eigenvector will be in column 4, etc.. Therefore, if you want a set of pairwise plots covering the top 4 eigenvectors, you can use “plot(eigenvec_table[3:6])”.

 >eigenvec_table <- read.table('pca_result.eigenvec') >plot(eigenvec_table[3:4]) 

This above plot should be fine and it is similar to the one plotted shown as follow by http://www.well.ox.ac.uk/gabriel-project using HapMap data, just no fancy graphics. Of course, you can plot beautifully by R but it is not target of this article.

 

Hope this helps!

Leave a comment