2024 VitisGen3
Genetics Workshop
Hands-On 2: Phenotype Data and GWAS
Written By: Dustin Wilkerson, PhD
VitisGen3 Postdoctoral Research Associate
Cornell Institute of Biotechnology

Introduction

This hands-on exercise is the second of three as part of the 2024 VitisGen3 Genetics Workshop. This goal of this workshop is to help VitisGen3 participating breeding programs and researchers learn how to work with their own rhAmpSeq data from acquisition to QTL. However, some of these tools and methods can also be transferable to other sequence types. Throughout all the hands-on exercises, we will be working with the sequence and phenotype data of the Vitis x doaniana 'PI 588149' x 'Chardonnay' F~1~ family from Zou et al., 2023 that was used to map a novel grapevine downy mildew resistance locus, Rpv33.

In this hands-on, we will focus first on working with phenotype data in RStudio, from basic quality checks to transformation and prepping it for trait mapping. Then, we will use those phenotypes in Tassel to perform a GWAS after filtering our sequence data and generating an MDS plot to check for contaminant samples. By the end of this hands-on you should have a functional understanding of performing a GWAS in Tassel that you can apply to your own data in the future.

Required Programs

  1. RStudio

  2. Tassel 5.0

Working with Phenotype Data

In this section, we will cover how to load data into the R environment, see if there are any outliers, check the phenotype distribution, and finish off with performing Box-Cox transformations. We will be working with two of the columns of phenotype data that were used in Zou et al., 2023 to map Rpv33.

 

Getting Started with RStudio

At the beginning of every R session, we need to set our working directory and load the files we want to work with into the R environment. In the first hands-on, you should have created a working directory for yourself on the cbsuvitisgen2 machine and we will work within the same one for this hands-on. The raw phenotypes we need should be in the "/workdir/VG3Workshop_23" folder. Let's set our working directory and load that file.

 

Plotting the Data

Now that our data is in R, we can get a sense of what we have. We will start by taking a quick look at the data itself. Next, we will plot histograms and QQ-plots to look at the distribution of each phenotype and see if they have any outliers that we need to remove.

The summary statistics tell us that we have three columns in our phenotype file with 220 taxa. Specifically, these phenotypes are grapevine downy mildew ratings, incidence from 2019 and severity from 2021. Both were rated on a percentage scale from 0-100% in a single environment without replications. Lastly, we can also see 5 and 33 missing values (NA's) for incidence and severity, respectively. As for the plots (Figure 1), there are no obvious outliers but both traits have skewed distributions although the incidence ratings are more heavily skewed.

Figure 1: Phenotype Distributions

Missing Data and Box-Cox Transformation

Now that we know what we are dealing with, we have a few things to do to get this data ready for mapping. If had been outliers, we would likely replace those data points with NAs since we don't have replications to do a mean replacement. For the missing data, there will likely be taxa (genotypes/vines) that are missing for both phenotypes and will just be tagging along for the ride and not adding anything to our analysis. Let's check to see if there are any and remove them before continuing on to phenotype transformation.

In the attempt to normalize our traits, we will perform a Box-Cox transformation to see if their distributions improve using functions from the "MASS" package. Briefly, we first need to determine the lambda value most likely normalize our phenotype which is then plugged into a formula for the actual transformation. If you want to know more about the Box-Cox transformation, you can start here: https://www.statisticshowto.com/probability-and-statistics/normal-distributions/box-cox-transformation/.

Figure 2: Finding Our Box-Cox Lambdas

Figure 3: Box-Cox Transformed Distributions

 

Making Decisions and Saving Files

While we can say that both distributions somewhat improved, I would say only the severity ratings improved enough to be considered approximately normal (Figure 3). In reality, the appropriateness of using non-normal data for GWAS and QTL mapping varies by a number of factors. For the sake of this hands-on, we will continue with the untransformed incidence ratings and use non-parametric mapping methods when necessary. As a contrast, we will be using the transformed severity ratings and models that expect more normally distributed data. Let's make the required changes to our dataframe and save it to our working directory.

 

GWAS in Tassel

This section goes through the steps to complete a GWAS in Tassel. After preparing the files we need, we will load them into Tassel, do some marker filtering, generate an MDS plot, and then finish up with the GWAS itself. If you're continuing on from the phenotype section, you will need to transfer the phenotype file from the BioHPC onto the local machine where you installed Tassel.

 

Preparing Files for Tassel

The two files we need include our phenotype file, reformatted for Tassel, and the .vcf file with the genotype calls for our mapping population, the 'PI 588149' x 'Chardonnay' F~1~ family, described in Zou et al., 2023. Let's start with reformatting the phenotype file in Excel. Since we saved the phenotypes as a tab-delimited .txt file, we need to go through Excel's Text Import Wizard in order to work with it.

  1. Open Excel

  2. Click "Open Other Workbooks"

  3. Navigate to the directory where you saved your phenotype file

  4. Click "All Excel Files" in the lower right corner and change it to "All Files (*.*)"

  5. Your phenotype file should be visible now, double-click it

  6. In the Text Import Wizard, choose "Delimited" and "My data has headers"

  7. Click "Finish"

  8. Add the top two rows of the following table to your file above the column names

    <Phenotype>  
    taxadatadata
    TaxaIDIncid_2019BCT_Sev_2021
  9. Save the file

Next, let's regenerate a .vcf file following similar steps to those in the first hands-on while connected remotely to the BioHPC. This .vcf file will include multiple runs of the parents along with only the phenotyped progeny. For ease, the family file needed for slicing the hap_genotype file is already prepared and can be found in the "/workdir/VG3Workshop_23" folder with the other provided data. Once it's generated, transfer that .vcf file to your local machine and you're ready for Tassel.

 

Opening Files in Tassel

Let's get the genotype .vcf and phenotype table open in Tassel so we can get mapping!

  1. Open Tassel on your local machine

  2. Click "File" then "Open As..." in the top left corner

  3. Leave "Format" as "Make Best Guess"

    1. For the .vcf file

      1. Select both "Sort Positions" and "Keep Depth" before clicking "Ok"

    2. For the phenotype table

      1. No selections before clicking "Ok"

  4. Select the file you want to open then click "Open"

Filtering the Genotype Table

With our genotype table in Tassel, we can see more information about it on the left side of the Tassel window. The top panel on the left side shows the files you have loaded/generated within Tassel's file structure. The middle panel shows the total number of taxa and sites along with the breakdown of sites per chromosome. The table should have 219 taxa (our 215 phenotyped taxa with two sets of genotype calls for each parent) and 1991 sites. These sites have had minimal filtering as part of genotype calling and need to paired down to informative and high quality markers for trait mapping. We will start by filtering for call rate (missingness), followed by minor allele frequency, and then for polymorphic sites. The way filtering sites works in Tassel is through a specific window accessed by clicking "Filter" followed by "Filter Genotype Table Sites". The screenshot below (Figure 4) shows the top section of the window that should open. We can do our filters by only changing the values in these first six boxes.

Figure 4: Tassel's Filter Sites Window

The following table summarizes the steps we will use to filter the sites in our genotype table. For each filter, be sure to add a unique "Filter Name" to the top box as it will be appended to the name of the newly created genotype table within Tassel's file structure. This will help you keep tabs of the effect of each filter and backtrack as needed. I also recommend setting each value back to it's default as you move to the next filter.

Filter TargetTassel ParameterValueSites Remaining
80% Call Rate (219 taxa * .8 = 175.2)Site Min Count1761856
Minor Allele Frequency > 0.05Site Min Allele Freq
Site Max Allele Freq
0.05
0.95
1776
Polymorphic SitesMin Heterozygous Proportion
Max Heterozygous Proportion
0.01
0.99
1294

 

Generating an MDS Plot

Now that the sites are filtered, we have one last thing to check before the GWAS. Plotting our genotype data as an MDS (Multi-Dimensional Scaling) plot allows us to tell a couple things about it. From a practical perspective, an MDS plot will let us check the purity of our samples and gives us a way to summarize the population structure as covariates for a GWAS. We will be looking for any progeny that are not grouping with their siblings, as either off on their own (contaminants) or grouping with the seed parent (selfs). The primary reason we recommended sequencing the parents alongside the progeny in mapping populations is for this step as it would be incredibly difficult to judge true progeny from contaminants and selfs without them. In Tassel, we can get our MDS by first calculating a distance matrix of our filtered genotype table then generating the principal components for the MDS plot.

  1. Click "Analysis", "Relatedness", then "Distance Matrix"

  2. Click "Analysis", "Relatedness", then "MDS"

  3. Click "Ok"

  4. Make sure you have the "Numerical" table starting with "MDS_PCs" selected

  5. Click "Results" then "Chart"

  6. Change "Graph Type:" to "XYScatter"

  7. Change "Y1" to "PC2"

  8. Change "X" to "PC1"

An MDS plot (similar if not the same as Figure 5) should appear, showing three main groupings and one dot a little on its own. With the "MDS_PCs" table still open, you can click on the column headers for each PC to sort the values and determine which points on the plot are which taxa. For the below plot, the two runs of the seed parent "PI 588149" are in the lower left while the two runs of the pollen parent "Chardonnay" are in the top right. I can also see that the lonely point just off the main cluster is "Oblock14-23". Unfortunately for it but luckily for us, that's the only F~1~ that needs to be dropped before we move on to GWAS.

Figure 5: MDS Plot

Filtering Taxa

Functionally, filtering taxa is similar to filtering sites. By clicking "Filter" then "Filter Genotype Table Taxa", we can access the window we need (Figure 6). In other circumstances, this would be the window where you would also filter taxa with a high proportion of missing data, however this family has very little missing data and that step isn't necessary. We can also use this window to remove specific taxa through the "Taxa List" box. A new window will open after clicking "Select" that shows "Available" taxa on the left and "Selected" taxa on the right. Before running the GWAS we need to remove the four parental runs as we no longer need them and our contaminant individual from the MDS.

Once these five individuals are in the selected column (highlight them in "Available" and click "Add ->"), click "Capture Unselected" and then "Ok". Our ready-for-GWAS genotype table now has 214 taxa and 1294 sites.

Figure 6: Tassel's Taxa Filtering Window

Running a GWAS

We are nearly ready to perform the GWAS on our two phenotype traits. We just need to make sure all the pieces are in order. The first thing to do is to generate new PCs for the genotype table we will be using for the GWAS (our first set of PCs included the parents and the contaminant, which are no longer included). Once we have that, we will first run a generalized linear model (GLM) then a mixed linear model (MLM).

  1. Generalized Linear Model

    1. Hold down the ctrl key and select:

      1. GWAS-ready genotype table

      2. That table's "MDS_PCs"

      3. Phenotype Table

    2. Click "Data" then "Intersect Join"

    3. Click "Analysis", "Association", "GLM", then "Ok"

  2. Mixed Linear Model

    1. Generate a Kinship Matrix

      1. Select GWAS-ready genotype table

      2. Click "Analysis", "Relatedness", "Kinship", then "Ok"

    2. Hold down the ctrl key and select:

      1. Intersected table used for GLM

      2. Kinship Matrix ("Centered_IBS")

    3. Click "Analysis", "Association", "MLM", "Run", then "Ok"

We now have several tables under the Results -> Association folder within Tassel. The "GLM_Stats" and "MLM_statistics" table have the results of our analyses. Tassel has built in plots for us to quickly visualize our GWAS. Let's look at all the results to see how things went by looking at the Manhattan and QQ plots.

  1. Manhattan Plot

    1. Click "Results", "Manhattan Plot", Select Trait, then "Ok"

  2. QQ-Plot

    1. Click "Results", "QQ Plot", then "Okay"

Thanks to these plots, we can see that both GLM and MLM produced supported peaks that are at high enough LODs to assume significant hits on Chromosome 9 for both phenotypes (Figure 7). In comparing the models, we can see that the GLM produced greater LOD scores than the MLM for both traits while the QQ plots don't suggest any issues.

Figure 7: Tassel Manhattan Plot

Saving Files Out of Tassel

Although the Tassel plots tell us we likely have QTL, we still need to determine the significance thresholds for our traits and produce some higher quality plots along the way. The first step is saving the necessary files out of Tassel. This can be done by clicking "File" then "Save As" while you have the tables you want to save highlighted.

Once you have them all saved, transfer the GLM and MLM_stats tables to the BioHPC where we will work with them in RStudio and make some nicer figures.

Plotting Our Results in R

In this section, we will do our multiple test corrections by calculating both the Bonferroni significance threshold and the Benjamini-Hochberg cutoff in RStudio. As our last step, we will then visualize the create some nice Manhattan and QQ-plots to visualize our results.

Reading in Our Files

As before, we will start with setting our working directory and read in our files. For the hands-on, we will focus on the MLM results but the GLM results can be worked with the same way.

 

Multiple Test Corrections

Now let's do our multiple test corrections. We will first get our Bonferroni threshold which both of our phenotypes will share as it is calculated based on the number of markers. Then, we will use a 5% false discovery rate (FDR) to get a Benjamini-Hochberg cutoff for each phenotype as it is relative to the p-values themselves. More information about these procedures was discussed in section 2 of the recorded lectures.

 

Manhattan and QQ-Plots

Now that we have our thresholds, we can put our plots together. Both Manhattan and QQ-plots can be created using the R package "qqman". However, it does require numeric chromosome labels which we can extract using a function for manipulating character strings from the "stringr" package. We will finish up this hands-on by extracting the list of significant markers for both thresholds.

Our Manhattan plot for Incid_2019 (Figure 8) includes both Bonferroni threshold (red) and Benjamini-Hochberg cutoff (blue) lines. Both include the supported peak of seven markers on Chromosome 9 (corresponding to Rpv33), however the Benjamini-Hochberg line also includes three other significant markers on chromosomes 4, 7, and 13. This is in contrast to the Manhattan plot for Sev_2021 where both thresholds only include the peak on Chromosome 9. Although this is the end of this hands-on, the next step would include looking through the annotation file of the reference genome at the positions of the QTL to see if there are any potential genes or gene families present that could be affecting our phenotype.

Figure 8: Manhattan Plot - Incid_2019

 

References