2024 VitisGen3
Genetics Workshop
Hands-On 3: Linkage Map Construction
and QTL Mapping
Written By: Dustin Wilkerson, PhD
VitisGen3 Postdoctoral Research Associate
Cornell Institute of Biotechnology

Introduction

This hands-on is the final of three exercises as part of the 2024 VitisGen3 Genetics Workshop. The goal of this workshop is to help VitisGen3 participating breeding programs and researchers learn how to work with their own rhAmpSeq data from acquisition to QTL. However, some of these tools and methods can also be transferable to other sequence types. Throughout all the hands-on exercises, we will be working with the sequence and phenotype data of the Vitis x doaniana 'PI 588149' x 'Chardonnay' F~1~ family from Zou et al., 2023 that was used to map a novel grapevine downy mildew resistance locus, Rpv33.

In this hands-on, we will focus on creating linkage maps using Lep-MAP3 (Rastas, 2017) on the BioHPC following some marker filtering in Tassel. Then, we will take a look at the quality of our linkage maps in RStudio before QTL mapping use the R package "R/qtl". By the end of this hands-on you should have a better understanding of how to create linkage maps and use them to map QTL.

Required Programs

  1. RStudio

  2. Tassel 5.0

Linkage Map Construction

In this section of the hands-on, we will start by preparing a new .vcf file on the BioHPC. Once we get it filtered down to the highest quality markers, we will run imputation to fill in any missing genotype calls. After double checking our pedigree.txt file, we will run through the 'Lep-MAP3' pipeline before formatting the map for 'R/qtl'.

 

Preparing Our VCF

As in the last hands-on, we will begin by generating a new .vcf file. This file will differ from the one created for GWAS as it will only have a single set of consensus genotype calls and won't include the contaminant sample we identified in the second hands-on. Another relevant difference is that we also care about the pedigree.txt file that 'to_lep_map.pl' creates alongside the .vcf. Slicing the hap_genotype file again to remove unneeded taxa also will save us time later when checking the pedigree.txt file for 'Lep-MAP3'. I've once again provided the family list that we will be using in the "/workdir/VG3Workshop_23" directory. Once we have the files, transfer them to a local machine to work first with the .vcf in Tassel before checking our pedigree.txt file.

 

Opening Files in Tassel

Let's get the genotype .vcf open in Tassel.

  1. Open Tassel on your local machine

  2. Click "File" then "Open As..." in the top left corner

  3. Leave "Format" as "Make Best Guess"

    1. For the .vcf file

      1. Select both "Sort Positions" and "Keep Depth" before clicking "Ok"

  4. Select the .vcf file you want to open then click "Open"

Filtering and Imputation

The genotype table should have 216 taxa and 1958 sites. These sites have had minimal filtering as part of the genotype calling step and need to filtered down to only those that are informative and high quality. We will start by splitting away the parent genotype calls before filtering the progeny for call rate (missingness), minor allele frequency, and polymorphic sites. Some additional filtering will occur later as we work through the 'Lep-MAP3' pipeline. Our last step before saving files will be to impute missing data using the LD-kNNi (Linkage Disequilibrium k-means Nearest Neighbor imputation) method described in Money et al., 2015 that has been incorporated in the Tassel software. The way filtering works in Tassel is through a specific window accessed by clicking "Filter" followed by "Filter Genotype Table Sites" (or "Filter Genotype Table Taxa") depending on our goals. The screenshots below shows the taxa filtering window (Figure 1) and the top section of the sites window (Figure 2).

Figure 1: Tassel's Filter Taxa Window

Figure 2: Tassel's Filter Sites Window

Filtering the Genotype Table:

  1. Separate the parents and progeny

    1. Create parent table

      1. Make sure unfiltered genotype table is highlighted

      2. Select "Filter", "Filter Genotype Table Taxa"

      3. Next to the "Taxa List" box, click "Select"

      4. Highlight both parents ("Chardonnay" and "PI588678"); click "Add ->"

      5. Click "Capture Selected"; then "Ok"

    2. Create progeny table

      1. Make sure unfiltered genotype table is highlighted

      2. Repeat steps 1-3 from parent table

      3. Click "Capture Unselected"; then "Ok"

  2. Filter sites from the progeny table (214 taxa; 1958 sites)

    1. The following table presents the steps for filtering sites after you access the "Filter Genotype Table Sites" window, showing the arguments and the affect of each filter so you can follow along. I like to do them one at a time so I can keep track of how many sites are filtered at each step in case there are downstream issues.

Filter TargetTassel ParameterValueSites Remaining
90% Call Rate (214 taxa * .9 = 192.6)Site Min Count1931729
Minor Allele Frequency > 0.05Site Min Allele Freq
Site Max Allele Freq
0.05
0.95
1643
Polymorphic SitesMin Heterozygous Proportion
Max Heterozygous Proportion
0.01
0.99
975
  1. Merge progeny and parent table

    1. Highlight both filtered progeny table and the unfiltered parent table

    2. Select "Data", "Merge Genotype Tables"

      1. New table should have 216 taxa; 1958 sites

    3. Redo missingness (call rate) filter from above

      1. Table should now have 216 taxa; 975 sites

  2. Impute missing sites

    1. Highlight merged and filtered genotype table

    2. Select "Impute", "LD KNNi Imputation"; Click "Ok"

  3. Save files out of Tassel

    1. For this hands-on, we will only use the imputed vcf file going forward but I always like to save a .vcf of the filtered, unimputed genotype table as well just in case.

      1. Highlight imputed genotype table

      2. Select "File", "Save as..."

      3. Name the file; Set "Format" to "VCF"; Click "Ok"

Checking the Pedigree File

The pedigree.txt file is an important input file for 'Lep-MAP3'. It provides the software information about the family and it will build onto it as we work through the pipeline. Luckily, 'to_lep_map.pl' generates a properly formatted pedigree file ending in pedigree.txt. If we had filtered any taxa from our .vcf while in Tassel, we would need to remove them from the pedigree file as well. Even though we did not, it is good to make sure that the parents are properly identified and there weren't any errors when it was generated. Since it is a tab-delimited .txt file, we will open it using Excel's Text Import Wizard.

  1. Open Excel

  2. Click "Open Other Workbooks"

  3. Navigate to the directory where you saved your pedigree file

  4. Click "All Excel Files" in the lower right corner and change it to "All Files (*.*)"

  5. Your phenotype file should be visible now, double-click it

  6. In the Text Import Wizard, choose "Delimited"

  7. Click "Finish"

The pedigree file that opens should have six rows and as many columns as taxa plus two (Figure 3). The first columns should appear as they do in Figure 3. A more complete breakdown of each row is covered in the Section 3 lecture. There are a few things we want to make sure are correct here. For the dataset provided with the hands-on, the parents are "PI588678" (seed parent) and "Chardonnay" (pollen parent) and they should be indicated as the female (2) and male (1) parents in row 5, respectively. Another thing to ensure is that the names of the parents given in row 2 of columns C and D match those given for every progeny in rows 3 and 4. Since our pedigree file is correct, we don't need to make any changes. If you aren't following along exactly or found an error, make sure you transfer the corrected file back to the BioHPC as we will need it and the imputed .vcf for 'Lep-MAP3'.

Figure 3: Pedigree File for Hands-On 3

 

Executing the Lep-MAP3 Pipeline

We are now ready to start constructing our linkage map. Make sure you have both your imputed .vcf file and the checked pedigree file in your working directory on the BioHPC. The 'Lep-MAP3' pipeline itself is fairly linear. There are a series of modules that we will go through to produce our linkage map with the occasional need to tweak some arguments depending on the map. There are four main steps in the pipeline and each are discussed briefly in the Section 3 lecture and at full in the 'Lep-Map3' documentation that can be found here. There's an active forum at this link as well if you run into any issues beyond working with your own data.

'Lep-Map3' Pipeline Modules

  1. ParentCall2

    1. Used to call (possible missing or erroneous) parental genotypes and sets up input files for rest of pipeline.

  2. Filtering2

    1. Handles additional marker filtering, we are primarily using it to filter for segregation distortion

  3. SeparateChromosome2

    1. Assigns markers to linkage groups

  4. OrderMarkers2

    1. Orders the markers within each linkage group by maximizing the likelihood of the data given the order.

We now have a linkage map (yay). The last thing to do is extract the information we need about our map so we can get it ready for 'R/qtl'. Once these files are saved, we will shift to working with them on the RStudio server.

 

Formatting the Map for R/qtl

We have a few things to do before we can get to QTL mapping. The first thing is getting it reformatted for 'R/qtl' which we can handle in RStudio. In the next code block, we will read in all the necessary files, replace the taxa and marker name placeholders with human informative names, code the genotype calls, and then reformat the map as a comma-separated .csv file. The only required package for this section is "stringr" which helps with manipulating character strings. Once it's ready, we will save it out so we can read it back in as a proper 'cross' object that 'R/qtl' can use.

 

Checking Map Quality

It's time to read our linkage map into 'R/qtl'. Before we can try mapping with it however, we want to make sure we have a good quality map. In this section, we will check the quality by first plotting the map to get a sense of the gaps and cM lengths. We will then take a look at the recombination frequency within and among linkage groups and check their marker order by comparing it to the physical positions in the reference genome.

Reading in the Linkage Map

We will then use 'read.cross' to bring our map into 'R' and address some co-locating markers.

 

Looking at the Map

Let's take a look at the map and get some summary statistics.

I see a couple things to note here. The first is that we don't have any linkage groups that are too long (all are less than 100 cM) (Figure 4) and our overall map distance is reasonable for our genome size. The longest linkage group is LG 1 at 72.1 cM which makes sense because it has the most markers (43). The largest gap in the map is on linkage group 14 at 20.5 cM however that isn't large enough to be worried about, especially since its in the middle of the LG. What does need to be addressed is that we have 21 linkage groups for our 19 Vitis chromosomes. This likely means that we have one or more chromosomes that were split by 'Lep-MAP3'. Let's see if we can't figure out which ones by cross-referencing the linkage groups with the chromosome positions in the marker names.

Figure 4: Linkage Map Plot

Looking through this table we can see that chromosomes 16 and 19 have been split into two linkage groups each: chromosome 16 on LGs 19 and 20; chromosome 18 on LGs 18 and 21. Let's see if we can figure out why they were split by looking at a few more things.

 

Recombination Frequency

The recombination frequency plot shows the pairwise LOD scores for linkage and recombination frequencies of every marker in the linkage map. The ideal plot will be brightest along the diagonal.

We can see that we have a stable map given the strength of linkage along the diagonal (Figure 5). It is also worth noting that we do not see strong linkage between our split chromosomes. This likely suggests that there weren't enough high-quality markers keep it together. Our last plot can confirm that for us.

Figure 5: Recombination Frequency Plot

Checking Marker Order

Unlike the last two plots, there aren't any functions in 'R/qtl' that we can use to generate the genetic (cM) by physical (Mb) distance plots we need to check our marker order. We will extract the information we need from the 'cross' object and look at our split LGs before plotting the rest.

Figure 6: Marker Order Plots

Figure 6 shows the genetic by physical distance plots for the four LGs that comprise chromosomes 16 (LG 19 and 20) and 18 (LG 18 and 21). We can say for sure now that there weren't enough markers to bridge the gap. For chromosome 18, the two linkage groups contain markers that are at least 15 Mb apart while chromosome 16 has a smaller gap that is closer to 5 Mb. Since we know they are from the same chromosome, we can merge them together but need to make sure we have them oriented correctly. We can do this by reversing the order of the linkage groups with negative slopes. The orientation of the linkage groups is somewhat random so your 'flip.order' may be different then mine. Once we get our LGs merged, we will check out the marker order plots for all the LGs (Figure 7) and then ensure they are all oriented to match the reference genome.

Figure 7: Final Marker Order Plots

Now that we can see all the linkage groups, it's obvious that some are nicer than others but none are bad enough to warrant any further action. We are officially done with our map QC.

 

Saving the Map

As we transition to QTL mapping, we will make one final adjustment to our linkage map that will help us interpret our QTL results. We will finish up the map quality section by renaming our linkage groups to match their marker's chromosome before saving the finished map (Figure 8).

Figure 8: Final Linkage Map

QTL Mapping

We have all the pieces we need for QTL mapping. In this section, we will combine our finished linkage map with our phenotype data (prepared in hands-on 2), calculate the genotype probabilities for our map, run our QTL models, and finish the hands-on by getting the effect sizes of our QTL. As a reminder, we are using two phenotypes that Zou et al., 2023 used to map Rpv33 in this family.

Reading in the Linkage Map

Let's get started by reading our map into RStudio.

 

Incorporating the Phenotype Data

Now let's incorporate our phenotype data in with our map. Before we can, however, we need to not only ensure that both map and phenotype data have the same set of taxa but they are also in the same order. If you have followed along from hands-on 2, feel free to use your saved phenotype file (before it was reformatted for Tassel), otherwise the one I generated is available in the "/workdir/VG3Workshop_23" directory with the other files. This phenotype file should have two traits, grapevine downy mildew incidence ratings from 2019 (Incid_2019) and severity ratings from 2021 (BCT_Sev_2021). During hands-on 2, we determined that the incidence ratings could not be normalized and were left untransformed while the severity ratings did normalize following Box-Cox transformation.

 

Calculating Genotype Probabilities

This is a crucial step in linkage mapping as it uses a model to calculate the probabilities of the underlying genotypes given the expected error rate. For rhAmpSeq, this error probability is 1e-4 as shown in the code block. The 'step' argument is also important as 'calc.genoprob' will introduce pseudo-markers every 'step = x' cM and calculate probabilities for them as well. A heads-up on 'calc.genoprob', if you end up getting slightly different results than I do in the below sections, 'calc.genoprob' is the culprit. As it's estimating probabilities each time it's run, it may produce slightly different estimates which can impact whether or not a LOD score passes our threshold. This becomes more prominent if you are dealing with weaker effect size QTL.

 

QTL Mapping

We are finally ready to do some trait mapping. As was mentioned above, we are working with two sets of ratings that have different relationships with normality. Therefore, we will need to use different types of QTL models and, as a further comparison, we will be running two QTL models for each trait. For our approximately normal, Box-Cox transformed severity ratings from 2021, we will be using the normal models from both 'scanone' (single QTL model) and 'cim' (multiple QTL model) functions. However since the incidence ratings from 2019 resisted transformation to normality, we will be using the "np" (non-parametric) and "2part" models from 'scanone'. You can find more information about these models in the 'R/qtl' manual linked here. As we are running our models for each trait, we will also be running the permutation tests to get our 5 and 10% LOD significance thresholds. We will finish up by plotting our results and identifying the peak marker and position of our QTL.

BCT_Sev_2021

Figure 9: BCT_Sev_2021 QTL Results

Incid_2019

Figure 10: Incid_2019 QTL Results

Getting QTL Effect Sizes

Let's finish up the hands-on by using 'R/qtl' to get the percentage of phenotypic variation explained by our QTL and do one final set of plots. We will need the chromosome and position of our peak markers and run a few more lines of code to get the percentage. Our last step will be visualizing our QTL in a simpler plot with just the single model and threshold next to another plot showing the relationship between the phenotype and genotype calls of the peak marker.

 

Figure 11: BCT_Sev_2021 QTL and Effect Plots

Figure 12: Incid_2019 QTL and Effect Plots

References