Details for Alphapulldown (If the copy-pasted commands do not work, use this tool to remove unwanted characters)

About:Protein-protein interaction screens and high-throughput modelling of higher-order oligomers using AlphaFold-Multimer
Added:3/24/2024 11:38:46 AM
Updated:12/17/2024 4:44:50 PM

Before you start,

1. Make sure the protein sequence name in the input fasta file should not have extra annotation text. For example,  ">abc1", not ">abc1 transcription factor protein abc1".

2. If you have several hundred proteins, it might be faster to copy the database directories /home2/shared/colabfold_cache and /home2/shared/alphafold-2.3.2 to /workdir. Replace "/home2/shared/" in the command lines to "/workdir".  It will take about 3 hours to copy the directories. 

3.Only step 3 requires GPU. Other steps can use any BioHPC servers.

4. We have noticed certain protein sequences would crash at Step 2 and 3. You can either remove these sequences from your fasta files, or communicate with the developer through its github issue page: If too many proteins failed  the pipeline, it might be easier to run colabfold  multimer on all combinations. 

Step 1: generate msa with colabfold (much faster than the alphafold2 based alignment pipeline)

#This step normally takes a few minutes for dozens of proteins. You can run this step on any BioHPC server.  It does NOT require GPU. The actual computing happens on the colabfold cloud server, it could be busy sometimes. 

#Combine the bait and candidate protein sequences into a single fasta file, and put it under the current directory.

#Run the following command to generate msa files. In this command, my.fasta is the input fasta file name, msas is the output directory name. 

cd /workdir/$USER

singularity run --bind /home2/shared/colabfold_cache:/cache --bind $PWD --pwd $PWD /programs/colabfold-1.5.5/colabfold.sif colabfold_batch my.fasta msas --msa-only

Step 2: run on cbsugpu05-07

#This step runs on any BioHPC. It does not require GPU, but requires the database on the GPU. It should take <1hour for dozens of proteins.

#run this command from the parental directory of msas directory created from step1. 


singularity exec --bind /home2/shared/alphafold-2.3.2:/db --bind $PWD --pwd $PWD /programs/alphapulldown-1.0.4/ap-2.0.1.sif \ \
 --fasta_paths=test2.fasta \
 --data_dir=/db \
 --output_dir=msas \
 --use_precomputed_msas=True \
 --max_template_date=2024-01-01 \

Step 3: run on a GPU server (instructions of using two GPU units are at the end of the page)

#Use cbsugpu05 or above.  The old servers (cbsugpu02-4) does not have enough GPU RAM for most multimers.

#If you ran Step1&2 on a different server, copy the generated msas directory to the GPU server.

#create two text file bait.txt,candidates.txt. The bait.txt has one line: bait protein name; the condidates.txt file has one protein name per line.

#run the command from the current directory, which contains: bait.txt,candidates.txt, msas directory, and an empty outputdir. it could take hours or days to finish depending on number of candidate proteins.

mkdir outputdir
#set environment variable to enable GPU to use system memory, so that large proteins can be processed

singularity exec --nv --bind /home2/shared/alphafold-2.3.2:/db --bind $PWD --pwd $PWD /programs/alphapulldown-1.0.4/ap-2.0.1.sif \
 --mode=pulldown \
 --num_cycle=3 \
 --num_predictions_per_model=1 \
 --output_path=./outputdir \
 --data_dir=/db \
 --protein_lists=bait.txt,candidates.txt \

Step 4: generate a summary table

#this step can be run on any server. you just need the outputdir from the previous step

#run the command from the current directory, which has the outputdir under it

singularity exec --bind $PWD --pwd $PWD /programs/alphapulldown-1.0.4/ap-2.0.1.sif --cutoff=10 --output_dir=./outputdir 

Output is a csv file that you can open in Excel: predictions_with_good_interpae.csv in the ./outputdir. This page provides some guideline on scores.

Step 5: create a jupyter notebook

I found the jupyter visualiation tool not very useful. I prefer the plots generated by alphapickle in the next section. And use pymol for protein struction visualization. If you struggle with this step, you can skip to the next section  "#QC plot can also be produced by alphapickle."

#run the command in the outputdir

cd outputdir

singularity exec --bind $PWD --pwd $PWD /programs/alphapulldown-1.0.4/ap-2.0.1.sif --cutoff=5.0 --output_dir=./

#after this step, you should see a file output.ipynb in the directory.

#start jupyter in the outputdir

singularity exec --bind $PWD --pwd $PWD /programs/alphapulldown-1.0.4/ap-2.0.1.sif jupyter lab --ip= --port=8017 --no-browser

#you should see a URL, copy paste the URL into a web browser, open the output.ipynb in the left panel, and execute every step in the book

#QC plot can also be produced by alphapickle.

Execute in parallel on two GPU or more units on the same server:
1. Calculate the number of jobs for run_multimer, which is "number_of_baits x number_of candidates"

2. Divide the jobs into two batches, and create two shell scripts. The job_index should be between 1 and number of jobs:
shell script 1: 
NVIDIA_VISIBLE_DEVICES=0 singularity exec --nv --nvccli --contain ... --job_index=1
NVIDIA_VISIBLE_DEVICES=0 singularity exec --nv --nvccli --contain ... --job_index=3
NVIDIA_VISIBLE_DEVICES=0 singularity exec --nv --nvccli --contain ... --job_index=5

shell script 2:
NVIDIA_VISIBLE_DEVICES=1 singularity exec --nv --nvccli --contain ... --job_index=2
NVIDIA_VISIBLE_DEVICES=1 singularity exec --nv --nvccli --contain ... --job_index=4
NVIDIA_VISIBLE_DEVICES=1 singularity exec --nv --nvccli --contain ... --job_index=6

3. Run both shell scripts simultaneously in "screen" session
sh >& log1 &
sh >& log2 &

4. monitor whether both GPU

