the SyntenyLink algorithm allows users to handle reconstruct subgenomes of polyploid species more conveniently and to separate the set of genes belong to each subgenome in the organism with the aid of reference proteomes of polyploid species and related ancestor.
All programs are executed using command line options on Linux systems or Mac OS. Usage or help information are well built into the programs. To show them on the screen, users just need to run the program without giving any options::
This program, detects homologs between two species with blastp and involves a filtering step focusing on bit score and e-value to remove noise following two criteria: an
e-value threshold of less than 1e-20 and a ratio between the bit score and the highest bit score greater than 0.6.
- Usage
Reads in a data file: abc.blast.
The abc.blast file holds blast hits after running blastp between baseline species and polyploid species of interest::
Note: for blastp you need to use protein fasta files of baseline model and polyploid species of interest. After getting the output blast result update gene start loci and end loci adding row start values from the bed file of each species before using SyntenyLink_bf.py.
.. Note: for chr#, a two-letter short name is used as prefix for the species; # is the chromosome number. (For example, the second chromosome of Arabidopsis thaliana should be denoted as at2.)
.. The `bed` format is defined `here <http://genome.ucsc.edu/FAQ/FAQformat.html#format1>`_, and is especially useful since there are a ton of tools that can handle bed files, most notably BEDTOOLS.
.. The xyz.bed file can be generated by parsing the .gff3 file released by the sequencing initiatives.
.. Repeat of the same gene is not allowed in the .bed file.
.. When comparing multiple genomes, simply concatenate all inter-/intra-species m8 blast output into xyz .blast file and concatenate all gene positions of different species into xyz.bed file.
It is advised that to make SyntenyLink_bf generate more reasonable results, the number of BLASTP hits for a gene should be restricted to around top 5.
When you have abc.blast ready, put them in the same folder. Then you can simply use::
$ python3 ./SyntenyLink_bf.pl dir/abc.blast
- Output
The execution of SyntenyLink_bf outputs one blast file abc_blast_filtered.txt, containing filtered blast hits as follows::
This program uses the output after performing synteny analysis using DAGchainer to build a chain of syntenic genes and compute the score of each chain. The
modified version of filtered results from blastp (abc_blast_filtered_modified.txt) and DAGchainer (abc_synteny.aligncoords) are used to generate a syntelog table in which the gene chains are incorporated into their corresponding chromosomes with redundant chains (likely due to gene
duplications or tandem or segmental duplications in the polyploid genome) placed in the “overlap” column in the syntelog table.
Reads in data file for DAGchainer: abc_blast_filtered_modified.txt.
The abc_blast_filtered_modified.txt file holds the modified abc_blast_filtered.txt file matching the input file format of DAGchainer::
Note: After getting the abc_blast_filtered.txt file update query start, query end, subject start and subject end values incoorperating gene locus data from bed files of baseline species and polyploid species of interest.
Then add chromosome which each gene belongs to in the bed file of each species before using DAGchainer.
Here is a typical parameter setting for generating the abc_synteny.aligncoords file::
The execution of SyntenyLink_st outputs four output files: abc_synteny.success.colinear, abc_synteny.failed.colinear, abc_synteny.chains.passed & abc_synteny.all.chains.
The abc_synteny.success.colinear file holds the main output file with the generated syntelog table in which the gene chains are incorporated into their corresponding chromosomes with redundant chains (likely due to gene duplications or tandem
or segmental duplications in the polyploid genome) placed in the “overlap” column in the syntelog table::
This program identifies the main breakpoints of translocations in the abc_synteny.success.colinear file. It is is designed to detect fractionation gaps larger
than a gap threshold. Synteny blocks capped by two breakpoints are grouped as a “super-synteny block” where gaps smaller than the gap threshold could exist within the super block.
The input to this algorithm includes two parameters gap threshold, a minimum block length, and a data file syntelog table generated from Step 2, where collinear blocks are placed into the chromosomes of the organism.
Next, the gene density of super-synteny blocks in each chromosome is calculated. This algorithm disregards densities below threshold density value. We selected a threshold density value of 0.1 in here.
The chromosomes with gene density greater than 0.1 are ranked by the density of the blocks within the chromosome and assigned into candidate 𝑚 subgenomes, where 𝑚 denotes the ploidy level or the number
of subgenomes to be reconstructed.
- Usage
No input parameters or read in data files.
Note: You need to select the most suitable gap threshold and minimum block length for your data by running the gap_thresh_bl_length_param_study.py script.
Here is a typical parameter setting for obtaining optimal gap threshold and block length values for you data::
$ python3 gap_thresh_bl_length_param_study.py
- Output
Optimal gap threshold and block length values are printed to the console.
The Super_synteny_bl_sub_placement_density.xlsx file holds the subgenome placements of super-synteny blocks based on gene density, after removing the noise taking density threshold into account::
This program uses a weighted direct graph to dynamically link blocks produced from Step 3 that are most likely to be in a subgenome using a combined information of fractionation
and substitution patterns as well as continuity of gene chains. This algorithm takes two input parameters a1 and a2.
- Usage
Reads in two data files: abc_synteny.all.chains and abc_blastn.blast.
Two input parameters:a1 and a2.
Note: You need to generate abc_blastn.blast file by running nucleotide blast using cds fasta files of baseline species and polyploid species of interest.
The abc_blastn.blast file holds blast hits after running blastn between baseline species and polyploid species of interest::
The execution of SyntenyLink_wg outputs number of output files matching the number of subgenomes in the species of interest: Super_synteny_graph_nodes_sub{k}.xlsx. There will be m number of files.
k represents the corresponding subgenome number.
The Super_synteny_graph_nodes_sub{k}.xlsx file holds the nodes belong to each subgenome after traversing the graph::
Row start # Row end # subgenome1 subgenome2 subgenome3
0 123 N10.r N9 N8.r
124 698 N10 N8.r N9.r
699 1029 N6 N9.r N8.r
SyntenyLink_sb
::::::::::::::::::::::::::
This program retrieves genes “hidden” in small blocks that are missed in previous steps. Specically, we consider the chromosome blocks with densities lower than d<8= that are removed in previous steps
in the generation of small blocks because these blocks may be more likely to exhibit ipping of synteny blocks between the forward and reverse strands resulting from segmental duplication. Following
the generation of small blocks, small synteny blocks are incorporated into the corresponding super-synteny blocks, taking into account the subgenome placement of the super-synteny blocks as
a reference.
- Usage
No read in files
Three input parameters: ws1, ws2 and ws3.
Note: You need to select the most suitable window size parameters for subgenome1, subgenome2 and subgenome3.
.. To run SyntenyLink_sb.py you can simply use::
.. $ python3 SyntenyLink_sb.py -ws1 -ws2 -ws3
- Output
The execution of SyntenyLink_sb outputs number of output files: abc_synteny_chromosome_names.success.colinear.xlsx, subgenome_placement_blocks.all.xlsx and subgenome_placement_blocks.all_sub{k}.xlsx; There will be m number of files.
k represents the corresponding subgenome number.
The abc_synteny_chromosome_names.success.colinear.xlsx file holds the replaced gene id's with there corresponding chromosome names which they bellong to::
Row start # Row end # subgenome1 subgenome2 subgenome3
0 123 N10.r N9 N8.r
124 698 N10 N8.r N9.r
699 1029 N6 N9.r N8.r
The subgenome_placement_blocks.all.xlsx file holds the final placement of genes in subgenomes in step 5::
Row start # Row end # subgenome1 subgenome2 subgenome3
0 123 N10.r N9 N8.r
124 698 N10 N8.r N9.r
699 1029 N6 N9.r N8.r
The subgenome_placement_blocks.all_sub{k}.xlsx file holds the optimal placement of genes in subgenomes for each subgenome. Ex: subgenome_placement_blocks.all_sub1.xlsx::
Row start # Row end # subgenome1 subgenome2 subgenome3
0 75 N10.r N9 N8.r
76 78 N10.r N9.r N8.r
79 123 N10.r N9 N8.r
124 698 N10 N9.r N8.r
SyntenyLink_mn
::::::::::::::::::::::::::
Aimed to optimize the subgenome placements for each block (a row in the subgenome matrix) based on neighbourhood considerations in a subgenome,
maximize the number of continuous blocks/rows from the same chromosome. The algorithm utilizes two window sizes, up and down, which determine
the number of blocks above and below a given block, to define a neighbourhood.
- Usage
No read in files
Two input parameters: wup wdwn.
Note: You need to select the most suitable window up and window down parameters.
.. To run SyntenyLink_sb.py you can simply use::
.. $ python3 SyntenyLink_mn.py -wup -wdwn
- Output
The execution of SyntenyLink_sb outputs number of output files: Final_subgenome_placement_result.xlsx, Final_result.xlsx.
The Final_subgenome_placement_result.xlsx file holds the final placement of chromosome blocks in subgenomes::
Row start # Row end # subgenome1 subgenome2 subgenome3
0 75 N10.r N9 N8.r
76 78 N10.r N9.r N8.r
79 123 N10.r N9 N8.r
124 698 N10 N9.r N8.r
699 1035 N6 N9.r N8.r
1036 1036 N6 N9 N8.r
The Final_result.xlsx file holds the final placement of genes inside blocks in subgenomes::
subgenome1 subgenome2 subgenome3
BraA10g000880.3C x x
BraA10g000870.3C x x
BraA10g000860.3C x x
BraA10g000850.3C x x
BraA10g000830.3C BraA09g065940.3C x
BraA10g000820.3C x x
x BraA09g065960.3C x
BraA10g000790.3C x x
BraA10g000780.3C BraA09g065970.3C x
BraA10g000770.3C BraA09g065980.3C x
main_script
::::::::::::::::::::::::::
This holds the main script that runs all the above scripts in order. It takes in the following parameters:
-i input_file
-g gap_threshold
-m minimum_block_length
-n number_of_subgenomes
-gt ground_truth_file
-c chains_file
-bl blastn_file
-a1 adjusment1
-a2 adjusment2
-ws1 window_size_subgenome1
-ws2 window_size_subgenome2
-ws3 window_size_subgenome3
-wup window_up
-wdwn window_down
Note: Before running main_script.py, you need to run the SyntenyLink_bf.pl and SyntenyLink_st.pl scripts