From 4475bd3be107c2e38050d6d46aea9cebbfae0086 Mon Sep 17 00:00:00 2001 From: "Thulani Hewavithana (qnm481)" <qnm481@mail.usask.ca> Date: Fri, 7 Jul 2023 10:06:21 -0600 Subject: [PATCH] Delete README.md --- README.md | 423 ------------------------------------------------------ 1 file changed, 423 deletions(-) delete mode 100644 README.md diff --git a/README.md b/README.md deleted file mode 100644 index 9f7ed3d..0000000 --- a/README.md +++ /dev/null @@ -1,423 +0,0 @@ -SyntenyLink -=========== - -Overview -======== - -The SyntenyLink package has six major components: the SyntenyLink -algorithm allows users to handle reconstruct subgenomes of polyploid -species more conveniently and to separate the set of genes belong to -each subgenome in the organism with the aid of reference proteomes of -polyploid species and related ancestor. - -All programs are executed using command line options on Linux systems or -Mac OS. Usage or help information are well built into the programs. To -show them on the screen, users just need to run the program without -giving any options: - - $python3 ./SyntenyLink.py - -All code is copiable, distributable, modifiable, and usable without any -restrictions. Contact: Thulani Hewavithana, <qnm481@usask.ca>; Chu Shin -Koh, <kevin.koh@gifs.ca> - -The following is the list of executable programs ------------------------------------------------- - -**Main programs (in the Scripts folder)** - -- SyntenyLink\_bf.pl -- SyntenyLink\_st.pl -- SyntenyLink\_mbp.py -- SyntenyLink\_wg.py -- SyntenyLink\_sb.py -- SyntenyLink\_mn.py -- main\_script.py -- SyntenyLink\_acc.py -- gap\_threshold\_selection.py -- minimum\_block\_length\_selection.py - -Main programs -============= - -SyntenyLink\_bf -=============== - -This program, detects homologs between two species with blastp and -involves a filtering step focusing on bit score and e-value to remove -noise following two criteria: an e-value threshold of less than 1e-20 -and a ratio between the bit score and the highest bit score greater than -0.6. - -\- Usage Reads in a data file: abc.blast. The abc.blast file holds blast -hits after running blastp between baseline species and polyploid species -of interest: - - BraA01g000010.3C AT1G43860.1 74.194 124 14 3 80 186 231 353 9.61e-56 188 - -Here is a typical parameter setting for generating the abc.blast file: - - $ makeblastdb -in ref_pep.fa -dbtype prot -out ref_pep - - $ blastall -i query_pep.fasta -p blastp -d ref_pep -m 8 -e 1e-5 -F F -v 5 -b 5 -o abc.blast -a 4 - -Note: for blastp you need to use protein fasta files of baseline model -and polyploid species of interest. After getting the output blast result -update gene start loci and end loci adding row start values from the bed -file of each species before using SyntenyLink\_bf.py. - -It is advised that to make SyntenyLink\_bf generate more reasonable -results, the number of BLASTP hits for a gene should be restricted to -around top 5. When you have abc.blast ready, put them in the same -folder. Then you can simply use: - - $ ./SyntenyLink_bf.pl dir/abc.blast - -\- Output The execution of SyntenyLink\_bf outputs one blast file -abc\_blast\_filtered.txt, containing filtered blast hits as follows: - - BraA01g000010.3C AT1G43860.1 74.194 124 14 3 80 186 231 353 9.61e-56 188 - BraA01g000010.3C AT3G04630.1 66.087 115 21 4 194 297 165 272 7.06e-33 125 - BraA01g000010.3C AT3G04630.3 66.087 115 21 4 194 297 164 271 7.88e-33 125 - -SyntenyLink\_st ---------------- - -This program uses the output after performing synteny analysis using -DAGchainer to build a chain of syntenic genes and compute the score of -each chain. The modified version of filtered results from blastp -(abc\_blast\_filtered\_modified.txt) and DAGchainer -(abc\_synteny.aligncoords) are used to generate a syntelog table in -which the gene chains are incorporated into their corresponding -chromosomes with redundant chains (likely due to gene duplications or -tandem or segmental duplications in the polyploid genome) placed in the -"overlap" column in the syntelog table. - -Reads in data file for DAGchainer: abc\_blast\_filtered\_modified.txt. -The abc\_blast\_filtered\_modified.txt file holds the modified -abc\_blast\_filtered.txt file matching the input file format of -DAGchainer: - - A1 BraA01g000010.3C 2944 3050 Chr1 AT1G43860.1 16622247 16622597 9.61E-56 188 - A1 BraA01g000010.3C 3058 3161 Chr3 AT3G04630.1 1259234 1259503 7.06E-33 125 - -Note: After getting the abc\_blast\_filtered.txt file update query -start, query end, subject start and subject end values incoorperating -gene locus data from bed files of baseline species and polyploid species -of interest. Then add chromosome which each gene belongs to in the bed -file of each species before using DAGchainer. - -Here is a typical parameter setting for generating the -abc\_synteny.aligncoords file: - - $ ./run_DAG_chainer.pl -i dir/abc_blast_filtered_modified.txt -s -I - -The abc\_synteny.aligncoords file holds pairwise synteny blocks after -running DAGchainer : - - ## alignment A1 vs. Chr1 Alignment #1 score = 5177.6 (num aligned pairs: 121): - A1 BraA01g026830.3C 17161339 17161504 Chr1 AT1G56580.1 21198405 21198568 2.180000e-109 50 - A1 BraA01g026890.3C 17191267 17191318 Chr1 AT1G57550.1 21312544 21312593 3.270000e-24 40 - A1 BraA01g026900.3C 17196325 17196618 Chr1 AT1G57610.3 21337612 21337818 8.840000e-106 84 - -\- Usage Reads in two data files: abc\_blast\_filtered\_modified.txt & -ref\_genelist.txt. The ref\_genelist.txt file holds gff file like -details of baseline species : - - AT1G01010 AT1G01010.1 429 Chr1_1 Chr1 1 3631 5899 AT1G01010.1 NAC domain containing protein 1 - AT1G01020 AT1G01020.1 245 Chr1_2 Chr1 2 5928 8737 AT1G01020.1 Arv1-like protein - AT1G01030 AT1G01030.1 358 Chr1_4 Chr1 4 11649 13714 AT1G01030.1 AP2/B3-like transcriptional factor family protein - -To run SyntenyLink\_st.pl you can simply use: - - $ perl SyntenyLink_st.pl -d abc_synteny.aligncoords -g ref_genelist.txt - -\- Output The execution of SyntenyLink\_st outputs four output files: -abc\_synteny.success.colinear, abc\_synteny.failed.colinear, -abc\_synteny.chains.passed & abc\_synteny.all.chains. - -The abc\_synteny.success.colinear file holds the main output file with -the generated syntelog table in which the gene chains are incorporated -into their corresponding chromosomes with redundant chains (likely due -to gene duplications or tandem or segmental duplications in the -polyploid genome) placed in the "overlap" column in the syntelog table: - - BraA01g000010.3C AT1G43860.1 74.194 124 14 3 80 186 231 353 9.61e-56 188 - BraA01g000010.3C AT3G04630.1 66.087 115 21 4 194 297 165 272 7.06e-33 125 - BraA01g000010.3C AT3G04630.3 66.087 115 21 4 194 297 164 271 7.88e-33 125 - -The abc\_synteny.failed.colinear file holds the removed chains following -the condition if the number of remaining colinear pairs are less than 6: - - A6_Chr1_4 Chr1 AT1G14070 BraA06g010220.3C - A6_Chr1_4 Chr1 AT1G14080 BraA06g010230.3C - A6_Chr1_4 Chr1 AT1G14100 x - A6_Chr1_4 Chr1 AT1G14110 x - A6_Chr1_4 Chr1 AT1G14120 x - A6_Chr1_4 Chr1 AT1G14130 BraA06g010280.3C - A6_Chr1_4 Chr1 AT1G14140 x - A6_Chr1_4 Chr1 AT1G14150 x - A6_Chr1_4 Chr1 AT1G14160 x - A6_Chr1_4 Chr1 AT1G14170 x - A6_Chr1_4 Chr1 AT1G14180 x - A6_Chr1_4 Chr1 AT1G14185 x - -The abc\_synteny.chains.passed file holds the ids of set of passed -chains with collinear pairs greater than 6: - - A7_Chr1_1 - A8.r_Chr1_1 - A9.r_Chr1_1 - A2_Chr1_1 - A6_Chr1_1 - A7.r_Chr1_1 - -The abc\_synteny.all.chains file holds the all chains identified from -DAGchainer: - - A1_Chr1_1 A1 BraA01g026830.3C 17161339 17161504 Chr1 AT1G56580.1 21198405 21198568 2.180000e-109 50 - A1_Chr1_1 A1 BraA01g026890.3C 17191267 17191318 Chr1 AT1G57550.1 21312544 21312593 3.270000e-24 40 - A1_Chr1_1 A1 BraA01g026900.3C 17196325 17196618 Chr1 AT1G57610.3 21337612 21337818 8.840000e-106 84 - -SyntenyLink\_mbp ----------------- - -This program identifies the main breakpoints of translocations in the -abc\_synteny.success.colinear file. It is is designed to detect -fractionation gaps larger than a gap threshold. Synteny blocks capped by -two breakpoints are grouped as a "super-synteny block" where gaps -smaller than the gap threshold could exist within the super block. The -input to this algorithm includes two parameters gap threshold, a minimum -block length, and a data file syntelog table generated from Step 2, -where collinear blocks are placed into the chromosomes of the organism. -Next, the gene density of super-synteny blocks in each chromosome is -calculated. This algorithm disregards densities below threshold density -value. We selected a threshold density value of 0.1 in here. The -chromosomes with gene density greater than 0.1 are ranked by the density -of the blocks within the chromosome and assigned into candidate ð‘š -subgenomes, where ð‘š denotes the ploidy level or the number of subgenomes -to be reconstructed. - -\- Usage No input parameters or read in data files. - -Note: You need to select the most suitable gap threshold and minimum -block length for your data by running the gap\_threshold\_selection.py -and minimum\_block\_length\_selection.py scripts. - -Here is a typical parameter setting for obtaining optimal gap threshold -and block length values for you data: - - $ python3 gap_threshold_selection.py -i abc_synteny.success.colinear - -Then get the optimal gap threshold value from the console and run the -following command: - - $ python3 minimum_block_length_selection.py -i abc_synteny.success.colinear -g <output gap threshold value> - -\- Output Optimal gap threshold and block length values are printed to -the console. - -\- Output The execution of SyntenyLink\_mbp outputs two output files: -Super\_synteny\_block\_output.xlsx and -Super\_synteny\_bl\_sub\_placement\_density.xlsx. - -The Super\_synteny\_block\_output.xlsx file holds the generated -super-synteny blocks before removing the noise taking density threshold -into account: - - Block no. Block_start Block_end Row start # Row end # # genes in block N1 N1.r N2 N2.r N3 N3.r N4 N4.r N5 N5.r N6 N6.r N7 N7.r N8 N8.r - 1 AT1G01010 AT1G01560 0 57 58 0.0 0.0 0.0 0.3684210526315789 0.0 0.6666666666666666 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 - 2 AT1G01570 AT1G02205 58 123 66 0.0 0.0 0.43283582089552236 0.0 0.0 0.4626865671641791 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 - 3 AT1G02210 AT1G14900 124 1491 1368 0.005113221329437546 0.0 0.3951789627465303 0.005843681519357195 0.3915266617969321 0.0 0.0577063550036523 0.1891891891891892 0.0 0.0 0.07596785975164354 0.26004382761139516 0.0 0.0 0.0 0.0 - -The Super\_synteny\_bl\_sub\_placement\_density.xlsx file holds the -subgenome placements of super-synteny blocks based on gene density, -after removing the noise taking density threshold into account: - - Block no. Block_start Block_end Row start # Row end # # genes in block N1 N1.r N2 N2.r N3 N3.r N4 N4.r N5 N5.r N6 N6.r N7 N7.r N8 N8.r N9 N9.r Non_zero subgenome1 subgenome2 subgenome3 - 1 AT1G01010 AT1G11860 0 1161 1162 0 0 0 0 0 0 0 0 0.544358312 0 0 0 0 0 0.354005168 0.308354866 0 0 3 N5 N8 N8.r - 2 AT1G11870 AT1G13420 1162 1328 167 0 0 0 0 0 0 0 0 0 0.523809524 0 0 0 0 0.386904762 0.43452381 0 0 3 N5.r N8.r N8 - 3 AT1G13430 AT1G19470 1329 1957 629 0 0 0 0 0 0 0 0 0.598412698 0 0 0 0 0 0.33968254 0.422222222 0 0 3 N5 N8.r N8 - -SyntenyLink\_wg ---------------- - -This program uses a weighted direct graph to dynamically link blocks -produced from Step 3 that are most likely to be in a subgenome using a -combined information of fractionation and substitution patterns as well -as continuity of gene chains. This algorithm takes two input parameters -a1 and a2. - -\- Usage Reads in two data files: abc\_synteny.all.chains and -abc\_blastn.blast. Two input parameters:a1 and a2. - -Note: You need to generate abc\_blastn.blast file by running nucleotide -blast using cds fasta files of baseline species and polyploid species of -interest. - -The abc\_blastn.blast file holds blast hits after running blastn between -baseline species and polyploid species of interest: - - BraA01g000010.3C AT1G43860.1 78.706 371 25 6 239 558 692 1059 7.94e-91 335 - -Here is a typical parameter setting for generating the abc\_blastn.blast -file: - - $ makeblastdb -in ref_cds.fa -dbtype prot -out ref_cds - - $ blastall -i query_cds.fasta -p blastn -d ref_cds -m 8 -e 1e-5 -F F -v 5 -b 5 -o abc_blastn.blast -a 4 - -Note: You need to select the most suitable a1 and a2 parameter values -for your data, which accounts for adjusting weights of the graph. - -\- Output The execution of SyntenyLink\_wg outputs number of output -files matching the number of subgenomes in the species of interest: -Super\_synteny\_graph\_nodes\_sub{k}.xlsx. There will be m number of -files. k represents the corresponding subgenome number. - -The Super\_synteny\_graph\_nodes\_sub{k}.xlsx file holds the nodes -belong to each subgenome after traversing the graph: - - Row start # Row end # subgenome1 subgenome2 subgenome3 - 0 123 N10.r N9 N8.r - 124 698 N10 N8.r N9.r - 699 1029 N6 N9.r N8.r - -SyntenyLink\_sb ---------------- - -This program retrieves genes "hidden" in small blocks that are missed in -previous steps. Speciï¿¿cally, we consider the chromosome blocks with -densities lower than d\<8= that are removed in previous steps in the -generation of small blocks because these blocks may be more likely to -exhibit ï¿¿ipping of synteny blocks between the forward and reverse -strands resulting from segmental duplication. Following the generation -of small blocks, small synteny blocks are incorporated into the -corresponding super-synteny blocks, taking into account the subgenome -placement of the super-synteny blocks as a reference. - -\- Usage No read in files Three input parameters: ws1, ws2 and ws3. - -Note: You need to select the most suitable window size parameters for -subgenome1, subgenome2 and subgenome3. - -\- Output The execution of SyntenyLink\_sb outputs number of output -files: abc\_synteny\_chromosome\_names.success.colinear.xlsx, -subgenome\_placement\_blocks.all.xlsx and -subgenome\_placement\_blocks.all\_sub{k}.xlsx; There will be m number of -files. k represents the corresponding subgenome number. - -The abc\_synteny\_chromosome\_names.success.colinear.xlsx file holds the -replaced gene id\'s with there corresponding chromosome names which they -bellong to: - - Row start # Row end # subgenome1 subgenome2 subgenome3 - 0 123 N10.r N9 N8.r - 124 698 N10 N8.r N9.r - 699 1029 N6 N9.r N8.r - -The subgenome\_placement\_blocks.all.xlsx file holds the final placement -of genes in subgenomes in step 5: - - Row start # Row end # subgenome1 subgenome2 subgenome3 - 0 123 N10.r N9 N8.r - 124 698 N10 N8.r N9.r - 699 1029 N6 N9.r N8.r - -The subgenome\_placement\_blocks.all\_sub{k}.xlsx file holds the optimal placement of genes in subgenomes for each subgenome. Ex: subgenome\_placement\_blocks.all\_sub1.xlsx:: - -: Row start \# Row end \# subgenome1 subgenome2 subgenome3 0 75 N10.r - N9 N8.r 76 78 N10.r N9.r N8.r 79 123 N10.r N9 N8.r 124 698 N10 N9.r - N8.r - -SyntenyLink\_mn ---------------- - -Aimed to optimize the subgenome placements for each block (a row in the -subgenome matrix) based on neighbourhood considerations in a subgenome, -maximize the number of continuous blocks/rows from the same chromosome. -The algorithm utilizes two window sizes, up and down, which determine -the number of blocks above and below a given block, to define a -neighbourhood. - -\- Usage No read in files Two input parameters: wup wdwn. - -Note: You need to select the most suitable window up and window down -parameters. - -\- Output The execution of SyntenyLink\_sb outputs number of output -files: Final\_subgenome\_placement\_result.xlsx, Final\_result.xlsx. - -The Final\_subgenome\_placement\_result.xlsx file holds the final -placement of chromosome blocks in subgenomes: - - Row start # Row end # subgenome1 subgenome2 subgenome3 - 0 75 N10.r N9 N8.r - 76 78 N10.r N9.r N8.r - 79 123 N10.r N9 N8.r - 124 698 N10 N9.r N8.r - 699 1035 N6 N9.r N8.r - 1036 1036 N6 N9 N8.r - -The Final\_result.xlsx file holds the final placement of genes inside -blocks in subgenomes: - - subgenome1 subgenome2 subgenome3 - BraA10g000880.3C x x - BraA10g000870.3C x x - BraA10g000860.3C x x - BraA10g000850.3C x x - BraA10g000830.3C BraA09g065940.3C x - BraA10g000820.3C x x - x BraA09g065960.3C x - BraA10g000790.3C x x - BraA10g000780.3C BraA09g065970.3C x - BraA10g000770.3C BraA09g065980.3C x - -main\_script ------------- - -This holds the main script that runs all the above scripts in order. It -takes in the following parameters: -i input\_file -g gap\_threshold -m -minimum\_block\_length -n number\_of\_subgenomes -gt ground\_truth\_file --c chains\_file -bl blastn\_file -a1 adjusment1 -a2 adjusment2 -ws1 -window\_size\_subgenome1 -ws2 window\_size\_subgenome2 -ws3 -window\_size\_subgenome3 -wup window\_up -wdwn window\_down - -Note: Before running main\_script.py, you need to run the -SyntenyLink\_bf.pl and SyntenyLink\_st.pl scripts - -To run SyntenyLink\_sb.py you can simply use: - - $ python3 main_script.py -i abc_synteny.success.colinear -g -m -n -gt abc_groundtruth.xlsx -c abc_synteny.all.chains -bl abc_blastn.blast -a1 -a1 -a1 -a2 -a2 -a2 -ws1 -ws2 -ws3 -wup -wdwn - -\- Output The execution of main\_script outputs all the output files of -the above scripts in the same directory as the input file. - -SyntenyLink\_acc ----------------- - -This holds the script that calculates the accuracy of the final -placement of genes in subgenomes when there is a ground truth file. - -\- Usage One read in file: abc\_groundtruth.xlsx - -To run SyntenyLink\_sb.py you can simply use: - - $ python3 SyntenyLink_acc.py abc_groundtruth.xlsx - -\- Output The execution of SyntenyLink\_sb prints the subgenome -placemnet accuracy of each subgenome: - - Exact match percentage for subgenome1: 84.15% - Exact match number for subgenome1: 11490 - Missing genes for subgenome1: 2164 - Exact match percentage for subgenome2: 61.61% - Exact match number for subgenome2: 6062 - Missing genes for subgenome2: 3778 - Exact match percentage for subgenome3: 65.30% - Exact match number for subgenome3: 5635 - Missing genes for subgenome3: 2995 - -Example -======= -- GitLab