This is a full workflow that shows methods regarding the construction of gene families, listing of candidate adjacencies and construction of ancestral contigs using Maximum Weight Matching.
This is a full workflow that shows methods regarding the construction of gene families, listing of candidate adjacencies and construction of ancestral contigs using Maximum Weight Matching.
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
# Section 1: Importing Libraries, Modules and Configuration File
# Section 1: Importing Libraries, Modules and Configuration File
%% Cell type:code id: tags:
%% Cell type:code id: tags:
``` python
``` python
# setting the Notebook Display Method
# setting the Notebook Display Method
%matplotlibinline
%matplotlibinline
```
```
%% Cell type:code id: tags:
%% Cell type:code id: tags:
``` python
``` python
# install tqdm-joblib
# install tqdm-joblib
%pipinstalltqdm-joblib
%pipinstalltqdm-joblib
# install Bio
# install Bio
%pipinstallBio
%pipinstallBio
```
```
%% Cell type:code id: tags:
%% Cell type:code id: tags:
``` python
``` python
# import libraries
# import libraries
importsys
importsys
importyaml
importyaml
importtime
importtime
importos
importos
importio
importio
importpandasaspd
importpandasaspd
importseabornassns
importseabornassns
importmatplotlib
importmatplotlib
importmatplotlib.pyplotasplt
importmatplotlib.pyplotasplt
fromBioimportPhylo
fromBioimportPhylo
fromscipy.spatial.distanceimportsquareform
fromscipy.spatial.distanceimportsquareform
fromtqdmimporttqdm
fromtqdmimporttqdm
importtqdm_joblib
importtqdm_joblib
fromjoblibimportParallel,delayed
fromjoblibimportParallel,delayed
# import modules
# import modules
#? eventually the below list will be in your package
#? eventually the below list will be in your package
fromGeneFamilyimportGeneFamily
fromGeneFamilyimportGeneFamily
fromGenomeimportGenome
fromGenomeimportGenome
fromMWMInputTreeNodeimportMWMInputTreeNode
fromMWMInputTreeNodeimportMWMInputTreeNode
frommwmatchingimport*
frommwmatchingimport*
fromContigimport*
fromContigimport*
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
## Parsing the Configuration File
## Parsing the Configuration File
The configuration file consists of a list of input/output directories, input files and parameters to the pipeline.
The configuration file consists of a list of input/output directories, input files and parameters to the pipeline.
Specifications for Input:
Specifications for Input:
- leaf_genome_info: the input genomes with their positions on the phylogenetic tree, i.e. *Genomes.txt*<br>
- leaf_genome_info: the input genomes with their positions on the phylogenetic tree, i.e. *Genomes.txt*<br>
- synteny_file_name: the post-fix of the synmap output of pairwise genome comparisons; <br>
- synteny_file_name: the post-fix of the synmap output of pairwise genome comparisons; <br>
- synteny_file_path: the path to the synmap output data <br>
- synteny_file_path: the path to the synmap output data <br>
- phylo_tree_path: the path to the Newick Phylogenetic Tree structure <br>
- phylo_tree_path: the path to the Newick Phylogenetic Tree structure <br>
- jar_path: the path to the UniMoG jar file <br>
- jar_path: the path to the UniMoG jar file <br>
__Note__: The input genomes is in the *Genomes.txt* file.
__Note__: The input genomes is in the *Genomes.txt* file.
It contains the information about input genomes, including CoGe ID, genomeName, ancestor<br>
It contains the information about input genomes, including CoGe ID, genomeName, ancestor<br>
The phylogenetic relationship is in the *phyloTree* file.
The phylogenetic relationship is in the *phyloTree* file.
It includes the unrooted phylogenetic tree in newick format where each tree leaf is denoted as name_CoGeID. <br>
It includes the unrooted phylogenetic tree in newick format where each tree leaf is denoted as name_CoGeID. <br>
Sample Newick Tree for the monocot project: ( ( (Spirodela_51364, Acorus_54711), Dioscorea_51051), ( (Ananas_25734, Elaeis_33018), Asparagus_33908) )
Sample Newick Tree for the monocot project: ( ( (Spirodela_51364, Acorus_54711), Dioscorea_51051), ( (Ananas_25734, Elaeis_33018), Asparagus_33908) )
Specifications for Output:
Specifications for Output:
- gene_list: path where the gene list output should be created <br>
- gene_list: path where the gene list output should be created <br>
- gene_family: path where the gene family output should be created <br>
- gene_family: path where the gene family output should be created <br>
- genomes: path where the genome string output should be created <br>
- genomes: path where the genome string output should be created <br>
- mwm_input_template: path and template for the output files created from MWM Input step <br>
- mwm_input_template: path and template for the output files created from MWM Input step <br>
- mwm_output_template: path and template for the output files created from MWM <br>
- mwm_output_template: path and template for the output files created from MWM <br>
- contig_template: path and template for the output files created from constructing contigs <br>
- contig_template: path and template for the output files created from constructing contigs <br>
- dcj_output_path: path to the output file when calculating DCJ distance between ancestors <br>
- dcj_output_path: path to the output file when calculating DCJ distance between ancestors <br>
- dcj_summary_path: path to the output file containing a summary of the DCJ calculations <br><br>
- dcj_summary_path: path to the output file containing a summary of the DCJ calculations <br><br>
Global parameters:
Global parameters:
- min_cutoff_weight: minimum similarity cutoff weight for gene family construction, i.e. *min_cutoff_weight=65*
- min_cutoff_weight: minimum similarity cutoff weight for gene family construction, i.e. *min_cutoff_weight=65*
- max_cutoff_weight: maximum similarity cutoff weight for gene family construction, i.e. *max_cutoff_weight=100*
- max_cutoff_weight: maximum similarity cutoff weight for gene family construction, i.e. *max_cutoff_weight=100*
- ws: window size, i.e. *ws=7*
- ws: window size, i.e. *ws=7*
- min_mwm_weight: minimum adjacency weight to be considered for maximum weight matching, i.e. *min_mwm_weight=100*
- min_mwm_weight: minimum adjacency weight to be considered for maximum weight matching, i.e. *min_mwm_weight=100*
- gf1: maximum number of genes in a gene family, i.e. *gf1=50*
- gf1: maximum number of genes in a gene family, i.e. *gf1=50*
- gf2: the maximum number of genes from a genome in a gene family, i.e. *gf2=10*
- gf2: the maximum number of genes from a genome in a gene family, i.e. *gf2=10*
- gf3: the minimum number of genomes in a gene family, i.e. *gf3=1*
- gf3: the minimum number of genomes in a gene family, i.e. *gf3=1*
#print("The postfix of SynMap output files:", parsed_config['input_file']['synteny_file_name'])
#print("The postfix of SynMap output files:", parsed_config['input_file']['synteny_file_name'])
```
```
%% Output
%% Output
Project Name and Input Directory: ../inputData/project-buxus
Project Name and Input Directory: ../inputData/project-monocots
Project Name and Output Directory: ../outputData/project-buxus
Project Name and Output Directory: ../outputData/project-monocots
Please check required input data under the Input Directory:
Please check required input data under the Input Directory:
1. Genome.txt with info about input genomes
1. Genome.txt with info about input genomes
2. phyloTree.txt with Newick Tree Structure
2. phyloTree.txt with Newick Tree Structure
3. SynMap results between every pair of genomes
3. SynMap results between every pair of genomes
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
# Section 2: Constructing Gene Families from syntenically validated orthologs from SynMap
# Section 2: Constructing Gene Families from syntenically validated orthologs from SynMap
In order to succesfully construct gene families, we require:
In order to succesfully construct gene families, we require:
- Successful parsing of the configuration file <br>
- Successful parsing of the configuration file <br>
- Valid parameters in the configuration file
- Valid parameters in the configuration file
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
## Reading in Input Genome Data and Tree Structure Data
## Reading in Input Genome Data and Tree Structure Data
Each genome is a leaf of the input phylogenetic tree. The *get_leaves_and_tree_info()* method extracts the following attribute of each genome from the input file specified by *leaf_genome_info* parameter in the configuration file.
Each genome is a leaf of the input phylogenetic tree. The *get_leaves_and_tree_info()* method extracts the following attribute of each genome from the input file specified by *leaf_genome_info* parameter in the configuration file.
print("Visualizing the number of gene families in each chromosome of each genome")
print("Visualizing the number of gene families in each chromosome of each genome")
data={'genome':genomes,'chromosome number':chromosomes,'number of gene families':gene_lengths}
data={'genome':genomes,'chromosome number':chromosomes,'number of gene families':gene_lengths}
pddata=pd.DataFrame(data)
pddata=pd.DataFrame(data)
sns.set(style="darkgrid")
sns.set(style="darkgrid")
g=sns.catplot(x='chromosome number',y='number of gene families',col='genome',data=pddata,kind="bar",aspect=.7,sharex=False,dodge=False)
g=sns.catplot(x='chromosome number',y='number of gene families',col='genome',data=pddata,kind="bar",aspect=.7,sharex=False,dodge=False)
g.set_xticklabels(rotation=30)
g.set_xticklabels(rotation=30)
```
```
%% Output
%% Output
Visualizing the number of gene families in each chromosome of each genome
Visualizing the number of gene families in each chromosome of each genome
/Users/lij313/opt/anaconda3/lib/python3.8/site-packages/seaborn/categorical.py:3803: UserWarning: Setting `sharex=False` with `color=None` may cause different levels of the `x` variable to share colors. This will change in a future version.
/Users/lij313/opt/anaconda3/lib/python3.8/site-packages/seaborn/categorical.py:3803: UserWarning: Setting `sharex=False` with `color=None` may cause different levels of the `x` variable to share colors. This will change in a future version.
This is a full workflow that shows methods regarding the construction of gene families, listing of candidate adjacencies and construction of ancestral contigs using Maximum Weight Matching.
This is a full workflow that shows methods regarding the construction of gene families, listing of candidate adjacencies and construction of ancestral contigs using Maximum Weight Matching.
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
# Section 1: Importing Libraries, Modules and Configuration File
# Section 1: Importing Libraries, Modules and Configuration File
%% Cell type:code id: tags:
%% Cell type:code id: tags:
``` python
``` python
# setting the Notebook Display Method
# setting the Notebook Display Method
%matplotlibinline
%matplotlibinline
```
```
%% Cell type:code id: tags:
%% Cell type:code id: tags:
``` python
``` python
# install tqdm-joblib
# install tqdm-joblib
%pipinstalltqdm-joblib
%pipinstalltqdm-joblib
# install Bio
# install Bio
%pipinstallBio
%pipinstallBio
```
```
%% Cell type:code id: tags:
%% Cell type:code id: tags:
``` python
``` python
# import libraries
# import libraries
importsys
importsys
importyaml
importyaml
importtime
importtime
importos
importos
importio
importio
importpandasaspd
importpandasaspd
importseabornassns
importseabornassns
importmatplotlib
importmatplotlib
importmatplotlib.pyplotasplt
importmatplotlib.pyplotasplt
fromBioimportPhylo
fromBioimportPhylo
fromscipy.spatial.distanceimportsquareform
fromscipy.spatial.distanceimportsquareform
fromtqdmimporttqdm
fromtqdmimporttqdm
importtqdm_joblib
importtqdm_joblib
fromjoblibimportParallel,delayed
fromjoblibimportParallel,delayed
# import modules
# import modules
#? eventually the below list will be in your package
#? eventually the below list will be in your package
fromGeneFamilyimportGeneFamily
fromGeneFamilyimportGeneFamily
fromGenomeimportGenome
fromGenomeimportGenome
fromMWMInputTreeNodeimportMWMInputTreeNode
fromMWMInputTreeNodeimportMWMInputTreeNode
frommwmatchingimport*
frommwmatchingimport*
fromContigimport*
fromContigimport*
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
## Parsing the Configuration File
## Parsing the Configuration File
The configuration file consists of a list of input/output directories, input files and parameters to the pipeline.
The configuration file consists of a list of input/output directories, input files and parameters to the pipeline.
Specifications for Input:
Specifications for Input:
- leaf_genome_info: the input genomes with their positions on the phylogenetic tree, i.e. *Genomes.txt*<br>
- leaf_genome_info: the input genomes with their positions on the phylogenetic tree, i.e. *Genomes.txt*<br>
- synteny_file_name: the post-fix of the synmap output of pairwise genome comparisons; <br>
- synteny_file_name: the post-fix of the synmap output of pairwise genome comparisons; <br>
- synteny_file_path: the path to the synmap output data <br>
- synteny_file_path: the path to the synmap output data <br>
- phylo_tree_path: the path to the Newick Phylogenetic Tree structure <br>
- phylo_tree_path: the path to the Newick Phylogenetic Tree structure <br>
- jar_path: the path to the UniMoG jar file <br>
- jar_path: the path to the UniMoG jar file <br>
__Note__: The input genomes is in the *Genomes.txt* file.
__Note__: The input genomes is in the *Genomes.txt* file.
It contains the information about input genomes, including CoGe ID, genomeName, ancestor<br>
It contains the information about input genomes, including CoGe ID, genomeName, ancestor<br>
The phylogenetic relationship is in the *phyloTree* file.
The phylogenetic relationship is in the *phyloTree* file.
It includes the unrooted phylogenetic tree in newick format where each tree leaf is denoted as name_CoGeID. <br>
It includes the unrooted phylogenetic tree in newick format where each tree leaf is denoted as name_CoGeID. <br>
Sample Newick Tree for the monocot project: ( ( (Spirodela_51364, Acorus_54711), Dioscorea_51051), ( (Ananas_25734, Elaeis_33018), Asparagus_33908) )
Sample Newick Tree for the monocot project: ( ( (Spirodela_51364, Acorus_54711), Dioscorea_51051), ( (Ananas_25734, Elaeis_33018), Asparagus_33908) )
Specifications for Output:
Specifications for Output:
- gene_list: path where the gene list output should be created <br>
- gene_list: path where the gene list output should be created <br>
- gene_family: path where the gene family output should be created <br>
- gene_family: path where the gene family output should be created <br>
- genomes: path where the genome string output should be created <br>
- genomes: path where the genome string output should be created <br>
- mwm_input_template: path and template for the output files created from MWM Input step <br>
- mwm_input_template: path and template for the output files created from MWM Input step <br>
- mwm_output_template: path and template for the output files created from MWM <br>
- mwm_output_template: path and template for the output files created from MWM <br>
- contig_template: path and template for the output files created from constructing contigs <br>
- contig_template: path and template for the output files created from constructing contigs <br>
- dcj_output_path: path to the output file when calculating DCJ distance between ancestors <br>
- dcj_output_path: path to the output file when calculating DCJ distance between ancestors <br>
- dcj_summary_path: path to the output file containing a summary of the DCJ calculations <br><br>
- dcj_summary_path: path to the output file containing a summary of the DCJ calculations <br><br>
Global parameters:
Global parameters:
- min_cutoff_weight: minimum similarity cutoff weight for gene family construction, i.e. *min_cutoff_weight=65*
- min_cutoff_weight: minimum similarity cutoff weight for gene family construction, i.e. *min_cutoff_weight=65*
- max_cutoff_weight: maximum similarity cutoff weight for gene family construction, i.e. *max_cutoff_weight=100*
- max_cutoff_weight: maximum similarity cutoff weight for gene family construction, i.e. *max_cutoff_weight=100*
- ws: window size, i.e. *ws=7*
- ws: window size, i.e. *ws=7*
- min_mwm_weight: minimum adjacency weight to be considered for maximum weight matching, i.e. *min_mwm_weight=100*
- min_mwm_weight: minimum adjacency weight to be considered for maximum weight matching, i.e. *min_mwm_weight=100*
- gf1: maximum number of genes in a gene family, i.e. *gf1=50*
- gf1: maximum number of genes in a gene family, i.e. *gf1=50*
- gf2: the maximum number of genes from a genome in a gene family, i.e. *gf2=10*
- gf2: the maximum number of genes from a genome in a gene family, i.e. *gf2=10*
- gf3: the minimum number of genomes in a gene family, i.e. *gf3=1*
- gf3: the minimum number of genomes in a gene family, i.e. *gf3=1*
#print("The postfix of SynMap output files:", parsed_config['input_file']['synteny_file_name'])
#print("The postfix of SynMap output files:", parsed_config['input_file']['synteny_file_name'])
```
```
%% Output
%% Output
Project Name and Input Directory: ../inputData/project-buxus
Project Name and Input Directory: ../inputData/project-monocots
Project Name and Output Directory: ../outputData/project-buxus
Project Name and Output Directory: ../outputData/project-monocots
Please check required input data under the Input Directory:
Please check required input data under the Input Directory:
1. Genome.txt with info about input genomes
1. Genome.txt with info about input genomes
2. phyloTree.txt with Newick Tree Structure
2. phyloTree.txt with Newick Tree Structure
3. SynMap results between every pair of genomes
3. SynMap results between every pair of genomes
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
# Section 2: Constructing Gene Families from syntenically validated orthologs from SynMap
# Section 2: Constructing Gene Families from syntenically validated orthologs from SynMap
In order to succesfully construct gene families, we require:
In order to succesfully construct gene families, we require:
- Successful parsing of the configuration file <br>
- Successful parsing of the configuration file <br>
- Valid parameters in the configuration file
- Valid parameters in the configuration file
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
## Reading in Input Genome Data and Tree Structure Data
## Reading in Input Genome Data and Tree Structure Data
Each genome is a leaf of the input phylogenetic tree. The *get_leaves_and_tree_info()* method extracts the following attribute of each genome from the input file specified by *leaf_genome_info* parameter in the configuration file.
Each genome is a leaf of the input phylogenetic tree. The *get_leaves_and_tree_info()* method extracts the following attribute of each genome from the input file specified by *leaf_genome_info* parameter in the configuration file.
print("Visualizing the number of gene families in each chromosome of each genome")
print("Visualizing the number of gene families in each chromosome of each genome")
data={'genome':genomes,'chromosome number':chromosomes,'number of gene families':gene_lengths}
data={'genome':genomes,'chromosome number':chromosomes,'number of gene families':gene_lengths}
pddata=pd.DataFrame(data)
pddata=pd.DataFrame(data)
sns.set(style="darkgrid")
sns.set(style="darkgrid")
g=sns.catplot(x='chromosome number',y='number of gene families',col='genome',data=pddata,kind="bar",aspect=.7,sharex=False,dodge=False)
g=sns.catplot(x='chromosome number',y='number of gene families',col='genome',data=pddata,kind="bar",aspect=.7,sharex=False,dodge=False)
g.set_xticklabels(rotation=30)
g.set_xticklabels(rotation=30)
```
```
%% Output
%% Output
Visualizing the number of gene families in each chromosome of each genome
Visualizing the number of gene families in each chromosome of each genome
/Users/lij313/opt/anaconda3/lib/python3.8/site-packages/seaborn/categorical.py:3803: UserWarning: Setting `sharex=False` with `color=None` may cause different levels of the `x` variable to share colors. This will change in a future version.
/Users/lij313/opt/anaconda3/lib/python3.8/site-packages/seaborn/categorical.py:3803: UserWarning: Setting `sharex=False` with `color=None` may cause different levels of the `x` variable to share colors. This will change in a future version.