This is a full workflow that shows methods regarding the construction of gene families, listing of candidate adjacencies and construction of ancestral contigs using Maximum Weight Matching.
%% Cell type:markdown id: tags:
# Section 1: Importing Libraries, Modules and Configuration File
%% Cell type:code id: tags:
``` python
# setting the Notebook Display Method
%matplotlibinline
```
%% Cell type:code id: tags:
``` python
# install tqdm-joblib
%pipinstalltqdm-joblib
# install Bio
%pipinstallBio
```
%% Cell type:code id: tags:
``` python
# import libraries
importsys
importyaml
importtime
importos
importio
importpandasaspd
importseabornassns
importmatplotlib
importmatplotlib.pyplotasplt
fromBioimportPhylo
fromscipy.spatial.distanceimportsquareform
fromtqdmimporttqdm
importtqdm_joblib
fromjoblibimportParallel,delayed
# import modules
#? eventually the below list will be in your package
fromGeneFamilyimportGeneFamily
fromGenomeimportGenome
fromMWMInputTreeNodeimportMWMInputTreeNode
frommwmatchingimport*
fromContigimport*
```
%% Cell type:markdown id: tags:
## Parsing the Configuration File
The configuration file consists of a list of input/output directories, input files and parameters to the pipeline.
Specifications for Input:
- leaf_genome_info: the input genomes with their positions on the phylogenetic tree, i.e. *Genomes.txt*<br>
- synteny_file_name: the post-fix of the synmap output of pairwise genome comparisons; <br>
- synteny_file_path: the path to the synmap output data <br>
- phylo_tree_path: the path to the Newick Phylogenetic Tree structure <br>
- jar_path: the path to the UniMoG jar file <br>
__Note__: The input genomes is in the *Genomes.txt* file.
It contains the information about input genomes, including CoGe ID, genomeName, ancestor<br>
The phylogenetic relationship is in the *phyloTree* file.
It includes the unrooted phylogenetic tree in newick format where each tree leaf is denoted as name_CoGeID. <br>
Sample Newick Tree for the monocot project: ( ( (Spirodela_51364, Acorus_54711), Dioscorea_51051), ( (Ananas_25734, Elaeis_33018), Asparagus_33908) )
Specifications for Output:
- gene_list: path where the gene list output should be created <br>
- gene_family: path where the gene family output should be created <br>
- genomes: path where the genome string output should be created <br>
- mwm_input_template: path and template for the output files created from MWM Input step <br>
- mwm_output_template: path and template for the output files created from MWM <br>
- contig_template: path and template for the output files created from constructing contigs <br>
- dcj_output_path: path to the output file when calculating DCJ distance between ancestors <br>
- dcj_summary_path: path to the output file containing a summary of the DCJ calculations <br><br>
Global parameters:
- min_cutoff_weight: minimum similarity cutoff weight for gene family construction, i.e. *min_cutoff_weight=65*
- max_cutoff_weight: maximum similarity cutoff weight for gene family construction, i.e. *max_cutoff_weight=100*
- ws: window size, i.e. *ws=7*
- min_mwm_weight: minimum adjacency weight to be considered for maximum weight matching, i.e. *min_mwm_weight=100*
- gf1: maximum number of genes in a gene family, i.e. *gf1=50*
- gf2: the maximum number of genes from a genome in a gene family, i.e. *gf2=10*
- gf3: the minimum number of genomes in a gene family, i.e. *gf3=1*
#print("The postfix of SynMap output files:", parsed_config['input_file']['synteny_file_name'])
```
%% Output
Project Name and Input Directory: ../inputData/project-buxus
Project Name and Output Directory: ../outputData/project-buxus
Project Name and Input Directory: ../inputData/project-monocots
Project Name and Output Directory: ../outputData/project-monocots
Please check required input data under the Input Directory:
1. Genome.txt with info about input genomes
2. phyloTree.txt with Newick Tree Structure
3. SynMap results between every pair of genomes
%% Cell type:markdown id: tags:
# Section 2: Constructing Gene Families from syntenically validated orthologs from SynMap
In order to succesfully construct gene families, we require:
- Successful parsing of the configuration file <br>
- Valid parameters in the configuration file
%% Cell type:markdown id: tags:
## Reading in Input Genome Data and Tree Structure Data
Each genome is a leaf of the input phylogenetic tree. The *get_leaves_and_tree_info()* method extracts the following attribute of each genome from the input file specified by *leaf_genome_info* parameter in the configuration file.
print("Visualizing the number of gene families in each chromosome of each genome")
data={'genome':genomes,'chromosome number':chromosomes,'number of gene families':gene_lengths}
pddata=pd.DataFrame(data)
sns.set(style="darkgrid")
g=sns.catplot(x='chromosome number',y='number of gene families',col='genome',data=pddata,kind="bar",aspect=.7,sharex=False,dodge=False)
g.set_xticklabels(rotation=30)
```
%% Output
Visualizing the number of gene families in each chromosome of each genome
/Users/lij313/opt/anaconda3/lib/python3.8/site-packages/seaborn/categorical.py:3803: UserWarning: Setting `sharex=False` with `color=None` may cause different levels of the `x` variable to share colors. This will change in a future version.
This is a full workflow that shows methods regarding the construction of gene families, listing of candidate adjacencies and construction of ancestral contigs using Maximum Weight Matching.
%% Cell type:markdown id: tags:
# Section 1: Importing Libraries, Modules and Configuration File
%% Cell type:code id: tags:
``` python
# setting the Notebook Display Method
%matplotlibinline
```
%% Cell type:code id: tags:
``` python
# install tqdm-joblib
%pipinstalltqdm-joblib
# install Bio
%pipinstallBio
```
%% Cell type:code id: tags:
``` python
# import libraries
importsys
importyaml
importtime
importos
importio
importpandasaspd
importseabornassns
importmatplotlib
importmatplotlib.pyplotasplt
fromBioimportPhylo
fromscipy.spatial.distanceimportsquareform
fromtqdmimporttqdm
importtqdm_joblib
fromjoblibimportParallel,delayed
# import modules
#? eventually the below list will be in your package
fromGeneFamilyimportGeneFamily
fromGenomeimportGenome
fromMWMInputTreeNodeimportMWMInputTreeNode
frommwmatchingimport*
fromContigimport*
```
%% Cell type:markdown id: tags:
## Parsing the Configuration File
The configuration file consists of a list of input/output directories, input files and parameters to the pipeline.
Specifications for Input:
- leaf_genome_info: the input genomes with their positions on the phylogenetic tree, i.e. *Genomes.txt*<br>
- synteny_file_name: the post-fix of the synmap output of pairwise genome comparisons; <br>
- synteny_file_path: the path to the synmap output data <br>
- phylo_tree_path: the path to the Newick Phylogenetic Tree structure <br>
- jar_path: the path to the UniMoG jar file <br>
__Note__: The input genomes is in the *Genomes.txt* file.
It contains the information about input genomes, including CoGe ID, genomeName, ancestor<br>
The phylogenetic relationship is in the *phyloTree* file.
It includes the unrooted phylogenetic tree in newick format where each tree leaf is denoted as name_CoGeID. <br>
Sample Newick Tree for the monocot project: ( ( (Spirodela_51364, Acorus_54711), Dioscorea_51051), ( (Ananas_25734, Elaeis_33018), Asparagus_33908) )
Specifications for Output:
- gene_list: path where the gene list output should be created <br>
- gene_family: path where the gene family output should be created <br>
- genomes: path where the genome string output should be created <br>
- mwm_input_template: path and template for the output files created from MWM Input step <br>
- mwm_output_template: path and template for the output files created from MWM <br>
- contig_template: path and template for the output files created from constructing contigs <br>
- dcj_output_path: path to the output file when calculating DCJ distance between ancestors <br>
- dcj_summary_path: path to the output file containing a summary of the DCJ calculations <br><br>
Global parameters:
- min_cutoff_weight: minimum similarity cutoff weight for gene family construction, i.e. *min_cutoff_weight=65*
- max_cutoff_weight: maximum similarity cutoff weight for gene family construction, i.e. *max_cutoff_weight=100*
- ws: window size, i.e. *ws=7*
- min_mwm_weight: minimum adjacency weight to be considered for maximum weight matching, i.e. *min_mwm_weight=100*
- gf1: maximum number of genes in a gene family, i.e. *gf1=50*
- gf2: the maximum number of genes from a genome in a gene family, i.e. *gf2=10*
- gf3: the minimum number of genomes in a gene family, i.e. *gf3=1*
#print("The postfix of SynMap output files:", parsed_config['input_file']['synteny_file_name'])
```
%% Output
Project Name and Input Directory: ../inputData/project-buxus
Project Name and Output Directory: ../outputData/project-buxus
Project Name and Input Directory: ../inputData/project-monocots
Project Name and Output Directory: ../outputData/project-monocots
Please check required input data under the Input Directory:
1. Genome.txt with info about input genomes
2. phyloTree.txt with Newick Tree Structure
3. SynMap results between every pair of genomes
%% Cell type:markdown id: tags:
# Section 2: Constructing Gene Families from syntenically validated orthologs from SynMap
In order to succesfully construct gene families, we require:
- Successful parsing of the configuration file <br>
- Valid parameters in the configuration file
%% Cell type:markdown id: tags:
## Reading in Input Genome Data and Tree Structure Data
Each genome is a leaf of the input phylogenetic tree. The *get_leaves_and_tree_info()* method extracts the following attribute of each genome from the input file specified by *leaf_genome_info* parameter in the configuration file.
print("Visualizing the number of gene families in each chromosome of each genome")
data={'genome':genomes,'chromosome number':chromosomes,'number of gene families':gene_lengths}
pddata=pd.DataFrame(data)
sns.set(style="darkgrid")
g=sns.catplot(x='chromosome number',y='number of gene families',col='genome',data=pddata,kind="bar",aspect=.7,sharex=False,dodge=False)
g.set_xticklabels(rotation=30)
```
%% Output
Visualizing the number of gene families in each chromosome of each genome
/Users/lij313/opt/anaconda3/lib/python3.8/site-packages/seaborn/categorical.py:3803: UserWarning: Setting `sharex=False` with `color=None` may cause different levels of the `x` variable to share colors. This will change in a future version.