This is a full workflow that shows methods regarding the construction of gene families, listing of candidate adjacencies and construction of ancestral contigs using Maximum Weight Matching.
%% Cell type:markdown id: tags:
# Section 1: Importing Libraries, Modules and Configuration File
%% Cell type:code id: tags:
``` python
# setting the Notebook Display Method
%matplotlibinline
```
%% Cell type:code id: tags:
``` python
# install tqdm-joblib
%pipinstalltqdm-joblib
# install Bio
%pipinstallBio
```
%% Output
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: tqdm-joblib in /student/nma904/.local/lib/python3.8/site-packages (0.0.2)
Requirement already satisfied: tqdm in /usr/local/anaconda3/lib/python3.8/site-packages (from tqdm-joblib) (4.50.2)
Note: you may need to restart the kernel to use updated packages.
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: Bio in /student/nma904/.local/lib/python3.8/site-packages (1.4.0)
Requirement already satisfied: requests in /usr/local/anaconda3/lib/python3.8/site-packages (from Bio) (2.24.0)
Requirement already satisfied: biopython>=1.79 in /usr/local/anaconda3/lib/python3.8/site-packages (from Bio) (1.79)
Requirement already satisfied: mygene in /student/nma904/.local/lib/python3.8/site-packages (from Bio) (3.2.2)
Requirement already satisfied: tqdm in /usr/local/anaconda3/lib/python3.8/site-packages (from Bio) (4.50.2)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/anaconda3/lib/python3.8/site-packages (from requests->Bio) (2022.5.18.1)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/anaconda3/lib/python3.8/site-packages (from requests->Bio) (3.0.4)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/anaconda3/lib/python3.8/site-packages (from requests->Bio) (1.25.11)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/anaconda3/lib/python3.8/site-packages (from requests->Bio) (2.10)
Requirement already satisfied: numpy in /usr/local/anaconda3/lib/python3.8/site-packages (from biopython>=1.79->Bio) (1.23.1)
Requirement already satisfied: biothings-client>=0.2.6 in /student/nma904/.local/lib/python3.8/site-packages (from mygene->Bio) (0.2.6)
Note: you may need to restart the kernel to use updated packages.
%% Cell type:code id: tags:
``` python
# import libraries
importsys
importyaml
importtime
importos
importio
importpandasaspd
importseabornassns
importmatplotlib.pyplotasplt
fromBioimportPhylo
fromtqdmimporttqdm
importtqdm_joblib
fromjoblibimportParallel,delayed
# import modules
#? eventually the below list will be in your package
fromGeneFamilyimportGeneFamily
fromGenomeimportGenome
fromMWMInputTreeNodeimportMWMInputTreeNode
frommwmatchingimport*
fromContigimport*
```
%% Cell type:markdown id: tags:
## Parsing the Configuration File
The configuration file consists of a list of input/output directories, input files and parameters to the pipeline.
Specifications for Input:
- leaf_genome_info: the input genomes with their positions on the phylogenetic tree, i.e. *Genomes.txt*<br>
- synteny_file_name: the post-fix of the synmap output of pairwise genome comparisons; <br>
- synteny_file_path: the path to the synmap output data <br>
- jar_path: the path to the UniMoG jar file <br>
__Note__: The input genomes and phylogenetic relationship is in the *Genomes.txt* file.
It includes the unrooted phylogenetic tree in newick format along with genome structure and corresponding input files.<br>
Sample Newick Tree for the monocot projet: (((51364, 54711), 51051), ((25734, 33018), 33908))
Specifications for Output:
- gene_list: path where the gene list output should be created <br>
- gene_family: path where the gene family output should be created <br>
- genomes: path where the genome string output should be created <br>
- mwm_input_template: path and template for the output files created from MWM Input step <br>
- mwm_output_template: path and template for the output files created from MWM <br>
- contig_template: path and template for the output files created from constructing contigs <br>
- dcj_output_path: path to the output file when calculating DCJ distance between ancestors <br>
- dcj_summary_path: path to the output file containing a summary of the DCJ calculations <br><br>
Global parameters:
- min_cutoff_weight: minimum similarity cutoff weight for gene family construction
- max_cutoff_weight: maximum similarity cutoff weight for gene family construction
#print("The postfix of SynMap output files:", parsed_config['input_file']['synteny_file_name'])
```
%% Output
Project Name and Input Directory: ../inputData/project-buxus
Project Name and Output Directory: ../outputData/project-buxus
Project Name and Input Directory: ../inputData/project-monocots
Project Name and Output Directory: ../outputData/project-monocots
Please check required input data:
1. Genome.txt with info about input genomes and phylogenetic tree
2. SynMap results between every pair of genomes
%% Cell type:markdown id: tags:
# Section 2: Constructing Gene Families from syntenically validated orthologs from SynMap
In order to succesfully construct gene families, we require:
- Successful parsing of the configuration file <br>
- Valid parameters in the configuration file
%% Cell type:markdown id: tags:
## Reading in Input Genome Data and Tree Structure Data
Each genome is a leaf of the input phylogenetic tree. The *get_leaves_and_tree_info()* method extracts the following attribute of each genome from the input file specified by *leaf_genome_info* parameter in the configuration file.
print("Visualizing the number of gene families in each chromosome of each genome")
data={'genome':genomes,'chromosome number':chromosomes,'number of genes':gene_lengths}
pddata=pd.DataFrame(data)
sns.set(style="darkgrid")
g=sns.catplot(x='chromosome number',y='number of genes',col='genome',data=pddata,kind="bar",aspect=.7,sharex=False,dodge=False)
g.set_xticklabels(rotation=30)
```
%% Output
Visualizing the number of gene families in each chromosome of each genome
/usr/local/anaconda3/lib/python3.8/site-packages/seaborn/categorical.py:3793: UserWarning: Setting `sharex=False` with `color=None` may cause different levels of the `x` variable to share colors. This will change in a future version.
dcj_summary_file.write("Median Structure for Ancestor "+str((i+1))+": "+"%s"%median_structure[i]+"\n")
forfileindcj_output:
path=file
withopen(path,'r')asdcj_file:
dcj_info=dcj_file.readlines()
dcj_summary_file.write(dcj_info[0])
print("Summary Done")
```
%% Output
Summary Done
...
...
%% Cell type:markdown id: tags:
# RACCROCHE - Module 1 & 2
This is a full workflow that shows methods regarding the construction of gene families, listing of candidate adjacencies and construction of ancestral contigs using Maximum Weight Matching.
%% Cell type:markdown id: tags:
# Section 1: Importing Libraries, Modules and Configuration File
%% Cell type:code id: tags:
``` python
# setting the Notebook Display Method
%matplotlibinline
```
%% Cell type:code id: tags:
``` python
# install tqdm-joblib
%pipinstalltqdm-joblib
# install Bio
%pipinstallBio
```
%% Output
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: tqdm-joblib in /student/nma904/.local/lib/python3.8/site-packages (0.0.2)
Requirement already satisfied: tqdm in /usr/local/anaconda3/lib/python3.8/site-packages (from tqdm-joblib) (4.50.2)
Note: you may need to restart the kernel to use updated packages.
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: Bio in /student/nma904/.local/lib/python3.8/site-packages (1.4.0)
Requirement already satisfied: requests in /usr/local/anaconda3/lib/python3.8/site-packages (from Bio) (2.24.0)
Requirement already satisfied: biopython>=1.79 in /usr/local/anaconda3/lib/python3.8/site-packages (from Bio) (1.79)
Requirement already satisfied: mygene in /student/nma904/.local/lib/python3.8/site-packages (from Bio) (3.2.2)
Requirement already satisfied: tqdm in /usr/local/anaconda3/lib/python3.8/site-packages (from Bio) (4.50.2)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/anaconda3/lib/python3.8/site-packages (from requests->Bio) (2022.5.18.1)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/anaconda3/lib/python3.8/site-packages (from requests->Bio) (3.0.4)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/anaconda3/lib/python3.8/site-packages (from requests->Bio) (1.25.11)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/anaconda3/lib/python3.8/site-packages (from requests->Bio) (2.10)
Requirement already satisfied: numpy in /usr/local/anaconda3/lib/python3.8/site-packages (from biopython>=1.79->Bio) (1.23.1)
Requirement already satisfied: biothings-client>=0.2.6 in /student/nma904/.local/lib/python3.8/site-packages (from mygene->Bio) (0.2.6)
Note: you may need to restart the kernel to use updated packages.
%% Cell type:code id: tags:
``` python
# import libraries
importsys
importyaml
importtime
importos
importio
importpandasaspd
importseabornassns
importmatplotlib.pyplotasplt
fromBioimportPhylo
fromtqdmimporttqdm
importtqdm_joblib
fromjoblibimportParallel,delayed
# import modules
#? eventually the below list will be in your package
fromGeneFamilyimportGeneFamily
fromGenomeimportGenome
fromMWMInputTreeNodeimportMWMInputTreeNode
frommwmatchingimport*
fromContigimport*
```
%% Cell type:markdown id: tags:
## Parsing the Configuration File
The configuration file consists of a list of input/output directories, input files and parameters to the pipeline.
Specifications for Input:
- leaf_genome_info: the input genomes with their positions on the phylogenetic tree, i.e. *Genomes.txt*<br>
- synteny_file_name: the post-fix of the synmap output of pairwise genome comparisons; <br>
- synteny_file_path: the path to the synmap output data <br>
- jar_path: the path to the UniMoG jar file <br>
__Note__: The input genomes and phylogenetic relationship is in the *Genomes.txt* file.
It includes the unrooted phylogenetic tree in newick format along with genome structure and corresponding input files.<br>
Sample Newick Tree for the monocot projet: (((51364, 54711), 51051), ((25734, 33018), 33908))
Specifications for Output:
- gene_list: path where the gene list output should be created <br>
- gene_family: path where the gene family output should be created <br>
- genomes: path where the genome string output should be created <br>
- mwm_input_template: path and template for the output files created from MWM Input step <br>
- mwm_output_template: path and template for the output files created from MWM <br>
- contig_template: path and template for the output files created from constructing contigs <br>
- dcj_output_path: path to the output file when calculating DCJ distance between ancestors <br>
- dcj_summary_path: path to the output file containing a summary of the DCJ calculations <br><br>
Global parameters:
- min_cutoff_weight: minimum similarity cutoff weight for gene family construction
- max_cutoff_weight: maximum similarity cutoff weight for gene family construction
#print("The postfix of SynMap output files:", parsed_config['input_file']['synteny_file_name'])
```
%% Output
Project Name and Input Directory: ../inputData/project-buxus
Project Name and Output Directory: ../outputData/project-buxus
Project Name and Input Directory: ../inputData/project-monocots
Project Name and Output Directory: ../outputData/project-monocots
Please check required input data:
1. Genome.txt with info about input genomes and phylogenetic tree
2. SynMap results between every pair of genomes
%% Cell type:markdown id: tags:
# Section 2: Constructing Gene Families from syntenically validated orthologs from SynMap
In order to succesfully construct gene families, we require:
- Successful parsing of the configuration file <br>
- Valid parameters in the configuration file
%% Cell type:markdown id: tags:
## Reading in Input Genome Data and Tree Structure Data
Each genome is a leaf of the input phylogenetic tree. The *get_leaves_and_tree_info()* method extracts the following attribute of each genome from the input file specified by *leaf_genome_info* parameter in the configuration file.
print("Visualizing the number of gene families in each chromosome of each genome")
data={'genome':genomes,'chromosome number':chromosomes,'number of genes':gene_lengths}
pddata=pd.DataFrame(data)
sns.set(style="darkgrid")
g=sns.catplot(x='chromosome number',y='number of genes',col='genome',data=pddata,kind="bar",aspect=.7,sharex=False,dodge=False)
g.set_xticklabels(rotation=30)
```
%% Output
Visualizing the number of gene families in each chromosome of each genome
/usr/local/anaconda3/lib/python3.8/site-packages/seaborn/categorical.py:3793: UserWarning: Setting `sharex=False` with `color=None` may cause different levels of the `x` variable to share colors. This will change in a future version.