Commit d3e7377e authored by Lingling Jin (lij313)'s avatar Lingling Jin (lij313)
Browse files

add biopython examples

parent 62317df8
LOCUS DQ091202 2119 bp DNA linear MAM 11-JUN-2009
DEFINITION Elephas maximus HBB/D gene, complete cds.
ACCESSION DQ091202
VERSION DQ091202.1
KEYWORDS .
SOURCE Elephas maximus (Asiatic elephant)
ORGANISM Elephas maximus
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
Mammalia; Eutheria; Afrotheria; Proboscidea; Elephantidae; Elephas.
REFERENCE 1 (bases 1 to 2119)
AUTHORS Opazo,J.C., Sloan,A.M., Campbell,K.L. and Storz,J.F.
TITLE Origin and ascendancy of a chimeric fusion gene: the
beta/delta-globin gene of paenungulate mammals
JOURNAL Mol. Biol. Evol. 26 (7), 1469-1478 (2009)
PUBMED 19332641
REFERENCE 2 (bases 1 to 2119)
AUTHORS Sloan,A.M. and Campbell,K.L.
TITLE Direct Submission
JOURNAL Submitted (09-JUN-2005) Zoology, University of Manitoba, 190 Dysart
Road, Winnipeg, MB R3T 2N2, Canada
FEATURES Location/Qualifiers
source 1..2119
/organism="Elephas maximus"
/mol_type="genomic DNA"
/db_xref="taxon:9783"
regulatory 275..278
/regulatory_class="CAAT_signal"
regulatory 319..322
/regulatory_class="TATA_box"
mRNA join(<401..492,621..843,1573..>1810)
/product="HBB/D"
CDS join(401..492,621..843,1573..1701)
/note="delta-globin; hemoglobin adult delta-chain"
/codon_start=1
/product="HBB/D"
/protein_id="AAZ22675.1"
/translation="MVNLTAAEKTQVTNLWGKVNVKELGGEALSRLLVVYPWTRRFFE
HFGDLSTADAVLHNAKVLAHGEKVLTSFGEGLKHLDNLKGTFADLSELHCDKLHVDPE
NFRLLGNVLVIVLARHFGKEFTPDVQAAYEKVVAGVANALAHKYH"
regulatory 1805..1810
/regulatory_class="polyA_signal_sequence"
ORIGIN
1 ttctgggcct cagtttcctc atttgtataa taacagaatt ggagagtaaa ttcttaagag
61 gcttaccagg ctgtaattct aaaaaaaatg cataaataaa cttgccaagg cagatgtttt
121 tagcagcaat tcctgaaaga aacgggacca ggagataagt agagaaagag tgaaggtctg
181 aaatcaaact aataagacag tcccagactg tcaaggagag gtatggctgt catcattcag
241 gcctcaccct gcagaaccac accctggcct tggccaatct gctcacaaga gcaaaaaggg
301 caggaccagg gttgggcata taaggaagag tagtgccagc tgctgtttac actcacttct
361 gacacaactg tgttgactag caactaccca atcagacacc atggtgaatc tgactgctgc
421 tgagaagaca caagtcacca acctgtgggg caaggtgaat gtgaaagagc ttggtggtga
481 ggccctgagc aggtttgtat ctaggttgca aggtagactt aaggagggtt gagtggggct
541 gggcatgtgg agacagaaca gtctcccagt ttctgacagg cactgacttc ctctgcaccs
601 tgtggtgctt tcaccttcag gctgctggtg gtctacccat ggacccggag gttctttgaa
661 cactttgggg acctgtccac tgctgacgct gtcctgcaca acgctaaagt gctggcccat
721 ggcgagaaag tgttgacctc ctttggtgag ggcctgaagc acctggacaa cctcaagggc
781 acctttgccg atctgagcga gctgcactgt gacaagctgc acgtggatcc tgagaatttc
841 agggtgagtc taggagacac tctatttttt cttttcactt tgtagtcttt cactgtgatt
901 attttgctta tttgaatttc ctctgtatct ctttttactc gactatgttt catcatttag
961 tgttttttca acttatacca ttttgtatta cttttctttc aatattcttc cttttttcct
1021 gactcacatt cttgctttat atcatgctct ttatttaatt tcctacgttt ttgctcttgc
1081 tctccctttc tcctagtttc cttccctctg aacagtaccc aaattgtgca taccacctct
1141 cgtccactat ttctgcactg gggcaaatcc ccacccctcc tccatatgag ggttggaaag
1201 gactgaatca aagaggagag gatcatggtg ctgttctaga gtatgtgatt catttcagac
1261 ttgaaggata acttgaataa tataaaatca ggagtaaatg gagaggaaag tcagtatctg
1321 agaatgaaag atcagaaggt catagacgag atggggagca gaagttacta agaaactgac
1381 cattgtggct ataattaatc acttaattag ttaattaata tgtttgttat ttattcacgt
1441 ttttcatttt ggtgggagta aatttgggct agtgtgtggg caacataaat gggtttcacc
1501 ccattgtctc agaggccaag ctggattgct ttgttaacca tgtctgtgta tgtatctacc
1561 tcttccccat agctcctggg caatgtgctg gtgattgtcc tggcccgcca ctttggcaag
1621 gaattcaccc cagatgttca ggctgcctat gagaaggttg tggcaggtgt ggcgaatgcc
1681 ctggctcaca aataccactg agatcctggc ctgttcctgg tatccatcgg aagccccatt
1741 tcccgagatg ctatctctga atttgggaaa ataatgccaa ctctcaaggg catctcttct
1801 gcctaataaa gtactttcag ctcaactttc tgattcattt attttttttc tcagtcactc
1861 ttgtggtggg ggaagttccc aaggctctat ggacagagag ctcttgtgcc ttataggaaa
1921 agttcaaggg aaattggaaa ataaagggaa ccatacacag atattaatgg gaacaattct
1981 acttcaaagg cataaagatt gggaaggttt ggcaaatagg atactggtac tacagggatt
2041 ccatgggcct caggcctaag acatagcccc agggctaact ttcagattca attccagaaa
2101 ttactcacaa aataatgga
//
This diff is collapsed.
bash-4.4$ conda list biopython
# packages in environment at /usr/local/anaconda3:
#
# Name Version Build Channel
biopython 1.72 py36h04863e7_0
bash-4.4$
bash-4.4$ python
Python 3.6.6 |Anaconda custom (64-bit)| (default, Jun 28 2018, 17:14:51)
[GCC 7.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>
>>> my_seq = Seq( "AGTACACTGGT" )
>>> my_seq
Seq('AGTACACTGGT')
>>> type( my_seq )
<class 'Bio.Seq.Seq'>
>>> dir( my_seq )
['__add__', '__class__', '__contains__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__imul__', '__init__', '__init_subclass__', '__le__', '__len__', '__lt__', '__module__', '__mul__', '__ne__', '__new__', '__radd__', '__reduce__', '__reduce_ex__', '__repr__', '__rmul__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_data', '_get_seq_str_and_check_alphabet', 'alphabet', 'back_transcribe', 'complement', 'count', 'count_overlap', 'endswith', 'find', 'lower', 'lstrip', 'reverse_complement', 'rfind', 'rsplit', 'rstrip', 'split', 'startswith', 'strip', 'tomutable', 'tostring', 'transcribe', 'translate', 'ungap', 'upper']
>>> help( Seq )
>>> my_seq.complement()
Seq('TCATGTGACCA')
>>> my_seq.reverse_complement()
Seq('ACCAGTGTACT')
>>>
bash-4.4$
#!/usr/bin/env python
# example inspired by example from pg 205 of Stevens & Boucher
import sys
from Bio import SeqIO
# import some other functions from the Stevens & Boucher text
from stevens_boucher import estimateCharge, estimateIsoelectric
# open the file named on the command line
fileObj = open( sys.argv[1], "r" )
# use SeqIO.read() to obtain just one sequence
protein=SeqIO.read( fileObj, "fasta" )
print( "Sequence ID: %s" % protein.id )
print( "Sequence:" )
seq = protein.seq
while seq:
print( " "+seq[0:60] )
seq = seq[60:]
# see pg 207 for estimateIsoelectric()
print( "Estimated Iso-electric point: %f" % estimateIsoelectric( protein.seq ) )
fileObj.close()
bash-4.4$ pydoc Bio.SeqIO.read
bash-4.4$
bash-4.4$ more one_sequence.fasta
>P00435.3 RecName: Full=Glutathione peroxidase 1; Short=GPx-1; Short=GSHPx-1; Al
tName: Full=Cellular glutathione peroxidase
MCAAQRSAAALAAAAPRTVYAFSARPLAGGEPFNLSSLRGKVLLIENVASLUGTTVRDYTQMNDLQRRLG
PRGLVVLGFPCNQFGHQENAKNEEILNCLKYVRPGGGFEPNFMLFEKCEVNGEKAHPLFAFLREVLPTPS
DDATALMTDPKFITWSPVCRNDVSWNFEKFLVGPDGVPVRRYSRRFLTIDIEPDIETLLSQGASA
bash-4.4$
bash-4.4$ chmod u+x biopython2.py
bash-4.4$
bash-4.4$ ./biopython2.py one_sequence.fasta
Sequence ID: P00435.3
Sequence:
MCAAQRSAAALAAAAPRTVYAFSARPLAGGEPFNLSSLRGKVLLIENVASLUGTTVRDYT
QMNDLQRRLGPRGLVVLGFPCNQFGHQENAKNEEILNCLKYVRPGGGFEPNFMLFEKCEV
NGEKAHPLFAFLREVLPTPSDDATALMTDPKFITWSPVCRNDVSWNFEKFLVGPDGVPVR
RYSRRFLTIDIEPDIETLLSQGASA
Estimated Iso-electric point: 7.189697
bash-4.4$
#!/usr/bin/env python
# modified version of example from pg 205 of Stevens & Boucher
import sys
from Bio import SeqIO
# import some other functions from the Stevens & Boucher text
from stevens_boucher import estimateCharge, estimateIsoelectric
fileObj = open( sys.argv[1], "r" )
for protein in SeqIO.parse( fileObj, "fasta" ):
print( "Sequence ID: %s" % protein.id )
print( "Sequence:" )
seq = protein.seq
while seq:
print( " "+seq[0:60] )
seq = seq[60:]
# see pg 207 for estimateIsoelectric()
print( "Estimated Iso-electric point: %f" % estimateIsoelectric( protein.seq ) )
fileObj.close()
bash-4.4$ pydoc Bio.SeqIO.parse
bash-4.4$
bash-4.4$ more two_sequences.fasta
>uniprot|P00395|COX1_HUMAN Cytochrome c oxidase subunit 1 (EC 1.9.3.1) (Cytochro
me c oxidase polypeptide I).
MFADRWLFSTNHKDIGTLYLLFGAWAGVLGTALSLLIRAELGQPGNLLGNDHIYNVIVTA
HAFVMIFFMVMPIMIGGFGNWLVPLMIGAPDMAFPRMNNMSFWLLPPSLLLLLASAMVEA
GAGTGWTVYPPLAGNYSHPGASVDLTIFSLHLAGVSSILGAINFITTIINMKPPAMTQYQ
TPLFVWSVLITAVLLLLSLPVLAAGITMLLTDRNLNTTFFDPAGGGDPILYQHLFWFFGH
PEVYILILPGFGMISHIVTYYSGKKEPFGYMGMVWAMMSIGFLGFIVWAHHMFTVGMDVD
TRAYFTSATMIIAIPTGVKVFSWLATLHGSNMKWSAAVLWALGFIFLFTVGGLTGIVLAN
SSLDIVLHDTYYVVAHFHYVLSMGAVFAIMGGFIHWFPLFSGYTLDQTYAKIHFTIMFIG
VNLTFFPQHFLGLSGMPRRYSDYPDAYTTWNILSSVGSFISLTAVMLMIFMIWEAFASKR
KVLMVEEPSMNLEWLYGCPPPYHTFEEPVYMKS
>uniprot|P53551|H1_YEAST Histone H1.
MAPKKSTTKTTSKGKKPATSKGKEKSTSKAAIKKTTAKKEEASSKSYRELIIEGLTALKE
RKGSSRPALKKFIKENYPIVGSASNFDLYFNNAIKKGVEAGDFEQPKGPAGAVKLAKKKS
PEVKKEKEVSPKPKQAATSVSATASKAKAASTKLAPKKVVKKKSPTVTAKKASSPSSLTY
KEMILKSMPQLNDGKGSSRIVLKKYVKDTFSSKLKTSSNFDYLFNSAIKKCVENGELVQP
KGPSGIIKLNKKKVKLST
bash-4.4$
bash-4.4$ python ./biopython3.py two_sequences.fasta
Sequence ID: uniprot|P00395|COX1_HUMAN
Sequence:
MFADRWLFSTNHKDIGTLYLLFGAWAGVLGTALSLLIRAELGQPGNLLGNDHIYNVIVTA
HAFVMIFFMVMPIMIGGFGNWLVPLMIGAPDMAFPRMNNMSFWLLPPSLLLLLASAMVEA
GAGTGWTVYPPLAGNYSHPGASVDLTIFSLHLAGVSSILGAINFITTIINMKPPAMTQYQ
TPLFVWSVLITAVLLLLSLPVLAAGITMLLTDRNLNTTFFDPAGGGDPILYQHLFWFFGH
PEVYILILPGFGMISHIVTYYSGKKEPFGYMGMVWAMMSIGFLGFIVWAHHMFTVGMDVD
TRAYFTSATMIIAIPTGVKVFSWLATLHGSNMKWSAAVLWALGFIFLFTVGGLTGIVLAN
SSLDIVLHDTYYVVAHFHYVLSMGAVFAIMGGFIHWFPLFSGYTLDQTYAKIHFTIMFIG
VNLTFFPQHFLGLSGMPRRYSDYPDAYTTWNILSSVGSFISLTAVMLMIFMIWEAFASKR
KVLMVEEPSMNLEWLYGCPPPYHTFEEPVYMKS
Estimated Iso-electric point: 6.693665
Sequence ID: uniprot|P53551|H1_YEAST
Sequence:
MAPKKSTTKTTSKGKKPATSKGKEKSTSKAAIKKTTAKKEEASSKSYRELIIEGLTALKE
RKGSSRPALKKFIKENYPIVGSASNFDLYFNNAIKKGVEAGDFEQPKGPAGAVKLAKKKS
PEVKKEKEVSPKPKQAATSVSATASKAKAASTKLAPKKVVKKKSPTVTAKKASSPSSLTY
KEMILKSMPQLNDGKGSSRIVLKKYVKDTFSSKLKTSSNFDYLFNSAIKKCVENGELVQP
KGPSGIIKLNKKKVKLST
Estimated Iso-electric point: 10.213692
bash-4.4$
#!/usr/bin/env python
# example from pg 205 of Stevens & Boucher
from Bio import SeqIO
# SeqRecord is the final object we wish to make, and which will be output
from Bio.SeqRecord import SeqRecord
# Seq object is needed internally to make a SeqRecord
from Bio.Seq import Seq
fileObj = open( "output.fasta", "w" )
proteinSeq = "AAPRTVYAFSARPLAGGEPFNLSSLRGKVLLIENVASLUGTTVRDYTQMNDLQRRLG"
# create a Seq object
seqObj = Seq( proteinSeq)
# create a SeqRecord object from seqObj; include an ID and Description field
# to the object.
proteinObj = SeqRecord( seqObj, id="biopython_example_4",
description="sequence in biopython example 4" )
# write out a list of sequence records. In this case there is only one
# record
SeqIO.write( [proteinObj], fileObj, 'fasta' )
# done
fileObj.close()
bash-4.4$ pydoc Bio.SeqIO.write
bash-4.4$ pydoc Bio.SeqRecord
bash-4.4$ pydoc Bio.Alphabet
bash-4.4$
bash-4.4$ ls output.fasta
ls: cannot access 'output.fasta': No such file or directory
bash-4.4$
bash-4.4$ chmod u+x biopython4.py
bash-4.4$
bash-4.4$ ./biopython4.py
bash-4.4$
bash-4.4$ ls -l output.fasta
-rw-r--r-- 1 kusalik faculty 101 Jan 31 19:30 output.fasta
bash-4.4$
#!/usr/bin/env python
# example from pg 206-7 of Stevens & Boucher
from Bio import SeqIO
# Entrez for obtaining records from NCBI Genbank
from Bio import Entrez
# Identify ourselves to NCBI
Entrez.email = "binfo200@cs.usask.ca"
# look in the nucleotide database, return format is FASTA,
# and the accession number to look for is DQ091202. It happens to be
# the record for the Elephas maximus HBB/D gene.
socketObj = Entrez.efetch( db="nucleotide", rettype="fasta",
id="DQ091202" )
# the above creates a socket object that works like a typical file object.
# Hence we can use SeqIO to read from it.
# read the record from the socket object connected to NCBI
dnaObj = SeqIO.read( socketObj, "fasta" )
# we're done with the socket
socketObj.close()
# show the information we've obtained
print( dnaObj.description )
print( dnaObj.seq )
# compare the result to that in the file DQ091202.gb in this directory
bash-4.4$ pydoc Bio.Entrez.efetch
bash-4.4$ pydoc socket
bash-4.4$
bash-4.4$ chmod u+x biopython5.py
bash-4.4$ ./biopython5.py
DQ091202.1 Elephas maximus HBB/D gene, complete cds
TTCTGGGCCTCAGTTTCCTCATTTGTATAATAACAGAATTGGAGAGTAAATTCTTAAGAGGCTTACCAGGCTGTAATTCTAAAAAAAATGCATAAATAAACTTGCCAAGGCAGATGTTTTTAGCAGCAATTCCTGAAAGAAACGGGACCAGGAGATAAGTAGAGAAAGAGTGAAGGTCTGAAATCAAACTAATAAGACAGTCCCAGACTGTCAAGGAGAGGTATGGCTGTCATCATTCAGGCCTCACCCTGCAGAACCACACCCTGGCCTTGGCCAATCTGCTCACAAGAGCAAAAAGGGCAGGACCAGGGTTGGGCATATAAGGAAGAGTAGTGCCAGCTGCTGTTTACACTCACTTCTGACACAACTGTGTTGACTAGCAACTACCCAATCAGACACCATGGTGAATCTGACTGCTGCTGAGAAGACACAAGTCACCAACCTGTGGGGCAAGGTGAATGTGAAAGAGCTTGGTGGTGAGGCCCTGAGCAGGTTTGTATCTAGGTTGCAAGGTAGACTTAAGGAGGGTTGAGTGGGGCTGGGCATGTGGAGACAGAACAGTCTCCCAGTTTCTGACAGGCACTGACTTCCTCTGCACCSTGTGGTGCTTTCACCTTCAGGCTGCTGGTGGTCTACCCATGGACCCGGAGGTTCTTTGAACACTTTGGGGACCTGTCCACTGCTGACGCTGTCCTGCACAACGCTAAAGTGCTGGCCCATGGCGAGAAAGTGTTGACCTCCTTTGGTGAGGGCCTGAAGCACCTGGACAACCTCAAGGGCACCTTTGCCGATCTGAGCGAGCTGCACTGTGACAAGCTGCACGTGGATCCTGAGAATTTCAGGGTGAGTCTAGGAGACACTCTATTTTTTCTTTTCACTTTGTAGTCTTTCACTGTGATTATTTTGCTTATTTGAATTTCCTCTGTATCTCTTTTTACTCGACTATGTTTCATCATTTAGTGTTTTTTCAACTTATACCATTTTGTATTACTTTTCTTTCAATATTCTTCCTTTTTTCCTGACTCACATTCTTGCTTTATATCATGCTCTTTATTTAATTTCCTACGTTTTTGCTCTTGCTCTCCCTTTCTCCTAGTTTCCTTCCCTCTGAACAGTACCCAAATTGTGCATACCACCTCTCGTCCACTATTTCTGCACTGGGGCAAATCCCCACCCCTCCTCCATATGAGGGTTGGAAAGGACTGAATCAAAGAGGAGAGGATCATGGTGCTGTTCTAGAGTATGTGATTCATTTCAGACTTGAAGGATAACTTGAATAATATAAAATCAGGAGTAAATGGAGAGGAAAGTCAGTATCTGAGAATGAAAGATCAGAAGGTCATAGACGAGATGGGGAGCAGAAGTTACTAAGAAACTGACCATTGTGGCTATAATTAATCACTTAATTAGTTAATTAATATGTTTGTTATTTATTCACGTTTTTCATTTTGGTGGGAGTAAATTTGGGCTAGTGTGTGGGCAACATAAATGGGTTTCACCCCATTGTCTCAGAGGCCAAGCTGGATTGCTTTGTTAACCATGTCTGTGTATGTATCTACCTCTTCCCCATAGCTCCTGGGCAATGTGCTGGTGATTGTCCTGGCCCGCCACTTTGGCAAGGAATTCACCCCAGATGTTCAGGCTGCCTATGAGAAGGTTGTGGCAGGTGTGGCGAATGCCCTGGCTCACAAATACCACTGAGATCCTGGCCTGTTCCTGGTATCCATCGGAAGCCCCATTTCCCGAGATGCTATCTCTGAATTTGGGAAAATAATGCCAACTCTCAAGGGCATCTCTTCTGCCTAATAAAGTACTTTCAGCTCAACTTTCTGATTCATTTATTTTTTTTCTCAGTCACTCTTGTGGTGGGGGAAGTTCCCAAGGCTCTATGGACAGAGAGCTCTTGTGCCTTATAGGAAAAGTTCAAGGGAAATTGGAAAATAAAGGGAACCATACACAGATATTAATGGGAACAATTCTACTTCAAAGGCATAAAGATTGGGAAGGTTTGGCAAATAGGATACTGGTACTACAGGGATTCCATGGGCCTCAGGCCTAAGACATAGCCCCAGGGCTAACTTTCAGATTCAATTCCAGAAATTACTCACAAAATAATGGA
bash-4.4$ seqret DQ091202.gb
Reads and writes (returns) sequences
output sequence(s) [dq091202.fasta]: FASTA::stdout
>DQ091202 DQ091202.1 Elephas maximus HBB/D gene, complete cds.
ttctgggcctcagtttcctcatttgtataataacagaattggagagtaaattcttaagag
gcttaccaggctgtaattctaaaaaaaatgcataaataaacttgccaaggcagatgtttt
tagcagcaattcctgaaagaaacgggaccaggagataagtagagaaagagtgaaggtctg
aaatcaaactaataagacagtcccagactgtcaaggagaggtatggctgtcatcattcag
gcctcaccctgcagaaccacaccctggccttggccaatctgctcacaagagcaaaaaggg
caggaccagggttgggcatataaggaagagtagtgccagctgctgtttacactcacttct
gacacaactgtgttgactagcaactacccaatcagacaccatggtgaatctgactgctgc
tgagaagacacaagtcaccaacctgtggggcaaggtgaatgtgaaagagcttggtggtga
ggccctgagcaggtttgtatctaggttgcaaggtagacttaaggagggttgagtggggct
gggcatgtggagacagaacagtctcccagtttctgacaggcactgacttcctctgcaccs
tgtggtgctttcaccttcaggctgctggtggtctacccatggacccggaggttctttgaa
cactttggggacctgtccactgctgacgctgtcctgcacaacgctaaagtgctggcccat
ggcgagaaagtgttgacctcctttggtgagggcctgaagcacctggacaacctcaagggc
acctttgccgatctgagcgagctgcactgtgacaagctgcacgtggatcctgagaatttc
agggtgagtctaggagacactctattttttcttttcactttgtagtctttcactgtgatt
attttgcttatttgaatttcctctgtatctctttttactcgactatgtttcatcatttag
tgttttttcaacttataccattttgtattacttttctttcaatattcttccttttttcct
gactcacattcttgctttatatcatgctctttatttaatttcctacgtttttgctcttgc
tctccctttctcctagtttccttccctctgaacagtacccaaattgtgcataccacctct
cgtccactatttctgcactggggcaaatccccacccctcctccatatgagggttggaaag
gactgaatcaaagaggagaggatcatggtgctgttctagagtatgtgattcatttcagac
ttgaaggataacttgaataatataaaatcaggagtaaatggagaggaaagtcagtatctg
agaatgaaagatcagaaggtcatagacgagatggggagcagaagttactaagaaactgac
cattgtggctataattaatcacttaattagttaattaatatgtttgttatttattcacgt
ttttcattttggtgggagtaaatttgggctagtgtgtgggcaacataaatgggtttcacc
ccattgtctcagaggccaagctggattgctttgttaaccatgtctgtgtatgtatctacc
tcttccccatagctcctgggcaatgtgctggtgattgtcctggcccgccactttggcaag
gaattcaccccagatgttcaggctgcctatgagaaggttgtggcaggtgtggcgaatgcc
ctggctcacaaataccactgagatcctggcctgttcctggtatccatcggaagccccatt
tcccgagatgctatctctgaatttgggaaaataatgccaactctcaagggcatctcttct
gcctaataaagtactttcagctcaactttctgattcatttattttttttctcagtcactc
ttgtggtgggggaagttcccaaggctctatggacagagagctcttgtgccttataggaaa
agttcaagggaaattggaaaataaagggaaccatacacagatattaatgggaacaattct
acttcaaaggcataaagattgggaaggtttggcaaataggatactggtactacagggatt
ccatgggcctcaggcctaagacatagccccagggctaactttcagattcaattccagaaa
ttactcacaaaataatgga
bash-4.4$
#!/usr/bin/env python
# example from pg 207 of Stevens & Boucher
from Bio import SeqIO
# ExPASy for obtaining records from ExPASy/Swiss-Prot
from Bio import ExPASy
# setup a socket object to get the record for P68871 (Human HBB,
# Hemoglobin subunit beta)
socketObj = ExPASy.get_sprot_raw( "P68871" )
# the above creates a socket object that works like a typical file object.
# Hence we can use SeqIO to read from it.
# read the record from the socket object connected to ExPASy/Swiss-Prot
proteinObj = SeqIO.read( socketObj, "swiss" )
# we're done with the socket
socketObj.close()
# show the information we've obtained
print( proteinObj.description )
print( proteinObj.seq )
# compare the result to that in the file P68871.sp in this directory
bash-4.4$ pydoc Bio.ExPASy.get_sprot_raw
bash-4.4$ python biopython6.py
RecName: Full=Hemoglobin subunit beta; AltName: Full=Beta-globin; AltName: Full=Hemoglobin beta chain; Contains: RecName: Full=LVV-hemorphin-7; Contains: RecName: Full=Spinorphin;
MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH
bash-4.4$
bash-4.4$ seqret P68871.sp
Reads and writes (returns) sequences
output sequence(s) [hbb_human.fasta]: FASTA::stdout
>HBB_HUMAN P68871 Hemoglobin subunit beta (Beta-globin) (Hemoglobin beta chain) (LVV-hemorphin-7) (Spinorphin)
MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPK
VKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFG
KEFTPPVQAAYQKVVAGVANALAHKYH
bash-4.4$
>P00435.3 RecName: Full=Glutathione peroxidase 1; Short=GPx-1; Short=GSHPx-1; AltName: Full=Cellular glutathione peroxidase
MCAAQRSAAALAAAAPRTVYAFSARPLAGGEPFNLSSLRGKVLLIENVASLUGTTVRDYTQMNDLQRRLG
PRGLVVLGFPCNQFGHQENAKNEEILNCLKYVRPGGGFEPNFMLFEKCEVNGEKAHPLFAFLREVLPTPS
DDATALMTDPKFITWSPVCRNDVSWNFEKFLVGPDGVPVRRYSRRFLTIDIEPDIETLLSQGASA
\ No newline at end of file
# Functions from the Stevens & Boucher text
# pg 203-4 estimateCharge( sequence, pH )
def estimateCharge( sequence, pH ):
""" Using pKa values estimate the charge of a sequence of
amino acids at a given pH """
pKaDict = { '+': 8.0, '-': 3.1, 'K': 10.0, 'C': 8.5,
'H': 6.5, 'E': 4.4, 'Y': 10.0, 'D': 4.4, 'R': 12.0 }
isAcid = { '+': False, '-': True, 'C': True, 'H': False,
'K': False, 'E': True, 'Y': True, 'R': False, 'D': True }
total = 0.0
upper_case_seq = sequence.upper() # make sure the sequence is in uppercase
for aminoAcid in upper_case_seq:
pKa = pKaDict.get( aminoAcid )
if pKa is not None:
r = 10.8 ** (pH - pKa)
disassociated = r/(r+1.0)
if isAcid[aminoAcid]:
charge = -1.0 * disassociated
else:
charge = 1.0 - disassociated
total += charge
return total
# pg 204-5 estimateIsoelectric( sequence )
def estimateIsoelectric( sequence ):
""" Estimate the charge neutral pH of a protein sequence.
This is just a gues as pKa values will vary according to
protein sequence, conformation, and conditions. Assumes that
sequence is a string. """
sequence = '+' + sequence + '-' # assumes sequence is a string
bestValue = 0.0
minCharge = estimateCharge( sequence, bestValue )
increment = 7.0
while abs( minCharge ) > 0.001:
pHtest = bestValue + increment
charge = estimateCharge( sequence, pHtest )
if abs( charge ) < abs( minCharge ):
minCharge = charge
bestValue = pHtest
else:
increment = abs( increment )/2.0
if minCharge < 0.0:
increment *= -1
return bestValue
>uniprot|P00395|COX1_HUMAN Cytochrome c oxidase subunit 1 (EC 1.9.3.1) (Cytochrome c oxidase polypeptide I).
MFADRWLFSTNHKDIGTLYLLFGAWAGVLGTALSLLIRAELGQPGNLLGNDHIYNVIVTA
HAFVMIFFMVMPIMIGGFGNWLVPLMIGAPDMAFPRMNNMSFWLLPPSLLLLLASAMVEA
GAGTGWTVYPPLAGNYSHPGASVDLTIFSLHLAGVSSILGAINFITTIINMKPPAMTQYQ
TPLFVWSVLITAVLLLLSLPVLAAGITMLLTDRNLNTTFFDPAGGGDPILYQHLFWFFGH
PEVYILILPGFGMISHIVTYYSGKKEPFGYMGMVWAMMSIGFLGFIVWAHHMFTVGMDVD
TRAYFTSATMIIAIPTGVKVFSWLATLHGSNMKWSAAVLWALGFIFLFTVGGLTGIVLAN
SSLDIVLHDTYYVVAHFHYVLSMGAVFAIMGGFIHWFPLFSGYTLDQTYAKIHFTIMFIG
VNLTFFPQHFLGLSGMPRRYSDYPDAYTTWNILSSVGSFISLTAVMLMIFMIWEAFASKR
KVLMVEEPSMNLEWLYGCPPPYHTFEEPVYMKS
>uniprot|P53551|H1_YEAST Histone H1.
MAPKKSTTKTTSKGKKPATSKGKEKSTSKAAIKKTTAKKEEASSKSYRELIIEGLTALKE
RKGSSRPALKKFIKENYPIVGSASNFDLYFNNAIKKGVEAGDFEQPKGPAGAVKLAKKKS
PEVKKEKEVSPKPKQAATSVSATASKAKAASTKLAPKKVVKKKSPTVTAKKASSPSSLTY
KEMILKSMPQLNDGKGSSRIVLKKYVKDTFSSKLKTSSNFDYLFNSAIKKCVENGELVQP
KGPSGIIKLNKKKVKLST
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment