Programming Lab 6 Report

hoxannotpfam 00046_Hom eobox_Hom eobox_dom ain._
pfam 16326_ABC_t ran_CTD_ABC_t ransport er_C-t erm inal_dom ain._This_dom ain_is_found_at _t he_C-t erm inus_of_ABC_t ransport ers._It _has_a_coiled_coil_st ruct ur
pfam 08317_Spc7_Spc7_kinet ochore_prot ein._This_dom ain_is_found_in_cell_division_prot eins_which_are_required_for_kinet ochore-spindle_associat ion.
pfam 05109_Herpes_BLLF1_Herpes_virus_m ajor_out er_envelope_glycoprot ein_(BLLF1)._This_fam ily_consist s_of_t he_BLLF1_viral_lat e_glycoprot ein_also_t erm e
pfam 05278_PEARLI-4_Arabidopsis_phospholipase-like_prot ein_(PEARLI_4)._This_fam ily_cont ains_several_phospholipase-like_prot eins_from _Arabidopsis_t halia
pfam 04617_Hox9_act _Hox9_act ivat ion_region._This_fam ily_const it ut es_t he_N_t erm ini_of_t he_paralogous_hom eobox_prot eins_HoxA9_HoxB9_HoxC9_and_Hox
pfam 10525_Engrail_1_C_sig_Engrailed_hom eobox_C-t erm inal_signat ure_dom ain._Engrailed_hom eobox_prot eins_are_charact erized_by_t he_presence_of_a_con
pfam 04218_CENP-B_N_CENP-B_N-t erm inal_DNA-binding_dom ain._Cent rom ere_Prot ein_B_(CENP-B)_is_a_DNA-binding_prot ein_localized_t o_t he_cent rom ere._W
pfam 04731_Caudal_act _Caudal_like_prot ein_act ivat ion_region._This_fam ily_consist s_of_t he_am ino_t erm ini_of_prot eins_belonging_t o_t he_caudal-relat ed_hom
pfam 13293_DUF4074_Dom ain_of_unknown_funct ion_(DUF4074)._This_fam ily_is_found_at _t he_C-t erm inal_of_Hom eobox_prot eins_in_Met azoa.
pfam 12045_DUF3528_Prot ein_of_unknown_funct ion_(DUF3528)._This_fam ily_of_prot eins_is_funct ionally_uncharact erized._This_prot ein_is_found_in_eukaryot e
pfam 05920_Hom eobox_KN_Hom eobox_KN_dom ain._This_is_a_hom eobox_t ranscript ion_fact or_KN_dom ain_conserved_from _fungi_t o_hum an_and_plant s._They
pfam 17380_DUF5401_Fam ily_of_unknown_funct ion_(DUF5401)._This_is_a_fam ily_of_unknown_funct ion_found_in_Chrom adorea.
pfam 06583_Neogenin_C_Neogenin_C-t erm inus._This_fam ily_represent s_t he_C-t erm inus_of_eukaryot ic_neogenin_precursor_prot eins_which_cont ains_several_
Nvect ensis.LOC5518689 hom eobox prot ein engrailedli
Nvect ensis.LOC116611509 hom eobox prot ein MSX2like
Bbelcheri.LOC109479268 hom eobox prot ein ceh30like
Bbelcheri.LOC109479269 hom eobox prot ein vab15like
Dgigant ea.LOC114520258 hom eobox prot ein Hm xlike
Dgigant ea.LOC114520241 hom eobox prot ein pnxlike
Bbelcheri.LOC109479273 hom eobox prot ein engrailed1
Nvect ensis.LOC116608936 hom eobox prot ein HMX3Alike
Myessoensis.NW 018407737.1 LOC110456657
Bbelcheri.LOC109462984 barHlike 2 hom eobox prot ein
Aplanci.LOC110973961 hom eobox prot ein Nkx61like
Bbelcheri.LOC109486788 hom eobox prot ein vent 1like
Bbelcheri.LOC109486789 hom eobox prot ein vent 1like
Bbelcheri.LOC109462947 Tcell leukem ia hom eobox pro
Aplanci.LOC110973894 Tcell leukem ia hom eobox prot e
Hsapiens.NP 0055121 Tcell leukem ia hom eobox prot ei
Hsapiens.NP 0663052 Tcell leukem ia hom eobox prot ei
Myessoensis.NW 018408625.1 LOC110461278
Hsapiens.NP 0036492 hom eobox prot ein BarHlike 2
Hsapiens.NP 0644481 barHlike 1 hom eobox prot ein
Bbelcheri.LOC109464825 Tcell leukem ia hom eobox pro
Nvect ensis.LOC5514051 hom eobox prot ein HoxC5 isofo
Dgigant ea.LOC114524538 hom eobox prot ein HoxA7like
Dgigant ea.LOC114544209 hom eobox prot ein HoxA7like
Hvulgaris.LOC100197809 hom eobox prot ein HoxD10like
Bbelcheri.LOC109475868 hom eobox prot ein HoxA1like
Hsapiens.NP 0021352 hom eobox prot ein HoxB1
Hsapiens.NP 0055132 hom eobox prot ein HoxA1 isoform
Hsapiens.NP 0787771 hom eobox prot ein HoxD1
Aplanci.LOC110974647 hom eobox prot ein HoxA1alike
Myessoensis.NW 018485984.1 LOC110445101
Nvect ensis.LOC5514380 hom eobox prot ein unplugged
Bbelcheri.LOC109462344 hom eobox prot ein GBX2like
Myessoensis.NW 018406762.1 LOC110452732
Hsapiens.NP 0010923041 hom eobox prot ein GBX1
Hsapiens.NP 0014762 hom eobox prot ein GBX2 isoform
Aplanci.LOC110977331 hom eobox prot ein unpluggedlik
Nvect ensis.LOC5514103 hom eobox prot ein MSHC
Aplanci.LOC110976046 hom eobox prot ein roughlike
Dgigant ea.LOC114519119 hom eobox prot ein roughlike
Nvect ensis.LOC5505085 hom eobox prot ein MOX2
Nvect ensis.LOC5505088 hom eobox prot ein HoxB4 isofo
Dgigant ea.LOC114527332 hom eobox prot ein MOX2like
Nvect ensis.LOC5505086 hom eobox prot ein MOX2 isofor
Nvect ensis.LOC5505087 hom eobox prot ein MOX2
Hvulgaris.LOC100214554 hom eobox prot ein pnxlike
Bbelcheri.LOC109475891 hom eobox prot ein MOX1like
Aplanci.LOC110988557 hom eobox prot ein MOX1like
Myessoensis.NW 018404831.1 LOC110446822
Myessoensis.NW 018404831.1 LOC110446812
Myessoensis.NW 018484896.1 LOC110444607
Hsapiens.NP 0045181 hom eobox prot ein MOX1 isoform
Hsapiens.NP 0059152 hom eobox prot ein MOX2
Nvect ensis.LOC5514056 hom eobox prot ein HoxB4a
Bbelcheri.LOC109481563 m ot or neuron and pancreas h
Bbelcheri.LOC109462363 m ot or neuron and pancreas h
Myessoensis.NW 018480059.1 LOC110443067
Aplanci.LOC110975976 m ot or neuron and pancreas hom
Hsapiens.NP 0055063 m ot or neuron and pancreas hom e
Nvect ensis.LOC5521839 hom eobox prot ein HoxB7A
Dgigant ea.LOC114532759 hom eobox prot ein HoxA9like
Dgigant ea.LOC114524627 hom eobox prot ein HoxB5blike
Dgigant ea.LOC114544210 hom eobox prot ein HoxB5blike
Nvect ensis.LOC5521423 hom eobox prot ein HoxC4
Dgigant ea.LOC114516149 pancreasduodenum hom eobox p
Hvulgaris.LOC105850935 hom eobox prot ein m ab5like
Hvulgaris.LOC100201084 hom eobox prot ein m ab5
Hvulgaris.LOC100213602 hom eobox prot ein HoxC6
Nvect ensis.LOC5517717 hom eobox prot ein HoxB3a
Bbelcheri.LOC109470583 pancreasduodenum hom eobox p
Myessoensis.NW 018406672.1 LOC110452400
Hsapiens.NP 0002001 pancreasduodenum hom eobox prot
Aplanci.LOC110988620 pancreasduodenum hom eobox pro
Myessoensis.NW 018406672.1 LOC110452409
Hsapiens.NP 0012564 hom eobox prot ein CDX2 isoform
Hsapiens.NP 0017952 hom eobox prot ein CDX1
Bbelcheri.LOC109475825 hom eobox prot ein HoxB4alike
Aplanci.LOC110974660 hom eobox prot ein HoxB4like
Hsapiens.NP 0021323 hom eobox prot ein HoxA4
Hsapiens.NP 7058971 hom eobox prot ein HoxC4
Hsapiens.NP 0769201 hom eobox prot ein HoxB4
Hsapiens.NP 0554362 hom eobox prot ein HoxD4
Myessoensis.NW 018485984.1 LOC110445094
Bbelcheri.LOC109475859 hom eobox prot ein HoxB5like
Hsapiens.NP 0619752 hom eobox prot ein HoxA5
Hsapiens.NP 0021381 hom eobox prot ein HoxB5
Hsapiens.NP 0618261 hom eobox prot ein HoxC5
Aplanci.LOC110974633 hom eobox prot ein HoxA5like
Myessoensis.NW 018485984.1 LOC110445095
Bbelcheri.LOC109475861 hom eobox prot ein HoxB6like
Myessoensis.NW 018485984.1 LOC110445080
Bbelcheri.LOC109475860 hom eobox prot ein HoxA7like
Bbelcheri.LOC109475781 hom eobox prot ein HoxD8like
Myessoensis.NW 018485984.1 LOC110445078
Myessoensis.NW 018485984.1 LOC110445099
Myessoensis.NW 018485984.1 LOC110445103
Hsapiens.NP 0769191 hom eobox prot ein HoxA6
Hsapiens.NP 0618252 hom eobox prot ein HoxB6
Hsapiens.NP 0044941 hom eobox prot ein HoxC6 isoform
Aplanci.LOC110974644 hom eobox prot ein HB1like
Hsapiens.NP 0731491 hom eobox prot ein HoxC8
Hsapiens.NP 0624581 hom eobox prot ein HoxD8 isoform
Hsapiens.NP 0769211 hom eobox prot ein HoxB8
Hsapiens.NP 0044933 hom eobox prot ein HoxB7
Hsapiens.NP 0088272 hom eobox prot ein HoxA7
Aplanci.LOC110974629 hom eobox prot ein HoxA7like is
Bbelcheri.LOC109475889 hom eobox prot ein HoxC9alike
Bbelcheri.LOC109475772 hom eobox prot ein HoxA10like
Bbelcheri.LOC109475776 hom eobox prot ein HoxD8like
Bbelcheri.LOC109475780 hom eobox prot ein HoxA9like
Hsapiens.NP 0550271 hom eobox prot ein HoxC11
Hsapiens.NP 0055141 hom eobox prot ein HoxA11
Hsapiens.NP 0670152 hom eobox prot ein HoxD11
Hsapiens.NP 0618243 hom eobox prot ein HoxA10
Hsapiens.NP 0021392 hom eobox prot ein HoxD10
Hsapiens.NP 0591052 hom eobox prot ein HoxC10
Aplanci.LOC110974479 hom eobox prot ein HoxC9alike
Hsapiens.NP 6899521 hom eobox prot ein HoxA9
Hsapiens.NP 0550283 hom eobox prot ein HoxD9
Hsapiens.NP 0769221 hom eobox prot ein HoxB9
Aplanci.LOC110974642 hom eobox prot ein HoxA10like
Aplanci.LOC110974385 hom eobox prot ein HoxC8like
Aplanci.LOC110973135 hom eobox prot ein HoxB7Alike i
Nvect ensis.LOC116601518 hom eobox prot ein HoxD3like
Dgigant ea.LOC114532667 hom eobox prot ein HoxB3like
Bbelcheri.LOC109475823 hom eobox prot ein HOX3
Hsapiens.NP 0013716781 hom eobox prot ein HoxB3 isof
Hsapiens.NP 7058951 hom eobox prot ein HoxA3
Hsapiens.NP 0088293 hom eobox prot ein HoxD3
Myessoensis.NW 018485984.1 LOC110445065
Aplanci.LOC110974557 hom eobox prot ein HoxA3alike
Nvect ensis.LOC5514093 hom eobox prot ein HoxA5
Nvect ensis.LOC5514048 hom eobox prot ein HoxA6
Nvect ensis.LOC5514094 hom eobox prot ein HoxB6
Dgigant ea.LOC114520820 hom eobox prot ein HoxA4like
Bbelcheri.LOC109475826 hom eobox prot ein HoxB2like
Aplanci.LOC110974556 hom eobox prot ein HoxC4like
Hsapiens.NP 0067261 hom eobox prot ein HoxA2
Hsapiens.NP 0021361 hom eobox prot ein HoxB2
Myessoensis.NW 018485984.1 LOC110445098
Nvect ensis.LOC5521840 hom eobox prot ein abdom inalA
Nvect ensis.LOC5517734 GS hom eobox 1
Dgigant ea.LOC114528570 hom eobox prot ein HoxC6like
Hvulgaris.LOC100215022 hom eobox prot ein HoxC6alike
Bbelcheri.LOC109470584 uncharact erized prot ein LOC
Hsapiens.NP 5735742 GS hom eobox 2
Hsapiens.NP 6636321 GS hom eobox 1
Aplanci.LOC110988652 GS hom eobox 1like
Myessoensis.NW 018406672.1 LOC110452406
What information should be in
your results?
To get answers:
Identify homologs à align homologs à produce gene
tree à reconcile gene tree and species tree
How has domain content changed with gene family
evolution?
What figures should be in your
results?
To get answers:
Identify homologs à align homologs à produce gene
tree à reconcile gene tree and species tree
How has domain content changed with gene family
evolution?
Getting read for the methods:
– Conceptual map
– Project Repository
To get answers:
Identify homologs à align homologs à produce gene
tree à reconcile gene tree and species tree
How has domain content changed with gene family
evolution?
Analysis Pipeline
Song et al. 2015
Cylinder shapes indicate data, arrows data flows, rectangular shapes
programs (process, name).
In Class Activity (8 points)
• Draw the pipeline that you used to identify
and analyze your gene (Labs 5-8)
• Data at each step (cylinders)
• Programs (Process + name)
• Arrows (data flow)
• Use https://sketch.io/sketchpad/, for example.
– Project Repository
• You will be given a ”Project Repository”
• This is your project’s pipeline.
• For extra credit:
– All of the commands present
– All of the data files and result files present
– Someone could just clone, run the commands
with the provided data, and get the resulting
output!
– Nothing extra!!
– (Right now, this is spread out among labs 3-6).
Results: Describe the outcomes of your analyses. Supplement your description with
figures and tables, and refer to these in your text.
• Your results section should have:
• Clear and complete narrative description, point by point, of each result. Walk the
reader through the results, what they mean, and how they are to be interpreted.
• Accurate and scientifically sound interpretation of your
results using appropriate technical terms.



At least two figures and/or tables that appropriately illuminate the results. Relevant
figures or tables from your analysis are placed either on separate pages after the
main text or within the text. Figures are numbered consecutively, and each figure is
accompanied with a legend. Figure are referred to directly in the main text. Axes are
labeled with units. Tables should be numbered consecutively, referred to directly in
the main text, and each table should include a title. Example: “Our results strongly
suggest that Gremlin production increases as an exponential function of the number
of times Mogwai are fed after midnight (Figure 1).”
A legend is written for each figure. A legend is a complete description of the figure
and can stand alone from the text.
Just provide the main results that shed light onto your research question. Not
every detail. Keep focused on your research question as described in your
Introduction (and adjust your question, if necessary).
Results: (for starters, to be expanded).
– How many homologs? Average percent identity. Length of the alignment.

Figures:

Rooted Gene tree, with domain content
Rooted Gene tree, with bootstrap support
Gene tree: notung reconciliation
Gene tree: notung reconciliation displayed as rechplyovisu
Motifs and Domains
To get answers:
Identify homologs à align homologs à produce gene
tree à reconcile gene tree and species tree
How has domain content changed with gene family
evolution?
Examples of protein domains
This protein (hemocyanin) has two
distinct domains (blue and green)
which are connected by a short linker
(red).
This enzyme (laccase) has three
distinct domains (each colored
differently).
ribbon
diagram
of laccase
The amino acid chain of hemocyanin can
be represented like this:
residues forming “green” domain
residues forming “blue” domain
residue 394
residue 1
residues forming linker
space-filling
diagram
of laccase
Domains are functional elements of proteins.
Some examples of biochemical functions of domains:
• An enzymeʼs catalytic domain has the function of catalyzing
the conversion of a reactant into a product.
• A structural protein domain has the function of influencing the
shape of a cell.
• The binding domain of a transport protein has the function of
carrying a ligand from one location to another.
Ribbon diagrams of β-propeller proteins containing 4-8 blades, each made up of WD40
domains. (Jawad and Paoli 2002)
Domain architectures of the different MAGUK classes.
The membrane-associated guanylate kinases (MAGUK)
are a superfamily of proteins. The MAGUKs are defined
by their inclusion of PDZ, SH3 and GUK domains,
although many of them also contain regions homologous
of CaMKII, WW and L27 domains.
de Mendoza et al. 2010
Domain architectures of the different MAGUK classes.
The PDZ domain is a common structural domain of 80-90 aminoacids found in the signaling proteins of bacteria, yeast, plants,
viruses and animals. Proteins containing PDZ domains play a key
role in anchoring receptor proteins in the membrane to
cytoskeletal components.
de Mendoza et al. 2010
Domain architectures of the different MAGUK classes.
The SH3 domain is a distinct motif that binds target proteins,
including proteins associated with the actin cytoskeleton,
through sequences containing proline and hydrophobic amino
acids.
de Mendoza et al. 2010
Domain architectures of the different MAGUK classes.
MAGUK (membrane-associated guanylate kinase)
The GuK domain in MAGUK proteins is enzymatically inactive;
instead, the domain mediates protein-protein interactions and
associates intramolecularly with the SH3 domain.
de Mendoza et al. 2010
Guanylate kinase
PDZ domain
SH3 domain
Why is it useful to identify motifs and domains in
families of proteins?
• To identify the functionally important residues and patterns in
a given domain.
• To predict the function of a new protein by comparing its
sequence to the sequences of domains with known functions.
• To evaluate evidence for patterns of homology among
orthologs and paralogs. i.e. partial homology
• To trace the evolution of protein function.
New genes from parts of old genes
But how could this happen?
Structural modules:
Domain origins:
EGF domain
epidermal
growth factor
(EGF)
fibronectin finger domain
fibronectin
plasminogen kringle domain
vit. K-dependent calcium-binding
domain (osteocalcin)
trypsin-like serine protease
Mosaic proteins
tissue plasminogen activator
prothrombin
urokinase
Source: Sylvia Nagl
plasminogen
One way to make a new gene:
Use the parts of old genes.
When domains are repeatedly
found in diverse proteins they
are called structural modules.
Long structural modules are
likely homologous. (Shorter
structural modules may be
convergent.)
Domains may contain motifs.
Hemocyanin as an example:
The “green” domain of hemocyanin contains this copper-binding motif:
H-X3-H-X22-37-H
location of copper-binding motif
location of copper-binding motif
A group of proteins that share a domain in common
constitute a FAMILY. Family members are evolutionarily
related (homologous) and their domains have sequence
similarity.
Family members can share a domain in common in a
number of ways:
protein 1
protein 2
domain x
domain x
domain x
domain x
domain x
domain x
domain x
protein 1
protein 2
domain x
protein 2
A domain may extend essentially
across the length of a protein.
Domains may contain highly
related stretches of amino acids
that form only a subset of each
proteinʼs sequence.
protein 1
A domain may be
repeated within a single
protein.
(Figure 8.2 from Bioinformatics and Functional Genomics by J. Pevsner)
How are domains identified and
classified?
(Ponting and Russell 2002)
A. By sequence- or structure-based families related by common
ancestry (homology).
B. By function.
C. By shared GC content.
D. By numbers of duplication, losses, as determined by
reconciliation of gene and species trees.
• Identification and classification by function?
• Most domain families contain representatives with
different functions. Providing a standard definition of
function is difficult.
• Identification and classification of domains by sequence
(homology or structure).
Domains may contain motifs.
Hemocyanin as an example:
The “green” domain of hemocyanin contains this copper-binding motif:
H-X3-H-X22-37-H
location of copper-binding motif
location of copper-binding motif
How are motifs and domains
identified in protein families?
By aligning family members in a global
multiple sequence alignment.
Motifs and/or domains can then be
identified as conserved regions of the
alignment. Sometimes it is easy to align
the sequences, and these conserved
regions are obvious and can be identified
“by eye.”
Seq1 APIPPPDLKSCGVAHIDDKGTEVSY–SCCPPVPDDIDSVPYYKFPPMTKLR-IRPPAHA 57
Seq2 APAPPPDLSSCSIARINEN-QVVPY–SCCAPKPDDMEKVPYYKFPSMTKLR-VRQPAHE 56
Seq3 APVPIPDLTKCVI-P—PSGAPVP-INCCPPFSK–DIIDFKYP-SFEKLR-VRPAAQL 51
Seq4 SPISPPDLSKCVP-PSDLPSGTTPPNINCCPPYST–KITDFKFP-SNQPLR-VRQAAHL 55
Seq5 APIQAPDLGDCHQ-PVDVPATAPAI–NCCPTYSAGTVAVDFAPPPASSPLR-VRPAAHL 56
Seq6 APILAPDLSTCGP-PADLPASARPT–VCCPPYQS–TIIDFKLPPRSAPLR-VRPAAHL 54
Seq7 APIQAPDISKCG–TATVPDGVTPT–NCCPPVTT–KIIDFQLPSSGSPMR-TRPAAHL 53
Seq8 APIQAPEISKCVVPPADLPPGAVVD–NCCPPVAS–NIVDYKIP-VVTTMK-VRPAAHT 54
Seq9 APIL-PDVEKCTLSDALWDGSVGDH—CCPPPFDLNITKDFEFKNYHNHVKKVRRPAHK 56
:*
*:: *
**..
:
:: * .*:
Seq1 A–DEEYVAKYQLATSRMRELDK-DPFDPLGFKQQANIHCAYCNGAYKIGGK—ELQVH 111
Seq2 A–NEEYIAKYNLAISRMKDLDKTQPLNPIGFKQQANIHCAYCNGAYRIGGK—ELQVH 111
Seq3 V–DDDYFAKYNKALELMRALPDDDPRS—FSQQAKIHCAYCVGGYKQLGYPEIELSVH 106
Seq4 V–DNEFLEKYKKATELMKALPSNDPRN—FTQQANIHCAYCDGAYSQIGFPDLKLQVH 110
Seq5 A–DRAYLAKYERAVSLMKKLPADDPRS—FEQQWRVHCAYCDGAYDQVGFPGLEIQIH 111
Seq6 V–DADYLAKYKKAVELMRALPADDPRN—FVQQAKVHCAYCDGAYDQIGFPDLEIQIH 109
Seq7 V–SKEYLAKYKKAIELQKALPDDDPRS—FKQQANVHCTYCQGAYDQVGYTDLELQVH 108
Seq8 M–DKDAIAKFARAVDLMRALPGDDPRN—FYQQALVHCAYCNGGYDQVNFPDQEIQVH 109
Seq9 AYEDQEWLNDYKRAIAIMKSLPMSDPRS—HMQQARVHCAYCDGSYPVLGHNDTRLEVH 113
.
. .: *
: *
:* .
. ** :**:** *.*
.
.:.:*
At the right is an example of a multiple
Seq1 FSWLFFPFHRWYLYFYERILGSLINDPTFALPYWNWDHPKGMRIPPMFDREGSSLYDEKR 171
NSWLFFPFHRWYLYFHERIVGKFIDDPTFALPYWNWDHPKGMRFPAMYDREGTSLFDVTR 171
sequence alignment of a family of proteins. Seq2
Seq3 NSWLFLAFHRWYIYFYERILGSLINDPTFAIPFWNFDAPDGMQIPSIFTNPNSSLYDLKR 166
GSWLFFPFHRWYLYFYERILGSLINDPTFALPFWNYDAPDGMQLPTIYADKASPLYDELR 170
A conserved copper-binding motif is known Seq4
Seq5 SCWLFFPWHRMYLYFHERILGKLIGDETFALPFWNWDAPDGMSFPAMYANRWSPLYDPRR 171
Seq6 NSWLFFPWHRFYLYSNERILGKLIGDDTFALPFWNWDAPGGMQFPSIYTDPSSSLYDKLR 169
to exist in these proteins. Examine the
Seq7 ASWLFLPFHRYYLYFNERILAKLIDDPTFALPYWAWDNPDGMYMPTIYASSPSSLYDEKR 168
Seq8 NSWLFFPFHRWYLYFYERILGKLIGDPSFGLPFWNWDNPGGMVLPDFLNDSTSSLYDSNR 169
alignment carefully– can you identify the
Seq9 ASWLFPSFHRWYLYFYERILGKLINKPDFALPYWNWDHRDGMRIPEIFKEMDSPLFDPNR 173
region containing the motif?
.*** .:** *:* ***:..:*.. *.:*:* :*
** :* :
:.*:* *
(See next slide for answer.)
Seq1 NQNHRNGTIIDLGHFGKDVRTPQL—–Seq2 DQSHRNGAVIDLGFFGNEVETTQL—–Seq3 DSRHQPPRIIDLNYNKDTEDPGPNYPPSAE
Seq4 NASHQPPTLIDLNFCDIGSDIDRN—–Seq5 NQAHLPPFPLDLDYSGTDTNIPKD—–Seq6 DAKHQPPTLIDLDYNGTDPTFSPE—–Seq7 NAKHLPPTVIDLDYDGTEPTIPDD—–Seq8 NQSHLPPVVVDLGYNGADTDVTDQ—–Seq9 NTNHLD-KMMNLSFVSDEEGSDVN—-ED
: *
::*..
The copper-binding motif is within the red
box. It is located within a conserved
section of sequence which is marked with
a yellow box (note the “ * : . ” symbols
below the alignment which indicate
conserved residues).
Sometimes (as in this example) it is easy
to align family members and identify
conserved regions that are likely to be
important to the function of the protein.
However, for distantly related sequences,
it may be very difficult to even align the
sequences properly, let alone detect
conserved sequence patterns. These
situations require the use of sensitive
statistical methods.
Seq1 APIPPPDLKSCGVAHIDDKGTEVSY–SCCPPVPDDIDSVPYYKFPPMTKLR-IRPPAHA 57
Seq2 APAPPPDLSSCSIARINEN-QVVPY–SCCAPKPDDMEKVPYYKFPSMTKLR-VRQPAHE 56
Seq3 APVPIPDLTKCVI-P—PSGAPVP-INCCPPFSK–DIIDFKYP-SFEKLR-VRPAAQL 51
Seq4 SPISPPDLSKCVP-PSDLPSGTTPPNINCCPPYST–KITDFKFP-SNQPLR-VRQAAHL 55
Seq5 APIQAPDLGDCHQ-PVDVPATAPAI–NCCPTYSAGTVAVDFAPPPASSPLR-VRPAAHL 56
Seq6 APILAPDLSTCGP-PADLPASARPT–VCCPPYQS–TIIDFKLPPRSAPLR-VRPAAHL 54
Seq7 APIQAPDISKCG–TATVPDGVTPT–NCCPPVTT–KIIDFQLPSSGSPMR-TRPAAHL 53
Seq8 APIQAPEISKCVVPPADLPPGAVVD–NCCPPVAS–NIVDYKIP-VVTTMK-VRPAAHT 54
Seq9 APIL-PDVEKCTLSDALWDGSVGDH—CCPPPFDLNITKDFEFKNYHNHVKKVRRPAHK 56
:*
*:: *
**..
:
:: * .*:
Seq1 A–DEEYVAKYQLATSRMRELDK-DPFDPLGFKQQANIHCAYCNGAYKIGGK—ELQVH 111
Seq2 A–NEEYIAKYNLAISRMKDLDKTQPLNPIGFKQQANIHCAYCNGAYRIGGK—ELQVH 111
Seq3 V–DDDYFAKYNKALELMRALPDDDPRS—FSQQAKIHCAYCVGGYKQLGYPEIELSVH 106
Seq4 V–DNEFLEKYKKATELMKALPSNDPRN—FTQQANIHCAYCDGAYSQIGFPDLKLQVH 110
Seq5 A–DRAYLAKYERAVSLMKKLPADDPRS—FEQQWRVHCAYCDGAYDQVGFPGLEIQIH 111
Seq6 V–DADYLAKYKKAVELMRALPADDPRN—FVQQAKVHCAYCDGAYDQIGFPDLEIQIH 109
Seq7 V–SKEYLAKYKKAIELQKALPDDDPRS—FKQQANVHCTYCQGAYDQVGYTDLELQVH 108
Seq8 M–DKDAIAKFARAVDLMRALPGDDPRN—FYQQALVHCAYCNGGYDQVNFPDQEIQVH 109
Seq9 AYEDQEWLNDYKRAIAIMKSLPMSDPRS—HMQQARVHCAYCDGSYPVLGHNDTRLEVH 113
.
. .: *
: *
:* .
. ** :**:** *.*
.
.:.:*
Seq1 FSWLFFPFHRWYLYFYERILGSLINDPTFALPYWNWDHPKGMRIPPMFDREGSSLYDEKR 171
Seq2 NSWLFFPFHRWYLYFHERIVGKFIDDPTFALPYWNWDHPKGMRFPAMYDREGTSLFDVTR 171
Seq3 NSWLFLAFHRWYIYFYERILGSLINDPTFAIPFWNFDAPDGMQIPSIFTNPNSSLYDLKR 166
Seq4 GSWLFFPFHRWYLYFYERILGSLINDPTFALPFWNYDAPDGMQLPTIYADKASPLYDELR 170
Seq5 SCWLFFPWHRMYLYFHERILGKLIGDETFALPFWNWDAPDGMSFPAMYANRWSPLYDPRR 171
Seq6 NSWLFFPWHRFYLYSNERILGKLIGDDTFALPFWNWDAPGGMQFPSIYTDPSSSLYDKLR 169
Seq7 ASWLFLPFHRYYLYFNERILAKLIDDPTFALPYWAWDNPDGMYMPTIYASSPSSLYDEKR 168
Seq8 NSWLFFPFHRWYLYFYERILGKLIGDPSFGLPFWNWDNPGGMVLPDFLNDSTSSLYDSNR 169
Seq9 ASWLFPSFHRWYLYFYERILGKLINKPDFALPYWNWDHRDGMRIPEIFKEMDSPLFDPNR 173
.*** .:** *:* ***:..:*.. *.:*:* :*
** :* :
:.*:* *
Seq1 NQNHRNGTIIDLGHFGKDVRTPQL—–Seq2 DQSHRNGAVIDLGFFGNEVETTQL—–Seq3 DSRHQPPRIIDLNYNKDTEDPGPNYPPSAE
Seq4 NASHQPPTLIDLNFCDIGSDIDRN—–Seq5 NQAHLPPFPLDLDYSGTDTNIPKD—–Seq6 DAKHQPPTLIDLDYNGTDPTFSPE—–Seq7 NAKHLPPTVIDLDYDGTEPTIPDD—–Seq8 NQSHLPPVVVDLGYNGADTDVTDQ—–Seq9 NTNHLD-KMMNLSFVSDEEGSDVN—-ED
: *
::*..
Protein motifs and domains are consensus
sequence patterns.
Motif– a short conserved sequence pattern; can be just a few
amino acid residues, up to ~20.
Y-X-Y
and
C-X4-C-X12-H-X3-H
Domain– a longer conserved sequence pattern which adopts a
particular three-dimensional structure and is an independent
functional and structural unit; typically 40-700 residues.
Example of a two-domain protein:
This protein (troponin C) is composed
of a single amino acid chain, but
each half of the chain forms an
independent structural and functional
unit– a domain.
NOTE: Many short motifs are NOT specific to a particular
protein family. Thus, their occurrence does not indicate
homology.
Example:
protein kinase C phosphorylation site has this 3-residue motif:
S/T – X – R/K
(S or T, followed by any residue, followed by R or K)
This is a common motif that occurs in many unrelated proteins.
These represent evolutionary convergence for common
function.
Motifs and domains are FUNCTIONAL elements of proteins.
Some examples of biochemical functions of domains:
• An enzymeʼs catalytic domain has the function of catalyzing
the conversion of a reactant into a product.
• A structural protein domain has the function of influencing the
shape of a cell.
• The binding domain of a transport protein has the function of
carrying a ligand from one location to another.
Some examples of the functions of motifs:
• The Yʼs of this tyrosine motif have the function of interacting
with specific residues of a protein to stabilize its structure.
Y-X-Y
• The Hʼs and Cʼs of this zinc finger motif have the function of
binding zinc ions.
C-X4-C-X12-H-X3-H
How are motifs and domains in protein families
represented?
1. Regular expressions/patterns
A multiple sequence alignment is converted to a consensus sequence
called a regular expression or pattern. Example:
Multiple sequence alignment:
seq1 GEW
seq2 GTW
seq3 GTY
seq4 GRW
seq5 GKW
seq6 GAW
—————————-Regular expression: G-X-[WY]
(G, followed by any residue,
followed by W or Y)
Interpreting regular expressions:
Example:
E-X(2)-[FHM]-X(4)-{P}-L
Interpretation:
First residue of the pattern is E;
followed by any 2 residues;
followed by F, or H, or M;
followed by any 4 residues;
followed by any residue except P;
followed by L.
Limitations of regular expressions:
They do not take into account sequence probability information about the
multiple sequence alignment. For instance, in the above example, we
donʼt know how often F, H, and M each occur at the 4th position in this
motif. H may be much more common than F or M, but we have no way of
knowing this from the regular expression.
Which sequence does not
contain the domain motif
defined by the regular
expression: AR[ND]C(2)E
A. ARNCCE
B. ARDCCE
C. ARNDCE
How are motifs and domains in protein families represented?
• Regular expressions,
• PSSMs, profiles, and profile hidden Markov models: numerical
representations of a multiple sequence alignment that contain
information about the probability of observing a specific residue at
a given location in the alignment.
Logogram
Determining if a sequence of interest contains a motif or
domain represented by a probabilistic model:
We would use the profile to “scan” the new proteinʼs sequence:
X X X X X X X X X X…àscore1
Calculate score for occurrence
of motif beginning at residue 1
X X X X X X X X X X…àscore2
Calculate score for occurrence
of motif beginning at residue 2
.
. Continue scanning until end
. of sequence is reached.
Calculate score for occurrence
…X X X X X X X X X X àscoreN of motif at last possible position
The highest scoring location is the most likely position of the motif/domain
in the sequence.
Databases of motifs and domains
The following are databases of regular expressions, PSSMs, profiles, and/or profile
HMMs derived from alignments of motifs and domains found in protein families. You
can submit a protein sequence to any of these databases in order to determine if the
sequence contains one of the motifs or domains represented in the database.
Pfam (http://pfam.xfam.org)
Uses profile HMMs.
Two-part database: Pfam-A (curated) and Pfam-B (automatically generated).
InterPro (http://www.ebi.ac.uk/interpro/)
An integrated database designed to unify multiple databases, including PROSITE,
Pfam, PRINTS, ProDom, SMART, and others.
Note: searching InterPro may produce different results than searching the individual
databases that are part of InterPro.
CD-SEARCH (https://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi)
Uses profiles. Includes the SMART and Pfam databases.
SMART (http://smart.embl-heidelberg.de/)
Uses profile HMMs.
Alignments of domains checked manually by curators.
Identify putative function of
protein query sequence with CDSearch tool
RPS-BLAST output
Drosophila_melanogaster_Discs_large_5_Q9VKG8
2d1dd0159d53e777a070d719964b0545 1916
Pfam
PF00625
1903
T
21-10-2020 IPR008145
Guanylate kinase
1773
3.3E-11
Guanylate kinase/L-type calcium channel beta subunit
Guanylate kinase
PDZ domain
SH3 domain
Questions about your Gene
Family (for your paper)
• Does domain content vary between
orthologs and paralogs?
• Does domain content vary between
cnidarians and bilaterians?
• What does the domain content reveal
about the potential functional roles of your
proteins?
• What does the domain content reveal
about the evolution of functional roles?
In Lab…
• You will be using RPS-BLAST to search
for domains in your sequences using
PFAM_A HMMs
• You will be visualizing changes in domain
content on your phylogeny.
Questions for you to look up
about your gene family…
• What domains do members of your gene
family contain?
• What is the function of each of these
domains?
• What are the PFAM accessions of these
domains?

Save Time On Research and Writing
Hire a Pro to Write You a 100% Plagiarism-Free Paper.
Get My Paper
Still stressed from student homework?
Get quality assistance from academic writers!

Order your essay today and save 25% with the discount code LAVENDER