0% found this document useful (0 votes)
7 views

Slides 1

Uploaded by

Phlip Ong
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Slides 1

Uploaded by

Phlip Ong
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 57

Dr Colin Bingle Me and my bioinformatics hat!

Senior Lecturer
Academic Unit of Respiratory Medicine
LU108
[email protected]
I am a respiratory cell biologist

I am interested in components of the pulmonary innate


immune response - in normal and diseased lung
My studies involve understanding the following
questions

• Why are certain genes expressed in certain cells?


• What factors control the expression of these genes?
• How are these genes processed?
• What do the gene products do?
My studies involve understanding the following
questions

• Why are certain genes expressed in certain cells?


• What factors control the expression of these genes?
• How are these genes processed?
• What do the gene products do?
• These are all questions that involve gene hunting,
transcriptional and proteomic analysis
Genome is fixed – Cells are dynamic
• A genome is static
– Every cell in our body has a copy of same genome

• A cell is dynamic
– Responds to external conditions
– Most cells follow a cell cycle of division

• Cells differentiate during development and can alter


their transcriptome/proteome
1. Gene Hunting
2. Transcriptome and Proteome
Outline - I will use examples of my own research to
illustrate the ways in which genes are increasingly
identified and studied.
(Essentially) gone are the days when studies are
initiated from traditional experimental techniques
Predominantly genes are identified based on where and
when they are expressed or by purely bioinfomatic
approaches
1. Gene Hunting
2. Transcriptome and Proteome
We will study how genes are identified - basic
techniques
What constitutes the transcriptome and the proteome
how it varies and and how it can be studied
You will see how such techniques can be used in
individual projects
Six steps at which eukaryotic gene expression can
be controlled can be considered to be divided
between genomic/transcriptomic and proteomic
Genomics, transcriptomics and proteomics should
be considered as different spokes of the same
wheel - information from each provides support to
the others. Transcriptomics
Genomics

Proteomics
My work principally involves the identification and study
of novel genes and the comparative analysis of well
established genes.

Such genes have been identified through a variety of


approaches.

These include

EST analysis

SAGE and array studies

Proteomics

De novo gene predictions followed by cloning


My work principally involves the identification and study
of novel genes and the comparative analysis of well
established genes.

Such genes have been identified through a variety of


approaches.

These include:

EST analysis

SAGE and array studies

Proteomics

De novo gene predictions followed by cloning


Do not trust everything you read
Do not trust everything you read

There is much to learn and many genes to discover!


AN EXERCISE IN FUNCTIONAL GENOMICS: THE
BIOLOGY OF HE4
For the past few years we have been studying the biology of two low molecular weight
antiproteinases which play a role in host defence in the airways.
These proteins Elafin and Secretory leukocyte proteinase inhibitor (SLPI) are members of
the Whey acidic domain protein (WAP)/ 4 disulphide core family of proteins.
WAP domains comprise units of =50
amino acids that include 8 conserved
cysteines.
WAP proteins are typically small
secretory proteins
In SLPI and Elafin the precise
arrangement of the WAP domain confers
the antiproteinase activity.
SLPI and elafin are co-localised within
75Kb of each other on Chromosome 20
Due to this observation we have been
looking to see if they are co-ordinately
regulated.
In searching the surrounding regions of
Ch20 we were able to locate a number of
additional WAP domain containing
proteins -including HE4
WAP domains are present in multiple proteins - and
individual proteins can contain more than 1 domain
The human WFDC protein locus on chromosome 20 contains at
least 15 genes
Most are completely uncharacterised (other than by PCR)

HE4
De novo gene prediction
Output often require manual annotation and editing!

However it can be used as the basis for subsequent


PCR and cloning
The human WFDC protein locus on chromosome 20 contains at
least 15 genes
Most are completely uncharacterised (other than by PCR)

HE4
HE4 was identified in 1991 as a human epididymis-
specific DNA by the use of differential screening
Sequence analysis revealed that HE4 shared sequence homology with
WAP proteins and therefore it was suggested that it is a antiproteinase!
They have no functional evidence.
Subsequently it was used as a epididymal marker and cloned from dog
and rabbit.
More recently HE4 was identified as a putative marker of human
ovarian tumour by the use of a variety of expression array studies.
These papers used three different techniques; DNA and
oligonucleotide arrays as well as Serial Analysis of Gene Expression
(SAGE)
Sequence analysis revealed that HE4 shared sequence homology with
WAP proteins and therefore it was suggested that it is a antiproteinase!
They have no functional evidence.
Subsequently it was used as a epididymal marker and cloned from dog
and rabbit.
More recently HE4 was identified as a putative marker of human
ovarian tumour by the use of a variety of expression array studies.
These papers used three different techniques; DNA and
oligonucleotide arrays as well as Serial Analysis of Gene Expression
(SAGE)

These unbiased approaches all identify genes based on


differential expression in tissues or cells.
A DNA micro array can allow us to observe a genome’s gene expression program.
Each cell in our bodies expresses a specific set of genes according to a precisely
controlled genetic script that gives that cell its distinctive design and functional
capabilities.
The gene expression program that unfolds during a developmental or physiological
or pathological process can be read as a kind of a script for that process.

Serial Analysis of Gene Expression, or SAGE, is


also designed to gain a quantitative measure of gene
expression. The SAGE technique itself includes several
steps utilizing molecular biological, DNA sequencing
and bioinformatics techniques. These steps have been
used to produce small "tags", which are then, in some
manner, assigned gene descriptions.
A DNA micro array can allow us to observe a genome’s gene expression program.
Each cell in our bodies expresses a specific set of genes according to a precisely
controlled genetic script that gives that cell its distinctive design and functional
capabilities.
The gene expression program that unfolds during a developmental or physiological
or pathological process can be read as a kind of a script for that process.
Uses known or unknown cDNA or oligo sequence
Serial Analysis of Gene Expression, or SAGE, is
also designed to gain a quantitative measure of gene
expression. The SAGE technique itself includes several
steps utilizing molecular biological, DNA sequencing
and bioinformatics techniques. These steps have been
used to produce small "tags", which are then, in some
manner, assigned gene descriptions.
A DNA micro array can allow us to observe a genome’s gene expression program.
Each cell in our bodies expresses a specific set of genes according to a precisely
controlled genetic script that gives that cell its distinctive design and functional
capabilities.
The gene expression program that unfolds during a developmental or physiological
or pathological process can be read as a kind of a script for that process.
Uses known or unknown cDNA or oligo sequence
Serial Analysis of Gene Expression, or SAGE, is
also designed to gain a quantitative measure of gene
expression. The SAGE technique itself includes several
steps utilizing molecular biological, DNA sequencing
and bioinformatics techniques. These steps have been
used to produce small "tags", which are then, in some
manner, assigned gene descriptions.
Generates data on unknown sequences
Three principles underlie the SAGE
methodology: 1) A short sequence tag
(10-14bp) contains sufficient information to
uniquely identify a transcript provided that
that the tag is obtained from a unique
position within each transcript; 2)
Sequence tags can be linked together to
from long serial molecules that can be
cloned and sequenced; and 3)
Quantitation of the number of times a
particular tag is observed provides the
expression level of the corresponding
transcript.

http://www.embl-heidelberg.de/info/sage/
HE4 may be differentially expressed in distinct types of ovarian
cancers
The human WFDC protein locus on chromosome 20 contains at
least 15 genes
Most are completely uncharacterised (other than by PCR)

HE4
Searching ensembl with the term HE4 provides access
to expression and sequence information as well as to
gene structure predictions

Much of the data is derived from other public databases and represents
the “best effort” of the sequencing and annotation communities - hence
they are not always correct!

Transcript cDNA Sequence

Total length: 564 bp No. Exons: 4

>ENST00000217425
CCTGCACCCCGCCCGGGCATAGCACCATGCCTGCTTGTCGCCTAGGCCCG
CTAGCCGCCGCCCTCCTCCTCAGCCTGCTGCTGTTCGGCTTCACCCTAGT
CTCAGGCACAGGAGCAGAGAAGACTGGCGTGTGCCCCGAGCTCCAGGCTG
ACCAGAACTGCACGCAAGAGTGCGTCTCGGACAGCGAATGCGCCGACAAC
CTCAAGTGCTGCAGCGCGGGCTGTGCCACCTTCTGCTCTCTGCCCAATGA
TAAGGAGGGTTCCTGCCCCCAGGTGAACATTAACTTTCCCCAGCTCGGCC
TCTGTCGGGACCAGTGCCAGGTGGACAGCCAGTGTCCTGGCCAGATGAAA
TGCTGCCGCAATGGCTGTGGGAAGGTGTCCTGTGTCACTCCCAATTTCTG
AGCTCCAGCCACCACCAGGCTGAGCAGTGAGGAGAGAAAGTTTCTGCCTG
GCCCTGCATCTGGTTCCAGCCCACCTGCCCTCCCCTTTTTCGGGACTCTG
TATTCCCTCTTGGGCTGACCACAGCTTCTCCCTTTCCCAACCAATAAAGT
AACCACTTTCAGCA
Output of blast analysis of HE4
vs human EST database
Mouse-over to show defline and scores. Click to show alignments

BLASTN 2.2.1 [Apr-13-2001]

Reference:
Altschul, Stephen F., Thomas L. Madden, Alejandro A.
Schäffer,
Jinghui Zhang, Zheng Zhang, Webb Miller, and David
J. Lipman (1997),
"Gapped BLAST and PSI-BLAST: a new generation of
protein database search
programs", Nucleic Acids Res. 25:3389-3402.

RID: 1002902122-2668-16916

Query= gi|32050|emb|X63187.1|HSHE4MR H.sapiens


HE4 mRNA for extracellular proteinase inhibitor
homologue (583 letters)

Database: GenBank Human EST entries


3,832,541 sequences; 1,821,805,599 total letters
Distribution of 194 Blast Hits on the
Query Sequence
EST analysis of non-human databases allows the
identification of HE4 sequences from mouse, rat and pig

Multiple alignment reveals a high degree of sequence similarity


between species. Also it is clear the the two WAP domains in
the mouse and rat proteins are separated by a “linker” region
not found in the other species. This may have functional
relevance
|HUMAN| 1 MPACRLGPLAAALLLSLLLFG-FTLVSGTGAEKTGVCPELQADQNCTQECVSDSECADNL
|DOG| 1 MPASRPGPLAGALLLGLLLG--LPRVPGTEVEKPGVCPQVSVDLNCTQDCVSDAQCADNL
|PIG| 1 MPACRLGLLVASLLLGLLLG--LPPPTGTGAEKSGVCPAVEVDMNCTQECLSDADCADNL
|RABBIT| 1 MPASRLVLLGAVLLLGLLLLLELPPVTGTGADKPGVCPQVSVDLNCTQDCRADQDCAENL
|MOUSE| 1 MPACRPCLLAAGLLLGLLCGT-PISATGTDAEKPGECPQVEPITDCVLDCTLDKDCADNR
|RAT| 1 -------------LLGLLLFT-PLSATGTRAEKPGVCPQVEPITDCVKACILDNDCQDNY

|HUMAN| 60 KCCSAGCATFCSLPN---------------------------------------------
|DOG| 59 KCCQAGCATICHLPN---------------------------------------------
|PIG| 59 KCCKAGCVTICQMPN---------------------------------------------
|RABBIT| 61 KCCRAGCSAICSIPN---------------------------------------------
|MOUSE| 60 KCCQAGCSSVCSKPNGPSEGELSGTDTKLSETGTTTQSAGLDHTTKPPGGQVSTKPPAVT
|RAT| 47 KCCQAGCGSVCSKNGPLSEGKLS-----RTATGTTTLSAGLARTSPLSRGQVSTKPPVVT

|HUMAN| 75 -------DKEGSCPQVNINFPQLGLCRDQCQVDSQCPGQMKCCRNGCGKVSCVTPNF
|DOG| 74 -------EKEGSCPQVNTDFPQLGLCQDQCQVDSHCPGLLKCCYNGCGKVSCVTPIF
|PIG| 74 -------EKEGSCPQVDIAFPQLGLCLDQCQVDSQCPGQLKCCRNGCGKVSCVTPVF
|RABBIT| 76 -------EKEGSCP--SIDFPQLGICQDLCQVDSQCPGKMKCCLNGCGKVSCVTPNF
|MOUSE| 120 REGLGVREKQGTCP--SVDIPKLGLCEDQCQVDSQCSGNMKCCRNGCGKMACTTPKF
|RAT| 102 KE-GGNGEKQGTCP--SVDFPKLGLCEDQCQMDSQCSGNMKCCRNGCGKMGCTTPKF
We characterised the gene and could show that the
human gene can undergo complex alternative splicing
which potentially generates a number of distinct protein
products (Oncogene, 2002)

a. 1 2 3b 3a 3 4b 4a 4 5
FL
127 124 >331 129 136 >128 290 153 162 V4

V1
* V2

V3
V1 * N WAP

V2 * C WAP SP

N -W A P
V4 *
C -W A P

U n iq ue
V3 *
We characterised the gene and could show that the
human gene can undergo complex alternative splicing
which potentially generates a number of distinct protein
products (Oncogene, 2002)

a. 1 2 3b 3a 3 4b 4a 4 5
FL
127 124 >331 129 136 >128 290 153 162 V4

V1
* V2

V3
V1 * N WAP

V2 * C WAP SP

N -W A P
V4 *
C -W A P

U n iq ue
V3 *

Each of these will likely have a distinct function!


In fact it may be more complicated than this!
In fact it may be more complicated than this!


The HE4 gene probably has at least 3 promoter regions


Each promoter region will be differentially regulated
a. 1 2 3b 3a 3 4b 4a 4 5

127 124 >331 129 136 >128 290 153 162

V1 * N WAP

V2 * C WAP

V4 *

V3 *
Gene identification now often uses a combination of
bioinfomatic and “wet” labaratory techniques.

These rely on;


Genomic,
Transcriptomic
Proteomic methodologies
These studies have added greatly to the understanding of the
function of HE4.

But all of the genomics based information still required much additional
functional studies not least to determine if the protein has true
antiproteinase activities.

We also require information on the relative levels of expression of the


different isoforms as well as specific functional information on them.
In this study the authors set out to identify mediators
found in the vomeronasal organ (VNO), located at the
base of the nasal septum, responsible for mediating
pheromone information in mice
• It was known that
something in soiled
bedding could mediate
gene expression (using
a transcription factor c-
fos) in female VNO.
• Such activity was gland
specific and could be
purified using HPLC
• The protein identified
was not previously
described
• This shows a recent
example of the
identification of a novel
family of completely
unknown proteins
present within the
“finished” and annotated
human genome.

• There are still more to


be discovered

You might also like