For a given gene sequence, how do we find the 5' transcription start site. What is the % similarity to consensus initiator sequence responsible for transcription initiation. How do we identify and mark the binding site for TF 1 B.
CS444: BIOINFORMATICS (Assignment 1 - Lab)
(To be made handwritten)
The following transcript was found to be abundant in a human patient’s blood sample.
>Example
ACTCTTCTGGTCCCCACAGACTCAGAGAGAACCCACCATGGTGCTGTCTCCTGCCGACAAGACCAACGTC
AAGGCCGCCTGGGGTAAGGTCGGCGCGCACGCTGGCGAGTATGGTGCGGAGGCCCTGGAGAGGATGTTCC
TGTCCTTCCCCACCACCAAGACCTACTTCCCGCACTTCGACCTGAGCCACGGCTCTGCCCAGGTTAAGGG
CCACGGCAAGAAGGTGGCCGACGCGCTGACCAACGCCGTGGCGCACGTGGACGACATGCCCAACGCGCTG
TCCGCCCTGAGCGACCTGCACGCGCACAAGCTTCGGGTGGACCCGGTCAACTTCAAGCTCCTAAGCCACT
GCCTGCTGGTGACCCTGGCCGCCCACCTCCCCGCCGAGTTCACCCCTGCGGTGCACGCCTCCCTGGACAA
GTTCCTGGCTTCTGTGAGCACCGTGCTGACCTCCAAATACCGTTAAGCTGGAGCCTCGGTGGCCATGCTT
CTTGCCCCTTGGGCCTCCCCCCAGCCCCTCCTCCCCTTCCTGCACCCGTACCCCCGTGGTCTTTGAATAA
AGTCTGAGTGGGCGGCA
Q1:
Which BLAST program should we use in this case?
Sol:
Q2:
What are the names and accession numbers of the top ten hits from your BLAST search?
Sol:
Q3:
What are the percent identities for the top five hits?
Sol:
Q4:
How many identical and non identical nucleotides are there in your top hit compared to your last reported hit?
Sol:
Q5:
What is the “Official Symbol” and “Official Full Name” for this gene?
Sol:
Q6:
What is the “Lineage” for this gene?
Sol:
Q7:
What chromosome is this gene located on?
Sol:
Q8:
How many exons are annotated for this gene?
Sol:
Q9:
What is the function of the encoded protein?
Sol:
Q10:
Does the protein have a role in human disease(s)? If so, what diseases?
Sol:
CS444: BIOINFORMATICS (Assignment 1)
Q1: What is the complement to the DNA sequence given below?
5’-ACCAAACAAAGTTGGGTAAGGATAGATCAATCAATGATCATATTCTAGTACACTTAGGATTCAAGATCCT
ATTATCAGGGACAAGAGCAGGATTAGGGATATCCGAGATGGCCACACTTTTGAGGAGCTTAGCATTGTTC
AAAAGAAACAAGGACAAACCACCCATTACATCAGGATCCGGTGGAGCCATCAGAGGAATCAAACACATTA
TTATAGTACCAATTCCTGGAGATTCCTCAATTACCACTCGATCCAGACTACTGGACCGGTTGGTCAGGTT
AATTGGAAACCCGGATGTGAGCGGGCCCAAACTAACAGGGGCACTAATAGGTATATTATCCTTATTTGTG
GAGTCTCCAGGTCAATTGATTCAGAGGATCACCGATGACCCTGACGTTAGCATCAGGCTGTTAGAGGTTG
TTCAGAGTGACCAGTCACAATCTGGCCTTACCTTCGCATCAAGAGGTACCAACATGGAGGATGAGGCGGA
CCAATACTTTTCACATGATGATCCAAGCAGTAGTGATCAATCCAGGTCCGGATGGTTCGAGAACAAGGAA
ATCTCAGATATTGAAGTGCAAGACCCTGAGGGATTCAACATGATTCTGGGTACCATTCTAGCCCAGATCT
GGGTCTTGCTCGCAAAGGCGGTTACGGCCCCAGACACGGCAGCTGATTCGGAGCTAAGAAGGTGGATAAA
GTACACCCAACAAAGAAGGGTAGTTGGTGAATTTAGATTGGAGAGAAAATGGTTGGATGTGGTGAGGAAC
AGGATTGCCGAGGACCTCTCTTTACGCCGATTCATGGTGGCTCTAATCCTGGATATCAAGAGGACACCCG
GGAACAAACCTAGGATTGCTGAAATGATATGTGACATTGATACATATATCGTAGAGGCAGGATTAGCCAG
TTTTATCCTGACTATTAAGTTTGGGATAGAAACTATGTATCCTGCTCTTGGACTGCATGAATTTGCTGGT
GAGTTATCCACACTTGAGTCCTTGATGAATCTTTACCAGCAAATGGGAGAAACTGCACCCTACATGGTAA-3’
Q2: What is the mRNA sequence of the given DNA sequence in Q1?
Q3: What is the protein sequence formed from the mRNA sequence of Q2?
Q4: What will be the mRNA encoded sequence if all the “AT”s are mutated into “TA”s in Q1 DNA sequence?
Q5: What will be the protein sequence of the new mRNA sequence formed after Q4?
(a) BLAST
What is the impact on
• the speed of the heuristic
• the number of false negatives
• the number of false positives
of the following changes in BLAST parameters:-
(i) increase/decrease in w; where w is the length of words
(ii) increase/decrease T; where T is the least score to find list of words corresponding to each word from query sequence when scored using a pair-score matrix.
(iii) increase/decrease in S; where S is the threshold score after extension of alignment
(b) The higher the level of accuracy required in DNA sequences, more time consuming the process of database formation is. What is done to reduce this time? Does this bring in errors? Mention how accuracy is then improved.
(a) In order to calculate a multiple sequence alignment for N sequences, how many pair- wise alignments have to be calculated?
(b) Align the following using “star alignment” showing all intermediate steps:
S1= ATTCGGATT
S2= ATCCGGATT
S3= ATGGAATTTT
S4= ATGTTGTT
S5= AGTCAGG
(a) You have a protein of unknown function from a bacterium. You have made a knock- out mutant, but the bacteria die immediately without the corresponding gene. You have sequenced the protein. What steps would you take to guess the function of the protein? What kind of information would you look for?
(a) What is the difference between spotted and oligonucleotide microarrays?
(b) What is a probe? How are probes for microarrays designed?
(c) What is a probeset? What is probeset summarization and why do we need it?
(d) If a gene is shown to be induced four-fold in a microarray experiment, what would be the log2-transformed expression ratio?
(a) Why do you have to normalize microarray data to compare two conditions? Explain two normalization techniques that can be used here.
(b) Describe and discuss specific problems likely to appear on a microarray? Describe and discuss what measures can be taken to reduce or eliminate such effects from a data analysis point of view?
(a) What is the output obtained from a RNA-seq experiment? Why do you have to remove rRNA and tRNA before performing RNA-seq?
(b) Why is mapping of RNA-seq reads more difficult than mapping re-sequencing reads or ChIP-seq reads? Explain.
(c) What is Phred quality score? Explain its use in RNA-seq experiment.