Delft - Leiden BioInformatics Master track

Molecular Computational Biology 2009

Exercises / Assignments

The answers to the problems formulated in the assignments may be either sent to a.p.gultyaev /at/ biology.leidenuniv.nl or submitted after the lecture hours. Deadlines for submissions in spring 2009 are indicated below. Timely submissions of correct assignment solutions can contribute up to 2 points of the total exam mark (in other words, without assignments the maximum mark is 8).

Important: please don't send attachments or hyperlinks in your e-mails (they will not be considered): it is enough to describe briefly what has been done and what kind of result is obtained.

Assignments 1-6: deadline March 10, 2009

(0.15 pt) The genome of a human coronavirus NL63 has an accession NC_005831. How many nucleotides does it contain? How many amino acids are in the protein annotated as “replicase polyprotein 1ab”, encoded in this genome?
(0.15 pt) Using protein-protein (blastp) program, determine the species having the most similar homolog of human interferon-gamma (NP_000610). Give the accession of this homolog and the number of amino acid differences as compared to the human interferon-gamma.
(0.15 pt) The following DNA sequence fragment, containing some mutation, was isolated from a patient:

tttgctccccgcgcgctgtttttctcagtgactttcagcgggcggaaaag
(a) In what gene the mutation is located? On which chromosome? How many nucleotides are changed?
(b) Using the annotation given for corresponding sequence database entry, could you indicate possible diseases determined by mutations in this gene?
(0.15 pt) What is the most efficient strategy to determine quickly the difference (number of amino acid substitutions) between homologous proteins from two strains of influenza virus? For instance, determine the number of substitutions in the polymerase PB1 from the strain resulting in the death of a veterinarian during the outbreak of bird flu in 2003 in The Netherlands (strain A/Netherlands/219/03) as compared to the homologous protein from the strain isolated from the 1918 pandemy victim who had been interred in Alaska permafrost since November 1918? (strain A/Brevig Mission/1/1918). How many of these substitutions are conservative ones according to the default substitution matrix (BLOSUM62) used in BLAST programs for proteins?
(0.15 pt) One of the 8 RNA fragments of influenza A genome codes for a polymerase called PB1 of about 750 amino acids. It has been recently determined that the 5'-proximal part of this RNA fragment contains an overlapping open reading frame (ORF) coding for another protein PB1-F2 of about 90 amino acids. However, for many influenza A virus strains the information about this protein is still missing in GenBank. Using the tool ORF Finder (www.ncbi.nlm.nih.gov/Tools/), determine the size of PB1-F2 protein encoded by the PB1 segment from the strain A/Netherlands/219/03 (accession AY340083). Using one of the BLAST versions, provided by the ORF Finder, determine the strain that has the most similar putative PB1-F2 to that from A/Netherlands/219/03.
(0.15 pt) Using BLAST options and the amino acid sequence of the protein Dicer from Arabidopsis thaliana (accession Q9SP32), retrieve putative (partial) plant Dicer mRNAs from the database of expressed sequence tags (EST). What organisms have the putative Dicer proteins with the highest sequence similarities to that from A.thaliana (give three names and accession numbers of BLAST hits)? (NB. EST database contains "raw" nucleotide sequences, and its entries do not include features like coding sequences).
Assignments 7-10: deadline March 20, 2009
(0.25 pt) Using ClustalW program, available at the EBI website (www.ebi.ac.uk/services/), calculate a multiple alignment of five homologous Hfq proteins from the following organisms: Escherichia coli (Accession NP_418593), Neisseria gonorrhoeae (YP_207484), Nitrosomonas europaea (Q82V23), Legionella pneumophila (Q5ZZK1) and Bacillus subtilis (NP_389616). How many amino acid residues are completely conserved in all five sequences? What is the length of the longest stretch of conserved amino acids? Give this motif in single letter code.
(NB. The easiest input for ClustalW is a FASTA format file of sequences, prepared in advance. For this assignment, default parameters of ClustalW are sufficient)
(0.25 pt) Recently a so-called minor spliceosome (that catalyses the splicing of atypical introns) has been identified in a number of organisms. In order to establish the evolutionary history of the minor spliceosome, BLAST searches for minor spliceosome-specific proteins were used. Explore the usefulness of PSI-BLAST program for the search of (distant) homologs of one of the human minor spliceosome-specific proteins (accession NP_078847):
- How many hits are yielded by the PSI-BLAST iteration 1?
- How many hits are yielded by the PSI-BLAST iteration 2? How many new hits with Evalue better than threshold are found? Give the accession number and organism name for the best of these new hits.
- Does PSI-BLAST iteration 3 yield new hits with E-value better than threshold? Explain the result.
(0.15 pt) Using the database of protein profiles PROSITE (www.expasy.org/prosite), determine whether the amino acid sequence

vkpklplipghegvgvieevgpgvt
contains some consensus pattern. Give the description of this consensus pattern.
Using the database of protein profiles PROSITE determine the positions of conserved motifs (profiles) in one of the human proteins involved in RNA splicing, 9G8 (Accession NP_001026854). Using the resources of ENTREZ system, determine how many exons are in the 9G8 gene, and which of these exons contain the sequences encoding the found motifs.
Assignments 11-12: deadline April 20, 2009
IMPORTANT: Exercise 11 is preferably to be performed at one of University computers (in any case, a University email should be submitted at the server input form), because the PSIPRED server description contains a statement about a distinction between academic and commercial users. For academic ones, no problems (no registration, passwords, etc.), the result is simply sent to your e-mail. Other e-mails (e.g. your private accounts, especially hotmails etc.) may be considered as commercial ones, with more complicated procedure of using the server.
(0.15 pt) It is known that the secondary structures of RNA-binding proteins Hfq contain specific structural motif: the N-terminal alpha helix followed by several beta strand regions. Using PSIPRED protein secondary structure prediction program (www.psipred.net), predict the secondary structure of Hfq protein from Mesorhizobium loti (accession NP_102205).
How many helices and beta strand regions are predicted? What are their lengths? Is the predicted secondary structure consistent with the motif description, mentioned above?
(0.15 pt) Some RNA molecules have alternative secondary structures with close values of free energy. For the sequence given below

ACAGGUUCGCCUGUGUUGCGAACCUGCGGGUUCG
predict the folding, using the program mfold (www.bioinfo.rpi.edu/applications/mfold). What is the free energy of the lowest energy conformation? Free energy of the structure 2? What are the main secondary structure elements in these structures?