Difference between revisions of "Texas Xenopus Genome Project/Species Identification"
From Marcotte Lab
(→Selection procedure) |
|||
Line 8: | Line 8: | ||
** [[:xdata:ID/XENTR_mRNA.xenbase20091127.fasta.gz]] 17 MB, gzipped. | ** [[:xdata:ID/XENTR_mRNA.xenbase20091127.fasta.gz]] 17 MB, gzipped. | ||
− | * Download | + | * Download CHORI-219 sequences (from NCBI GenBank). |
− | + | ||
** [[:xdata:ID/XENLA_CH219.fasta.gz]] 6.5 MB, gzipped. (CHORI-219 sequences. 29 BAC sequences from ''X. laeves'' genome) | ** [[:xdata:ID/XENLA_CH219.fasta.gz]] 6.5 MB, gzipped. (CHORI-219 sequences. 29 BAC sequences from ''X. laeves'' genome) | ||
* Run BLAT (version 3.4, with default option) to known CHORI BAC sequences. | * Run BLAT (version 3.4, with default option) to known CHORI BAC sequences. | ||
** [[:xdata:ID/XENTR_mRNA.XENLA_CH219.blat_pslx.gz]] 1.2 MB, gzipped. | ** [[:xdata:ID/XENTR_mRNA.XENLA_CH219.blat_pslx.gz]] 1.2 MB, gzipped. | ||
− | + | :<pre> blat XENLA_CH219.fasta XENTR_mRNA.xenbase20091127.fasta XENTR_mRNA.XENLA_CH219.blat_pslx -out=pslx</pre> | |
− | :<pre> blat | + | |
* Parse two BLAT output files with the following criteria. | * Parse two BLAT output files with the following criteria. | ||
*# From ''X. tropicalis'' mRNA, only RefSeq (starts sith 'NM_') sequences are considered. | *# From ''X. tropicalis'' mRNA, only RefSeq (starts sith 'NM_') sequences are considered. | ||
− | *# Select ''X. tropicalis'' mRNA sequences which hit both CHORI-219 | + | *# Select ''X. tropicalis'' mRNA sequences which hit both CHORI-219 (minimum match length is 200 bp to be called as a 'hit'). I only consider 10 CHORI-219 BACs which we already knew that they are available ('74I8','204L9','197E3','71P23','36I4','35I18','262A22','20I13','206K7','166K18'). |
− | *# Survey each hit blocks. If the | + | *# Survey each hit blocks. If the hit block is less than 200 bp, discard it. 42 hit blocks from 8 mRNAs are selected. |
− | + | *#* NM_001004837 Unnamed, predicted gene MGC69309 [http://www.ncbi.nlm.nih.gov/nuccore/52345577|NCBI][http://www.xenbase.org/gene/showgene.do?method=displayGeneSummary&geneId=5903347|XenBase] | |
− | + | *#* NM_001007499 paired-like homeodomain 1 (pitx-1) [http://www.ncbi.nlm.nih.gov/nuccore/55926079|NCBI][http://www.xenbase.org/gene/showgene.do?method=displayGeneSummary&geneId=485440|XenBase] | |
− | + | *#* NM_001011405 Homeobox A5 (hoxa5) [http://www.ncbi.nlm.nih.gov/nuccore/58332665|NCBI][http://www.xenbase.org/gene/showgene.do?method=displayGeneSummary&geneId=486060|XebBase] | |
− | + | *#* NM_001035121 | |
− | + | *#* NM_001113032 | |
− | + | *#* NM_001127429 | |
− | + | *#* NM_001129937 | |
− | + | *#* NM_001142220 | |
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | |||
* Run MUSCLE (version 4.0, with default option) for multiple sequence alignment. Interestingly, sequence from CHORI-216 is somewhat different, compared to both XENTR_mRNA and CHORI-219 fragment. | * Run MUSCLE (version 4.0, with default option) for multiple sequence alignment. Interestingly, sequence from CHORI-216 is somewhat different, compared to both XENTR_mRNA and CHORI-219 fragment. |
Revision as of 12:09, 9 December 2009
Target gene
- pfas (phosphoribosylformylglycinamidine synthase). PREDICTED from human Entrez gene.
Selection procedure
- Download X. tropicalis mRNA sequences from XenBase (Nov. 27, 2009 version).
- xdata:ID/XENTR_mRNA.xenbase20091127.fasta.gz 17 MB, gzipped.
- Download CHORI-219 sequences (from NCBI GenBank).
- xdata:ID/XENLA_CH219.fasta.gz 6.5 MB, gzipped. (CHORI-219 sequences. 29 BAC sequences from X. laeves genome)
- Run BLAT (version 3.4, with default option) to known CHORI BAC sequences.
- xdata:ID/XENTR_mRNA.XENLA_CH219.blat_pslx.gz 1.2 MB, gzipped.
blat XENLA_CH219.fasta XENTR_mRNA.xenbase20091127.fasta XENTR_mRNA.XENLA_CH219.blat_pslx -out=pslx
- Parse two BLAT output files with the following criteria.
- From X. tropicalis mRNA, only RefSeq (starts sith 'NM_') sequences are considered.
- Select X. tropicalis mRNA sequences which hit both CHORI-219 (minimum match length is 200 bp to be called as a 'hit'). I only consider 10 CHORI-219 BACs which we already knew that they are available ('74I8','204L9','197E3','71P23','36I4','35I18','262A22','20I13','206K7','166K18').
- Survey each hit blocks. If the hit block is less than 200 bp, discard it. 42 hit blocks from 8 mRNAs are selected.
- Run MUSCLE (version 4.0, with default option) for multiple sequence alignment. Interestingly, sequence from CHORI-216 is somewhat different, compared to both XENTR_mRNA and CHORI-219 fragment.
$ mus4 -i XENTR_CHORI.fasta -o XENTR_CHORI.muscle
XENLA_CH219-20I1 1 + ttattt----------------------gtgccctggatacccctggaactatagcagggtgac 42 XENTR_NM_0011422 1 + ttattt----------------------gtgccctgggtacccctggaactatagcggggtgac 42 XENTR_CH216-2E23 1 + tcaccccaaatccccccctaactggccttcaggctgggcccccttag-ctcataacaaggttac 63 *.*... .....****...***.*.**...***.*..***.** XENLA_CH219-20I1 43 + tgttaccccaatgtttctatatatctgtaaccttgttattagct-aagggggcccagtctgaag 105 XENTR_NM_0011422 43 + tgttaccccaatgtttctatatatctgtaaccttgttatgggct-aagggggcccagcctgaag 105 XENTR_CH216-2E23 64 + agatatatagaaacattggggtaacagtcaccccgctatagttccaggggtacccagggc---- 123 .*.**.....*....*.....**.*.**.***..*.*** .... *.***..***** ..**** XENLA_CH219-20I1 106 + gtcagttagggggagatttggggtgagggcttatttg-----taccctgggtacccctggaact 164 XENTR_NM_0011422 106 + gccagttagggggggatttggggtgagtgcttatttg-----tgccctgggtacccctggaact 164 XENTR_CH216-2E23 124 + -acaaataagcactcaccccaaatcatcccctaactggccttcaggctgggcccc-cttagccc 185 * **..**.*... .*.......*.*. .*.**..** ....*****..*****....*. XENLA_CH219-20I1 165 + atagcagggtgactgttaccccaatgtttctatatatctgtaaccttgttatgagctaa-gggg 227 XENTR_NM_0011422 165 + atagcagggtgactgttaccccaatgtttctatatatctgtaaccttgttatgggctaa-gggg 227 XENTR_CH216-2E23 186 + ataacaaggttacagatatatagaaacattggggtaacagtcaccccgctatagttccaggggt 249 ***.**.***.**.*.**.....*....*.....**.*.**.***..*.***......* ***. XENLA_CH219-20I1 228 + gcccagtctgaaggccagttagggggagatatggggtgagtgtttatttgtgccctggttaccc 291 XENTR_NM_0011422 228 + gcccagcctgaaggccagttagggggggatttggggtgagtgcttatttgtgccctgggtaccc 291 XENTR_CH216-2E23 250 + acccagggca---------------caaataagcact----------------------caccc 276 .***** ...***************...**..*...****** *************** .**** XENLA_CH219-20I1 292 + ctggaactatagcagggtgac 312(341) XENTR_NM_0011422 292 + ctggaactatagcagggtgac 312(341) XENTR_CH216-2E23 277 + c---------------aaatc 282(341) ****************....*
- Run MUSCLE again, only with XENTR_mRNA and CHORI-219 sequence.
XENLA_CH219-20I1 1 + ttatttgtgccctggatacccctggaactatagcagggtgactgttaccccaatgtttctatat 64 XENTR_NM_0011422 1 + ttatttgtgccctgggtacccctggaactatagcggggtgactgttaccccaatgtttctatat 64 *************** ****************** ***************************** XENLA_CH219-20I1 65 + atctgtaaccttgttattagctaagggggcccagtctgaaggtcagttagggggagatttgggg 128 XENTR_NM_0011422 65 + atctgtaaccttgttatgggctaagggggcccagcctgaaggccagttagggggggatttgggg 128 ***************** *************** ******* *********** ********* XENLA_CH219-20I1 129 + tgagggcttatttgtaccctgggtacccctggaactatagcagggtgactgttaccccaatgtt 192 XENTR_NM_0011422 129 + tgagtgcttatttgtgccctgggtacccctggaactatagcagggtgactgttaccccaatgtt 192 **** ********** ************************************************ XENLA_CH219-20I1 193 + tctatatatctgtaaccttgttatgagctaagggggcccagtctgaaggccagttagggggaga 256 XENTR_NM_0011422 193 + tctatatatctgtaaccttgttatgggctaagggggcccagcctgaaggccagttaggggggga 256 ************************* *************** ******************* ** XENLA_CH219-20I1 257 + tatggggtgagtgtttatttgtgccctggttacccctggaactatagcagggtgac 312(1) XENTR_NM_0011422 257 + tttggggtgagtgcttatttgtgccctgggtacccctggaactatagcagggtgac 312(1) * *********** *************** **************************
- However, it turns out that they are highly repetitive (~ 135 bp unit). See the 1st, 3rd and 5th line (or the 2nd and 4th line) in each sequences.