The frequent words with mismatches problem
One way to solve the Frequent Words with Mismatches problem is to generate all 4k k-mers Pattern, compute ApproximatePatternCount(Text, Pattern, d) for each k-mer Pattern, and then find k-mers with the maximum number of approximate occurrences. This is an inefficient approach in practice, since many of the 4k k-mers should not be considered because neither they nor their mutated versions (with up to d mismatches) appear in Text.
Genome= GCAAAATGGAGCAGGATCAGCAAAATGGAAAATAAATGGAGGATCAAAATAAATGGAGGAGGAAAATGGAGGAAAATAAATGGATCAGGAAAATGCAGCAGGATCATCATCAGGAGCAGGATCAAAATTCAGGAGCAGGAGGATCAGCATCAGGAGGATCAGCAGGAAAATGCAGGAGGAGGAGGAAAATTCAAAATGGAGGAGGAGGAGCATCAGCAGCATCAGGAGGAGGATCAGCAGCAGGAGGAGGAGGAGGAAAATGGAGGAGGAGCAGGAGGAGCATCAGGAGGATCAGGAGCATCAGCAAAATTCAAAATGGAGGAAAATGCAGGAAAATGGAGCAGGAAAATAAATTCATCAAAATGCAGGAGGA
k= 6
d= 2
You can see that ACTAT is a most frequent 5-mer of ACAACTATGCATACTATCGGGAACTATCCT, and ATA is a most frequent 3-mer of CGATATATCCATAG.
Comments
Leave a comment