Abstract
Large collections of genomic information have been accumulated in recent years, and embedded latently in them is potentially significant knowledge for exploitation in medicine and in the pharmaceutical industry. The approach taken here to the distillation of such knowledge is to detect strings in DNA sequences which appear frequently, either within a given sequence (e.g., for a particular patient) or across sequences (e.g., from different patients sharing a particular medical diagnosis). Motifs are strings that occur very frequently. We present basic theory and algorithms for finding very frequent and common strings. Strings which are maximally frequent are of particular interest and, having discovered such motifs, we show briefly how to mine association rules by an existing rough sets based technique. Further work and applications are in progress.
Get full access to this article
View all access options for this article.
