It can be applied to micro array data to extract genes related to some specific biological mechanism (e.g., bio markers for cancers); Primary applications include gene discovery.
About
Summary: This invention is a new computational algorithm for feature selection and classification that can be used to facilitate subsequent processing of the data. Specifically, it can be applied to microarray data to extract genes related to some specific biological mechanism (e.g., biomarkers for cancers). Overview: Many current techniques to predict and classify cancer use DNA microarray data, and gene expression data, which allows for testing large quantities of genetic material. Generally, using the data, cells can be analyzed, genetic information can be extracted, and phenotypes can be determined. Current filter methods used in microarray data analysis focus on mean difference or variance of data. The methods disclosed in this invention focus on the difference between statistical distributions. Mean difference can be considered as a special case of these methods. This method provides for further improvements using Bayesian Neural Network (BNN) or Gaussian process modelling. Applications: Potential primary applications include gene discovery, cancer classification, cancer diagnosis, drug discovery, and other diagnostic tools. How it works: This technology is a filter method which focuses on the difference between statistical distributions. No assumption of any prior distribution of data is made - the distribution is calculated based on the data. Generally, the method involves determining a statistical distribution function for a first dataset; determining a statistical distribution function for a second dataset; identifying a characterizing feature of the first and second datasets; determining a probability of distinction based on the first and second statistical distribution functions, with the probability of distinction being the probability that the characterizing feature can be used to distinguish the first dataset from the second dataset; and identifying a subject of the first dataset based on the probability of distinction. Once the subject has been identified, the first dataset can be further analyzed using processing techniques appropriate to the subject. Benefits: This method will provide a probability of importance for each gene, which is not available in the existing filter methods. This probability combined with some biological knowledge will provide biologists with a better picture about these genes. Why it is better: The main advantage of this invention is that it will provide a probability of importance for each gene, which is not available in the existing filter methods. This probability combined with some biological knowledge will provide biologists with a better picture about these genes. The secondary advantage is that it can identify genes that are missed in the existing methods - genes that are statistically different but have a similar mean difference. Other Applications: This method can be applied to many different areas where feature selection is needed, such as an analysis requiring selection of a few pixels from images or video for classification (e.g., to detect a human face for surveillance purposes).