Within a mixed population of cells, distinct cell types have distinct programs of transcription. Likewise, cells from distinct phases of the cell cycle exhibit phase-specific transcriptional patterns. When transcription levels are measured from a population of cells in a typical experiment, such as by using DNA microarrays, the measured transcription actually represents the weighted average of these many independent transcriptional programs.

Expression deconvolution is a method we have developed to estimate the proportions of different cells or cell types in a cell population by deconvoluting or deconstructing the DNA microarray data of the entire cell population as the weighted combination of expression from the distinct cell types. We are essentially treating specific transcriptional patterns in DNA microarray data as cell-type specific markers, then looking for the relative proportions of these markers.

The program Deconvolute is a Java application that performs these calculations. The program operates on two files of microarray data:

A first set of microarray data are read in to the program to act as basis experiments, representing the cell-type specific expression patterns. For example:
(1) To deconvolute yeast microarray data into contributions from cells in different phases of the cell cycle, the basis experiments we selected were expression data from synchronized yeast cells in the G1, S, G2, M, and M/G1 phases of the cell cycle.
(2) To deconvolute expression data from a tissue biopsy in order to find proportions of distinct cell types, the basis experiments would be expression data from homogenous cell populations representing the distinct cell types.

A second set of microarray data, in the same format, are read in to the program to be analyzed. Each column in this second file, representing one microarray experiment, is fit to find the optimal linear combination of the basis experiments that best model the cell population data. Specifically, mixtures of normalized basis vectors are evaluated with a simulated annealing algorithm to find the mixture giving the maximum Pearson correlation coefficient with the normalized target vector. The optimal weights of the basis vectors are interpreted as the proportions of the corresponding cells in the population.

Note that the simulated annealing algorithm is a stochastic algorithm, and is not guaranteed to find the optimal value every time it is run (although it works quite well.) We typically perform the deconvolution several times to ensure that the results are consistent across runs, and represent the most optimal mixture of basis sets. The outcome of different runs can be compared via the Final Score, which is the Pearson correlation coefficient between the mixture of basis sets and the actual population data--this correlation coefficient ranges from -1 to 1, where 1 represents the maximum positive correlation (and a perfect fit to the data.)