Download Simulation_of_Tumor_Data_from_Single_Cell_Sequencing

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Microevolution wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Copy-number variation wikipedia , lookup

History of genetic engineering wikipedia , lookup

Gene wikipedia , lookup

Ridge (biology) wikipedia , lookup

X-inactivation wikipedia , lookup

Genomic imprinting wikipedia , lookup

Gene expression programming wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

RNA-Seq wikipedia , lookup

Designer baby wikipedia , lookup

Oncogenomics wikipedia , lookup

Genome (book) wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Gene expression profiling wikipedia , lookup

Minimal genome wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

NEDD9 wikipedia , lookup

Transcript
Simulation of Tumor Data from Single Cell Sequencing
Haoyun Lei
Computation Biology
We got a set of single cell data files that each contains 10kbins (9934 rows) of information such
as chromosome, chromosome position, gene number and the copy number of different genes in
every kind of tumor from different patients.
Figure 1. An example shows the information contained in one single cell sequencing file
The goal is to infer what is happening at the cellular level in tumors using mixed measurements
averaging signals from many cells. Here, we used the copy number of different genes as the
parameter and tried to simulate a matrix of tumors.
Since the files are in the format of ‘txt’, and the information of copy number is in the last second
columns, the first step is to extract the copy number of every gene and put them in a matrix with
each row is the different cell and each column is the copy number of different gene. Noting that
there is some incomplete or irregular information in some files (for example, not all the files
contains 10kbins information, or the copy number of some genes is missing in some cells), we
decided to ignore these and chose the majority that have the complete information to do the
simulation.
Figure 2. Before filtering by the length of each row, we found that some genes are missing in some cells
(there should be 10k gene in each cell, but some cells only have 1k genes. Also, some cells do have 10k
genes but some genes don’t have the copy number, somehow, these genes always exist in the last row of
the cell, so we chose the data from the first to the last second row to avoid this (example not shown). In
addition, to make sure the filtering is right, I check the length of the output array).
In order to make the simulation more conveniently, I made the output information into a numpy
array and randomly picked out 3 cells as a matrix and transposed it, which gave me a new 2D
array with each row is the numpy number of genes and columns are different cell.
Figure 3. the output of the modified 2D CELL array
Then I started to simulated the matrix of tumors. For now, we are going to simulate 100 tumors
information and each tumor should have 10K row of copy number of different genes. So size of
final matrix of tumors is 9933 * 100. Based on the calculation of matrix, we have dot multiply rule
as follows:
Figure 4. Rule of dot multiply in matrix
Knowing this, I created a weight matrix called W with the size of 3 * 100, also, I created noise
matrix N with the size of 9933 * 3. Please note that factors in W are positive and obey uniform
distribution and have been normalized so that the sum of every columns equals to 1. Matrix N
obey normal distribution.
……
Figure 5. example of W and N
With the matrix of genes, W and N, we were able to simulate a matrix of tumors. Furthermore, I
optimized the format of Tumor Matrix with pandas.
……
Figure 6. partial information of simulated tumor matrix.
Future work:
With TUMOR matrix and CELL matrix, we are able to simulate the weight W matrix W’, and
compare W and W’ to see the highest level of noise the system could bear.