* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Simulation_of_Tumor_Data_from_Single_Cell_Sequencing
Microevolution wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Copy-number variation wikipedia , lookup
History of genetic engineering wikipedia , lookup
Ridge (biology) wikipedia , lookup
X-inactivation wikipedia , lookup
Genomic imprinting wikipedia , lookup
Gene expression programming wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Designer baby wikipedia , lookup
Oncogenomics wikipedia , lookup
Genome (book) wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Gene expression profiling wikipedia , lookup
Minimal genome wikipedia , lookup
Biology and consumer behaviour wikipedia , lookup
Simulation of Tumor Data from Single Cell Sequencing Haoyun Lei Computation Biology We got a set of single cell data files that each contains 10kbins (9934 rows) of information such as chromosome, chromosome position, gene number and the copy number of different genes in every kind of tumor from different patients. Figure 1. An example shows the information contained in one single cell sequencing file The goal is to infer what is happening at the cellular level in tumors using mixed measurements averaging signals from many cells. Here, we used the copy number of different genes as the parameter and tried to simulate a matrix of tumors. Since the files are in the format of ‘txt’, and the information of copy number is in the last second columns, the first step is to extract the copy number of every gene and put them in a matrix with each row is the different cell and each column is the copy number of different gene. Noting that there is some incomplete or irregular information in some files (for example, not all the files contains 10kbins information, or the copy number of some genes is missing in some cells), we decided to ignore these and chose the majority that have the complete information to do the simulation. Figure 2. Before filtering by the length of each row, we found that some genes are missing in some cells (there should be 10k gene in each cell, but some cells only have 1k genes. Also, some cells do have 10k genes but some genes don’t have the copy number, somehow, these genes always exist in the last row of the cell, so we chose the data from the first to the last second row to avoid this (example not shown). In addition, to make sure the filtering is right, I check the length of the output array). In order to make the simulation more conveniently, I made the output information into a numpy array and randomly picked out 3 cells as a matrix and transposed it, which gave me a new 2D array with each row is the numpy number of genes and columns are different cell. Figure 3. the output of the modified 2D CELL array Then I started to simulated the matrix of tumors. For now, we are going to simulate 100 tumors information and each tumor should have 10K row of copy number of different genes. So size of final matrix of tumors is 9933 * 100. Based on the calculation of matrix, we have dot multiply rule as follows: Figure 4. Rule of dot multiply in matrix Knowing this, I created a weight matrix called W with the size of 3 * 100, also, I created noise matrix N with the size of 9933 * 3. Please note that factors in W are positive and obey uniform distribution and have been normalized so that the sum of every columns equals to 1. Matrix N obey normal distribution. …… Figure 5. example of W and N With the matrix of genes, W and N, we were able to simulate a matrix of tumors. Furthermore, I optimized the format of Tumor Matrix with pandas. …… Figure 6. partial information of simulated tumor matrix. Future work: With TUMOR matrix and CELL matrix, we are able to simulate the weight W matrix W’, and compare W and W’ to see the highest level of noise the system could bear.