Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Methods in Imaging Chromosomes Aalok Shah 1 Introduction Experimental evidence indicates that an understanding of the physical organization and positioning of chromosomes may lead to insight into understanding gene expression. Specifically, it is believed that certain locations in the nucleus have a higher propensity of active genes [8]. Moreover, it is believed that the relative positioning of genes within the chromosome may also play a role in gene expression [8]. Thus, an exploration of the geometry of chromosomes should lead to a deeper understanding of how genes are activated. To achieve this understanding, we are considering a simpler set of chromosomes than the human body: polytene chromosomes in the fruit fly, known as Drosophila. These are ideal because of the availability of data and the prior work done on Drosophila. In order to understand these polytene chromosomes, a geometrical representation must be developed, and we must be able to simulate an image in order to compare the representation with available image data. The latter will be the primary focus of this paper. 2 2.1 Background The Fourier Transform Using the Projection-Slice Theorem, we know that the Fourier Transform of a projection is equal to the projection of the Fourier Transform. That is, the Fourier Transform of a projection is equivalent to the Fourier Transform of the experimental image. For instance, consider a 3-dimensional image, given by the function ρ(x, y, z). This image, or function, can be reconstructed by summing the Fourier Transforms of slices along the z-direction and then back transforming the total function. As described in [4] and [8], we shall use a filamentary model to represent chromosomes in the image. This means that the distribution ρ(x, y, z) is only parameterized by arc length along a space curve r(s), where s is the arc length parameter. Mathematically, this implies that Z L λ(s)δ[x − r(s)]ds, ρ(x) = 0 1 (1) Figure 1: Sample X-ray image slice of Drosophila. The band pattern along the chromosome represents the allele pattern. where x is the standard position vector and λ(s) is the aforementioned filamentary density function. Using this, the Fourier Transform becomes: F(k) = 1 2π Z ρ(x) exp (ik · x)dx = R3 1 2π Z L λ(s) exp [ik · r(s)]ds. (2) 0 In [4], Hausrath and Goriely denote this as the Frenet-Fourier Transform. If the curvature and torsion parameters (denoted by κ(s) and τ (s)) are known for the space curve, [4] shows that we can simultaneously calculate the Frenet-Fourier Transform and the space curve r(s) by coupling this equation with the Frenet-Serret formulas: F 0 [k](s) = λ(s) exp (ik · r) r0 (s) = t t0 (s) = κ(s)n (3) 0 n (s) = −κ(s)t + τ (s)b b0 (s) = −τ (s)n. Here, {t, n, b} are respectively the tangent, normal, and binormal vectors to the curve. Initial values for {r, t, n, b} must be given, and the initial value for the Fourier Transform can be taken to be zero. Thus, we now have both an integral formulation and an ODE formulation for computing the Fourier Transform of the image using the filamentary model. Note that for both formulations, computations must be done for a grid that spans values of k in all directions. 2 2.2 Filamentary Model Figure 2: Filamentary tube model for polytene chromosomes. The figure on the left is the space curve through the chromosomes, and the figure in the middle is radially symmetric tube centered at that space curve. The last image is the same tube sliced in order to show the radial symmetry. In this model, chromosomes are tubes centered at the space curve r(s). In [4], this is modeled with a Gaussian filament density. The density function λ(s) is radially symmetric about this curve, as shown in Figure 4. Looking closely at Figure 1, one notes the banded structure of the chromosome alleles. The pattern changes as a function of arc length, which supports the filamentary model. Let the function describing the band pattern be α(s). Also, let the intra-tube variance be β(s). In [4], Hausrath and Goriely derive the density function to be: λ(s) = h (k · n)2 + (k · b)2 i α(s) exp . 2β(s) −4β(s) (4) Note that the exponent in this density function vanishes if and only if k is parallel to t, which would be the direction of the space curve. In this direction, λ(s) is maximized. Because the function α(s) describes the band pattern in the chromosome image, it is especially valuable to the further applications of this problem. This function allows for us to correlate specific alleles to physical location, since it is parameterized by arc length and the position vector r(s) is also known. In 1935, Dr. C.B. Bridges drew by hand the equivalent to α(s), and we would like to have a more accurate representation on a larger scale. 3 Figure 3: Bridges’ Map for a piece of a polytene chromosome. Although hand-drawn, he was still remarkably accurate, and his drawings are still used today. 2.3 Parameter Estimation Previous work by Hausrath et. al [8] used algorithms to compute κ(s), τ (s), α(s). Essentially, ridge lines are computed by interpolating through the intensity of the image, and points along the space curve are determined that ensure connectivity of the ridge lines. α(s) is determined by the intensity profile using an interpolation scheme based on its neighbors. Curvature and torsion profiles are estimated by scanning through a range of values and choosing the helical arc that best fits each small segment of the chromosome data and satisfies smoothness constraints. β(s) is taken to be constant along helical arcs. The curve is taken to have piecewise constant curvature and torsion. For this case, there exists an exact solution to the Frenet-Serret Formulas, and the space curve is a combination of helical arcs. Since the Frenet frame can be computed exactly, one only needs to solve for the integral in (2) to compute the Fourier Transform. Furthermore, one can compute the Fourier Transform for each arc separately and then sum them afterwards since the transform is linear, and so this computation is embarrassingly parallel. 3 3.1 Numerical Integration Integral Formulation As noted previously, there are two ways to formulate the problem of solving for the FrenetFourier transform of the image, denoted by (2). The first method is to compute the Frenet frame for the desired curve and then to numerically integrate (2). This method is especially useful when the curve has piecewise constant curvatures, because the Frenet frame can be computed exactly. Thus, for this case, the numerical error involved in computing the 4 Figure 4: Sample helical arcs taken from chromosome. Segments are small to ensure accuracy. Fourier transform is only in the numerical integration. We shall consider three approaches to numerically integrating (2). The first approach is a simple quadrature rule to compute the integral. In this method, we first split the integral into n subintervals of a predetermined length, denoted by h. Then, within each subinterval, the integrand can be fitted by a polynomial, denoted by PR pi (s). Then, (2) is approximated by pi (s)ds. That is, if the polynomial is of degree q, i ∆i then Z F[k] = λ(s)e ık·r(s) Z ds = ∆i (pi (s) + O(hq ))ds, (5) ∆i where O(hq ) is ”Big O” notation for the error in the polynomial fit. Thus, the error in the numerical integration is of the same order, since we are summing over h1 intervals, and the integration adds a factor of h. However, this also assumes that the integrand can be sampled anywhere, and in practice, λ(s) is sampled at fixed points determined from the image. Therefore, there is an increase in error if the step size becomes too small in this method due to error arising from linearly interpolating λ(s). The second approach considered involves treating the integral as an ordinary differential equation (ODE). If the integrand is denoted by g(s), then this amounts to solving the ODE dF = g(s). ds (6) Since this ODE is only a function of the independent variable s, any solver can be easily implemented. Fourier integrals can be highly oscillatory, and so we used a simple stiff ODE solver to account for this. Any other method can be used; in fact, various Runge-Kutta methods would be equivalent to using various quadrature rules described above. However, 5 since stiff ODE solvers incorporate future information in its numerical estimation, they are especially suited for ODEs that can change rapidly. We considered a Linear Multistep 2nd order BDF method as our solver. For more information, please see [1]. For transforms that are relatively simple and do not vary on different time scales, the previous approaches are suitable. However, this is not the case for most image data, and it is therefore prudent to try another approach. Since we are computing a Fourier type integral, it is perhaps more efficient to consider a method that takes this into account. That is why the Clenshaw-Curtis method is the last approach considered in our applications. This method is especially suited for Fourier integrals, and it is for this reason it is the method of choice. From a simple change of variables, we know that Z 1 Z π g(s)ds = g(cos θ) sin θdθ. (7) −1 0 (7) motivates the Clenshaw-Curtis method. This is because using a cosine transform, we P know that g(cos θ) = ak cos(kθ) (since g(cos θ) is an even function in θ, the sine transform k vanishes). The Discrete Cosine Transform (DCT) can then be used to calculate ak , which implies that Z π ∞ X 2a2k g(cos θ) sin θdθ = a0 + . (8) 1 − 4k 2 0 k=1 Therefore, it remains to solve for the coefficients a2k . Using symmetry and the DCT, This can be calculated, where N is the number of terms used to approximate the infinite sum, taken to be even: a2k N/2−1 X 2 h g(1) + g(−1) jπ 2jkπ i jπ k = + g(0)(−1) + ) . (9) [g(cos( )) + g(− cos( ))] cos( N 2 N N N j=1 In practice, N is usually taken to be a power of 2, because algorithms to compute the Fourier Transform are usually optimized for such N . If a large number of oscillations is expected, then it is important to use a larger value for N . Also, Trefethen has showed that this approach fares well when compared to Gauss quadrature (which is a popular polynomial technique), especially if the integrand is not necessarily analytic [7]. The integrand g(s) is dependent on the Frenet frame of the polytene curve, which can be computed exactly. When implementing this scheme, it is important to first compute the Frenet frame at all points necessary in order to speed up calculations. Error can vary with this method depending on the integrand and its modes of oscillations. Local error has been estimated with a curve that is a circle (constant curvature and no torsion). For such a curve, for N = 32, the local error, En , was within .03. Here, En = |FN − F2N |. Below is a log log plot of the local error against the frequency N . 6 Log Log Plot of Error vs. Sampling Frequency −2 y = − 0.61*x − 1.7 data 1 linear −2.5 −3 −3.5 −4 −4.5 −5 1 1.5 2 2.5 3 3.5 4 4.5 5 Figure 5: Log Log Plot of Local Error vs. Frequency using the Clenshaw-Curtis method. Although the rate of convergence is not particularly fast, small values of N are needed for accurate results. 3.2 ODE Formulation If the curvature values for the polytene curve is not piecewise constant, then it is difficult to compute an exact solution to the Frenet frame. In this scenario, it is better to couple the Frenet-Serret ODE system with the differential equation form of the Fourier transform as in (3). The mode of oscillations can vary significantly depending on the choice of wave number k, and so a variable step size numerical solver should work well. A popular choice is the Dormand-Prince Runge-Kutta 4-5 method. This method uses the difference between a 4th order and 5th order calculation to determine the step size. In Matlab, it can be implemented using the popular function ode45. If a fixed step size solver is desired, a Runge-Kutta 4th order solver can be used, although depending on accuracy constraints, for rapidly oscillating functions a restrictive step size may be necessary. 4 Preliminary Results As can be seen, preliminary results are promising (the brightness can be attributed to a difference in scale), although there are still some issues. Occasionally, the initial algorithm used to determine curvature profiles and the general path was inaccurate. This is seen in the third image from the top in Figure 6. In the future, a curvature based algorithm that determines the direction of the next neighboring point should improve upon this. Also, along various points, there is a ringing phenomenon in which the brightness intensifies. This data set did not use the Clenshaw-Curtis method and instead used a simple quadrature 7 Figure 6: Simulated vs. Real Images for various slices of chromosome. Figure 7: Polytene map of α(s) and the corresponding helical arc. Calculations taken on a piece at low resolution, yet the results are still quite good. 8 method. That may have contributed to the ringing phenomenon. Also, the way the simulated image is normalized may have also contributed to this problem. Figure 7 is the distribution of α(s) for a small section. Although taken at a lowresolution (a 100 x 100 pixel image), the variation in the data is still relatively large, which allows for an easier identification of allele patterns. 5 Parameter Optimization Procedure Simulating an image so that it can be compared to the original is a good verification tool for the Gaussian filament model. However, the simulated image should also be able to be used in order to improve upon initial estimates of parameters such as α(s). Since among the primary goals is to have an accurate representation of α(s), an algorithm that would cause the simulated image to converge to the real image would be very helpful. Also, in other applications, curvature may be an important parameter, as it could be potentially related to gene expression. Thus, a general way to improve upon initial estimates would lead to more accurate results without the need for more data. Suppose there is only one constant parameter that needs to be adjusted. A simple approach to improving initial parameter information is to slightly decrease the parameter and slightly increase the parameter. Simulate the image twice and whichever simulation has a lower error (calculated by using RMSD) should be a better approximation. This brute force algorithm would work fairly well for such a simple case. However, if the parameters that need to be modified are functions defined on the grid, then the number of potential directions of change are exponentiated. Therefore, it is important to find a way to determine the optimal direction in parameter space. Let S(x, y) be the simulated image, and let I(x, y) be the real image data. Let the local squared error be: h i2 R(x, y) = I(x, y) − S(x, y) . (10) This can also be defined along arc length, since the rest of the image is irrelevant: h i2 R(s) = R(x(s), y(s)) = I(x(s), y(s)) − S(x(s), y(s)) . (11) In truth, R(s) = R(s; f (s)), where f (s) is a vector of parameters that need to be optimized. For instance, if one were to try to optimize all the parameters, then f (s) = (α(s), β(s), κ(s), τ (s)). Since we want R(s; f (s)) to be minimized, we would like to find ∂f f (s) such that ∂R ∂f = 0. Assuming that ∂s is non-vanishing so that the inverse function theorem implies, we get that: ∂s ∂f T ∂f −1 ∂f = . ∂f ∂s ∂s ∂s 9 Note that this formulation is essentially taking the pseudo inverse of ∂f ∂s . Using gradient descent, we get the PDE: df ∂R ∂f −2 ∂f =− . (12) dt ∂s ∂s ∂s Also, one could consider the full variational setting. Suppose f (s) = α(s). Let E(s) be the full error functional. That is, Z L R(s; α(s))ds. E(s) = (13) 0 Then by the Euler-Lagrange equations, we know that E(s) is minimized when ∂R d ∂R − = 0. ∂α ds α00 (s) (14) The second term can be simplified to: α00 (s)R00 (s) − R0 (s)α000 (s) d ∂R = . ds α00 (s) [α00 (s)]2 (15) Therefore, using gradient descent, we have a new PDE that minimizes the error functional as a whole by incorporating local variation within the image. This PDE is: dα ∂R ∂α −2 ∂α α00 (s)R00 (s) − R0 (s)α000 (s) =− + . dt ∂s ∂s ∂s [α00 (s)]2 (16) Both of these PDEs should be able to be used in order to improve upon initial estimates of important parameters. In order to improve these parameters, one could take one time step in the PDE and redo the simulation of the image with the new estimates. Then, repeat the process until satisfied. Although (16) should be more precise than (12), it is also much more complicated to discretize. However, (16) has the distinct advantage that if ∂α ∂s vanishes, the second term will offset and keep the algorithm from failing. Numerical tests are still needed to test this theory. 6 Discussion In this paper, we have outlined methods used to represent polytene chromosomes from image data. This work will then be used to examine the relationship between geometry and gene expression. Tests can now be conducted to determine if, for instance, gene expression is more prevalent in the center of the nuclei compared to the periphery. Also, the relationship between curvature and gene expression can be examined, and we can also determine if allele to allele interactions play a role. In addition to this statistical analysis, there is more work to be done. 10 Improvements can be made in the parameter estimation process. Since the image data is not particularly noisy, and high resolution images have fine edges, edge detection algorithms can aid in determining the curvature profiles and the necessary points in which the space curve goes through. Also, we need to determine a more accurate way to estimate the variance within the tube: the parameter β. Finally, a stable numerical scheme needs to be developed in order to implement the parameter optimization procedure. This would allow for accurate calculations that would help answer these important questions in chromosome geometry. References [1] U.M. Ascher and L.R. Petzold. Computer Methods for Ordinary Differential Equations and Differential-Algebraic Equations. Society for Industrial and Applied Mathematics, 1998. [2] C.W. Clenshaw and A.W. Curtis. A method for numerical integration on an automatic computer. Num. Math., 2:197–205, 1960. [3] Andy Hausrath and Alain Goriely. Continuous representations of proteins: Construction of coordinate models from curvature profiles. Journal of Structural Biology, November 2006. [4] Andy Hausrath and Alain Goriely. The fourier transforms of curves and filaments and their application to low-resolution protein crystallography. Journal of Applied Crystallography, (42), 2009. [5] T. Havie. On a modification of the clenshaw-curtis quadrature formula. Informationsbehandling, 9(1219):338–350, 1969. [6] Tom Misteli. Self-organization in the genome. PNAS, 106:6885–6886, April 2009. [7] Lloyd N. Trefethen. Is gauss quadrature better than clenshaw–curtis? SIAM Review, 50(1):67–87, 2008. [8] Livia Zarnescu, Alain Goriely, Gio Bosco, Andy Hausrath, and Aalok Shah. The genome in three dimensions: simulation of confocal images of polytene nuclei, 2010. 7 Acknowledgements Much of this work was done in collaboration with Dr. Andrew Hausrath of the Department of Biochemistry at the University of Arizona. This work is part of an ongoing project that Dr. Hausrath is currently working on. 11