Re: [AMBER] Error in pca analysis

From: Daniel Roe <daniel.r.roe.gmail.com>
Date: Thu, 17 Jun 2021 11:06:12 -0400

Hi,

It looks like in your 'pca3.in' file you're missing the 'trajin'
statements, i.e. there is nothing for the 'projection' commands to
work with.

-Dan

On Thu, Jun 17, 2021 at 7:23 AM Sruthi Sudhakar
<sruthisudhakarraji.gmail.com> wrote:
>
> Thanks Dr. Anselm for adding the comments. I had some more follow up
> questions on the approach one , which I had directly mailed Dan. But for
> beginners like me who might face similar issue, I would like to post the
> question here. And I would really appreciate it if someone would just
> verify if my script is right or not. Because it has been really a
> tough task to find the loophole in this since I feel my input is right.
> I have tried approach one, where I stripped off the nucleic acid part in
> both (box info not removed) and used a common prmtop file. But in that
> case, I do face some errors in the modes of the second trajectory. The
> method I used is
> 1. load both the stripped trajectories, rms fit to the first frame, and
> write out the average file;
> 2. the second stage loaded the 2 stripped trajectories, rms fit the
> trajectories to the average, calculated covariance matrix and wrote out a
> fit nc file (500000 frames - the sum of both trajectories) and used the
> diagmatrix command;
> 3. In the third stage where all the analysis were done I had faced the
> error. In the third stage, the fit trajectory with 500000 frames was loaded
> and I used the projection command to get the PC 1 and 2 for the
> trajectories separately by dividing the frames which gave odd results. The
> KL-PC1 (according to the script below, all inputs are attached too) was not
> formed at all. T1.dat read 250000 frames but the T2.dat had 500000
> coordinates. Projection of T2 was supposed to begin from the 250001th frame
> and end in the last frame. The histogram of T1 gave a Gaussian distribution
> but the histogram of T2 was not giving a Gaussian distribution. T2.dat
> contained all values of mode 1 and mode 2 as zero for the first 250000
> frames.
>
> Could you give any insights regarding why this might be happening? I
> suspect it is some error in the script below (input.txt), since, in
> independent analysis, only the loaded frames appeared in the modes, unlike
> the T2 of this case. Also, would you recommend combining the trajectories
> much before fitting the trajectory, where instead of loading 2 stripped
> trajectories in the first stage and then creating a combined fit
> trajectory, we could directly create a single trajectory in the very first
> stage?
> Regards,
> Sruthi Sudhakar
>
> parm parmsg.prmtop
> trajin protsg.nc
> trajin protdg.nc
> # Fit to the first frame to remove global rotation/translation
> rms first :1-1362.CA
> # Write out the restart
> average gaccAvg1.rst7 restart
>
> pca2.in
> parm parmsg.prmtop
> trajin protsg.nc
> trajin protdg.nc
> # Read in average structure, tag as [avg]
> reference gaccAvg1.rst7 [avg]
> # RMS-fit coordinates to average
> rms ref [avg] :1-1362.CA
> # Calculate coordinate covariance matrix
> matrix covar :1-1362.CA name gaccCovar
> # Write out the fit trajectory
> trajout fit1.nc
> # Diagonalize coordinate covariance matrix for eigenmodes (only 2)
> diagmatrix gaccCovar out evecs1.dat vecs 20 nmwiz nmwizvecs 5 nmwizfile
> prot.nmd nmwizmask :1-1362.CA
>
> Pca3.in
> parm parmsg.prmtoptrajin fit1.nc
> # Read in modes
> readdata evecs1.dat name MyModes
> # Now create separate PC projections for each trajectory
> projection T1 modes MyModes beg 1 end 2 :1-1362.CA start 1 stop 250000 out
> T1.dat
> projection T2 modes MyModes beg 1 end 2 :1-1362.CA start 250001 stop 500000
> out T2.dat
> # Calculate Kullback-Leibler Divergence vs time for PC histogramskde T1:1
> kldiv T2:1 klout KL-PC1.dat bins 200 name SG-DG-1
> kde T1:2 kldiv T2:2 klout KL-PC2.dat bins 200 name SG-DG-2
> kde T1:1 out kde1-PC1.dat bins 200 name KDE1-1
> kde T2:1 out kde2-PC1.dat bins 200 name KDE2-1hist T1:1 bins 200 out
> T1-hist1.dat normint name HIST1-1
> hist T1:2 bins 200 out T1-hist2.dat normint name HIST1-2
> hist T2:1 bins 200 out T2-hist1.dat normint name HIST2-1
> hist T2:2 bins 200 out T2-hist2.dat normint name HIST2-2
>
>
>
> On Thu, Jun 17, 2021 at 2:52 PM Dr. Anselm Horn <anselm.horn.fau.de> wrote:
>
> > Hi,
> >
> > many thanks to Dan for his explaining comments regarding the PCA analysis!
> >
> > Since this issue (different systems) bothered me for many years and I
> > had many discussions about that I'd like to add my thoughts here.
> >
> > I see some disadvantages of approach 2, i.e. performing isolated PCA
> > analyses on the different systems. As Dan pointed out, the underlying
> > eigenvectors will certainly not be the same. Although a pairwise
> > comparison (e.g. via RMSIP) between the eigenvectors of the two systems
> > is possible, in praxi there may arise problems:
> > 1) the eigenvectors will have a different size, if you include all atoms
> > of the full systems, or the vector entries may represent different
> > atoms. Thus, a comparison may not be that straightforward.
> > 2) from the pairwise comparison between the eigenvectors of the two
> > systems you deduce a correspondence between them. Maybe in real systems
> > the most important eigenvectors can clearly be matched, but in principle
> > the matching may not be that distinct. And this problem may arise even
> > if you restrict your analysis to atoms common on both system (e.g.
> > backbone or CA).
> > 3) The ease of application of isolated PCA-analysis may seduce the user
> > to omit that matching step and just proceed to the well-known
> > PCA1:PCA2-histogram plots.
> >
> > That said, I totally agree with Dan that you may use approach 2 in a
> > very elegant way to obtain histogram plots of a similar pattern, if you
> > put an extra efford into the comparison step and be aware that the basis
> > eigenvectors resemble a similar basis, i.e. describe a similar overall
> > motion, but are not the same.
> >
> > It's my feeling that approach 1, where you restrict you analysis to a
> > subset of atoms occuring in all systems investigated (backbone or CA),
> > is more "save" and thus should be considered first. In that way, you are
> > sure that the basis of all your PCA histograms is the same, and
> > differences and equalities in the plot between different systems can
> > directly be compared and discussed.
> > However, using approach 2 in addition can provide further insights into
> > the systems' dynamics.
> > And I also agree with Dan that visual inspection might be very helpful.
> >
> > Best regards,
> >
> > Anselm
> >
> >
> >
> > On 06/16/2021 03:18 PM, Daniel Roe wrote:
> > > Hi,
> > >
> > > I know I just responded to you directly but I'll post the reply here
> > > as well in case people search for it in the future.
> > >
> > > So when you have different systems and you want to perform principal
> > > component analysis there are two ways you can go. You can either 1)
> > > modify one or both of the systems so you have a common core (i.e.
> > > remove the parts of each system via 'strip' which don't match the
> > > other) and perform a combined PCA on that, or 2) perform PCA on each
> > > system separately and then compare the resulting eigenvectors via
> > > something like RMSIP (root mean square inner product).
> > >
> > > The advantage of approach 1 is the usual advantage of the combined PCA
> > > approach - you're guaranteed that each eigenvector is the same for the
> > > separate systems. This works best when the differences between the
> > > systems are not too extreme, but if the systems are too different this
> > > should show up as e.g. non-overlapping PC projection histograms. The
> > > disadvantage is that if you have important contributions to the
> > > overall motion from parts that you've removed this of course will not
> > > show up in your final analysis.
> > >
> > > The advantage of approach 2 is that you can perform PCA on each full
> > > system, so you won't miss any motion. The disadvantage is that you're
> > > not guaranteed that the eigenvectors from each analysis will match
> > > (and it's likely they won't), i.e. PC 1 in system A will not be the
> > > same as PC 1 in system B. This is where you would use RMSIP to compare
> > > the different PCs to see which (if any) match.
> > >
> > > My feeling is that both approaches should be tried. Approach 2 will
> > > give you a sense for whether the motions in the systems are similar
> > > enough for approach 1 to make sense. Also, really make use of the
> > > 'modes trajout' analysis to generate pseudo trajectories (and also the
> > > 'nmwiz' keyword for the 'diagmatrix' command for use with the VMD
> > > nmwiz plugin) for the first 2-3 PCs for each system to get a sense for
> > > what the motions actually look like. In my opinion there's really no
> > > substitute for visualization when it comes to PCA.
> > >
> > > Hope this helps!
> > >
> > > -Dan
> > >
> > > On Sat, Jun 12, 2021 at 5:16 PM Sruthi Sudhakar
> > > <sruthisudhakarraji.gmail.com> wrote:
> > >>
> > >> This method of breaking the analysis into stages actually worked and I
> > was
> > >> able to complete the analysis overcoming the memory issues.
> > >> An additional query: Incase we are using 2 different trajectories with
> > >> different number of residues (1530 and 1526) to do a combined trajectory
> > >> analysis, would it be enough to load gamd2.nc (1530 residues) and
> > >> gamd3.nc (1526
> > >> residues) together? Each has 250000 frames. And in that case do we need
> > to
> > >> fit into a single trajectory (fit.nc) or separately create 2 fit
> > >> trajectory? I am doubtful about the third stage where we load the fit
> > >> trajectory, should we load both the parm files in that case? Also would
> > the
> > >> difference in the total number of residues create any problem?
> > >>
> > >> My sample script following your guidelines would be
> > >> pca1.in
> > >> parm noionsa.prmtop [ag]
> > >> parm noionsb.prmtop [bg]
> > >> trajin gamd2.nc parm [ag]
> > >> trajin gamd3.nc parm [bg]
> > >> # Fit to the first frame to remove global rotation/translation
> > >> rms first :1-1362.CA
> > >> # Write out the restart
> > >> average gaccAvg.rst7 restart
> > >>
> > >> pca2.in
> > >> parm noionsa.prmtop [ag]
> > >> parm noionsb.prmtop [bg]
> > >> trajin gamd2.nc parm [ag]
> > >> trajin gamd3.nc parm [bg]
> > >> # Read in average structure, tag as [avg]
> > >> reference gaccAvg.rst7 [avg]
> > >> # RMS-fit coordinates to average
> > >> rms ref [avg] :1-1362.CA
> > >> # Calculate coordinate covariance matrix
> > >> matrix covar :1-1362.CA name gaccCovar
> > >> # Write out the fit trajectory
> > >> trajout fit.nc
> > >> # Diagonalize coordinate covariance matrix for eigenmodes (only 2)
> > >> diagmatrix gaccCovar out evecs.dat vecs 20 nmwiz nmwizvecs 5
> > >> nmwizfile prot.nmd nmwizmask :1-1362.CA
> > >>
> > >> pca3.in
> > >>
> > >> parm noionsa.prmtop
> > >> parm noionsb.prmtop
> > >> trajin fit.nc
> > >> # Read in modes
> > >> readdata evecs.dat name MyModes
> > >> # Now create separate PC projections for each trajectory
> > >> projection T1 modes MyModes beg 1 end 20 :1-1362.CA start 1 stop
> > 250000 out
> > >> T1.dat
> > >> projection T2 modes MyModes beg 1 end 20 :1-1362.CA start 250001 stop
> > >> 500000 out T2.dat
> > >> # Calculate Kullback-Leibler Divergence vs time for PC histograms
> > >> kde T1:1 kldiv T2:1 klout KL-PC1.dat bins 200 name SG-DG-1
> > >> kde T1:2 kldiv T2:2 klout KL-PC2.dat bins 200 name SG-DG-2
> > >> kde T1:1 out kde1-PC1.dat bins 200 name KDE1-1
> > >> kde T2:1 out kde2-PC1.dat bins 200 name KDE2-1
> > >> hist T1:1 bins 200 out T1-hist1.dat normint name HIST1-1
> > >> hist T1:2 bins 200 out T1-hist2.dat normint name HIST1-2
> > >> hist T2:1 bins 200 out T2-hist1.dat normint name HIST2-1
> > >> hist T2:2 bins 200 out T2-hist2.dat normint name HIST2-2
> > >>
> > >> Kindly suggest any corrections required.
> > >> Regards,
> > >> Sruthi Sudhakar
> > >>
> > >>
> > >> On Sat, Jun 12, 2021 at 12:11 PM Sruthi Sudhakar <
> > >> sruthisudhakarraji.gmail.com> wrote:
> > >>
> > >>> This method actually worked and I was able to complete the analysis. An
> > >>> additional query: Incase we are using 2 different trajectories to do a
> > >>> combined trajectory analysis, would it be enough to load gamd2.nc and
> > >>> gamd3.nc together? And in that case do we need to fit into a single
> > >>> trajectory (fit.nc) or separately create 2 fit trajectory?
> > >>>
> > >>> Regards,
> > >>> Sruthi Sudhakar
> > >>>
> > >>>
> > >>> On Wed, Jun 2, 2021 at 11:32 PM Daniel Roe <daniel.r.roe.gmail.com>
> > wrote:
> > >>>
> > >>>> Hi,
> > >>>>
> > >>>> First, if you haven't already I really recommend that you look at some
> > >>>> literature to familiarize yourself with what principal component
> > >>>> analysis actually does. It's important to understand what each part of
> > >>>> the analysis is for. I like the old "essential dynamics" papers, but a
> > >>>> very rough explanation is given in the cpptraj PCA tutorial:
> > >>>>
> > >>>>
> > https://amberhub.chpc.utah.edu/introduction-to-principal-component-analysis/
> > >>>>
> > >>>> Also, read the cpptraj manual so you know what each command is
> > actually
> > >>>> doing.
> > >>>>
> > >>>> That said, I now realize my examples in the manual and the tutorial
> > >>>> both rely on having the trajectories stored in memory. So I will give
> > >>>> a rough outline of how to do it all on-disk.
> > >>>>
> > >>>> 1) Create average coordinates.
> > >>>>
> > >>>> parm noions.prmtop
> > >>>> trajin gamd2.nc
> > >>>> # Fit to the first frame to remove global rotation/translation
> > >>>> rms first :1-1530&!.H=
> > >>>> # Write out the restart
> > >>>> average gaccAvg.rst7 restart
> > >>>>
> > >>>> 2) Rms fit your trajectory to average coordinates, calculate the
> > >>>> covariance matrix, write out the fit trajectory, diagonalize the
> > >>>> matrix and write out the "modes" data (i.e. eigenvectors and
> > >>>> eigenvalues).
> > >>>>
> > >>>> parm noions.prmtop
> > >>>> trajin gamd2.nc
> > >>>> # Read in average structure, tag as [avg]
> > >>>> reference gaccAvg.rst7 [avg]
> > >>>> # RMS-fit coordinates to average
> > >>>> rms ref [avg] :1-1530&!.H=
> > >>>> # Calculate coordinate covariance matrix
> > >>>> matrix covar :1-1530&!.H= name gaccCovar
> > >>>> # Write out the fit trajectory
> > >>>> trajout fit.nc
> > >>>> # Diagonalize coordinate covariance matrix for eigenmodes (only 2)
> > >>>> diagmatrix gaccCovar out evecs.dat vecs 2 nmwiz nmwizvecs 2 nmwizfile
> > >>>> cas.nmd nmwizmask :1-1530&!.H=
> > >>>>
> > >>>>
> > >>>> 3) Read in the "modes" data, calculate principal component projections
> > >>>> from the fit trajectory, do the Kullback-Leibler divergence analysis.
> > >>>>
> > >>>> parm noions.prmtop
> > >>>> trajin fit.nc
> > >>>> # Read in modes
> > >>>> readdata evecs.dat name MyModes
> > >>>> # Now create separate PC projections for each trajectory
> > >>>> projection T1 modes MyModes beg 1 end 2 :1-1530&!.H= start 1 stop
> > >>>> 50000 out T1.dat
> > >>>> projection T2 modes MyModes beg 1 end 2 :1-1530&!.H= start 50001 stop
> > >>>> 100000 out T2.dat
> > >>>> <other projection commands>
> > >>>> # Calculate Kullback-Leibler Divergence vs time for PC histograms
> > >>>> kde T1:1 kldiv T2:1 klout KL-PC.agr bins 400 name AMD-MREMD-1
> > >>>> <other kde commands and analyses>
> > >>>>
> > >>>> Hope this helps,
> > >>>>
> > >>>> -Dan
> > >>>>
> > >>>> On Tue, Jun 1, 2021 at 3:23 PM Sruthi Sudhakar
> > >>>> <sruthisudhakarraji.gmail.com> wrote:
> > >>>>>
> > >>>>> The available memory in the beginning of cpptraj is shown as 32 Gb
> > and
> > >>>>> estimated memory usage is 82gb. I am really confused about this
> > memory
> > >>>>> allotment statistics. I am doing the process in a disk with more than
> > >>>> 3TB
> > >>>>> space. I am not well versed with this technicality. Could someone
> > please
> > >>>>> exaplain how to overcome the issue in this principal component
> > analysis
> > >>>>> part? I did understand that we have to separate the analysis into 3
> > >>>> phases
> > >>>>> but not clear as to how the inputs should be changed. Kindly advise.
> > >>>>>
> > >>>>> On Tue, 1 Jun 2021 at 7:21 PM, Sruthi Sudhakar <
> > >>>> sruthisudhakarraji.gmail.com>
> > >>>>> wrote:
> > >>>>>
> > >>>>>>
> > >>>>>> Thank you for the reply. Since I am doing this for the first time, I
> > >>>>>> wanted to know if I am supposed to create 3 separate inputs to run
> > in
> > >>>>>> cpptraj to do the methodology you suggested.
> > >>>>>>
> > >>>>>> Regards,
> > >>>>>> Sruthi
> > >>>>>>
> > >>>>>> On Tue, 1 Jun 2021 at 6:33 PM, Daniel Roe <daniel.r.roe.gmail.com>
> > >>>> wrote:
> > >>>>>>
> > >>>>>>> Hi,
> > >>>>>>>
> > >>>>>>> You are likely running out of memory. This is why your problems
> > >>>> during
> > >>>>>>> clustering went away when you reduced the number of input frames
> > by a
> > >>>>>>> factor of 5. The solution is to do everything on disk. So instead
> > of
> > >>>>>>> loading all coordinates into memory, separate the principal
> > component
> > >>>>>>> analysis into three separate phases:
> > >>>>>>>
> > >>>>>>> 1) Create average coordinates.
> > >>>>>>> 2) Rms fit your trajectory to average coordinates, calculate the
> > >>>>>>> covariance matrix, write out the fit trajectory, diagonalize the
> > >>>>>>> matrix and write out the "modes" data (i.e. eigenvectors and
> > >>>>>>> eigenvalues).
> > >>>>>>> 3) Read in the "modes" data, calculate principal component
> > >>>> projections
> > >>>>>>> from the fit trajectory, do the Kullback-Leibler divergence
> > analysis.
> > >>>>>>>
> > >>>>>>> This way, the most memory you need is to store the covariance
> > matrix,
> > >>>>>>> modes, and other data of that type. Hope this helps,
> > >>>>>>>
> > >>>>>>> -Dan
> > >>>>>>>
> > >>>>>>> PS - Note that there appears to be a small error in the input you
> > >>>>>>> posted (in pca.in). The 'nmwiz' keyword should be part of the
> > >>>>>>> diagmatrix command, not on a separate line.
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> On Mon, May 31, 2021 at 1:22 PM Sruthi Sudhakar
> > >>>>>>> <sruthisudhakarraji.gmail.com> wrote:
> > >>>>>>>>
> > >>>>>>>> Dear all,
> > >>>>>>>>
> > >>>>>>>> I have been doing pca analysis on an accelerated MD trajectory of
> > >>>> 500ns
> > >>>>>>>> (250,000 frames). I have attached the input file I have used for
> > >>>> the
> > >>>>>>> study.
> > >>>>>>>> The job stops at the createcrd stage. Basically, the job gets
> > >>>> killed at
> > >>>>>>>> 30%. The same happened during the cluster analysis reading every
> > >>>> frames.
> > >>>>>>>> The clustering error was solved when I changed the input to read
> > >>>> every
> > >>>>>>> 5th
> > >>>>>>>> frame. Now since this is repeating in pca analysis, kindly help
> > >>>>>>> regarding
> > >>>>>>>> the same.
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> Regards,
> > >>>>>>>> Sruthi Sudhakar
> > >>>>>>>> _______________________________________________
> > >>>>>>>> AMBER mailing list
> > >>>>>>>> AMBER.ambermd.org
> > >>>>>>>> http://lists.ambermd.org/mailman/listinfo/amber
> > >>>>>>>
> > >>>>>>> _______________________________________________
> > >>>>>>> AMBER mailing list
> > >>>>>>> AMBER.ambermd.org
> > >>>>>>> http://lists.ambermd.org/mailman/listinfo/amber
> > >>>>>>>
> > >>>>>>
> > >>>>> _______________________________________________
> > >>>>> AMBER mailing list
> > >>>>> AMBER.ambermd.org
> > >>>>> http://lists.ambermd.org/mailman/listinfo/amber
> > >>>>
> > >>>> _______________________________________________
> > >>>> AMBER mailing list
> > >>>> AMBER.ambermd.org
> > >>>> http://lists.ambermd.org/mailman/listinfo/amber
> > >>>>
> > >>>
> > >> _______________________________________________
> > >> AMBER mailing list
> > >> AMBER.ambermd.org
> > >> http://lists.ambermd.org/mailman/listinfo/amber
> > >
> > > _______________________________________________
> > > AMBER mailing list
> > > AMBER.ambermd.org
> > > http://lists.ambermd.org/mailman/listinfo/amber
> > >
> >
> >
> > _______________________________________________
> > AMBER mailing list
> > AMBER.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber
> >
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Thu Jun 17 2021 - 08:30:02 PDT
Custom Search