Re: [AMBER] Error in pca analysis from Sruthi Sudhakar on 2021-06-17 (Amber Archive Jun 2021)

From: Sruthi Sudhakar <sruthisudhakarraji.gmail.com>
Date: Thu, 17 Jun 2021 21:36:40 +0530

Trajin is given Actually it was a typo. trajin fit1.nc was loaded and like
I said above T1.dat has 250000 frames but T2.dat has given the error. I had
attached the files in the mail I had directly send you. Is there any other
error in the pca1.in, pca2.in and pca3.in. If not I am not sure what is
happening to the T2.dat. Because almost all the kde and hist commands seems
to give odd results.
Regards,
Sruthi Sudhakar

On Thu, Jun 17, 2021 at 8:36 PM Daniel Roe <daniel.r.roe.gmail.com> wrote:

> Hi,
>
> It looks like in your 'pca3.in' file you're missing the 'trajin'
> statements, i.e. there is nothing for the 'projection' commands to
> work with.
>
> -Dan
>
> On Thu, Jun 17, 2021 at 7:23 AM Sruthi Sudhakar
> <sruthisudhakarraji.gmail.com> wrote:
> >
> > Thanks Dr. Anselm for adding the comments. I had some more follow up
> > questions on the approach one , which I had directly mailed Dan. But for
> > beginners like me who might face similar issue, I would like to post the
> > question here. And I would really appreciate it if someone would just
> > verify if my script is right or not. Because it has been really a
> > tough task to find the loophole in this since I feel my input is right.
> > I have tried approach one, where I stripped off the nucleic acid part in
> > both (box info not removed) and used a common prmtop file. But in that
> > case, I do face some errors in the modes of the second trajectory. The
> > method I used is
> > 1. load both the stripped trajectories, rms fit to the first frame, and
> > write out the average file;
> > 2. the second stage loaded the 2 stripped trajectories, rms fit the
> > trajectories to the average, calculated covariance matrix and wrote out a
> > fit nc file (500000 frames - the sum of both trajectories) and used the
> > diagmatrix command;
> > 3. In the third stage where all the analysis were done I had faced the
> > error. In the third stage, the fit trajectory with 500000 frames was
> loaded
> > and I used the projection command to get the PC 1 and 2 for the
> > trajectories separately by dividing the frames which gave odd results.
> The
> > KL-PC1 (according to the script below, all inputs are attached too) was
> not
> > formed at all. T1.dat read 250000 frames but the T2.dat had 500000
> > coordinates. Projection of T2 was supposed to begin from the 250001th
> frame
> > and end in the last frame. The histogram of T1 gave a Gaussian
> distribution
> > but the histogram of T2 was not giving a Gaussian distribution. T2.dat
> > contained all values of mode 1 and mode 2 as zero for the first 250000
> > frames.
> >
> > Could you give any insights regarding why this might be happening? I
> > suspect it is some error in the script below (input.txt), since, in
> > independent analysis, only the loaded frames appeared in the modes,
> unlike
> > the T2 of this case. Also, would you recommend combining the trajectories
> > much before fitting the trajectory, where instead of loading 2 stripped
> > trajectories in the first stage and then creating a combined fit
> > trajectory, we could directly create a single trajectory in the very
> first
> > stage?
> > Regards,
> > Sruthi Sudhakar
> >
> > parm parmsg.prmtop
> > trajin protsg.nc
> > trajin protdg.nc
> > # Fit to the first frame to remove global rotation/translation
> > rms first :1-1362.CA
> > # Write out the restart
> > average gaccAvg1.rst7 restart
> >
> > pca2.in
> > parm parmsg.prmtop
> > trajin protsg.nc
> > trajin protdg.nc
> > # Read in average structure, tag as [avg]
> > reference gaccAvg1.rst7 [avg]
> > # RMS-fit coordinates to average
> > rms ref [avg] :1-1362.CA
> > # Calculate coordinate covariance matrix
> > matrix covar :1-1362.CA name gaccCovar
> > # Write out the fit trajectory
> > trajout fit1.nc
> > # Diagonalize coordinate covariance matrix for eigenmodes (only 2)
> > diagmatrix gaccCovar out evecs1.dat vecs 20 nmwiz nmwizvecs 5 nmwizfile
> > prot.nmd nmwizmask :1-1362.CA
> >
> > Pca3.in
> > parm parmsg.prmtoptrajin fit1.nc
> > # Read in modes
> > readdata evecs1.dat name MyModes
> > # Now create separate PC projections for each trajectory
> > projection T1 modes MyModes beg 1 end 2 :1-1362.CA start 1 stop 250000
> out
> > T1.dat
> > projection T2 modes MyModes beg 1 end 2 :1-1362.CA start 250001 stop
> 500000
> > out T2.dat
> > # Calculate Kullback-Leibler Divergence vs time for PC histogramskde T1:1
> > kldiv T2:1 klout KL-PC1.dat bins 200 name SG-DG-1
> > kde T1:2 kldiv T2:2 klout KL-PC2.dat bins 200 name SG-DG-2
> > kde T1:1 out kde1-PC1.dat bins 200 name KDE1-1
> > kde T2:1 out kde2-PC1.dat bins 200 name KDE2-1hist T1:1 bins 200 out
> > T1-hist1.dat normint name HIST1-1
> > hist T1:2 bins 200 out T1-hist2.dat normint name HIST1-2
> > hist T2:1 bins 200 out T2-hist1.dat normint name HIST2-1
> > hist T2:2 bins 200 out T2-hist2.dat normint name HIST2-2
> >
> >
> >
> > On Thu, Jun 17, 2021 at 2:52 PM Dr. Anselm Horn <anselm.horn.fau.de>
> wrote:
> >
> > > Hi,
> > >
> > > many thanks to Dan for his explaining comments regarding the PCA
> analysis!
> > >
> > > Since this issue (different systems) bothered me for many years and I
> > > had many discussions about that I'd like to add my thoughts here.
> > >
> > > I see some disadvantages of approach 2, i.e. performing isolated PCA
> > > analyses on the different systems. As Dan pointed out, the underlying
> > > eigenvectors will certainly not be the same. Although a pairwise
> > > comparison (e.g. via RMSIP) between the eigenvectors of the two systems
> > > is possible, in praxi there may arise problems:
> > > 1) the eigenvectors will have a different size, if you include all
> atoms
> > > of the full systems, or the vector entries may represent different
> > > atoms. Thus, a comparison may not be that straightforward.
> > > 2) from the pairwise comparison between the eigenvectors of the two
> > > systems you deduce a correspondence between them. Maybe in real systems
> > > the most important eigenvectors can clearly be matched, but in
> principle
> > > the matching may not be that distinct. And this problem may arise even
> > > if you restrict your analysis to atoms common on both system (e.g.
> > > backbone or CA).
> > > 3) The ease of application of isolated PCA-analysis may seduce the user
> > > to omit that matching step and just proceed to the well-known
> > > PCA1:PCA2-histogram plots.
> > >
> > > That said, I totally agree with Dan that you may use approach 2 in a
> > > very elegant way to obtain histogram plots of a similar pattern, if you
> > > put an extra efford into the comparison step and be aware that the
> basis
> > > eigenvectors resemble a similar basis, i.e. describe a similar overall
> > > motion, but are not the same.
> > >
> > > It's my feeling that approach 1, where you restrict you analysis to a
> > > subset of atoms occuring in all systems investigated (backbone or CA),
> > > is more "save" and thus should be considered first. In that way, you
> are
> > > sure that the basis of all your PCA histograms is the same, and
> > > differences and equalities in the plot between different systems can
> > > directly be compared and discussed.
> > > However, using approach 2 in addition can provide further insights into
> > > the systems' dynamics.
> > > And I also agree with Dan that visual inspection might be very helpful.
> > >
> > > Best regards,
> > >
> > > Anselm
> > >
> > >
> > >
> > > On 06/16/2021 03:18 PM, Daniel Roe wrote:
> > > > Hi,
> > > >
> > > > I know I just responded to you directly but I'll post the reply here
> > > > as well in case people search for it in the future.
> > > >
> > > > So when you have different systems and you want to perform principal
> > > > component analysis there are two ways you can go. You can either 1)
> > > > modify one or both of the systems so you have a common core (i.e.
> > > > remove the parts of each system via 'strip' which don't match the
> > > > other) and perform a combined PCA on that, or 2) perform PCA on each
> > > > system separately and then compare the resulting eigenvectors via
> > > > something like RMSIP (root mean square inner product).
> > > >
> > > > The advantage of approach 1 is the usual advantage of the combined
> PCA
> > > > approach - you're guaranteed that each eigenvector is the same for
> the
> > > > separate systems. This works best when the differences between the
> > > > systems are not too extreme, but if the systems are too different
> this
> > > > should show up as e.g. non-overlapping PC projection histograms. The
> > > > disadvantage is that if you have important contributions to the
> > > > overall motion from parts that you've removed this of course will not
> > > > show up in your final analysis.
> > > >
> > > > The advantage of approach 2 is that you can perform PCA on each full
> > > > system, so you won't miss any motion. The disadvantage is that you're
> > > > not guaranteed that the eigenvectors from each analysis will match
> > > > (and it's likely they won't), i.e. PC 1 in system A will not be the
> > > > same as PC 1 in system B. This is where you would use RMSIP to
> compare
> > > > the different PCs to see which (if any) match.
> > > >
> > > > My feeling is that both approaches should be tried. Approach 2 will
> > > > give you a sense for whether the motions in the systems are similar
> > > > enough for approach 1 to make sense. Also, really make use of the
> > > > 'modes trajout' analysis to generate pseudo trajectories (and also
> the
> > > > 'nmwiz' keyword for the 'diagmatrix' command for use with the VMD
> > > > nmwiz plugin) for the first 2-3 PCs for each system to get a sense
> for
> > > > what the motions actually look like. In my opinion there's really no
> > > > substitute for visualization when it comes to PCA.
> > > >
> > > > Hope this helps!
> > > >
> > > > -Dan
> > > >
> > > > On Sat, Jun 12, 2021 at 5:16 PM Sruthi Sudhakar
> > > > <sruthisudhakarraji.gmail.com> wrote:
> > > >>
> > > >> This method of breaking the analysis into stages actually worked
> and I
> > > was
> > > >> able to complete the analysis overcoming the memory issues.
> > > >> An additional query: Incase we are using 2 different trajectories
> with
> > > >> different number of residues (1530 and 1526) to do a combined
> trajectory
> > > >> analysis, would it be enough to load gamd2.nc (1530 residues) and
> > > >> gamd3.nc (1526
> > > >> residues) together? Each has 250000 frames. And in that case do we
> need
> > > to
> > > >> fit into a single trajectory (fit.nc) or separately create 2 fit
> > > >> trajectory? I am doubtful about the third stage where we load the
> fit
> > > >> trajectory, should we load both the parm files in that case? Also
> would
> > > the
> > > >> difference in the total number of residues create any problem?
> > > >>
> > > >> My sample script following your guidelines would be
> > > >> pca1.in
> > > >> parm noionsa.prmtop [ag]
> > > >> parm noionsb.prmtop [bg]
> > > >> trajin gamd2.nc parm [ag]
> > > >> trajin gamd3.nc parm [bg]
> > > >> # Fit to the first frame to remove global rotation/translation
> > > >> rms first :1-1362.CA
> > > >> # Write out the restart
> > > >> average gaccAvg.rst7 restart
> > > >>
> > > >> pca2.in
> > > >> parm noionsa.prmtop [ag]
> > > >> parm noionsb.prmtop [bg]
> > > >> trajin gamd2.nc parm [ag]
> > > >> trajin gamd3.nc parm [bg]
> > > >> # Read in average structure, tag as [avg]
> > > >> reference gaccAvg.rst7 [avg]
> > > >> # RMS-fit coordinates to average
> > > >> rms ref [avg] :1-1362.CA
> > > >> # Calculate coordinate covariance matrix
> > > >> matrix covar :1-1362.CA name gaccCovar
> > > >> # Write out the fit trajectory
> > > >> trajout fit.nc
> > > >> # Diagonalize coordinate covariance matrix for eigenmodes (only 2)
> > > >> diagmatrix gaccCovar out evecs.dat vecs 20 nmwiz nmwizvecs 5
> > > >> nmwizfile prot.nmd nmwizmask :1-1362.CA
> > > >>
> > > >> pca3.in
> > > >>
> > > >> parm noionsa.prmtop
> > > >> parm noionsb.prmtop
> > > >> trajin fit.nc
> > > >> # Read in modes
> > > >> readdata evecs.dat name MyModes
> > > >> # Now create separate PC projections for each trajectory
> > > >> projection T1 modes MyModes beg 1 end 20 :1-1362.CA start 1 stop
> > > 250000 out
> > > >> T1.dat
> > > >> projection T2 modes MyModes beg 1 end 20 :1-1362.CA start 250001
> stop
> > > >> 500000 out T2.dat
> > > >> # Calculate Kullback-Leibler Divergence vs time for PC histograms
> > > >> kde T1:1 kldiv T2:1 klout KL-PC1.dat bins 200 name SG-DG-1
> > > >> kde T1:2 kldiv T2:2 klout KL-PC2.dat bins 200 name SG-DG-2
> > > >> kde T1:1 out kde1-PC1.dat bins 200 name KDE1-1
> > > >> kde T2:1 out kde2-PC1.dat bins 200 name KDE2-1
> > > >> hist T1:1 bins 200 out T1-hist1.dat normint name HIST1-1
> > > >> hist T1:2 bins 200 out T1-hist2.dat normint name HIST1-2
> > > >> hist T2:1 bins 200 out T2-hist1.dat normint name HIST2-1
> > > >> hist T2:2 bins 200 out T2-hist2.dat normint name HIST2-2
> > > >>
> > > >> Kindly suggest any corrections required.
> > > >> Regards,
> > > >> Sruthi Sudhakar
> > > >>
> > > >>
> > > >> On Sat, Jun 12, 2021 at 12:11 PM Sruthi Sudhakar <
> > > >> sruthisudhakarraji.gmail.com> wrote:
> > > >>
> > > >>> This method actually worked and I was able to complete the
> analysis. An
> > > >>> additional query: Incase we are using 2 different trajectories to
> do a
> > > >>> combined trajectory analysis, would it be enough to load gamd2.nc
> and
> > > >>> gamd3.nc together? And in that case do we need to fit into a
> single
> > > >>> trajectory (fit.nc) or separately create 2 fit trajectory?
> > > >>>
> > > >>> Regards,
> > > >>> Sruthi Sudhakar
> > > >>>
> > > >>>
> > > >>> On Wed, Jun 2, 2021 at 11:32 PM Daniel Roe <daniel.r.roe.gmail.com
> >
> > > wrote:
> > > >>>
> > > >>>> Hi,
> > > >>>>
> > > >>>> First, if you haven't already I really recommend that you look at
> some
> > > >>>> literature to familiarize yourself with what principal component
> > > >>>> analysis actually does. It's important to understand what each
> part of
> > > >>>> the analysis is for. I like the old "essential dynamics" papers,
> but a
> > > >>>> very rough explanation is given in the cpptraj PCA tutorial:
> > > >>>>
> > > >>>>
> > >
> https://amberhub.chpc.utah.edu/introduction-to-principal-component-analysis/
> > > >>>>
> > > >>>> Also, read the cpptraj manual so you know what each command is
> > > actually
> > > >>>> doing.
> > > >>>>
> > > >>>> That said, I now realize my examples in the manual and the
> tutorial
> > > >>>> both rely on having the trajectories stored in memory. So I will
> give
> > > >>>> a rough outline of how to do it all on-disk.
> > > >>>>
> > > >>>> 1) Create average coordinates.
> > > >>>>
> > > >>>> parm noions.prmtop
> > > >>>> trajin gamd2.nc
> > > >>>> # Fit to the first frame to remove global rotation/translation
> > > >>>> rms first :1-1530&!.H=
> > > >>>> # Write out the restart
> > > >>>> average gaccAvg.rst7 restart
> > > >>>>
> > > >>>> 2) Rms fit your trajectory to average coordinates, calculate the
> > > >>>> covariance matrix, write out the fit trajectory, diagonalize the
> > > >>>> matrix and write out the "modes" data (i.e. eigenvectors and
> > > >>>> eigenvalues).
> > > >>>>
> > > >>>> parm noions.prmtop
> > > >>>> trajin gamd2.nc
> > > >>>> # Read in average structure, tag as [avg]
> > > >>>> reference gaccAvg.rst7 [avg]
> > > >>>> # RMS-fit coordinates to average
> > > >>>> rms ref [avg] :1-1530&!.H=
> > > >>>> # Calculate coordinate covariance matrix
> > > >>>> matrix covar :1-1530&!.H= name gaccCovar
> > > >>>> # Write out the fit trajectory
> > > >>>> trajout fit.nc
> > > >>>> # Diagonalize coordinate covariance matrix for eigenmodes (only 2)
> > > >>>> diagmatrix gaccCovar out evecs.dat vecs 2 nmwiz nmwizvecs 2
> nmwizfile
> > > >>>> cas.nmd nmwizmask :1-1530&!.H=
> > > >>>>
> > > >>>>
> > > >>>> 3) Read in the "modes" data, calculate principal component
> projections
> > > >>>> from the fit trajectory, do the Kullback-Leibler divergence
> analysis.
> > > >>>>
> > > >>>> parm noions.prmtop
> > > >>>> trajin fit.nc
> > > >>>> # Read in modes
> > > >>>> readdata evecs.dat name MyModes
> > > >>>> # Now create separate PC projections for each trajectory
> > > >>>> projection T1 modes MyModes beg 1 end 2 :1-1530&!.H= start 1 stop
> > > >>>> 50000 out T1.dat
> > > >>>> projection T2 modes MyModes beg 1 end 2 :1-1530&!.H= start 50001
> stop
> > > >>>> 100000 out T2.dat
> > > >>>> <other projection commands>
> > > >>>> # Calculate Kullback-Leibler Divergence vs time for PC histograms
> > > >>>> kde T1:1 kldiv T2:1 klout KL-PC.agr bins 400 name AMD-MREMD-1
> > > >>>> <other kde commands and analyses>
> > > >>>>
> > > >>>> Hope this helps,
> > > >>>>
> > > >>>> -Dan
> > > >>>>
> > > >>>> On Tue, Jun 1, 2021 at 3:23 PM Sruthi Sudhakar
> > > >>>> <sruthisudhakarraji.gmail.com> wrote:
> > > >>>>>
> > > >>>>> The available memory in the beginning of cpptraj is shown as 32
> Gb
> > > and
> > > >>>>> estimated memory usage is 82gb. I am really confused about this
> > > memory
> > > >>>>> allotment statistics. I am doing the process in a disk with more
> than
> > > >>>> 3TB
> > > >>>>> space. I am not well versed with this technicality. Could someone
> > > please
> > > >>>>> exaplain how to overcome the issue in this principal component
> > > analysis
> > > >>>>> part? I did understand that we have to separate the analysis
> into 3
> > > >>>> phases
> > > >>>>> but not clear as to how the inputs should be changed. Kindly
> advise.
> > > >>>>>
> > > >>>>> On Tue, 1 Jun 2021 at 7:21 PM, Sruthi Sudhakar <
> > > >>>> sruthisudhakarraji.gmail.com>
> > > >>>>> wrote:
> > > >>>>>
> > > >>>>>>
> > > >>>>>> Thank you for the reply. Since I am doing this for the first
> time, I
> > > >>>>>> wanted to know if I am supposed to create 3 separate inputs to
> run
> > > in
> > > >>>>>> cpptraj to do the methodology you suggested.
> > > >>>>>>
> > > >>>>>> Regards,
> > > >>>>>> Sruthi
> > > >>>>>>
> > > >>>>>> On Tue, 1 Jun 2021 at 6:33 PM, Daniel Roe <
> daniel.r.roe.gmail.com>
> > > >>>> wrote:
> > > >>>>>>
> > > >>>>>>> Hi,
> > > >>>>>>>
> > > >>>>>>> You are likely running out of memory. This is why your problems
> > > >>>> during
> > > >>>>>>> clustering went away when you reduced the number of input
> frames
> > > by a
> > > >>>>>>> factor of 5. The solution is to do everything on disk. So
> instead
> > > of
> > > >>>>>>> loading all coordinates into memory, separate the principal
> > > component
> > > >>>>>>> analysis into three separate phases:
> > > >>>>>>>
> > > >>>>>>> 1) Create average coordinates.
> > > >>>>>>> 2) Rms fit your trajectory to average coordinates, calculate
> the
> > > >>>>>>> covariance matrix, write out the fit trajectory, diagonalize
> the
> > > >>>>>>> matrix and write out the "modes" data (i.e. eigenvectors and
> > > >>>>>>> eigenvalues).
> > > >>>>>>> 3) Read in the "modes" data, calculate principal component
> > > >>>> projections
> > > >>>>>>> from the fit trajectory, do the Kullback-Leibler divergence
> > > analysis.
> > > >>>>>>>
> > > >>>>>>> This way, the most memory you need is to store the covariance
> > > matrix,
> > > >>>>>>> modes, and other data of that type. Hope this helps,
> > > >>>>>>>
> > > >>>>>>> -Dan
> > > >>>>>>>
> > > >>>>>>> PS - Note that there appears to be a small error in the input
> you
> > > >>>>>>> posted (in pca.in). The 'nmwiz' keyword should be part of the
> > > >>>>>>> diagmatrix command, not on a separate line.
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>> On Mon, May 31, 2021 at 1:22 PM Sruthi Sudhakar
> > > >>>>>>> <sruthisudhakarraji.gmail.com> wrote:
> > > >>>>>>>>
> > > >>>>>>>> Dear all,
> > > >>>>>>>>
> > > >>>>>>>> I have been doing pca analysis on an accelerated MD
> trajectory of
> > > >>>> 500ns
> > > >>>>>>>> (250,000 frames). I have attached the input file I have used
> for
> > > >>>> the
> > > >>>>>>> study.
> > > >>>>>>>> The job stops at the createcrd stage. Basically, the job gets
> > > >>>> killed at
> > > >>>>>>>> 30%. The same happened during the cluster analysis reading
> every
> > > >>>> frames.
> > > >>>>>>>> The clustering error was solved when I changed the input to
> read
> > > >>>> every
> > > >>>>>>> 5th
> > > >>>>>>>> frame. Now since this is repeating in pca analysis, kindly
> help
> > > >>>>>>> regarding
> > > >>>>>>>> the same.
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>> Regards,
> > > >>>>>>>> Sruthi Sudhakar
> > > >>>>>>>> _______________________________________________
> > > >>>>>>>> AMBER mailing list
> > > >>>>>>>> AMBER.ambermd.org
> > > >>>>>>>> http://lists.ambermd.org/mailman/listinfo/amber
> > > >>>>>>>
> > > >>>>>>> _______________________________________________
> > > >>>>>>> AMBER mailing list
> > > >>>>>>> AMBER.ambermd.org
> > > >>>>>>> http://lists.ambermd.org/mailman/listinfo/amber
> > > >>>>>>>
> > > >>>>>>
> > > >>>>> _______________________________________________
> > > >>>>> AMBER mailing list
> > > >>>>> AMBER.ambermd.org
> > > >>>>> http://lists.ambermd.org/mailman/listinfo/amber
> > > >>>>
> > > >>>> _______________________________________________
> > > >>>> AMBER mailing list
> > > >>>> AMBER.ambermd.org
> > > >>>> http://lists.ambermd.org/mailman/listinfo/amber
> > > >>>>
> > > >>>
> > > >> _______________________________________________
> > > >> AMBER mailing list
> > > >> AMBER.ambermd.org
> > > >> http://lists.ambermd.org/mailman/listinfo/amber
> > > >
> > > > _______________________________________________
> > > > AMBER mailing list
> > > > AMBER.ambermd.org
> > > > http://lists.ambermd.org/mailman/listinfo/amber
> > > >
> > >
> > >
> > > _______________________________________________
> > > AMBER mailing list
> > > AMBER.ambermd.org
> > > http://lists.ambermd.org/mailman/listinfo/amber
> > >
> > _______________________________________________
> > AMBER mailing list
> > AMBER.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Thu Jun 17 2021 - 09:30:02 PDT