Re: [AMBER] Error in pca analysis

From: Daniel Roe <daniel.r.roe.gmail.com>
Date: Wed, 16 Jun 2021 09:18:39 -0400

Hi,

I know I just responded to you directly but I'll post the reply here
as well in case people search for it in the future.

So when you have different systems and you want to perform principal
component analysis there are two ways you can go. You can either 1)
modify one or both of the systems so you have a common core (i.e.
remove the parts of each system via 'strip' which don't match the
other) and perform a combined PCA on that, or 2) perform PCA on each
system separately and then compare the resulting eigenvectors via
something like RMSIP (root mean square inner product).

The advantage of approach 1 is the usual advantage of the combined PCA
approach - you're guaranteed that each eigenvector is the same for the
separate systems. This works best when the differences between the
systems are not too extreme, but if the systems are too different this
should show up as e.g. non-overlapping PC projection histograms. The
disadvantage is that if you have important contributions to the
overall motion from parts that you've removed this of course will not
show up in your final analysis.

The advantage of approach 2 is that you can perform PCA on each full
system, so you won't miss any motion. The disadvantage is that you're
not guaranteed that the eigenvectors from each analysis will match
(and it's likely they won't), i.e. PC 1 in system A will not be the
same as PC 1 in system B. This is where you would use RMSIP to compare
the different PCs to see which (if any) match.

My feeling is that both approaches should be tried. Approach 2 will
give you a sense for whether the motions in the systems are similar
enough for approach 1 to make sense. Also, really make use of the
'modes trajout' analysis to generate pseudo trajectories (and also the
'nmwiz' keyword for the 'diagmatrix' command for use with the VMD
nmwiz plugin) for the first 2-3 PCs for each system to get a sense for
what the motions actually look like. In my opinion there's really no
substitute for visualization when it comes to PCA.

Hope this helps!

-Dan

On Sat, Jun 12, 2021 at 5:16 PM Sruthi Sudhakar
<sruthisudhakarraji.gmail.com> wrote:
>
> This method of breaking the analysis into stages actually worked and I was
> able to complete the analysis overcoming the memory issues.
> An additional query: Incase we are using 2 different trajectories with
> different number of residues (1530 and 1526) to do a combined trajectory
> analysis, would it be enough to load gamd2.nc (1530 residues) and
> gamd3.nc (1526
> residues) together? Each has 250000 frames. And in that case do we need to
> fit into a single trajectory (fit.nc) or separately create 2 fit
> trajectory? I am doubtful about the third stage where we load the fit
> trajectory, should we load both the parm files in that case? Also would the
> difference in the total number of residues create any problem?
>
> My sample script following your guidelines would be
> pca1.in
> parm noionsa.prmtop [ag]
> parm noionsb.prmtop [bg]
> trajin gamd2.nc parm [ag]
> trajin gamd3.nc parm [bg]
> # Fit to the first frame to remove global rotation/translation
> rms first :1-1362.CA
> # Write out the restart
> average gaccAvg.rst7 restart
>
> pca2.in
> parm noionsa.prmtop [ag]
> parm noionsb.prmtop [bg]
> trajin gamd2.nc parm [ag]
> trajin gamd3.nc parm [bg]
> # Read in average structure, tag as [avg]
> reference gaccAvg.rst7 [avg]
> # RMS-fit coordinates to average
> rms ref [avg] :1-1362.CA
> # Calculate coordinate covariance matrix
> matrix covar :1-1362.CA name gaccCovar
> # Write out the fit trajectory
> trajout fit.nc
> # Diagonalize coordinate covariance matrix for eigenmodes (only 2)
> diagmatrix gaccCovar out evecs.dat vecs 20 nmwiz nmwizvecs 5
> nmwizfile prot.nmd nmwizmask :1-1362.CA
>
> pca3.in
>
> parm noionsa.prmtop
> parm noionsb.prmtop
> trajin fit.nc
> # Read in modes
> readdata evecs.dat name MyModes
> # Now create separate PC projections for each trajectory
> projection T1 modes MyModes beg 1 end 20 :1-1362.CA start 1 stop 250000 out
> T1.dat
> projection T2 modes MyModes beg 1 end 20 :1-1362.CA start 250001 stop
> 500000 out T2.dat
> # Calculate Kullback-Leibler Divergence vs time for PC histograms
> kde T1:1 kldiv T2:1 klout KL-PC1.dat bins 200 name SG-DG-1
> kde T1:2 kldiv T2:2 klout KL-PC2.dat bins 200 name SG-DG-2
> kde T1:1 out kde1-PC1.dat bins 200 name KDE1-1
> kde T2:1 out kde2-PC1.dat bins 200 name KDE2-1
> hist T1:1 bins 200 out T1-hist1.dat normint name HIST1-1
> hist T1:2 bins 200 out T1-hist2.dat normint name HIST1-2
> hist T2:1 bins 200 out T2-hist1.dat normint name HIST2-1
> hist T2:2 bins 200 out T2-hist2.dat normint name HIST2-2
>
> Kindly suggest any corrections required.
> Regards,
> Sruthi Sudhakar
>
>
> On Sat, Jun 12, 2021 at 12:11 PM Sruthi Sudhakar <
> sruthisudhakarraji.gmail.com> wrote:
>
> > This method actually worked and I was able to complete the analysis. An
> > additional query: Incase we are using 2 different trajectories to do a
> > combined trajectory analysis, would it be enough to load gamd2.nc and
> > gamd3.nc together? And in that case do we need to fit into a single
> > trajectory (fit.nc) or separately create 2 fit trajectory?
> >
> > Regards,
> > Sruthi Sudhakar
> >
> >
> > On Wed, Jun 2, 2021 at 11:32 PM Daniel Roe <daniel.r.roe.gmail.com> wrote:
> >
> >> Hi,
> >>
> >> First, if you haven't already I really recommend that you look at some
> >> literature to familiarize yourself with what principal component
> >> analysis actually does. It's important to understand what each part of
> >> the analysis is for. I like the old "essential dynamics" papers, but a
> >> very rough explanation is given in the cpptraj PCA tutorial:
> >>
> >> https://amberhub.chpc.utah.edu/introduction-to-principal-component-analysis/
> >>
> >> Also, read the cpptraj manual so you know what each command is actually
> >> doing.
> >>
> >> That said, I now realize my examples in the manual and the tutorial
> >> both rely on having the trajectories stored in memory. So I will give
> >> a rough outline of how to do it all on-disk.
> >>
> >> 1) Create average coordinates.
> >>
> >> parm noions.prmtop
> >> trajin gamd2.nc
> >> # Fit to the first frame to remove global rotation/translation
> >> rms first :1-1530&!.H=
> >> # Write out the restart
> >> average gaccAvg.rst7 restart
> >>
> >> 2) Rms fit your trajectory to average coordinates, calculate the
> >> covariance matrix, write out the fit trajectory, diagonalize the
> >> matrix and write out the "modes" data (i.e. eigenvectors and
> >> eigenvalues).
> >>
> >> parm noions.prmtop
> >> trajin gamd2.nc
> >> # Read in average structure, tag as [avg]
> >> reference gaccAvg.rst7 [avg]
> >> # RMS-fit coordinates to average
> >> rms ref [avg] :1-1530&!.H=
> >> # Calculate coordinate covariance matrix
> >> matrix covar :1-1530&!.H= name gaccCovar
> >> # Write out the fit trajectory
> >> trajout fit.nc
> >> # Diagonalize coordinate covariance matrix for eigenmodes (only 2)
> >> diagmatrix gaccCovar out evecs.dat vecs 2 nmwiz nmwizvecs 2 nmwizfile
> >> cas.nmd nmwizmask :1-1530&!.H=
> >>
> >>
> >> 3) Read in the "modes" data, calculate principal component projections
> >> from the fit trajectory, do the Kullback-Leibler divergence analysis.
> >>
> >> parm noions.prmtop
> >> trajin fit.nc
> >> # Read in modes
> >> readdata evecs.dat name MyModes
> >> # Now create separate PC projections for each trajectory
> >> projection T1 modes MyModes beg 1 end 2 :1-1530&!.H= start 1 stop
> >> 50000 out T1.dat
> >> projection T2 modes MyModes beg 1 end 2 :1-1530&!.H= start 50001 stop
> >> 100000 out T2.dat
> >> <other projection commands>
> >> # Calculate Kullback-Leibler Divergence vs time for PC histograms
> >> kde T1:1 kldiv T2:1 klout KL-PC.agr bins 400 name AMD-MREMD-1
> >> <other kde commands and analyses>
> >>
> >> Hope this helps,
> >>
> >> -Dan
> >>
> >> On Tue, Jun 1, 2021 at 3:23 PM Sruthi Sudhakar
> >> <sruthisudhakarraji.gmail.com> wrote:
> >> >
> >> > The available memory in the beginning of cpptraj is shown as 32 Gb and
> >> > estimated memory usage is 82gb. I am really confused about this memory
> >> > allotment statistics. I am doing the process in a disk with more than
> >> 3TB
> >> > space. I am not well versed with this technicality. Could someone please
> >> > exaplain how to overcome the issue in this principal component analysis
> >> > part? I did understand that we have to separate the analysis into 3
> >> phases
> >> > but not clear as to how the inputs should be changed. Kindly advise.
> >> >
> >> > On Tue, 1 Jun 2021 at 7:21 PM, Sruthi Sudhakar <
> >> sruthisudhakarraji.gmail.com>
> >> > wrote:
> >> >
> >> > >
> >> > > Thank you for the reply. Since I am doing this for the first time, I
> >> > > wanted to know if I am supposed to create 3 separate inputs to run in
> >> > > cpptraj to do the methodology you suggested.
> >> > >
> >> > > Regards,
> >> > > Sruthi
> >> > >
> >> > > On Tue, 1 Jun 2021 at 6:33 PM, Daniel Roe <daniel.r.roe.gmail.com>
> >> wrote:
> >> > >
> >> > >> Hi,
> >> > >>
> >> > >> You are likely running out of memory. This is why your problems
> >> during
> >> > >> clustering went away when you reduced the number of input frames by a
> >> > >> factor of 5. The solution is to do everything on disk. So instead of
> >> > >> loading all coordinates into memory, separate the principal component
> >> > >> analysis into three separate phases:
> >> > >>
> >> > >> 1) Create average coordinates.
> >> > >> 2) Rms fit your trajectory to average coordinates, calculate the
> >> > >> covariance matrix, write out the fit trajectory, diagonalize the
> >> > >> matrix and write out the "modes" data (i.e. eigenvectors and
> >> > >> eigenvalues).
> >> > >> 3) Read in the "modes" data, calculate principal component
> >> projections
> >> > >> from the fit trajectory, do the Kullback-Leibler divergence analysis.
> >> > >>
> >> > >> This way, the most memory you need is to store the covariance matrix,
> >> > >> modes, and other data of that type. Hope this helps,
> >> > >>
> >> > >> -Dan
> >> > >>
> >> > >> PS - Note that there appears to be a small error in the input you
> >> > >> posted (in pca.in). The 'nmwiz' keyword should be part of the
> >> > >> diagmatrix command, not on a separate line.
> >> > >>
> >> > >>
> >> > >> On Mon, May 31, 2021 at 1:22 PM Sruthi Sudhakar
> >> > >> <sruthisudhakarraji.gmail.com> wrote:
> >> > >> >
> >> > >> > Dear all,
> >> > >> >
> >> > >> > I have been doing pca analysis on an accelerated MD trajectory of
> >> 500ns
> >> > >> > (250,000 frames). I have attached the input file I have used for
> >> the
> >> > >> study.
> >> > >> > The job stops at the createcrd stage. Basically, the job gets
> >> killed at
> >> > >> > 30%. The same happened during the cluster analysis reading every
> >> frames.
> >> > >> > The clustering error was solved when I changed the input to read
> >> every
> >> > >> 5th
> >> > >> > frame. Now since this is repeating in pca analysis, kindly help
> >> > >> regarding
> >> > >> > the same.
> >> > >> >
> >> > >> >
> >> > >> > Regards,
> >> > >> > Sruthi Sudhakar
> >> > >> > _______________________________________________
> >> > >> > AMBER mailing list
> >> > >> > AMBER.ambermd.org
> >> > >> > http://lists.ambermd.org/mailman/listinfo/amber
> >> > >>
> >> > >> _______________________________________________
> >> > >> AMBER mailing list
> >> > >> AMBER.ambermd.org
> >> > >> http://lists.ambermd.org/mailman/listinfo/amber
> >> > >>
> >> > >
> >> > _______________________________________________
> >> > AMBER mailing list
> >> > AMBER.ambermd.org
> >> > http://lists.ambermd.org/mailman/listinfo/amber
> >>
> >> _______________________________________________
> >> AMBER mailing list
> >> AMBER.ambermd.org
> >> http://lists.ambermd.org/mailman/listinfo/amber
> >>
> >
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Wed Jun 16 2021 - 06:30:03 PDT
Custom Search