Re: [AMBER] MMPBSA.py decomposition CSV output -- proposal for improvement

From: Jason Swails <jason.swails.gmail.com>
Date: Tue, 23 Jul 2013 12:51:59 -0400

On Tue, Jul 23, 2013 at 11:31 AM, Jan-Philip Gehrcke <
jgehrcke.googlemail.com> wrote:

> Hello (especially to Jason :-)),
>
> this is an example of the current MMPBSA.py's CSV decomp output:
>
> | Run on Tue Jul 23 13:25:14 2013
> | GB non-polar solvation energies calculated with gbsa=2
> idecomp = 2: Per-residue decomp adding 1-4 interactions to EEL and VDW.
> Energy Decomposition Analysis (All units kcal/mol): Generalized Born
> solvent
> DELTAS:
> Total Energy Decomposition:
> Residue,Location,Internal,,,van der Waals,,,Electrostatic,,,Polar
> Solvation,,,Non-Polar Solv.,,,TOTAL,,
> ,,Avg.,Std. Dev.,Std. Err. of Mean,Avg.,Std. Dev.,Std. Err. of
> Mean,Avg.,Std. Dev.,Std. Err. of Mean,Avg.,Std. Dev.,Std. Err. of
> Mean,Avg.,Std. Dev.,Std. Err. of Mean
> THR 1,R THR
>
> 1,0.0,0.0,0.0,-1.7556639999999986,0.6806056766616089,0.04304528253381578,-173.6301999999999,12.528931685663462,0.7923992155069864,174.901216,12.943159858756875,0.8185973054666996,-0.4197939264000002,0.1184941850467571,0.00749423028466436,-0.9044419264000001,0.9120885180014503,0.05768554289144142
>
> I stripped everything after the first data line.
>
> I think we agree that the main purpose of the CSV output is
> machine-readability. And I think most people would also agree that
> machine-readability is not defined by "viewable with Excel". While there
> is not a real CSV standard, RFC 4180 [1] says in section 2 ("This
> section documents the format that seems to be followed by most
> implementations"):
>
> " There maybe an optional header line appearing as the first line
> of the file with the same format as normal record lines. This
> header will contain names corresponding to the fields in the file
> and should contain the same number of fields as the records in
> the rest of the file [...]."
>
> The CSV output of the current implementation of MMPBSA.py violates this
> 'rule'. It starts with a few comment lines and then has two header lines
> (with ambiguous names containing spaces). I agree that this looks okay
> when opened with LibreOffice or Excel. But with respect to existing CSV
> reader implementations in Python, numpy, pandas, R, Matlab, ..., we
> should not violate the above 'rule'.
>
> If you would like to keep things as they are, then it looks like one set
> of "Avg.,Std. Dev.,Std. Err. of Mean" is missing in the second header line.
>
> I would, however, vote for the following changes:
>
> - Remove the comment lines or at least label them consistently (e.g.
> starting with '#')
> - Write one header line with unambiguous column names (without spaces),
> such as:
>
> residue
> location
> internal_mean
> internal_stddev
> internal_stderr
> vdw_mean
> vdw_stddev
> vdw_stderr
> estatic_mean
> estatic_stddev
> estatic_stderr
> psolv_mean
> psolv_stddev
> psolv_stderr
> npsolv_mean
> npsolv_stddev
> npsolv_stderr
> total_mean
> total_stddev
> total_stderr
>
>
> Doing so would e.g. allow the following with Python/pandas:
>
> data = pandas.read_csv('FINAL_DECOMP_MMPBSA')
> print data.sort('total_mean').head()
>
>
This sounds good to me (it also sounds good for extension to the
energy-dump CSV files). I'll add this to my to-do list and try to get to
it (although if you give me a patch I'll certainly apply that ;)).

pandas intrigues me, and I've been looking for an excuse to look into it
(but scipy has been enough for me so far).

Thanks,
Jason

-- 
Jason M. Swails
BioMaPS,
Rutgers University
Postdoctoral Researcher
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Tue Jul 23 2013 - 10:00:03 PDT
Custom Search