[AMBER] MMPBSA.py decomposition CSV output -- proposal for improvement

From: Jan-Philip Gehrcke <jgehrcke.googlemail.com>
Date: Tue, 23 Jul 2013 17:31:51 +0200

Hello (especially to Jason :-)),

this is an example of the current MMPBSA.py's CSV decomp output:

| Run on Tue Jul 23 13:25:14 2013
| GB non-polar solvation energies calculated with gbsa=2
idecomp = 2: Per-residue decomp adding 1-4 interactions to EEL and VDW.
Energy Decomposition Analysis (All units kcal/mol): Generalized Born solvent
DELTAS:
Total Energy Decomposition:
Residue,Location,Internal,,,van der Waals,,,Electrostatic,,,Polar
Solvation,,,Non-Polar Solv.,,,TOTAL,,
,,Avg.,Std. Dev.,Std. Err. of Mean,Avg.,Std. Dev.,Std. Err. of
Mean,Avg.,Std. Dev.,Std. Err. of Mean,Avg.,Std. Dev.,Std. Err. of
Mean,Avg.,Std. Dev.,Std. Err. of Mean
THR 1,R THR
1,0.0,0.0,0.0,-1.7556639999999986,0.6806056766616089,0.04304528253381578,-173.6301999999999,12.528931685663462,0.7923992155069864,174.901216,12.943159858756875,0.8185973054666996,-0.4197939264000002,0.1184941850467571,0.00749423028466436,-0.9044419264000001,0.9120885180014503,0.05768554289144142

I stripped everything after the first data line.

I think we agree that the main purpose of the CSV output is
machine-readability. And I think most people would also agree that
machine-readability is not defined by "viewable with Excel". While there
is not a real CSV standard, RFC 4180 [1] says in section 2 ("This
section documents the format that seems to be followed by most
implementations"):

      " There maybe an optional header line appearing as the first line
        of the file with the same format as normal record lines. This
        header will contain names corresponding to the fields in the file
        and should contain the same number of fields as the records in
        the rest of the file [...]."

The CSV output of the current implementation of MMPBSA.py violates this
'rule'. It starts with a few comment lines and then has two header lines
(with ambiguous names containing spaces). I agree that this looks okay
when opened with LibreOffice or Excel. But with respect to existing CSV
reader implementations in Python, numpy, pandas, R, Matlab, ..., we
should not violate the above 'rule'.

If you would like to keep things as they are, then it looks like one set
of "Avg.,Std. Dev.,Std. Err. of Mean" is missing in the second header line.

I would, however, vote for the following changes:

- Remove the comment lines or at least label them consistently (e.g.
starting with '#')
- Write one header line with unambiguous column names (without spaces),
such as:

residue
location
internal_mean
internal_stddev
internal_stderr
vdw_mean
vdw_stddev
vdw_stderr
estatic_mean
estatic_stddev
estatic_stderr
psolv_mean
psolv_stddev
psolv_stderr
npsolv_mean
npsolv_stddev
npsolv_stderr
total_mean
total_stddev
total_stderr


Doing so would e.g. allow the following with Python/pandas:

     data = pandas.read_csv('FINAL_DECOMP_MMPBSA')
     print data.sort('total_mean').head()


Cheers,

Jan-Philip


[1] http://tools.ietf.org/html/rfc4180

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Tue Jul 23 2013 - 09:00:03 PDT
Custom Search