mirror of
https://github.com/PDB-REDO/libcifpp.git
synced 2026-06-04 13:54:25 +08:00
401 lines
14 KiB
ReStructuredText
401 lines
14 KiB
ReStructuredText
Basic usage
|
|
===========
|
|
|
|
This library, *libcifpp*, is a generic *CIF* library with some specific additions to work with *mmCIF* files. The main focus of this library is to make sure that files read or written are valid. That is, they are syntactically valid *and* their content is valid with respect to a CIF dictionary, if such a dictionary is available and specified.
|
|
|
|
Reading a file is as simple as:
|
|
|
|
.. code-block:: cpp
|
|
|
|
#include <cif++.hpp>
|
|
|
|
cif::file f("/path/to/file.cif");
|
|
|
|
The file may also be compressed using *gzip* which is detected automatically.
|
|
|
|
Writing out the file again is also simple, to write out the terminal you can do:
|
|
|
|
.. code-block:: cpp
|
|
|
|
std::cout << f;
|
|
|
|
// or
|
|
f.save(std::cout);
|
|
|
|
// or write a compressed file using gzip compression:
|
|
f.save("/tmp/f.cif.gz");
|
|
|
|
CIF files contain one or more datablocks. To print out the names of all datablocks in our file:
|
|
|
|
.. code-block:: cpp
|
|
|
|
for (auto &db : f)
|
|
std::cout << db.name() << '\n';
|
|
|
|
Most often *libcifpp* is used to read in structure files in mmCIF format. These files only contain one datablock and so you can safely use code like this:
|
|
|
|
.. code-block:: cpp
|
|
|
|
// get a reference to the first datablock in f
|
|
auto &db = f.front();
|
|
|
|
But if you know the name of the datablock, this also works:
|
|
|
|
.. code-block:: cpp
|
|
|
|
// get a reference to the datablock name '1CBS'
|
|
auto &db = f["1CBS"];
|
|
|
|
Now, each datablock contains categories. To print out all their names:
|
|
|
|
.. code-block:: cpp
|
|
|
|
for (auto &cat : db)
|
|
std::cout << cat.name() << '\n';
|
|
|
|
But you probably know what category you need to use, so lets fetch it by name:
|
|
|
|
.. _atom_site-label:
|
|
.. code-block:: cpp
|
|
|
|
// get a reference to the atom_site category in db
|
|
auto &atom_site = db["atom_site"];
|
|
|
|
// and make sure there's some data in it:
|
|
assert(not atom_site.empty());
|
|
|
|
.. note::
|
|
|
|
Note that we omit the leading underscore in the name of the category here.
|
|
|
|
Categories contain rows of data and each row has fields or items. Referencing a row in a category results in a :cpp:class:`cif::row_handle` object which you can use to request or manipulate item data.
|
|
|
|
.. code-block:: cpp
|
|
|
|
// Get the first row in atom_site
|
|
auto rh = atom_site.front();
|
|
|
|
// Get the label_atom_id value from this row handle as a std::string
|
|
std::string atom_id = rh["label_atom_id"].as<std::string>();
|
|
|
|
// Get the x, y and z coordinates using structered binding
|
|
const auto &[x, y, z] = rh.get<float,float,float>("Cartn_x", "Cartn_y", "Cartn_z");
|
|
|
|
// Assign a new value to the x coordinate or our atom
|
|
rh["Cartn_x"] = x + 1;
|
|
|
|
Querying
|
|
--------
|
|
|
|
Walking over the rows in a category is often not very useful. More often you are interested in specific rows in a category. The function :cpp:func:`cif::category::find` and friends are here to help.
|
|
|
|
What these functions have in common is that they return data based on a query implemented by :cpp:class:`cif::condition`. These condition objects are built in code using regular C++ syntax. The most basic example of a query is:
|
|
|
|
.. code-block:: cpp
|
|
|
|
cif::condition c = cif::key("id") == 1;
|
|
|
|
Here the condition is that all rows returned should have a value of 1 in there item named *id*. Likewise you can use other data types and even combine those. Oh, and I said we use regular C++ syntax for conditions, so you may as well use other operators to compare values:
|
|
|
|
.. code-block:: cpp
|
|
|
|
// condition for C-alpha atoms having an occupancy less than 1.0
|
|
cif::condition c = cif::key("occupancy") < 1.0f and cif::key("label_atom_id") == "CA";
|
|
|
|
Using the namespace *cif::literals* that code becomes a little less verbose:
|
|
|
|
.. code-block:: cpp
|
|
|
|
using namespace cif::literals;
|
|
cif::condition c = "occupancy"_key < 1.0f and "label_atom_id"_key == "CA";
|
|
|
|
Conditions can also be combined:
|
|
|
|
.. code-block:: cpp
|
|
|
|
cif::condition c = "occupancy"_key < 1.0f and "label_atom_id"_key == "CA";
|
|
|
|
// extend the condition by requiring the compound ID to be unequal to PRO
|
|
c = std::move(c) and "label_comp_id"_key != "PRO";
|
|
|
|
.. note::
|
|
|
|
Note the use of std::move here.
|
|
|
|
Using queries constructed in this way is simple:
|
|
|
|
.. code-block:: cpp
|
|
|
|
cif::condition c = ...
|
|
auto result = atom_site.find(std::move(c));
|
|
|
|
// or construct a condition inline:
|
|
auto result = atom_site.find("label_atom_id"_key == "CA");
|
|
|
|
In the example above the result is a range of :cpp:class:`cif::row_handle` objects. Often, using individual field values is more useful:
|
|
|
|
.. code-block:: cpp
|
|
|
|
// Requesting a single item:
|
|
for (auto id : atom_site.find<std::string>("label_atom_id"_key == "CA", "id"))
|
|
std::cout << "ID for CA: " << id << '\n';
|
|
|
|
// Requesting multiple items:
|
|
for (const auto &[id, x, y, z] : atom_site.find<std::string,float,float,float>("label_atom_id"_key == "CA",
|
|
"id", "Cartn_x", "Cartn_y", "Cartn_z"))
|
|
{
|
|
std::cout << "Atom " << id << " is at [" << x << ", " << y << ", " z << "]\n";
|
|
}
|
|
|
|
Returning a complete set if often not required, if you only want to have the first you can use :cpp:func:`cif::category::find_first` as shown here:
|
|
|
|
.. code-block:: cpp
|
|
|
|
// return the ID item for the first C-alpha atom
|
|
std::string v1 = atom_site.find_first<std::string>("label_atom_id"_key == "CA", "id");
|
|
|
|
// If you're not sure the row exists, use std::optional
|
|
auto v2 = atom_site.find_first<std::optional<std::string>>("label_atom_id"_key == "CA", "id");
|
|
if (v2.has_value())
|
|
...
|
|
|
|
There are cases when you really need exactly one result. The :cpp:func:`cif::category::find1` can be used in that case, it will throw an exception if the query does not result in exactly one row.
|
|
|
|
NULL and ANY
|
|
------------
|
|
|
|
Sometimes items may be empty. The trouble is a bit that empty comes in two flavors: unknown and null. Null in *CIF* parlance means the item should not contain a value since it makes no sense in this case, the value stored in the file is a single dot character: ``'.'``. E.g. *atom_site* records may have a NULL value for label_seq_id for atoms that are part of a *non-polymer*.
|
|
|
|
The other empty value is indicated by a question mark character: ``'?'``. This means the value is simply unknown.
|
|
|
|
Both these are NULL in *libcifpp* conditions and can be searched for using :cpp:var:`cif::null`.
|
|
|
|
So you can search for:
|
|
|
|
.. code-block:: cpp
|
|
|
|
cif::condition c = "label_seq_id"_key == cif::null;
|
|
|
|
You might also want to look for a certain value and don't care in which item it is stored, in that case you can use :cpp:var:`cif::any`.
|
|
|
|
.. code-block:: cpp
|
|
|
|
cif::condition c = cif::any == "foo";
|
|
|
|
And in linked record you might have the items that have a value in both parent and child or both should be NULL. For that, you can request the value to return by find to be of type std::optional and then use that value to build the query. An example to explain this, let's find the location of the atom that is referenced as the first atom in a struct_conn record:
|
|
|
|
.. code-block:: cpp
|
|
|
|
// Take references to the two categories we need
|
|
auto struct_conn = db["struct_conn"];
|
|
auto atom_site = db["atom_site"];
|
|
|
|
// Loop over all rows in struct_conn taking only the values we need
|
|
// Note that the label_seq_id is returned as a std::optional<int>
|
|
// That means it may contain an integer or may be empty
|
|
for (const auto &[asym1, seqid1, authseqid1, atomid1] :
|
|
struct_conn.rows<std::string,std::optional<int>,std::string,std::string,std::string>(
|
|
"ptnr1_label_asym_id", "ptnr1_label_seq_id", "ptnr1_auth_seq_id", "ptnr1_label_atom_id"
|
|
))
|
|
{
|
|
// Find the location of the first atom
|
|
cif::point p1 = atom_site.find1<float,float,float>(
|
|
"label_asym_id"_key == asym1 and "label_seq_id"_key == seqid1 and "auth_seq_id"_key == authseqid1 and "label_atom_id"_key == atomid1,
|
|
"cartn_x", "cartn_y", "cartn_z");
|
|
}
|
|
|
|
|
|
Validation
|
|
----------
|
|
|
|
CIF files can have a dictionary attached. And based on such a dictionary a :cpp:class:`cif::validator` object can be constructed which in turn can be used to validate the content of the file.
|
|
|
|
A simple case:
|
|
|
|
.. code-block:: cpp
|
|
|
|
#include <cif++.hpp>
|
|
|
|
cif::file f("1cbs.cif.gz");
|
|
f.load_dictionary("mmcif_pdbx.dic");
|
|
|
|
if (not f.is_valid())
|
|
std::cout << "This file is not valid\n";
|
|
|
|
If you want to know why it is not valid, you should set the global variable :cpp:var:`cif::VERBOSE` to something higer than zero. Depending on the value more or less diagnostic output is sent to std::cerr.
|
|
|
|
In the case above we load a dictionary based on its name. You can of course also load dictionaries based on a specific file, that's a bit more work:
|
|
|
|
.. code-block:: cpp
|
|
|
|
std::filesystem::ifstream dictFile("/tmp/my-dictionary.dic");
|
|
auto &validator = cif::parse_dictionary("my-dictionary", dictFile);
|
|
|
|
cif::file f("1cbs.cif.gz");
|
|
|
|
// assign the validator
|
|
f.set_validator(&validator);
|
|
|
|
// alternatively, load it by name
|
|
f.load_dictionary("my-dictionary");
|
|
|
|
if (not f.is_valid())
|
|
std::cout << "This file is not valid\n";
|
|
|
|
Creating your own dictionary is a lot of work, especially if you are only extending an existing dictionary with a couple of new categories or items. So, what you can do is extend a loaded validator like this (code taken from DSSP):
|
|
|
|
.. code-block:: cpp
|
|
|
|
// db is a cif::datablock reference containing an mmCIF file with DSSP annotations
|
|
auto &validator = const_cast<cif::validator &>(*db.get_validator());
|
|
if (validator.get_validator_for_category("dssp_struct_summary") == nullptr)
|
|
{
|
|
auto dssp_extension = cif::load_resource("dssp-extension.dic");
|
|
if (dssp_extension)
|
|
cif::extend_dictionary(validator, *dssp_extension);
|
|
}
|
|
|
|
.. note::
|
|
|
|
In the example above we're loading the data using :doc:`/resources`. See the documentation on that for more information.
|
|
|
|
If a validator has been assigned to a file, assignments to items are checked for valid data. So the following code will throw an exception (see: :ref:`_atom_site-label`):
|
|
|
|
.. code-block:: cpp
|
|
|
|
auto rh = atom_site.front();
|
|
rh["Cartn_x"] = "foo";
|
|
|
|
Linking
|
|
-------
|
|
|
|
Based on information recorded in dictionary files (see :ref:`Validation`) you can locate linked records in parent or child categories.
|
|
|
|
To make this example not too complex, lets assume the following example file:
|
|
|
|
.. code-block:: cif
|
|
|
|
data_test
|
|
loop_
|
|
_cat_1.id
|
|
_cat_1.name
|
|
_cat_1.desc
|
|
1 aap Aap
|
|
2 noot Noot
|
|
3 mies Mies
|
|
|
|
loop_
|
|
_cat_2.id
|
|
_cat_2.name
|
|
_cat_2.num
|
|
_cat_2.desc
|
|
1 aap 1 'Een dier'
|
|
2 aap 2 'Een andere aap'
|
|
3 noot 1 'walnoot bijvoorbeeld'
|
|
|
|
And we have a dictionary containing the following link definition:
|
|
|
|
.. code-block:: cif
|
|
|
|
loop_
|
|
_pdbx_item_linked_group_list.parent_category_id
|
|
_pdbx_item_linked_group_list.link_group_id
|
|
_pdbx_item_linked_group_list.parent_name
|
|
_pdbx_item_linked_group_list.child_name
|
|
_pdbx_item_linked_group_list.child_category_id
|
|
cat_1 1 '_cat_1.name' '_cat_2.name' cat_2
|
|
|
|
So, there are links between *cat_1* and *cat_2* based on the value in items named *name*. Using this information, we can now locate children and parents:
|
|
|
|
.. code-block:: cpp
|
|
|
|
// Assuming the file was loaded in f:
|
|
auto &cat1 = f.front()["cat_1"];
|
|
auto &cat2 = f.front()["cat_2"];
|
|
auto &cat3 = f.front()["cat_3"];
|
|
|
|
// Loop over all ape's in cat2
|
|
for (auto r : cat1.get_children(cat1.find1("name"_key == "aap"), cat2))
|
|
std::cout << r.get<std::string>("desc") << '\n';
|
|
|
|
Updating a value in an item in a parent category will update the corresponding value in all related children:
|
|
|
|
.. code-block:: cpp
|
|
|
|
auto r1 = cat1.find1("id"_key == 1);
|
|
r1["name"] = "aapje";
|
|
|
|
auto rs1 = cat2.find("name"_key == "aapje");
|
|
assert(rs1.size() == 2);
|
|
|
|
However, changing a value in a child record will not update the parent. This may result in an invalid file since you may then have a child that has no parent:
|
|
|
|
.. code-block:: cpp
|
|
|
|
auto r2 = cat2.find1("id"_key == 3);
|
|
r2["name"] = "wim";
|
|
|
|
assert(f.is_valid() == false);
|
|
|
|
So you have to fix this yourself by inserting a new item in cat1 with the new value.
|
|
|
|
.. _splitting-rows:
|
|
Another situation is when you change a value in a parent and updating children might introduce a situation where you need to split a child. To give an example, consider this:
|
|
|
|
.. code-block:: cif
|
|
|
|
data_test
|
|
loop_
|
|
_cat_1.id
|
|
_cat_1.name
|
|
_cat_1.desc
|
|
1 aap Aap
|
|
2 noot Noot
|
|
3 mies Mies
|
|
|
|
loop_
|
|
_cat_2.id
|
|
_cat_2.name
|
|
_cat_2.num
|
|
_cat_2.desc
|
|
1 aap 1 'Een dier'
|
|
2 aap 2 'Een andere aap'
|
|
3 noot 1 'walnoot bijvoorbeeld'
|
|
|
|
loop_
|
|
_cat_3.id
|
|
_cat_3.name
|
|
_cat_3.num
|
|
1 aap 1
|
|
2 aap 2
|
|
|
|
And we have a dictionary containing the following link definition (reversed compared to the previous example):
|
|
|
|
.. code-block:: cif
|
|
|
|
loop_
|
|
_pdbx_item_linked_group_list.parent_category_id
|
|
_pdbx_item_linked_group_list.link_group_id
|
|
_pdbx_item_linked_group_list.parent_name
|
|
_pdbx_item_linked_group_list.child_name
|
|
_pdbx_item_linked_group_list.child_category_id
|
|
cat_2 1 '_cat_2.name' '_cat_1.name' cat_1
|
|
cat_3 1 '_cat_3.name' '_cat_2.name' cat_2
|
|
cat_3 1 '_cat_3.num' '_cat_2.num' cat_2
|
|
|
|
So *cat3* is a parent of *cat2* and *cat2* is a parent of *cat1*. Now, if you change the *name* value of the first row of *cat3* to 'aapje', the corresponding row in *cat2* is updated as well. But when you update *cat2* you have to update *cat1* too. And simply changing the name field in row 1 of *cat1* is wrong. The default behaviour in libcifpp is to split the record in *cat1* and have a new child with the new name whereas the other remains as is.
|
|
|
|
The new *cat1* will thus be like:
|
|
|
|
.. code-block:: cif
|
|
|
|
loop_
|
|
_cat_1.id
|
|
_cat_1.name
|
|
_cat_1.desc
|
|
1 aapje Aap
|
|
2 noot Noot
|
|
3 mies Mies
|
|
5 aap Aap
|
|
|