Files
libcifpp/docs/basics.rst
Maarten L. Hekkelman 724cddb481 Merged develop manually
2025-04-09 09:20:53 +02:00

401 lines
14 KiB
ReStructuredText

Basic usage
===========
This library, *libcifpp*, is a generic *CIF* library with some specific additions to work with *mmCIF* files. The main focus of this library is to make sure that files read or written are valid. That is, they are syntactically valid *and* their content is valid with respect to a CIF dictionary, if such a dictionary is available and specified.
Reading a file is as simple as:
.. code-block:: cpp
#include <cif++.hpp>
cif::file f("/path/to/file.cif");
The file may also be compressed using *gzip* which is detected automatically.
Writing out the file again is also simple, to write out the terminal you can do:
.. code-block:: cpp
std::cout << f;
// or
f.save(std::cout);
// or write a compressed file using gzip compression:
f.save("/tmp/f.cif.gz");
CIF files contain one or more datablocks. To print out the names of all datablocks in our file:
.. code-block:: cpp
for (auto &db : f)
std::cout << db.name() << '\n';
Most often *libcifpp* is used to read in structure files in mmCIF format. These files only contain one datablock and so you can safely use code like this:
.. code-block:: cpp
// get a reference to the first datablock in f
auto &db = f.front();
But if you know the name of the datablock, this also works:
.. code-block:: cpp
// get a reference to the datablock name '1CBS'
auto &db = f["1CBS"];
Now, each datablock contains categories. To print out all their names:
.. code-block:: cpp
for (auto &cat : db)
std::cout << cat.name() << '\n';
But you probably know what category you need to use, so lets fetch it by name:
.. _atom_site-label:
.. code-block:: cpp
// get a reference to the atom_site category in db
auto &atom_site = db["atom_site"];
// and make sure there's some data in it:
assert(not atom_site.empty());
.. note::
Note that we omit the leading underscore in the name of the category here.
Categories contain rows of data and each row has fields or items. Referencing a row in a category results in a :cpp:class:`cif::row_handle` object which you can use to request or manipulate item data.
.. code-block:: cpp
// Get the first row in atom_site
auto rh = atom_site.front();
// Get the label_atom_id value from this row handle as a std::string
std::string atom_id = rh["label_atom_id"].as<std::string>();
// Get the x, y and z coordinates using structered binding
const auto &[x, y, z] = rh.get<float,float,float>("Cartn_x", "Cartn_y", "Cartn_z");
// Assign a new value to the x coordinate or our atom
rh["Cartn_x"] = x + 1;
Querying
--------
Walking over the rows in a category is often not very useful. More often you are interested in specific rows in a category. The function :cpp:func:`cif::category::find` and friends are here to help.
What these functions have in common is that they return data based on a query implemented by :cpp:class:`cif::condition`. These condition objects are built in code using regular C++ syntax. The most basic example of a query is:
.. code-block:: cpp
cif::condition c = cif::key("id") == 1;
Here the condition is that all rows returned should have a value of 1 in there item named *id*. Likewise you can use other data types and even combine those. Oh, and I said we use regular C++ syntax for conditions, so you may as well use other operators to compare values:
.. code-block:: cpp
// condition for C-alpha atoms having an occupancy less than 1.0
cif::condition c = cif::key("occupancy") < 1.0f and cif::key("label_atom_id") == "CA";
Using the namespace *cif::literals* that code becomes a little less verbose:
.. code-block:: cpp
using namespace cif::literals;
cif::condition c = "occupancy"_key < 1.0f and "label_atom_id"_key == "CA";
Conditions can also be combined:
.. code-block:: cpp
cif::condition c = "occupancy"_key < 1.0f and "label_atom_id"_key == "CA";
// extend the condition by requiring the compound ID to be unequal to PRO
c = std::move(c) and "label_comp_id"_key != "PRO";
.. note::
Note the use of std::move here.
Using queries constructed in this way is simple:
.. code-block:: cpp
cif::condition c = ...
auto result = atom_site.find(std::move(c));
// or construct a condition inline:
auto result = atom_site.find("label_atom_id"_key == "CA");
In the example above the result is a range of :cpp:class:`cif::row_handle` objects. Often, using individual field values is more useful:
.. code-block:: cpp
// Requesting a single item:
for (auto id : atom_site.find<std::string>("label_atom_id"_key == "CA", "id"))
std::cout << "ID for CA: " << id << '\n';
// Requesting multiple items:
for (const auto &[id, x, y, z] : atom_site.find<std::string,float,float,float>("label_atom_id"_key == "CA",
"id", "Cartn_x", "Cartn_y", "Cartn_z"))
{
std::cout << "Atom " << id << " is at [" << x << ", " << y << ", " z << "]\n";
}
Returning a complete set if often not required, if you only want to have the first you can use :cpp:func:`cif::category::find_first` as shown here:
.. code-block:: cpp
// return the ID item for the first C-alpha atom
std::string v1 = atom_site.find_first<std::string>("label_atom_id"_key == "CA", "id");
// If you're not sure the row exists, use std::optional
auto v2 = atom_site.find_first<std::optional<std::string>>("label_atom_id"_key == "CA", "id");
if (v2.has_value())
...
There are cases when you really need exactly one result. The :cpp:func:`cif::category::find1` can be used in that case, it will throw an exception if the query does not result in exactly one row.
NULL and ANY
------------
Sometimes items may be empty. The trouble is a bit that empty comes in two flavors: unknown and null. Null in *CIF* parlance means the item should not contain a value since it makes no sense in this case, the value stored in the file is a single dot character: ``'.'``. E.g. *atom_site* records may have a NULL value for label_seq_id for atoms that are part of a *non-polymer*.
The other empty value is indicated by a question mark character: ``'?'``. This means the value is simply unknown.
Both these are NULL in *libcifpp* conditions and can be searched for using :cpp:var:`cif::null`.
So you can search for:
.. code-block:: cpp
cif::condition c = "label_seq_id"_key == cif::null;
You might also want to look for a certain value and don't care in which item it is stored, in that case you can use :cpp:var:`cif::any`.
.. code-block:: cpp
cif::condition c = cif::any == "foo";
And in linked record you might have the items that have a value in both parent and child or both should be NULL. For that, you can request the value to return by find to be of type std::optional and then use that value to build the query. An example to explain this, let's find the location of the atom that is referenced as the first atom in a struct_conn record:
.. code-block:: cpp
// Take references to the two categories we need
auto struct_conn = db["struct_conn"];
auto atom_site = db["atom_site"];
// Loop over all rows in struct_conn taking only the values we need
// Note that the label_seq_id is returned as a std::optional<int>
// That means it may contain an integer or may be empty
for (const auto &[asym1, seqid1, authseqid1, atomid1] :
struct_conn.rows<std::string,std::optional<int>,std::string,std::string,std::string>(
"ptnr1_label_asym_id", "ptnr1_label_seq_id", "ptnr1_auth_seq_id", "ptnr1_label_atom_id"
))
{
// Find the location of the first atom
cif::point p1 = atom_site.find1<float,float,float>(
"label_asym_id"_key == asym1 and "label_seq_id"_key == seqid1 and "auth_seq_id"_key == authseqid1 and "label_atom_id"_key == atomid1,
"cartn_x", "cartn_y", "cartn_z");
}
Validation
----------
CIF files can have a dictionary attached. And based on such a dictionary a :cpp:class:`cif::validator` object can be constructed which in turn can be used to validate the content of the file.
A simple case:
.. code-block:: cpp
#include <cif++.hpp>
cif::file f("1cbs.cif.gz");
f.load_dictionary("mmcif_pdbx.dic");
if (not f.is_valid())
std::cout << "This file is not valid\n";
If you want to know why it is not valid, you should set the global variable :cpp:var:`cif::VERBOSE` to something higer than zero. Depending on the value more or less diagnostic output is sent to std::cerr.
In the case above we load a dictionary based on its name. You can of course also load dictionaries based on a specific file, that's a bit more work:
.. code-block:: cpp
std::filesystem::ifstream dictFile("/tmp/my-dictionary.dic");
auto &validator = cif::parse_dictionary("my-dictionary", dictFile);
cif::file f("1cbs.cif.gz");
// assign the validator
f.set_validator(&validator);
// alternatively, load it by name
f.load_dictionary("my-dictionary");
if (not f.is_valid())
std::cout << "This file is not valid\n";
Creating your own dictionary is a lot of work, especially if you are only extending an existing dictionary with a couple of new categories or items. So, what you can do is extend a loaded validator like this (code taken from DSSP):
.. code-block:: cpp
// db is a cif::datablock reference containing an mmCIF file with DSSP annotations
auto &validator = const_cast<cif::validator &>(*db.get_validator());
if (validator.get_validator_for_category("dssp_struct_summary") == nullptr)
{
auto dssp_extension = cif::load_resource("dssp-extension.dic");
if (dssp_extension)
cif::extend_dictionary(validator, *dssp_extension);
}
.. note::
In the example above we're loading the data using :doc:`/resources`. See the documentation on that for more information.
If a validator has been assigned to a file, assignments to items are checked for valid data. So the following code will throw an exception (see: :ref:`_atom_site-label`):
.. code-block:: cpp
auto rh = atom_site.front();
rh["Cartn_x"] = "foo";
Linking
-------
Based on information recorded in dictionary files (see :ref:`Validation`) you can locate linked records in parent or child categories.
To make this example not too complex, lets assume the following example file:
.. code-block:: cif
data_test
loop_
_cat_1.id
_cat_1.name
_cat_1.desc
1 aap Aap
2 noot Noot
3 mies Mies
loop_
_cat_2.id
_cat_2.name
_cat_2.num
_cat_2.desc
1 aap 1 'Een dier'
2 aap 2 'Een andere aap'
3 noot 1 'walnoot bijvoorbeeld'
And we have a dictionary containing the following link definition:
.. code-block:: cif
loop_
_pdbx_item_linked_group_list.parent_category_id
_pdbx_item_linked_group_list.link_group_id
_pdbx_item_linked_group_list.parent_name
_pdbx_item_linked_group_list.child_name
_pdbx_item_linked_group_list.child_category_id
cat_1 1 '_cat_1.name' '_cat_2.name' cat_2
So, there are links between *cat_1* and *cat_2* based on the value in items named *name*. Using this information, we can now locate children and parents:
.. code-block:: cpp
// Assuming the file was loaded in f:
auto &cat1 = f.front()["cat_1"];
auto &cat2 = f.front()["cat_2"];
auto &cat3 = f.front()["cat_3"];
// Loop over all ape's in cat2
for (auto r : cat1.get_children(cat1.find1("name"_key == "aap"), cat2))
std::cout << r.get<std::string>("desc") << '\n';
Updating a value in an item in a parent category will update the corresponding value in all related children:
.. code-block:: cpp
auto r1 = cat1.find1("id"_key == 1);
r1["name"] = "aapje";
auto rs1 = cat2.find("name"_key == "aapje");
assert(rs1.size() == 2);
However, changing a value in a child record will not update the parent. This may result in an invalid file since you may then have a child that has no parent:
.. code-block:: cpp
auto r2 = cat2.find1("id"_key == 3);
r2["name"] = "wim";
assert(f.is_valid() == false);
So you have to fix this yourself by inserting a new item in cat1 with the new value.
.. _splitting-rows:
Another situation is when you change a value in a parent and updating children might introduce a situation where you need to split a child. To give an example, consider this:
.. code-block:: cif
data_test
loop_
_cat_1.id
_cat_1.name
_cat_1.desc
1 aap Aap
2 noot Noot
3 mies Mies
loop_
_cat_2.id
_cat_2.name
_cat_2.num
_cat_2.desc
1 aap 1 'Een dier'
2 aap 2 'Een andere aap'
3 noot 1 'walnoot bijvoorbeeld'
loop_
_cat_3.id
_cat_3.name
_cat_3.num
1 aap 1
2 aap 2
And we have a dictionary containing the following link definition (reversed compared to the previous example):
.. code-block:: cif
loop_
_pdbx_item_linked_group_list.parent_category_id
_pdbx_item_linked_group_list.link_group_id
_pdbx_item_linked_group_list.parent_name
_pdbx_item_linked_group_list.child_name
_pdbx_item_linked_group_list.child_category_id
cat_2 1 '_cat_2.name' '_cat_1.name' cat_1
cat_3 1 '_cat_3.name' '_cat_2.name' cat_2
cat_3 1 '_cat_3.num' '_cat_2.num' cat_2
So *cat3* is a parent of *cat2* and *cat2* is a parent of *cat1*. Now, if you change the *name* value of the first row of *cat3* to 'aapje', the corresponding row in *cat2* is updated as well. But when you update *cat2* you have to update *cat1* too. And simply changing the name field in row 1 of *cat1* is wrong. The default behaviour in libcifpp is to split the record in *cat1* and have a new child with the new name whereas the other remains as is.
The new *cat1* will thus be like:
.. code-block:: cif
loop_
_cat_1.id
_cat_1.name
_cat_1.desc
1 aapje Aap
2 noot Noot
3 mies Mies
5 aap Aap