1.1. Example 1: Grab metadata from a database

In this example, we will learn how to get metadata of each file in a database. The metadata contains objective information (e.g. recording date, duration, etc.) as well as subjective information such as tags.

1.1.1. List database IDs

Firstly, let’s import and initialize DBHandler to list the ID of databases in example_db.

[7]:
from pydtk.db import DBHandler

db_id_handler = DBHandler(
    db_class='database_id',
    db_host='./example_db'
)
db_id_handler.read()
db_id_handler.df
[7]:
database_id df_name _creation_time _uuid _id
0 default db_0ffc6dbe_meta 1.621305e+09 c21f969b5f03d33d43e04f8f136e7682 3cce85d6b78011eb8fc30242ac110002

You can see that example_db contains database with ID default.

1.1.2. Get metadata from a database

Now, let’s initialize another DBHandler to retrieve the metadata of the contents in database default.

[8]:
from pydtk.db import V4DBHandler as DBHandler

db_handler = DBHandler(
    db_class='meta',
    db_host='./example_db',
    database_id='default',
    base_dir_path='../test'
)
db_handler.read()

Note that you need to call function read() to read data from DB.

You can visualize the metadata by accessing property .df as follows:

[9]:
db_handler.df
[9]:
Description Record ID File path Contents Tags data_type end_timestamp content_type start_timestamp database_id sub_record_id _creation_time _uuid _id
0 Description sample /opt/pydtk/test/records/sample/data/records.bag {'/points_concat_downsampled': {'msg_type': 's... NaN raw_data 1.550126e+09 application/rosbag 1.550126e+09 NaN NaN 1.621305e+09 adca5faea1d2012b809688628c8adcfc 3ca3cc9cb78011eb8fc30242ac110002
1 json file test /opt/pydtk/test/records/json_model_test/json_t... {'test': {'tags': ['test1', 'test2']}} NaN test NaN application/json NaN json datbase NaN 1.621305e+09 1a2e2cb364f2d4f43d133719c11d1867 3ca455d6b78011eb8fc30242ac110002
2 Forecast test /opt/pydtk/test/records/forecast_model_test/fo... {'forecast': {'tags': ['test1', 'test2']}} NaN forecast NaN text/csv NaN NaN NaN 1.621305e+09 be7a0ce377de8a4f164dbd019cacb7a2 3ca4ba76b78011eb8fc30242ac110002
3 Description. 016_00000000030000000015_1095_01 /opt/pydtk/test/records/annotation_model_test/... {'risk_annotation': {'tags': ['risk_score', 's... NaN annotation 1.484629e+09 text/csv 1.484629e+09 NaN NaN 1.621305e+09 9d78d143650bec29f293f35142f5528c 3ca5274ab78011eb8fc30242ac110002
Property .df is read-only.
If you want to access to the actual metadata, please use .data instead, which returns a list of dicts containing metadata.
Keep in mind that metadata is stored in file unit.
So, each line above corresponds to a file.

1.1.3. Access to metadata

When you use the metadata grabbed from DB, you will want to access it as dict.
In that case, you can access .data, which returns a list of dict containing metadata.
[24]:
db_handler.read()
db_handler.data
[24]:
[{'description': 'Description',
  'record_id': 'sample',
  'data_type': 'raw_data',
  'path': '/opt/pydtk/test/records/sample/data/records.bag',
  'start_timestamp': 1550125637.22,
  'end_timestamp': 1550125637.53,
  'content_type': 'application/rosbag',
  'contents': {'/points_concat_downsampled': {'msg_type': 'sensor_msgs/PointCloud2',
    'msg_md5sum': '1158d486dd51d683ce2f1be655c3c181',
    'count': 4,
    'frequency': 10.0,
    'tags': ['lidar', 'downsampled']}},
  '_uuid': 'adca5faea1d2012b809688628c8adcfc',
  '_creation_time': 1621304696.823646,
  '_id': '3ca3cc9cb78011eb8fc30242ac110002'},
 {'description': 'json file',
  'database_id': 'json datbase',
  'record_id': 'test',
  'data_type': 'test',
  'path': '/opt/pydtk/test/records/json_model_test/json_test.json',
  'content_type': 'application/json',
  'contents': {'test': {'tags': ['test1', 'test2']}},
  '_uuid': '1a2e2cb364f2d4f43d133719c11d1867',
  '_creation_time': 1621304696.824871,
  '_id': '3ca455d6b78011eb8fc30242ac110002'},
 {'description': 'Forecast',
  'record_id': 'test',
  'data_type': 'forecast',
  'path': '/opt/pydtk/test/records/forecast_model_test/forecast_test.csv',
  'content_type': 'text/csv',
  'contents': {'forecast': {'tags': ['test1', 'test2']}},
  '_uuid': 'be7a0ce377de8a4f164dbd019cacb7a2',
  '_creation_time': 1621304696.8263,
  '_id': '3ca4ba76b78011eb8fc30242ac110002'},
 {'description': 'Description.',
  'record_id': '016_00000000030000000015_1095_01',
  'data_type': 'annotation',
  'path': '/opt/pydtk/test/records/annotation_model_test/annotation_test.csv',
  'start_timestamp': 1484628818.02,
  'end_timestamp': 1484628823.98,
  'content_type': 'text/csv',
  'contents': {'risk_annotation': {'tags': ['risk_score',
     'scene_description',
     'risk_factor']}},
  '_uuid': '9d78d143650bec29f293f35142f5528c',
  '_creation_time': 1621304696.827451,
  '_id': '3ca5274ab78011eb8fc30242ac110002'}]

You can also retrieve metadata one-by-one by treating DBHandler as a iterator.

[26]:
for metadata in db_handler:
    print(metadata)
{'description': 'Description', 'record_id': 'sample', 'data_type': 'raw_data', 'path': '/opt/pydtk/test/records/sample/data/records.bag', 'start_timestamp': 1550125637.22, 'end_timestamp': 1550125637.53, 'content_type': 'application/rosbag', 'contents': {'/points_concat_downsampled': {'msg_type': 'sensor_msgs/PointCloud2', 'msg_md5sum': '1158d486dd51d683ce2f1be655c3c181', 'count': 4, 'frequency': 10.0, 'tags': ['lidar', 'downsampled']}}, '_id': '3ca3cc9cb78011eb8fc30242ac110002'}
{'description': 'json file', 'database_id': 'json datbase', 'record_id': 'test', 'data_type': 'test', 'path': '/opt/pydtk/test/records/json_model_test/json_test.json', 'content_type': 'application/json', 'contents': {'test': {'tags': ['test1', 'test2']}}, '_id': '3ca455d6b78011eb8fc30242ac110002'}
{'description': 'Forecast', 'record_id': 'test', 'data_type': 'forecast', 'path': '/opt/pydtk/test/records/forecast_model_test/forecast_test.csv', 'content_type': 'text/csv', 'contents': {'forecast': {'tags': ['test1', 'test2']}}, '_id': '3ca4ba76b78011eb8fc30242ac110002'}
{'description': 'Description.', 'record_id': '016_00000000030000000015_1095_01', 'data_type': 'annotation', 'path': '/opt/pydtk/test/records/annotation_model_test/annotation_test.csv', 'start_timestamp': 1484628818.02, 'end_timestamp': 1484628823.98, 'content_type': 'text/csv', 'contents': {'risk_annotation': {'tags': ['risk_score', 'scene_description', 'risk_factor']}}, '_id': '3ca5274ab78011eb8fc30242ac110002'}

1.1.4. Search metadata

When you want to handle a very large dataset, the metadata contains huge amount of information and as a result, it takes a long time to load all of it.
However, if you want to grab only a limited scope (e.g. metadata of files tagged ‘camera’ and ‘front’), it is costful to load all the dataset and search items on the loaded dataframe.
Therefore, the toolkit provides a method to execute a query before loading the database and limit the items to load.

Current DBHandler (V4) supports the DB-native queries and PQL.

[20]:
# Initialize DB-handler
db_handler = DBHandler(
    db_class='meta',
    db_host='./example_db',
    database_id='default',
    base_dir_path='../test',
    read_on_init=False,
    orient='contents'
)

# Filter records by the timestamps
db_handler.read(pql='start_timestamp > 1500000000')
print('# of metadata: {}'.format(len(db_handler.df)))
db_handler.read(pql='start_timestamp > 1500000000 and start_timestamp < 1700000000')
print('# of metadata: {}'.format(len(db_handler.df)))

# Filter records by `record_id` with regular expressions
db_handler.read(pql='record_id == regex("test.*")')
print('# of metadata: {}'.format(len(db_handler.df)))

# Read metadata containing a specific key
db_handler.read(pql='"contents./points_concat_downsampled" == exists(True)')
print('# of metadata: {}'.format(len(db_handler.df)))

# You can also use DB-native queries (Tinymongo is used in this case)
db_handler.read(query={'start_timestamp': {'$gt': 1500000000}})
print('# of metadata: {}'.format(len(db_handler.df)))
db_handler.read(query={'$and': [{'start_timestamp': {'$gt': 1500000000}}, {'start_timestamp': {'$lt': 1700000000}}]})
print('# of metadata: {}'.format(len(db_handler.df)))
db_handler.read(query={'record_id': {'$regex': 'test.*'}})
print('# of metadata: {}'.format(len(db_handler.df)))
db_handler.read(query={'contents./points_concat_downsampled': {'$exists': True}})
print('# of metadata: {}'.format(len(db_handler.df)))
# of metadata: 1
# of metadata: 1
# of metadata: 2
# of metadata: 1
# of metadata: 1
# of metadata: 1
# of metadata: 2
# of metadata: 1