1.1. Example 1: Grab metadata from a database¶

In this example, we will learn how to get metadata of each file in a database. The metadata contains objective information (e.g. recording date, duration, etc.) as well as subjective information such as tags.

1.1.1. List database IDs¶

Firstly, let’s import and initialize DBHandler to list the ID of databases in example_db.

[7]:

from pydtk.db import DBHandler

db_id_handler = DBHandler(
    db_class='database_id',
    db_host='./example_db'
)
db_id_handler.read()
db_id_handler.df

[7]:

	database_id	df_name	_creation_time	_uuid	_id
0	default	db_0ffc6dbe_meta	1.621305e+09	c21f969b5f03d33d43e04f8f136e7682	3cce85d6b78011eb8fc30242ac110002

You can see that example_db contains database with ID default.

1.1.2. Get metadata from a database¶

Now, let’s initialize another DBHandler to retrieve the metadata of the contents in database default.

[8]:

from pydtk.db import V4DBHandler as DBHandler

db_handler = DBHandler(
    db_class='meta',
    db_host='./example_db',
    database_id='default',
    base_dir_path='../test'
)
db_handler.read()

Note that you need to call function read() to read data from DB.

You can visualize the metadata by accessing property .df as follows:

[9]:

db_handler.df

[9]:

	Description	Record ID	File path	Contents	Tags	data_type	end_timestamp	content_type	start_timestamp	database_id	sub_record_id	_creation_time	_uuid	_id
0	Description	sample	/opt/pydtk/test/records/sample/data/records.bag	{'/points_concat_downsampled': {'msg_type': 's...	NaN	raw_data	1.550126e+09	application/rosbag	1.550126e+09	NaN	NaN	1.621305e+09	adca5faea1d2012b809688628c8adcfc	3ca3cc9cb78011eb8fc30242ac110002
1	json file	test	/opt/pydtk/test/records/json_model_test/json_t...	{'test': {'tags': ['test1', 'test2']}}	NaN	test	NaN	application/json	NaN	json datbase	NaN	1.621305e+09	1a2e2cb364f2d4f43d133719c11d1867	3ca455d6b78011eb8fc30242ac110002
2	Forecast	test	/opt/pydtk/test/records/forecast_model_test/fo...	{'forecast': {'tags': ['test1', 'test2']}}	NaN	forecast	NaN	text/csv	NaN	NaN	NaN	1.621305e+09	be7a0ce377de8a4f164dbd019cacb7a2	3ca4ba76b78011eb8fc30242ac110002
3	Description.	016_00000000030000000015_1095_01	/opt/pydtk/test/records/annotation_model_test/...	{'risk_annotation': {'tags': ['risk_score', 's...	NaN	annotation	1.484629e+09	text/csv	1.484629e+09	NaN	NaN	1.621305e+09	9d78d143650bec29f293f35142f5528c	3ca5274ab78011eb8fc30242ac110002

Property .df is read-only.

If you want to access to the actual metadata, please use .data instead, which returns a list of dicts containing metadata.

Keep in mind that metadata is stored in file unit.

So, each line above corresponds to a file.

1.1.3. Access to metadata¶

When you use the metadata grabbed from DB, you will want to access it as dict.

In that case, you can access .data, which returns a list of dict containing metadata.

[24]:

db_handler.read()
db_handler.data

[24]:

[{'description': 'Description',
  'record_id': 'sample',
  'data_type': 'raw_data',
  'path': '/opt/pydtk/test/records/sample/data/records.bag',
  'start_timestamp': 1550125637.22,
  'end_timestamp': 1550125637.53,
  'content_type': 'application/rosbag',
  'contents': {'/points_concat_downsampled': {'msg_type': 'sensor_msgs/PointCloud2',
    'msg_md5sum': '1158d486dd51d683ce2f1be655c3c181',
    'count': 4,
    'frequency': 10.0,
    'tags': ['lidar', 'downsampled']}},
  '_uuid': 'adca5faea1d2012b809688628c8adcfc',
  '_creation_time': 1621304696.823646,
  '_id': '3ca3cc9cb78011eb8fc30242ac110002'},
 {'description': 'json file',
  'database_id': 'json datbase',
  'record_id': 'test',
  'data_type': 'test',
  'path': '/opt/pydtk/test/records/json_model_test/json_test.json',
  'content_type': 'application/json',
  'contents': {'test': {'tags': ['test1', 'test2']}},
  '_uuid': '1a2e2cb364f2d4f43d133719c11d1867',
  '_creation_time': 1621304696.824871,
  '_id': '3ca455d6b78011eb8fc30242ac110002'},
 {'description': 'Forecast',
  'record_id': 'test',
  'data_type': 'forecast',
  'path': '/opt/pydtk/test/records/forecast_model_test/forecast_test.csv',
  'content_type': 'text/csv',
  'contents': {'forecast': {'tags': ['test1', 'test2']}},
  '_uuid': 'be7a0ce377de8a4f164dbd019cacb7a2',
  '_creation_time': 1621304696.8263,
  '_id': '3ca4ba76b78011eb8fc30242ac110002'},
 {'description': 'Description.',
  'record_id': '016_00000000030000000015_1095_01',
  'data_type': 'annotation',
  'path': '/opt/pydtk/test/records/annotation_model_test/annotation_test.csv',
  'start_timestamp': 1484628818.02,
  'end_timestamp': 1484628823.98,
  'content_type': 'text/csv',
  'contents': {'risk_annotation': {'tags': ['risk_score',
     'scene_description',
     'risk_factor']}},
  '_uuid': '9d78d143650bec29f293f35142f5528c',
  '_creation_time': 1621304696.827451,
  '_id': '3ca5274ab78011eb8fc30242ac110002'}]

You can also retrieve metadata one-by-one by treating DBHandler as a iterator.

[26]:

for metadata in db_handler:
    print(metadata)

{'description': 'Description', 'record_id': 'sample', 'data_type': 'raw_data', 'path': '/opt/pydtk/test/records/sample/data/records.bag', 'start_timestamp': 1550125637.22, 'end_timestamp': 1550125637.53, 'content_type': 'application/rosbag', 'contents': {'/points_concat_downsampled': {'msg_type': 'sensor_msgs/PointCloud2', 'msg_md5sum': '1158d486dd51d683ce2f1be655c3c181', 'count': 4, 'frequency': 10.0, 'tags': ['lidar', 'downsampled']}}, '_id': '3ca3cc9cb78011eb8fc30242ac110002'}
{'description': 'json file', 'database_id': 'json datbase', 'record_id': 'test', 'data_type': 'test', 'path': '/opt/pydtk/test/records/json_model_test/json_test.json', 'content_type': 'application/json', 'contents': {'test': {'tags': ['test1', 'test2']}}, '_id': '3ca455d6b78011eb8fc30242ac110002'}
{'description': 'Forecast', 'record_id': 'test', 'data_type': 'forecast', 'path': '/opt/pydtk/test/records/forecast_model_test/forecast_test.csv', 'content_type': 'text/csv', 'contents': {'forecast': {'tags': ['test1', 'test2']}}, '_id': '3ca4ba76b78011eb8fc30242ac110002'}
{'description': 'Description.', 'record_id': '016_00000000030000000015_1095_01', 'data_type': 'annotation', 'path': '/opt/pydtk/test/records/annotation_model_test/annotation_test.csv', 'start_timestamp': 1484628818.02, 'end_timestamp': 1484628823.98, 'content_type': 'text/csv', 'contents': {'risk_annotation': {'tags': ['risk_score', 'scene_description', 'risk_factor']}}, '_id': '3ca5274ab78011eb8fc30242ac110002'}

1.1.4. Search metadata¶

When you want to handle a very large dataset, the metadata contains huge amount of information and as a result, it takes a long time to load all of it.
However, if you want to grab only a limited scope (e.g. metadata of files tagged ‘camera’ and ‘front’), it is costful to load all the dataset and search items on the loaded dataframe.
Therefore, the toolkit provides a method to execute a query before loading the database and limit the items to load.

Current DBHandler (V4) supports the DB-native queries and PQL.

[20]:

# Initialize DB-handler
db_handler = DBHandler(
    db_class='meta',
    db_host='./example_db',
    database_id='default',
    base_dir_path='../test',
    read_on_init=False,
    orient='contents'
)

# Filter records by the timestamps
db_handler.read(pql='start_timestamp > 1500000000')
print('# of metadata: {}'.format(len(db_handler.df)))
db_handler.read(pql='start_timestamp > 1500000000 and start_timestamp < 1700000000')
print('# of metadata: {}'.format(len(db_handler.df)))

# Filter records by `record_id` with regular expressions
db_handler.read(pql='record_id == regex("test.*")')
print('# of metadata: {}'.format(len(db_handler.df)))

# Read metadata containing a specific key
db_handler.read(pql='"contents./points_concat_downsampled" == exists(True)')
print('# of metadata: {}'.format(len(db_handler.df)))

# You can also use DB-native queries (Tinymongo is used in this case)
db_handler.read(query={'start_timestamp': {'$gt': 1500000000}})
print('# of metadata: {}'.format(len(db_handler.df)))
db_handler.read(query={'$and': [{'start_timestamp': {'$gt': 1500000000}}, {'start_timestamp': {'$lt': 1700000000}}]})
print('# of metadata: {}'.format(len(db_handler.df)))
db_handler.read(query={'record_id': {'$regex': 'test.*'}})
print('# of metadata: {}'.format(len(db_handler.df)))
db_handler.read(query={'contents./points_concat_downsampled': {'$exists': True}})
print('# of metadata: {}'.format(len(db_handler.df)))

# of metadata: 1
# of metadata: 1
# of metadata: 2
# of metadata: 1
# of metadata: 1
# of metadata: 1
# of metadata: 2
# of metadata: 1