1.1. Example 1: Grab metadata from a database¶
In this example, we will learn how to get metadata of each file in a database. The metadata contains objective information (e.g. recording date, duration, etc.) as well as subjective information such as tags.
1.1.1. List database IDs¶
Firstly, let’s import and initialize DBHandler to list the ID of databases in example_db
.
[7]:
from pydtk.db import DBHandler
db_id_handler = DBHandler(
db_class='database_id',
db_host='./example_db'
)
db_id_handler.read()
db_id_handler.df
[7]:
database_id | df_name | _creation_time | _uuid | _id | |
---|---|---|---|---|---|
0 | default | db_0ffc6dbe_meta | 1.621305e+09 | c21f969b5f03d33d43e04f8f136e7682 | 3cce85d6b78011eb8fc30242ac110002 |
You can see that example_db
contains database with ID default
.
1.1.2. Get metadata from a database¶
Now, let’s initialize another DBHandler to retrieve the metadata of the contents in database default
.
[8]:
from pydtk.db import V4DBHandler as DBHandler
db_handler = DBHandler(
db_class='meta',
db_host='./example_db',
database_id='default',
base_dir_path='../test'
)
db_handler.read()
Note that you need to call function read()
to read data from DB.
You can visualize the metadata by accessing property .df
as follows:
[9]:
db_handler.df
[9]:
Description | Record ID | File path | Contents | Tags | data_type | end_timestamp | content_type | start_timestamp | database_id | sub_record_id | _creation_time | _uuid | _id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Description | sample | /opt/pydtk/test/records/sample/data/records.bag | {'/points_concat_downsampled': {'msg_type': 's... | NaN | raw_data | 1.550126e+09 | application/rosbag | 1.550126e+09 | NaN | NaN | 1.621305e+09 | adca5faea1d2012b809688628c8adcfc | 3ca3cc9cb78011eb8fc30242ac110002 |
1 | json file | test | /opt/pydtk/test/records/json_model_test/json_t... | {'test': {'tags': ['test1', 'test2']}} | NaN | test | NaN | application/json | NaN | json datbase | NaN | 1.621305e+09 | 1a2e2cb364f2d4f43d133719c11d1867 | 3ca455d6b78011eb8fc30242ac110002 |
2 | Forecast | test | /opt/pydtk/test/records/forecast_model_test/fo... | {'forecast': {'tags': ['test1', 'test2']}} | NaN | forecast | NaN | text/csv | NaN | NaN | NaN | 1.621305e+09 | be7a0ce377de8a4f164dbd019cacb7a2 | 3ca4ba76b78011eb8fc30242ac110002 |
3 | Description. | 016_00000000030000000015_1095_01 | /opt/pydtk/test/records/annotation_model_test/... | {'risk_annotation': {'tags': ['risk_score', 's... | NaN | annotation | 1.484629e+09 | text/csv | 1.484629e+09 | NaN | NaN | 1.621305e+09 | 9d78d143650bec29f293f35142f5528c | 3ca5274ab78011eb8fc30242ac110002 |
Property
.df
is read-only.If you want to access to the actual metadata, please use
.data
instead, which returns a list of dicts containing metadata.Keep in mind that metadata is stored in file unit.
So, each line above corresponds to a file.
1.1.3. Access to metadata¶
When you use the metadata grabbed from DB, you will want to access it as dict.
In that case, you can access
.data
, which returns a list of dict containing metadata.[24]:
db_handler.read()
db_handler.data
[24]:
[{'description': 'Description',
'record_id': 'sample',
'data_type': 'raw_data',
'path': '/opt/pydtk/test/records/sample/data/records.bag',
'start_timestamp': 1550125637.22,
'end_timestamp': 1550125637.53,
'content_type': 'application/rosbag',
'contents': {'/points_concat_downsampled': {'msg_type': 'sensor_msgs/PointCloud2',
'msg_md5sum': '1158d486dd51d683ce2f1be655c3c181',
'count': 4,
'frequency': 10.0,
'tags': ['lidar', 'downsampled']}},
'_uuid': 'adca5faea1d2012b809688628c8adcfc',
'_creation_time': 1621304696.823646,
'_id': '3ca3cc9cb78011eb8fc30242ac110002'},
{'description': 'json file',
'database_id': 'json datbase',
'record_id': 'test',
'data_type': 'test',
'path': '/opt/pydtk/test/records/json_model_test/json_test.json',
'content_type': 'application/json',
'contents': {'test': {'tags': ['test1', 'test2']}},
'_uuid': '1a2e2cb364f2d4f43d133719c11d1867',
'_creation_time': 1621304696.824871,
'_id': '3ca455d6b78011eb8fc30242ac110002'},
{'description': 'Forecast',
'record_id': 'test',
'data_type': 'forecast',
'path': '/opt/pydtk/test/records/forecast_model_test/forecast_test.csv',
'content_type': 'text/csv',
'contents': {'forecast': {'tags': ['test1', 'test2']}},
'_uuid': 'be7a0ce377de8a4f164dbd019cacb7a2',
'_creation_time': 1621304696.8263,
'_id': '3ca4ba76b78011eb8fc30242ac110002'},
{'description': 'Description.',
'record_id': '016_00000000030000000015_1095_01',
'data_type': 'annotation',
'path': '/opt/pydtk/test/records/annotation_model_test/annotation_test.csv',
'start_timestamp': 1484628818.02,
'end_timestamp': 1484628823.98,
'content_type': 'text/csv',
'contents': {'risk_annotation': {'tags': ['risk_score',
'scene_description',
'risk_factor']}},
'_uuid': '9d78d143650bec29f293f35142f5528c',
'_creation_time': 1621304696.827451,
'_id': '3ca5274ab78011eb8fc30242ac110002'}]
You can also retrieve metadata one-by-one by treating DBHandler as a iterator.
[26]:
for metadata in db_handler:
print(metadata)
{'description': 'Description', 'record_id': 'sample', 'data_type': 'raw_data', 'path': '/opt/pydtk/test/records/sample/data/records.bag', 'start_timestamp': 1550125637.22, 'end_timestamp': 1550125637.53, 'content_type': 'application/rosbag', 'contents': {'/points_concat_downsampled': {'msg_type': 'sensor_msgs/PointCloud2', 'msg_md5sum': '1158d486dd51d683ce2f1be655c3c181', 'count': 4, 'frequency': 10.0, 'tags': ['lidar', 'downsampled']}}, '_id': '3ca3cc9cb78011eb8fc30242ac110002'}
{'description': 'json file', 'database_id': 'json datbase', 'record_id': 'test', 'data_type': 'test', 'path': '/opt/pydtk/test/records/json_model_test/json_test.json', 'content_type': 'application/json', 'contents': {'test': {'tags': ['test1', 'test2']}}, '_id': '3ca455d6b78011eb8fc30242ac110002'}
{'description': 'Forecast', 'record_id': 'test', 'data_type': 'forecast', 'path': '/opt/pydtk/test/records/forecast_model_test/forecast_test.csv', 'content_type': 'text/csv', 'contents': {'forecast': {'tags': ['test1', 'test2']}}, '_id': '3ca4ba76b78011eb8fc30242ac110002'}
{'description': 'Description.', 'record_id': '016_00000000030000000015_1095_01', 'data_type': 'annotation', 'path': '/opt/pydtk/test/records/annotation_model_test/annotation_test.csv', 'start_timestamp': 1484628818.02, 'end_timestamp': 1484628823.98, 'content_type': 'text/csv', 'contents': {'risk_annotation': {'tags': ['risk_score', 'scene_description', 'risk_factor']}}, '_id': '3ca5274ab78011eb8fc30242ac110002'}
1.1.4. Search metadata¶
When you want to handle a very large dataset, the metadata contains huge amount of information and as a result, it takes a long time to load all of it.
However, if you want to grab only a limited scope (e.g. metadata of files tagged ‘camera’ and ‘front’), it is costful to load all the dataset and search items on the loaded dataframe.
Therefore, the toolkit provides a method to execute a query before loading the database and limit the items to load.
Current DBHandler (V4) supports the DB-native queries and PQL.
[20]:
# Initialize DB-handler
db_handler = DBHandler(
db_class='meta',
db_host='./example_db',
database_id='default',
base_dir_path='../test',
read_on_init=False,
orient='contents'
)
# Filter records by the timestamps
db_handler.read(pql='start_timestamp > 1500000000')
print('# of metadata: {}'.format(len(db_handler.df)))
db_handler.read(pql='start_timestamp > 1500000000 and start_timestamp < 1700000000')
print('# of metadata: {}'.format(len(db_handler.df)))
# Filter records by `record_id` with regular expressions
db_handler.read(pql='record_id == regex("test.*")')
print('# of metadata: {}'.format(len(db_handler.df)))
# Read metadata containing a specific key
db_handler.read(pql='"contents./points_concat_downsampled" == exists(True)')
print('# of metadata: {}'.format(len(db_handler.df)))
# You can also use DB-native queries (Tinymongo is used in this case)
db_handler.read(query={'start_timestamp': {'$gt': 1500000000}})
print('# of metadata: {}'.format(len(db_handler.df)))
db_handler.read(query={'$and': [{'start_timestamp': {'$gt': 1500000000}}, {'start_timestamp': {'$lt': 1700000000}}]})
print('# of metadata: {}'.format(len(db_handler.df)))
db_handler.read(query={'record_id': {'$regex': 'test.*'}})
print('# of metadata: {}'.format(len(db_handler.df)))
db_handler.read(query={'contents./points_concat_downsampled': {'$exists': True}})
print('# of metadata: {}'.format(len(db_handler.df)))
# of metadata: 1
# of metadata: 1
# of metadata: 2
# of metadata: 1
# of metadata: 1
# of metadata: 1
# of metadata: 2
# of metadata: 1