怎么减小COCO2017数据集
时间: 2025-01-04 14:36:53 浏览: 8
### 减少COCO2017数据集大小的方法
#### 选择特定类别
仅保留感兴趣的类别的图片和标注信息,去除其他不相关的类别。这可以通过筛选包含指定类别的图像并更新相应的JSON文件实现[^1]。
```python
import json
def filter_categories(json_file, output_file, target_categories):
with open(json_file) as f:
data = json.load(f)
new_images = []
new_annotations = []
image_id_map = {}
annotation_id_counter = 0
for img in data['images']:
anns_for_img = [ann for ann in data['annotations'] if ann['image_id'] == img['id']]
categories_in_image = set([ann['category_id'] for ann in anns_for_img])
if any(cat in categories_in_image for cat in target_categories.values()):
new_images.append(img)
image_id_map[img['id']] = len(new_images)-1
for ann in anns_for_img:
if ann['category_id'] in target_categories.values():
new_ann = dict(ann)
new_ann["id"] = annotation_id_counter
new_ann["image_id"] = len(new_images)-1
new_annotations.append(new_ann)
annotation_id_counter += 1
filtered_data = {
'info': data.get('info', {}),
'licenses': data.get('licenses', []),
'categories': [{'supercategory': '', 'id': v, 'name': k} for k,v in target_categories.items()],
'images': new_images,
'annotations': new_annotations
}
with open(output_file, 'w') as outfile:
json.dump(filtered_data, outfile)
target_categories = {"person": 1}
filter_categories("instances_train2017.json", "filtered_instances_train2017.json", target_categories)
```
#### 数据采样
通过随机抽样的方式选取一定比例的数据作为新的子集。这种方法适用于当原始数据量过大而硬件资源有限的情况。可以基于Python中的`random.sample()`函数完成此操作[^3]。
```python
from random import sample
def downsample_dataset(image_ids, ratio=0.5):
sampled_num = int(len(image_ids)*ratio)
return sample(image_ids, sampled_num)
with open('annotations/instances_val2017.json','r') as file:
coco_json = json.loads(file.read())
all_image_ids = [item['id'] for item in coco_json['images']]
down_sampled_ids = downsample_dataset(all_image_ids,ratio=0.1)
new_coco_dict={
key:[item for item in value if (isinstance(item,dict) and item['image_id'] in down_sampled_ids)] \
if isinstance(value,list) else value\
for key,value in coco_json.items()
}
new_coco_dict['images']=[item for item in new_coco_dict['images'] if item['id'] in down_sampled_ids]
with open('smaller_set.json','w') as out_file:
json.dump(new_coco_dict,out_file)
```
#### 压缩存储格式
采用更高效的压缩算法如ZIP或TAR.GZ来打包图像文件夹以及对应的标签文件。对于JPEG格式的图片还可以调整质量参数以进一步缩小体积而不明显影响视觉效果[^2]。
阅读全文