[튜토리얼7] 이미지 데이터 불러오기

이번 튜토리얼에서는 tf.data를 이용해서 꽃 이미지를 불러와 사용해보겠습니다.

이 예시에서 사용된 데이터셋은 이미지의 디렉토리로 배포되며, 디렉토리당 하나의 이미지 클래스로 구성되어있습니다.

import warnings
warnings.simplefilter('ignore')

import tensorflow as tf

import numpy as np
import matplotlib.pyplot as plt
import os

tf.__version__

train_data_gen = image_generator.flow_from_directory(directory=str(data_dir),
                                                     batch_size=BATCH_SIZE,
                                                     shuffle=True,
                                                     target_size=(IMG_HEIGHT, IMG_WIDTH),
                                                     classes = list(CLASS_NAMES))

배치(batch)를 확인합니다.

def show_batch(image_batch, label_batch):
    plt.figure(figsize=(10,10))
    for n in range(25):
        ax = plt.subplot(5,5,n+1)
        plt.imshow(image_batch[n])
        plt.title(CLASS_NAMES[label_batch[n]==1][0].title())
        plt.axis('off')

image_batch, label_batch = next(train_data_gen)
show_batch(image_batch, label_batch)

2.2 `tf.data`로 이미지 불러오기

위의 keras.preprocessing 방법은 편리하지만 다음과 같은 세 가지 단점이 있습니다.

성능을 확인해보면 느립니다.
세밀한 조정이 힘듭니다.
다른 텐서플로우 데이터 타입과 잘 통합되지 않습니다.

파일들을 tf.data.Dataset으로 불러오려면 먼저 파일 경로들의 데이터셋을 만들어야 합니다.

list_ds = tf.data.Dataset.list_files(str(data_dir/'*/*'))

for f in list_ds.take(5):
    print(f.numpy())

파일 경로를 통해 (image_data, label) 쌍으로 변환하는 텐서플로우 함수를 만듭니다.

def get_label(file_path):
  # 경로를 경로 구성요소 목록으로 변환합니다
  parts = tf.strings.split(file_path, os.path.sep)
  # 끝에서 두 번째 요소는 클래스 디렉터리입니다.
  return parts[-2] == CLASS_NAMES

def decode_img(img):
  # 압축된 문자열을 3D uint8 텐서로 변환합니다
  img = tf.image.decode_jpeg(img, channels=3)
  # `convert_image_dtype`은0~1 사이의 float 값으로 변환해줍니다.
  img = tf.image.convert_image_dtype(img, tf.float32)
  # 이미지를 원하는 크기로 조정합니다.
  return tf.image.resize(img, [IMG_WIDTH, IMG_HEIGHT])

def process_path(file_path):
    label = get_label(file_path)
    # 파일에서 raw 데이터를 문자열로 불러옵니다
    img = tf.io.read_file(file_path)
    img = decode_img(img)
    return img, label

Dataset.map을 이용해서 image, label 쌍으로 이루어진 데이터셋을 만듭니다.

AUTOTUNE = tf.data.experimental.AUTOTUNE

# 여러 이미지를 병렬로 불러오고 처리하도록 `num_parallel_calls` 설정합니다
labeled_ds = list_ds.map(process_path, num_parallel_calls=AUTOTUNE)

for image, label in labeled_ds.take(1):
    print("Image shape: ", image.numpy().shape)
    print("Label: ", label.numpy())

3. 기본적인 학습 방법

이 데이터 세트로 모델을 교육하려면 다음과 같은 데이터를 사용하려고 할 것입니다:

잘 섞인 데이터
배치(batch)가 이루어진 데이터
가능한 한 빨리 사용할 수 있는 배치 데이터

tf.data api를 이용하면 손쉽게 위 특성들을 가진 데이터를 사용할 수 있습니다.

def prepare_for_training(ds, cache=True, shuffle_buffer_size=1000):
    # 이는 데이터를 한 번만 불러오고, 이를 메모리에 저장하는 작은 데이터셋입니다.
    # 메모리에 맞지 않는 데이터셋의 전처리 작업을 캐시하려면 `.cache(filename)`를 사용하세요.
    if cache:
        if isinstance(cache, str):
            ds = ds.cache(cache)
        else:
            ds = ds.cache()

    ds = ds.shuffle(buffer_size=shuffle_buffer_size)

    # 계속 반복합니다.
    ds = ds.repeat()

    ds = ds.batch(BATCH_SIZE)

    # `prefetch`는 모델을 훈련하는 동안 
    # 데이터셋이 백그라운드에서 배치들을 가져올 수 있도록 해줍니다.
    ds = ds.prefetch(buffer_size=AUTOTUNE)

    return ds

train_ds = prepare_for_training(labeled_ds)

image_batch, label_batch = next(iter(train_ds))

show_batch(image_batch.numpy(), label_batch.numpy())

4. 성능 비교

불러온 데이터셋의 성능을 확인해봅시다:

import time
default_timeit_steps = 1000

def timeit(ds, steps=default_timeit_steps):
    start = time.time()
    it = iter(ds)
    for i in range(steps):
        batch = next(it)
        if i%10 == 0:
            print('.',end='')
    print()
    end = time.time()

    duration = end-start
    print("{} batches: {} s".format(steps, duration))
    print("{:0.5f} Images/s".format(BATCH_SIZE*steps/duration))

두 데이터 생성기의 속도를 비교해 봅시다.

# `keras.preprocessing`
timeit(train_data_gen)

# `tf.data`
timeit(train_ds)

#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

[튜토리얼7] 이미지 데이터 불러오기

목차

1. 이미지 검색

2. 이미지 불러오기

2.1 `keras.preprocessing`으로 이미지 불러오기

2.2 `tf.data`로 이미지 불러오기

3. 기본적인 학습 방법

4. 성능 비교

Copyright 2019 The TensorFlow Authors.

[튜토리얼7] 이미지 데이터 불러오기

목차

1. 이미지 검색

2. 이미지 불러오기

2.1 keras.preprocessing으로 이미지 불러오기

2.2 tf.data로 이미지 불러오기

3. 기본적인 학습 방법

4. 성능 비교

Copyright 2019 The TensorFlow Authors.

2.1 `keras.preprocessing`으로 이미지 불러오기

2.2 `tf.data`로 이미지 불러오기