Load and preprocess images

e5317aa7 · MaoXianxin · 8c6af35e · e5317aa7 · e5317aa7 · e5317aa7
3 changed file
--- a/.gitignore
+++ b/.gitignore
@@ -2,3 +2,5 @@
 30个Stack Overflow新编程术语.md

 Git简介.md
+
+code china社区云.md
--- a/CV_Classification/Load and preprocess images.md
+++ b/CV_Classification/Load and preprocess images.md
+This tutorial shows how to load and preprocess an image dataset in three ways. First, you will use high-level Keras preprocessing utilities and layers to read a directory of images on disk. Next, you will write your own input pipeline from scratch using tf.data. Finally, you will download a dataset from the large catalog available in TensorFlow Datasets.
+
+```
+import numpy as np
+import os
+import PIL
+import PIL.Image
+import tensorflow as tf
+import tensorflow_datasets as tfds
+```
+
+## Download the flowers dataset
+
+This tutorial uses a dataset of several thousand photos of flowers. The flowers dataset contains 5 sub-directories, one per class:
+
+```
+flowers_photos/
+  daisy/
+  dandelion/
+  roses/
+  sunflowers/
+  tulips/
+```
+
+```
+import pathlib
+dataset_url = "https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz"
+data_dir = tf.keras.utils.get_file(origin=dataset_url, 
+                                   fname='flower_photos', 
+                                   untar=True)
+data_dir = pathlib.Path(data_dir)
+
+# After downloading (218MB), you should now have a copy of the flower photos available. There are 3,670 total images:
+image_count = len(list(data_dir.glob('*/*.jpg')))
+print(image_count)
+```
+
+## Load using tf.keras.preprocessing
+
+Let's load these images off disk using tf.keras.preprocessing.image_dataset_from_directory.
+
+### Create a dataset
+
+Define some parameters for the loader:
+
+```
+batch_size = 32
+img_height = 180
+img_width = 180
+```
+
+It's good practice to use a validation split when developing your model. You will use 80% of the images for training and 20% for validation.
+
+```
+train_ds = tf.keras.preprocessing.image_dataset_from_directory(
+  data_dir,
+  validation_split=0.2,
+  subset="training",
+  seed=123,
+  image_size=(img_height, img_width),
+  batch_size=batch_size)
+
+val_ds = tf.keras.preprocessing.image_dataset_from_directory(
+  data_dir,
+  validation_split=0.2,
+  subset="validation",
+  seed=123,
+  image_size=(img_height, img_width),
+  batch_size=batch_size)
+```
+
+You can find the class names in the `class_names` attribute on these datasets.
+
+```
+class_names = train_ds.class_names
+print(class_names)
+```
+
+### Standardize the data
+
+The RGB channel values are in the [0, 255] range. This is not ideal for a neural network; in general you should seek to make your input values small. Here, you will standardize values to be in the [0, 1] range by using the tf.keras.layers.experimental.preprocessing.Rescaling layer.
+
+```
+normalization_layer = tf.keras.layers.experimental.preprocessing.Rescaling(1./255)
+```
+
+There are two ways to use this layer. You can apply it to the dataset by calling map:
+
+```
+normalized_ds = train_ds.map(lambda x, y: (normalization_layer(x), y))
+image_batch, labels_batch = next(iter(normalized_ds))
+first_image = image_batch[0]
+# Notice the pixels values are now in `[0,1]`.
+print(np.min(first_image), np.max(first_image))
+```
+
+### Configure the dataset for performance
+
+Let's make sure to use buffered prefetching, so you can yield data from disk without having I/O become blocking. These are two important methods you should use when loading data.
+
+.cache() keeps the images in memory after they're loaded off disk during the first epoch. This will ensure the dataset does not become a bottleneck while training your model. If your dataset is too large to fit into memory, you can also use this method to create a performant on-disk cache.
+
+.prefetch() overlaps data preprocessing and model execution while training.
+
+Interested readers can learn more about both methods, as well as how to cache data to disk in the data performance guide.
+
+```
+AUTOTUNE = tf.data.AUTOTUNE
+
+train_ds = train_ds.cache().prefetch(buffer_size=AUTOTUNE)
+val_ds = val_ds.cache().prefetch(buffer_size=AUTOTUNE)
+```
+
+### Train a model
+
+For completeness, you will show how to train a simple model using the datasets you have just prepared. This model has not been tuned in any way - the goal is to show you the mechanics using the datasets you just created. To learn more about image classification, visit this tutorial
+
+```
+num_classes = 5
+
+model = tf.keras.Sequential([
+  tf.keras.layers.experimental.preprocessing.Rescaling(1./255),
+  tf.keras.layers.Conv2D(32, 3, activation='relu'),
+  tf.keras.layers.MaxPooling2D(),
+  tf.keras.layers.Conv2D(32, 3, activation='relu'),
+  tf.keras.layers.MaxPooling2D(),
+  tf.keras.layers.Conv2D(32, 3, activation='relu'),
+  tf.keras.layers.MaxPooling2D(),
+  tf.keras.layers.Flatten(),
+  tf.keras.layers.Dense(128, activation='relu'),
+  tf.keras.layers.Dense(num_classes)
+])
+```
+
+```
+model.compile(
+  optimizer='adam',
+  loss=tf.losses.SparseCategoricalCrossentropy(from_logits=True),
+  metrics=['accuracy'])
+```
+
+```
+model.fit(
+  train_ds,
+  validation_data=val_ds,
+  epochs=3
+)
+```
+
+You may notice the validation accuracy is low compared to the training accuracy, indicating your model is overfitting. You can learn more about overfitting and how to reduce it in this tutorial.
+
+## Using tf.data for finer control
+
+The above tf.keras.preprocessing utilities are a convenient way to create a tf.data.Dataset from a directory of images. For finer grain control, you can write your own input pipeline using tf.data. This section shows how to do just that, beginning with the file paths from the TGZ file you downloaded earlier.
+
+```
+list_ds = tf.data.Dataset.list_files(str(data_dir/'*/*'), shuffle=False)
+list_ds = list_ds.shuffle(image_count, reshuffle_each_iteration=False)
+```
+
+```
+class_names = np.array(sorted([item.name for item in data_dir.glob('*') if item.name != "LICENSE.txt"]))
+print(class_names)
+```
+
+Split the dataset into train and validation:
+
+```
+val_size = int(image_count * 0.2)
+train_ds = list_ds.skip(val_size)
+val_ds = list_ds.take(val_size)
+```
+
+You can see the length of each dataset as follows:
+
+```
+print(tf.data.experimental.cardinality(train_ds).numpy())
+print(tf.data.experimental.cardinality(val_ds).numpy())
+```
+
+Write a short function that converts a file path to an `(img, label)` pair:
+
+```
+def get_label(file_path):
+  # convert the path to a list of path components
+  parts = tf.strings.split(file_path, os.path.sep)
+  # The second to last is the class-directory
+  one_hot = parts[-2] == class_names
+  # Integer encode the label
+  return tf.argmax(one_hot)
+```
+
+```
+def decode_img(img):
+  # convert the compressed string to a 3D uint8 tensor
+  img = tf.io.decode_jpeg(img, channels=3)
+  # resize the image to the desired size
+  return tf.image.resize(img, [img_height, img_width])
+```
+
+```
+def process_path(file_path):
+  label = get_label(file_path)
+  # load the raw data from the file as a string
+  img = tf.io.read_file(file_path)
+  img = decode_img(img)
+  return img, label
+```
+
+Use Dataset.map to create a dataset of image, label pairs:
+
+```
+# Set `num_parallel_calls` so multiple images are loaded/processed in parallel.
+train_ds = train_ds.map(process_path, num_parallel_calls=AUTOTUNE)
+val_ds = val_ds.map(process_path, num_parallel_calls=AUTOTUNE)
+```
+
+```
+for image, label in train_ds.take(1):
+  print("Image shape: ", image.numpy().shape)
+  print("Label: ", label.numpy())
+```
+
+### Configure dataset for performance
+
+To train a model with this dataset you will want the data:
+
+- To be well shuffled.
+- To be batched.
+- Batches to be available as soon as possible.
+
+These features can be added using the tf.data API. For more details, see the Input Pipeline Performance guide.
+
+```
+def configure_for_performance(ds):
+  ds = ds.cache()
+  ds = ds.shuffle(buffer_size=1000)
+  ds = ds.batch(batch_size)
+  ds = ds.prefetch(buffer_size=AUTOTUNE)
+  return ds
+
+train_ds = configure_for_performance(train_ds)
+val_ds = configure_for_performance(val_ds)
+```
+
+### Continue training the model
+
+You have now manually built a similar tf.data.Dataset to the one created by the keras.preprocessing above. You can continue training the model with it. As before, you will train for just a few epochs to keep the running time short
+
+```
+model.fit(
+  train_ds,
+  validation_data=val_ds,
+  epochs=3
+)
+```
+
+## Using TensorFlow Datasets
+
+So far, this tutorial has focused on loading data off disk. You can also find a dataset to use by exploring the large catalog of easy-to-download datasets at TensorFlow Datasets. As you have previously loaded the Flowers dataset off disk, let's see how to import it with TensorFlow Datasets.
+
+```
+(train_ds, val_ds, test_ds), metadata = tfds.load(
+    'tf_flowers',
+    split=['train[:80%]', 'train[80%:90%]', 'train[90%:]'],
+    with_info=True,
+    as_supervised=True,
+)
+```
+
+The flowers dataset has five classes.
+
+```
+num_classes = metadata.features['label'].num_classes
+print(num_classes)
+```
+
+As before, remember to batch, shuffle, and configure each dataset for performance.
+
+```
+train_ds = configure_for_performance(train_ds)
+val_ds = configure_for_performance(val_ds)
+test_ds = configure_for_performance(test_ds)
+```
+
+You can find a complete example of working with the flowers dataset and TensorFlow Datasets by visiting the Data augmentation tutorial.
+
+## Next steps
+
+This tutorial showed two ways of loading images off disk. First, you learned how to load and preprocess an image dataset using Keras preprocessing layers and utilities. Next, you learned how to write an input pipeline from scratch using tf.data. Finally, you learned how to download a dataset from TensorFlow Datasets. As a next step, you can learn how to add data augmentation by visiting this tutorial. To learn more about tf.data, you can visit the tf.data: Build TensorFlow input pipelines guide.
+
+代码链接: https://codechina.csdn.net/csdn_codechina/enterprise_technology/-/blob/master/load_preprocess_images.ipynb
\ No newline at end of file
--- a/Tensorflow_cookbook/Tensorflow小技巧(一).md
+++ b/Tensorflow_cookbook/Tensorflow小技巧(一).md
@@ -48,4 +48,30 @@ Returns an iterator which converts all elements of the dataset to numpy.
 dataset = tf.data.Dataset.from_tensor_slices([1, 2, 3])
 for element in dataset.as_numpy_iterator():
  print(element)
+```
+
+## tf.data.Dataset
+
+The `tf.data.Dataset` API supports writing descriptive and efficient input pipelines. `Dataset` usage follows a common pattern:
+
+1. Create a source dataset from your input data.
+2. Apply dataset transformations to preprocess the data.
+3. Iterate over the dataset and process the elements.
+
+Iteration happens in a streaming fashion, so the full dataset does not need to fit into memory.
+
+The simplest way to create a dataset is to create it from a python list:
+
+```
+dataset = tf.data.Dataset.from_tensor_slices([1, 2, 3])
+for element in dataset:
+  print(element)
+```
+
+Once you have a dataset, you can apply transformations to prepare the data for your model:
+
+```
+dataset = tf.data.Dataset.from_tensor_slices([1, 2, 3])
+dataset = dataset.map(lambda x: x*2)
+list(dataset.as_numpy_iterator())
 ```
\ No newline at end of file