TF 2/Keras文本矢量化重塑尺寸

mpbci0fu  于 2023-01-17  发布在  其他
关注(0)|答案(1)|浏览(136)

I was following TensorFlow tutorial on text classification (link) , and the example is running fine, but when try to apply same steps on different dataset I'm constantly getting error that I'm unable to debug.
In tutorial they downloaded the data and used different data loader so that might be one of the issues, the other thing I suspect is vectorize_text function where dimensions are getting expanded, but I've tried almost everything I can imagine but no success. CSV file that I'm using contains 2 columns, one with text data and other one is multiclass label.
From the error below, it seems that TextVectorization outputs tensor of shape (batch_size, 250), while model needs something like (batch_size, 250, 1) I guess?
Below is the code that I used

from sklearn.model_selection import train_test_split
import tensorflow as tf
import re
import numpy as np
import pandas as pd
import string

# load and split data
df = pd.read_csv('train.csv', index_col=[0])
X_train, X_test, y_train, y_test = train_test_split(df[['text']], pd.get_dummies(df['target']).values, test_size=0.2, random_state=1)
X_val, X_test, y_val, y_test = train_test_split(X_test, y_test, test_size=0.5, random_state=1)

# convert to tf dataset
raw_train_ds = tf.data.Dataset.from_tensor_slices((X_train, y_train))
raw_val_ds = tf.data.Dataset.from_tensor_slices((X_val, y_val))
raw_test_ds = tf.data.Dataset.from_tensor_slices((X_test, y_test))

# text cleanup
def custom_standardization(input_data):
  new_line_replace = tf.strings.regex_replace(input_data, '\n', ' ')
  non_alphanum_replace = tf.strings.regex_replace(new_line_replace, '[^a-zA-Z0-9_ ]', '')
  stripped = tf.strings.strip(non_alphanum_replace)
  lowercase = tf.strings.lower(stripped)
  
  return tf.strings.regex_replace(lowercase,
                                  '[%s]' % re.escape(string.punctuation),
                                  '')
# creating layer for text vectoriazation
max_features = 10000
sequence_length = 250

vectorize_layer = tf.keras.layers.TextVectorization(
    standardize=custom_standardization,
    max_tokens=max_features,
    output_mode='int',
    output_sequence_length=sequence_length)

train_text = raw_train_ds.map(lambda x, y: x)
vectorize_layer.adapt(train_text)

def vectorize_text(text, label):
  text = tf.expand_dims(text, -1)
  return vectorize_layer(text), label

train_ds = raw_train_ds.map(vectorize_text)
val_ds = raw_val_ds.map(vectorize_text)
test_ds = raw_test_ds.map(vectorize_text)

train_ds = train_ds.cache().prefetch(buffer_size=tf.data.AUTOTUNE)
val_ds = val_ds.cache().prefetch(buffer_size=tf.data.AUTOTUNE)
test_ds = test_ds.cache().prefetch(buffer_size=tf.data.AUTOTUNE)

model = tf.keras.Sequential([
  vectorize_layer,
  tf.keras.layers.Embedding(max_features + 1, 16),
  tf.keras.layers.Dropout(0.2),
  tf.keras.layers.GlobalAveragePooling1D(),
  tf.keras.layers.Dropout(0.2),
  tf.keras.layers.Dense(4)
])

model.compile(loss='categorical_crossentropy',optimizer='adam',metrics=['accuracy'])

epochs = 10
history = model.fit(
    train_ds,
    validation_data=val_ds,
    epochs=epochs)

And this is the error I'm getting:
Epoch 1/100 WARNING:tensorflow:Model was constructed with shape (None,) for input KerasTensor(type_spec=TensorSpec(shape=(None,), dtype=tf.string, name='text_vectorization_2_input'), name='text_vectorization_2_input', description="created by layer 'text_vectorization_2_input'"), but it was called on an input with incompatible shape (None, 250). Output exceeds the size limit. Open the full output data in a text editor --------------------------------------------------------------------------- ValueError Traceback (most recent call last) in 1 epochs = 100 ----> 2 history = model.fit( 3 train_ds, 4 validation_data=val_ds, 5 epochs=epochs)
c:\Users\panto\anaconda3\lib\site-packages\keras\utils\traceback_utils.py in error_handler(*args, **kwargs) 65 except Exception as e: # pylint: disable=broad-except 66 filtered_tb = _process_traceback_frames(e.traceback) ---> 67 raise e.with_traceback(filtered_tb) from None 68 finally: 69 del filtered_tb
c:\Users\panto\anaconda3\lib\site-packages\tensorflow\python\framework\func_graph.py in autograph_handler(*args, **kwargs) 1145 except Exception as e: # pylint:disable=broad-except 1146 if hasattr(e, "ag_error_metadata"): -> 1147 raise e.ag_error_metadata.to_exception(e) 1148 else: 1149 raise
ValueError: in user code: ... When using TextVectorization to tokenize strings, the input rank must be 1 or the last shape dimension must be 1. Received: inputs.shape=(None, 250) with rank=2

Call arguments received:
  • inputs=tf.Tensor(shape=(None, 250), dtype=string)

Edit:
Update:
Data sample looks like this:
| text | target |
| ------------ | ------------ |
| 'such a lovely day' | 'a' |
| 'not so great' | 'b' |
| 'hello world' | 'c' |
... etc - in total 4 classes
Instead of vectorize_text function I've moved code to custom_standardization and it works now

def custom_standardization(input_data):
  new_line_replace = tf.strings.regex_replace(input_data, '\n', ' ')
  non_alphanum_replace = tf.strings.regex_replace(new_line_replace, '[^a-zA-Z0-9_ ]', '')
  stripped = tf.strings.strip(non_alphanum_replace)
  lowercase = tf.strings.lower(stripped)
  
  return tf.expand_dims(tf.strings.regex_replace(lowercase,
                                  '[%s]' % re.escape(string.punctuation),
                                  ''), -1)

new issue is target doesn't match - I get error although I one hot encoded variables

ValueError: Shapes (4, 1) and (None, 4) are incompatible

zfycwa2u

zfycwa2u1#

尝试将“categorical_crossentropy”更改为“sparse_categorical_crossentropy”,如here所述

相关问题