NameError:加载pickle文件时未定义名称'pd'-但定义了pandas [重复]

lf5gs5x2  于 2023-05-15  发布在  其他
关注(0)|答案(1)|浏览(160)

此问题已在此处有答案

Can't use pickled function inside a pytest test(1个答案)
4小时前关闭
我正在尝试使用Apache Airflow建立一个简单的管道模型。我正在使用Docker在本地运行一个示例。需要做的一些事情是加载pickle sklearn模型并转换pandas Dataframe 。当我加载那个模型并尝试使用它时,我得到了最简单的错误。

NameError: name 'pd' is not defined

所以,我要做的第一件事就是去顶端进口Pandas。。但Pandas在那里
我在这里转录我的脚本和我的环境相关文件。
我的气流任务的简化版本:

import dill
import pandas as pd

model_file = 'models/the_model.pkl'

def task_run_model(**context):

    # Load the pre-trained models from the .pkl files
    with open(model_file, 'rb') as f:
        model = dill.load(f)

    # Test models
    file_name = "train_set.csv"
    time_series_df = pd.read_csv(file_name)
    train_features_df = model.transform(time_series_df)

    return train_features_df

堆栈跟踪的错误:

[2023-05-14, 19:21:10 UTC] {taskinstance.py:1847} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/operators/python.py", line 181, in execute
    return_value = self.execute_callable()
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/operators/python.py", line 198, in execute_callable
    return self.python_callable(*self.op_args, **self.op_kwargs)
  File "/opt/airflow/dags/my_tasks/transformation.py", line 15, in task_run_model
    train_features_df = model.transform(time_series_df)
  File "/Users/<USER_NAME>/Repos/algorithms/Projects/xxxxxxxxx/model.py", line 19, in transform
NameError: name 'pd' is not defined
[2023-05-14, 19:21:10 UTC] {taskinstance.py:1368} INFO - Marking task as FAILED. dag_id=feature_creation, task_id=create_features, execution_date=20230514T192058, start_date=20230514T192110, end_date=20230514T192110
[2023-05-14, 19:21:10 UTC] {standard_task_runner.py:104} ERROR - Failed to execute job 4 for task create_features (name 'pd' is not defined; 248)
[2023-05-14, 19:21:10 UTC] {local_task_job_runner.py:232} INFO - Task exited with return code 1
[2023-05-14, 19:21:10 UTC] {taskinstance.py:2674} INFO - 0 downstream tasks scheduled from follow-on schedule check

当然我没有访问/Users/<USER_NAME>/Repos/algorithms/Projects/xxxxxxxxx/model.py的权限,我只有给我的.pkl文件

我所知道的关于该环境的唯一数据是它是用以下依赖项构建的:

pandas           : 1.3.5
numpy            : 1.21.6
dateutil         : 2.8.2
scipy            : 1.10.1

这是我的环境:
docker-compose.yml

---
version: '3.4'

x-common:
  &common
  build:
    context: .
    dockerfile: Dockerfile
  user: "${AIRFLOW_UID}:0"
  env_file: 
    - .env
  volumes:
    - ./dags:/opt/airflow/dags
    - ./logs:/opt/airflow/logs
    - ./plugins:/opt/airflow/plugins
    - ./models:/opt/airflow/models
    - ./tests:/opt/airflow/tests
    - /var/run/docker.sock:/var/run/docker.sock

x-depends-on:
  &depends-on
  depends_on:
    postgres:
      condition: service_healthy
    airflow-init:
      condition: service_completed_successfully

services:
  postgres:
    image: postgres:13
    container_name: postgres
    ports:
      - "5434:5432"
    healthcheck:
      test: ["CMD", "pg_isready", "-U", "airflow"]
      interval: 5s
      retries: 5
    env_file:
      - .env

  scheduler:
    <<: *common
    <<: *depends-on
    container_name: pipeline-scheduler
    command: scheduler
    restart: on-failure
    ports:
      - "8793:8793"

  webserver:
    <<: *common
    <<: *depends-on
    container_name: pipeline-webserver
    restart: always
    command: webserver
    ports:
      - "8080:8080"
    healthcheck:
      test: ["CMD", "curl", "--fail", "http://localhost:8080/health"]
      interval: 30s
      timeout: 30s
      retries: 5
  
  airflow-init:
    <<: *common
    container_name: pipeline-init
    entrypoint: /bin/bash
    command:
      - -c
      - |
        mkdir -p /sources/logs /sources/dags /sources/plugins /sources/models
        chown -R "${AIRFLOW_UID}:0" /sources/{logs,dags,plugins,models}
        exec /entrypoint airflow version

Dockerfile

FROM apache/airflow:latest-python3.8
USER root
RUN apt-get update && \
    apt-get clean && \
    apt-get install vim-tiny -y && \
    apt-get autoremove -yqq --purge && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/*
USER airflow
ENV PYTHONPATH "${PYTHONPATH}:${AIRFLOW_HOME}"
COPY requirements.txt /tmp/requirements.txt
RUN pip install -r /tmp/requirements.txt

requirements.yml

pip==22.3.1
scikit-learn==1.1.3
numpy==1.21.6
scipy==1.10.1
pandas==1.3.5
dill==0.3.6
python-dateutil==2.8.2

.env

# Meta-Database
POSTGRES_USER=airflow
POSTGRES_PASSWORD=airflow
POSTGRES_DB=airflow

# Airflow Core
AIRFLOW__CORE__FERNET_KEY=UKMzEm3yIuFYEq1y3-2FxPNWSVwRASpahmQ9kQfEr8E=
AIRFLOW__CORE__EXECUTOR=LocalExecutor
AIRFLOW__CORE__DAGS_ARE_PAUSED_AT_CREATION=True
AIRFLOW__CORE__LOAD_EXAMPLES=False
AIRFLOW_UID=0

# Backend DB
AIRFLOW__DATABASE__SQL_ALCHEMY_CONN=postgresql+psycopg2://airflow:airflow@postgres/airflow
AIRFLOW__DATABASE__LOAD_DEFAULT_CONNECTIONS=False

# Airflow Init
_AIRFLOW_DB_UPGRADE=True
_AIRFLOW_WWW_USER_CREATE=True
_AIRFLOW_WWW_USER_USERNAME=airflow
_AIRFLOW_WWW_USER_PASSWORD=airflow

来建造这个我正在使用的

docker compose up -d

如果我在PyCharm中运行,pickle确实在本地运行

import pandas as pd
import dill

model_file = 'models/the_model.pkl'

train_dataset_file = 'datasets/train.csv'
test_dataset_file = 'datasets/test.csv'

# Load the pre-trained model from the .pkl files
with open(model_file, 'rb') as f:
    model = dill.load(f)

# Load the datasets
train_df = pd.read_csv(train_dataset_file)
test_df = pd.read_csv(test_dataset_file)

# Test model
train_features_df: pd.DataFrame = model.transform(train_df)
test_features_df: pd.DataFrame = model.transform(test_df)

print(train_features_df, test_features_df)

我在本地使用python 3.8,pandas==1.5.3 & dill==0.3.6(是的,我尝试的第一件事是在requirements.txt中将pandas升级到1.5.3,但相同。

tjjdgumg

tjjdgumg1#

https://stackoverflow.com/a/65318623/10972050 <- answer
基本上

import pandas as pd
import __main__
__main__.pd = pd

with open('pandize.pkl', 'rb') as f:
    p = dill.load(f)

p(1)

相关问题