此问题已在此处有答案:
Can't use pickled function inside a pytest test(1个答案)
4小时前关闭
我正在尝试使用Apache Airflow建立一个简单的管道模型。我正在使用Docker在本地运行一个示例。需要做的一些事情是加载pickle sklearn模型并转换pandas Dataframe 。当我加载那个模型并尝试使用它时,我得到了最简单的错误。
NameError: name 'pd' is not defined
所以,我要做的第一件事就是去顶端进口Pandas。。但Pandas在那里。
我在这里转录我的脚本和我的环境相关文件。
我的气流任务的简化版本:
import dill
import pandas as pd
model_file = 'models/the_model.pkl'
def task_run_model(**context):
# Load the pre-trained models from the .pkl files
with open(model_file, 'rb') as f:
model = dill.load(f)
# Test models
file_name = "train_set.csv"
time_series_df = pd.read_csv(file_name)
train_features_df = model.transform(time_series_df)
return train_features_df
堆栈跟踪的错误:
[2023-05-14, 19:21:10 UTC] {taskinstance.py:1847} ERROR - Task failed with exception
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/operators/python.py", line 181, in execute
return_value = self.execute_callable()
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/operators/python.py", line 198, in execute_callable
return self.python_callable(*self.op_args, **self.op_kwargs)
File "/opt/airflow/dags/my_tasks/transformation.py", line 15, in task_run_model
train_features_df = model.transform(time_series_df)
File "/Users/<USER_NAME>/Repos/algorithms/Projects/xxxxxxxxx/model.py", line 19, in transform
NameError: name 'pd' is not defined
[2023-05-14, 19:21:10 UTC] {taskinstance.py:1368} INFO - Marking task as FAILED. dag_id=feature_creation, task_id=create_features, execution_date=20230514T192058, start_date=20230514T192110, end_date=20230514T192110
[2023-05-14, 19:21:10 UTC] {standard_task_runner.py:104} ERROR - Failed to execute job 4 for task create_features (name 'pd' is not defined; 248)
[2023-05-14, 19:21:10 UTC] {local_task_job_runner.py:232} INFO - Task exited with return code 1
[2023-05-14, 19:21:10 UTC] {taskinstance.py:2674} INFO - 0 downstream tasks scheduled from follow-on schedule check
当然我没有访问/Users/<USER_NAME>/Repos/algorithms/Projects/xxxxxxxxx/model.py的权限,我只有给我的.pkl文件
我所知道的关于该环境的唯一数据是它是用以下依赖项构建的:
pandas : 1.3.5
numpy : 1.21.6
dateutil : 2.8.2
scipy : 1.10.1
这是我的环境:
docker-compose.yml
---
version: '3.4'
x-common:
&common
build:
context: .
dockerfile: Dockerfile
user: "${AIRFLOW_UID}:0"
env_file:
- .env
volumes:
- ./dags:/opt/airflow/dags
- ./logs:/opt/airflow/logs
- ./plugins:/opt/airflow/plugins
- ./models:/opt/airflow/models
- ./tests:/opt/airflow/tests
- /var/run/docker.sock:/var/run/docker.sock
x-depends-on:
&depends-on
depends_on:
postgres:
condition: service_healthy
airflow-init:
condition: service_completed_successfully
services:
postgres:
image: postgres:13
container_name: postgres
ports:
- "5434:5432"
healthcheck:
test: ["CMD", "pg_isready", "-U", "airflow"]
interval: 5s
retries: 5
env_file:
- .env
scheduler:
<<: *common
<<: *depends-on
container_name: pipeline-scheduler
command: scheduler
restart: on-failure
ports:
- "8793:8793"
webserver:
<<: *common
<<: *depends-on
container_name: pipeline-webserver
restart: always
command: webserver
ports:
- "8080:8080"
healthcheck:
test: ["CMD", "curl", "--fail", "http://localhost:8080/health"]
interval: 30s
timeout: 30s
retries: 5
airflow-init:
<<: *common
container_name: pipeline-init
entrypoint: /bin/bash
command:
- -c
- |
mkdir -p /sources/logs /sources/dags /sources/plugins /sources/models
chown -R "${AIRFLOW_UID}:0" /sources/{logs,dags,plugins,models}
exec /entrypoint airflow version
Dockerfile
FROM apache/airflow:latest-python3.8
USER root
RUN apt-get update && \
apt-get clean && \
apt-get install vim-tiny -y && \
apt-get autoremove -yqq --purge && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
USER airflow
ENV PYTHONPATH "${PYTHONPATH}:${AIRFLOW_HOME}"
COPY requirements.txt /tmp/requirements.txt
RUN pip install -r /tmp/requirements.txt
requirements.yml
pip==22.3.1
scikit-learn==1.1.3
numpy==1.21.6
scipy==1.10.1
pandas==1.3.5
dill==0.3.6
python-dateutil==2.8.2
.env
# Meta-Database
POSTGRES_USER=airflow
POSTGRES_PASSWORD=airflow
POSTGRES_DB=airflow
# Airflow Core
AIRFLOW__CORE__FERNET_KEY=UKMzEm3yIuFYEq1y3-2FxPNWSVwRASpahmQ9kQfEr8E=
AIRFLOW__CORE__EXECUTOR=LocalExecutor
AIRFLOW__CORE__DAGS_ARE_PAUSED_AT_CREATION=True
AIRFLOW__CORE__LOAD_EXAMPLES=False
AIRFLOW_UID=0
# Backend DB
AIRFLOW__DATABASE__SQL_ALCHEMY_CONN=postgresql+psycopg2://airflow:airflow@postgres/airflow
AIRFLOW__DATABASE__LOAD_DEFAULT_CONNECTIONS=False
# Airflow Init
_AIRFLOW_DB_UPGRADE=True
_AIRFLOW_WWW_USER_CREATE=True
_AIRFLOW_WWW_USER_USERNAME=airflow
_AIRFLOW_WWW_USER_PASSWORD=airflow
来建造这个我正在使用的
docker compose up -d
如果我在PyCharm中运行,pickle确实在本地运行
import pandas as pd
import dill
model_file = 'models/the_model.pkl'
train_dataset_file = 'datasets/train.csv'
test_dataset_file = 'datasets/test.csv'
# Load the pre-trained model from the .pkl files
with open(model_file, 'rb') as f:
model = dill.load(f)
# Load the datasets
train_df = pd.read_csv(train_dataset_file)
test_df = pd.read_csv(test_dataset_file)
# Test model
train_features_df: pd.DataFrame = model.transform(train_df)
test_features_df: pd.DataFrame = model.transform(test_df)
print(train_features_df, test_features_df)
我在本地使用python 3.8,pandas==1.5.3 & dill==0.3.6(是的,我尝试的第一件事是在requirements.txt中将pandas升级到1.5.3,但相同。
1条答案
按热度按时间tjjdgumg1#
https://stackoverflow.com/a/65318623/10972050 <- answer
基本上