pandas python中的一个matplotlib函数,用来可视化在特定条件下具有不同颜色的数据点?

vaqhlq81  于 2023-03-16  发布在  Python
关注(0)|答案(2)|浏览(94)

我尝试使用z得分来实现异常检测,我想用蓝色绘制所有数据点,用红色只绘制异常。

import numpy as np
# random data points to calculate z-score
data = [1, 2, 3, 4, 100]

mean = np.mean(data) 

sd = np.std(data)

threshold = 2

outliers = []

for i in data: 
    z = (i-mean)/sd # calculate z-score
    if abs(z) > threshold:  # identify outliers
        outliers.append(i)

谁能帮我用matplotlib的散点图在同一个图中用蓝色表示正态数据[1,2,3,4],用红色表示数据[100]?
注:我对python非常非常陌生,所以非常感谢你的帮助。
The expected plot

b09cbbtk

b09cbbtk1#

import pandas as pd
import matplotlib.pyplot as plt
time = pd.date_range('2022-01-01', '2022-01-12', freq='D')
data = pd.Series([5, 3, 7, 9, 2, 1, 8, 4, 6, 130, 3, 5])
df = pd.DataFrame({'time': time, 'data': data})

# Calculate the interquartile range (IQR)
q1, q3 = df['data'].quantile([0.25, 0.75])
iqr = q3 - q1

# Define a threshold for outlier detection
outlier_threshold = 1.5 * iqr

# Find the outliers
outliers = df[(df['data'] < q1 - outlier_threshold) | (df['data'] > q3 + outlier_threshold)]

# Plot the data, marking the outliers in red
plt.plot(df['time'], df['data'], color='blue')
plt.plot(outliers['time'], outliers['data'], 'ro')
plt.show()

用于绘制散点图。

import pandas as pd
import matplotlib.pyplot as plt

time = pd.date_range('2022-01-01', '2022-01-12', freq='D')
data = pd.Series([5, 3, 7, 9, 2, 1, 8, 4, 6, 130, 3, 5])
df = pd.DataFrame({'time': time, 'data': data})
q1, q3 = df['data'].quantile([0.25, 0.75])
iqr = q3 - q1
outlier_threshold = 1.5 * iqr
outliers = df[(df['data'] < q1 - outlier_threshold) | (df['data'] > q3 + outlier_threshold)]
plt.scatter(df['time'], df['data'], color='blue')
plt.scatter(outliers['time'], outliers['data'], color='red')
plt.show()

如果你想要

time = pd.date_range('2022-01-01', '2022-01-12', freq='D')
data = pd.Series([5, 3, 7, 9, 2, 1, 8, 4, 6, 130, 3, 5])
df = pd.DataFrame({'time': time, 'data': data})
q1, q3 = df['data'].quantile([0.25, 0.75])
iqr = q3 - q1
outlier_threshold = 1.5 * iqr
outliers = df[(df['data'] < q1 - outlier_threshold) | (df['data'] > q3 + outlier_threshold)]
plt.scatter(df['time'], df['data'], color='blue')
plt.plot(df['time'], df['data'], color='blue')
plt.scatter(outliers['time'], outliers['data'], color='red')    
plt.show()
qyswt5oh

qyswt5oh2#

一种方法是这样做的最小的变化,现有的代码将看起来像下面这样。注意,我用中位数而不是平均值在这里。希望这是你正在寻找的...

import numpy as np
import matplotlib. pyplot as plt ## Matplotlib for plotting
# random data points to calculate z-score
data = [1, 2, 3, 4, 100]
date = pd.date_range('2018-01-01', '2023-01-01', freq='Y').tolist() ##Dates
mean = np.median(data) ## Used Median here
sd = np.std(data)
threshold = 2
outliers = []
for i in data: 
    z = (i-mean)/sd # calculate z-score
    if abs(z) > threshold:  # identify outliers
        outliers.append(i)

plt.scatter(date, data, marker='o') ## Plot the data - line plot is default
if len(outliers) > 0:
    for item in outliers:
        plt.scatter(date[data.index(item)], item, c='red', zorder=5) ##Plot the dots - scatter plot
plt.xticks(rotation=45)

相关问题