所以,我现在被一个bug卡住了。
我正在处理一个包含以下信息的巨大数据集:
关于许多威尔斯的多个示例的信息,每个孔都标有其自己的唯一孔ID号、镭污染水平和取样日期。
例如:
Well ID: AT091
Radium Level: 44.9
Sample Date: 3/18/2015
Well ID: AT091
Radium Level: 50.2
Sample Date: 2/18/2015
Well ID: AT091
Radium Level: 33.7 PCI/L
Sample Date: 7/28/2020
我被要求编写一个Python脚本,从原始数据集中过滤出数据,并根据以下条件创建一个新的Excel工作表:
对于每口井,如果该井每年取样一次,则保留该井。对于每口井,如果该井在一年内多次取样,则保留污染水平最高的取样日期。
例如,如果一个孔被取样三次:
Well ID: AT091
Radium Level: 44.9
Sample Date: 3/18/2015
Well ID: AT091
Radium Level: 50.2
Sample Date: 2/18/2015
Well ID: AT091
Radium Level: 33.7 PCI/L
Sample Date: 7/28/2020
代码应使用以下内容更新电子表格:
Well ID: AT091
Radium Level: 50.2
Sample Date: 2/18/2015
Well ID: AT091
Radium Level: 33.7 PCI/L
Sample Date: 7/28/2020
下面是我写的代码:
def wells_sampled_once_per_year(well_numbers, formatted_dates, concentration):
well_count = {}
max_contamination = {}
for well, date, conc in zip(well_numbers, formatted_dates, concentration):
if date is None:
continue
try:
year = pd.to_datetime(date).year
except AttributeError:
continue
well_year = (well, year)
if well_year in well_count:
well_count[well_year] += 1
max_contamination[well_year] = max(max_contamination[well_year], conc)
else:
well_count[well_year] = 1
max_contamination[well_year] = conc
sampled_once_per_year = [
(well, date, conc, max_contamination[(well, pd.to_datetime(date).year)])
for well, date, conc in zip(well_numbers, formatted_dates, concentration)
if well_count[(well, pd.to_datetime(date).year)] == 1
]
return sorted(sampled_once_per_year)
def wells_sampled_multiple_times_per_year(well_numbers, formatted_dates, concentration):
well_count = {}
max_contamination = {}
for well, date, conc in zip(well_numbers, formatted_dates, concentration):
if date is None:
continue
try:
year = pd.to_datetime(date).year
except AttributeError:
continue
well_year = (well, year)
if well_year in well_count:
well_count[well_year] += 1
if conc > max_contamination[well_year]:
max_contamination[well_year] = conc
else:
well_count[well_year] = 1
max_contamination[well_year] = conc
sampled_multiple_times_per_year = [
(well, date, conc, max_contamination[(well, pd.to_datetime(date).year)])
for well, date, conc in zip(well_numbers, formatted_dates, concentration)
if well_count[(well, pd.to_datetime(date).year)] > 1 and conc == max_contamination[(well, pd.to_datetime(date).year)]
]
# Remove duplicates from the list
sampled_multiple_times_per_year = list(set(sampled_multiple_times_per_year))
return sorted(sampled_multiple_times_per_year)
1条答案
按热度按时间yb3bgrhw1#
在
for
循环之后,max_contamination
包含了几乎所有需要的信息,除了日期。为了简化返回值i的构造,我在循环中添加了它。e.将循环的最后五行改为(or如果需要的话,进行排序)。