pandas 字符串索引

z18hc3ub  于 2023-04-04  发布在  其他
关注(0)|答案(2)|浏览(147)

我试图得到一个基于这个过程的输出,最好用一个例子来解释。
例如在微笑中,
C(N)(N)CC(N)C,[0,1,2,0,0,1,0]
这是我试图得到的输出。
它会计算分支(用括号表示)。所以对于上面的例子,它计算第一个(N)为1,则第二个然后,一旦该计数到达未分支的原子,就重置该计数(或括号内)。它继续得到0,计数开始并再次重置。问题是我没有得到预期的输出。下面是我的输出,预期输出和代码。谢谢
另外,我需要确保像这些CC(CC(C))这样的情况没有被错误地索引。它不应该计数多余的,不应该重置,不应该连续计数。那个微笑应该有输出[0 0 1 1 1]。
另一个例子:CC(CCC)CCCC [0 0 1 1 1 0 0 0 0]
对于嵌套的括号,我将重新运行此过程,并从1开始计数。
我来拿这个

SMILES                             branch_count
0  C(N)(N)CC(N)C  [0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0]
1            CCC                                [0, 0, 0]
2          C1CC1                          [0, 0, 0, 0, 0]
3      C1CC1(C)C              [0, 0, 0, 0, 0, 0, 1, 0, 0]
4         CC(C)C                       [0, 0, 0, 1, 0, 0]

但其实应该是这样
x一个一个一个一个x一个一个二个x

gwbalxhn

gwbalxhn1#

这个循环包含了作为字符的方括号,所以对于每个开括号和闭括号,你的代码会把它算作一个原子。你应该使用.isalpha()检查字符是否是字母。然后你也应该检查(我的是n)是否应该用数字来替换字符。例如,在你的坏代码中,括号和数字也被替换为0/1,这意味着你有额外的原子,你不想要的.阅读我的评论额外的解释,并在自己的引擎中运行这段代码,以确保它是正确的(虽然我已经检查了多次).

import pandas as pd
import numpy as np
from rdkit import Chem

# All changes in function
def get_branch_count(smile):
    # Initialize variables
    n = 0 # This is to make sure that only the needed characters are added, so it doesn't include 
    length_smile = 0
    for char in smile:
        if char.isalpha():
            length_smile += 1
    branch_count = [0] * length_smile
    bracket_count = 0
    bracket_together = 0 # Use this variable for when the brackets are next to each other for less confusing code
    current_count = 0
    # Loop through each character in the smile
    for i, c in enumerate(smile):
        if c == '(':
            bracket_count += 1
        
        # Continue after the IF statement because the letters are now inside of the brackets
        elif bracket_count >= 1 and c.isalpha():
            current_count = bracket_count
            branch_count[n] = current_count
            n += 1
        # This is to check if there are consecutive branches
        elif c ==')':
            if smile[i+1] != '(':
                bracket_count = 0
            
            
        # If the character is not surrounded by brackets and if it is alphabetical
        elif c.isalpha() and bracket_count == 0:
            current_count = 0
            branch_count[n] = current_count # Do this inside of each IF statement for the alphabetical chars so that it doesn't include the brackets
            n += 1
            
    return branch_count

def collect_branch_count(smile_list):
    rows = []

    for smile in smile_list:
        branch_count = get_branch_count(smile)
        data = {"branch_count": branch_count}

        row = {"SMILES": smile}
        for key, value in data.items():
            row[key] = value
        rows.append(row)

    df = pd.DataFrame(rows)
    return df

smile_list = ["C(N)(N)CC(N)C", "CCC", "C1CC1", "C1CC1(C)C", "CC(C)C"]
df = collect_branch_count(smile_list)
print(df)

正如你所看到的,我改变了一些东西:

  • 而不是执行branch_count = [0] * len(smile),我将其改为:
```python
 # This is to make sure that there are no extra numbers (for example the brackets and the non-alphabetical characters.
 length_smile = 0
 for char in smile:
     if char.isalpha():
         length_smile += 1
 branch_count = [0] * length_smile
7eumitmz

7eumitmz2#

这就是我的解决方案。
首先,我用C替换所有的C1,以计算一个字母作为一个可选组。然后,我计数开括号。如果只有一个开括号,我就有一个新组。如果我有一个闭括号,我检查下一个字母是开括号,以检查是否有连续的组。如果没有,我将计数器重置为0。

import pandas as pd

def smile_grouping(s):
    s = s.replace('C1', 'C')
    open_brackets = 0
    group_counter = 0

    res = []
    for i, letter in enumerate(s):
        if letter == '(':
            open_brackets += 1
            if open_brackets == 1:
                group_counter += 1
        elif letter == ')':
            open_brackets -= 1
        else:
            res.append(group_counter)

        if open_brackets == 0:
            if i+1<len(s) and s[i+1] != '(':
                group_counter = 0
    return res

这就是结果

df = pd.DataFrame(
    {'smile':[
        "C(N)(N)CC(N)C",
        "CCC",
        "C1CC1",
        "C1CC1(C)C",
        "CC(C)C",
        "C(N)(N)(N)CC(N)C",
        "C((N)(N)N)CC(N)C",
        "CC(CCC)CCCC",
        "CC(CC(C))"
    ]})
df['branch_count'] = df['smile'].apply(smile_grouping)
>>> df
              smile                 branch_count
0     C(N)(N)CC(N)C        [0, 1, 2, 0, 0, 1, 0]
1               CCC                    [0, 0, 0]
2             C1CC1                    [0, 0, 0]
3         C1CC1(C)C              [0, 0, 0, 1, 0]
4            CC(C)C                 [0, 0, 1, 0]
5  C(N)(N)(N)CC(N)C     [0, 1, 2, 3, 0, 0, 1, 0]
6  C((N)(N)N)CC(N)C     [0, 1, 1, 1, 0, 0, 1, 0]
7       CC(CCC)CCCC  [0, 0, 1, 1, 1, 0, 0, 0, 0]
8         CC(CC(C))              [0, 0, 1, 1, 1]

相关问题