pandas 字符串索引

z18hc3ub 于 2023-04-04 发布在其他

关注(0)|答案(2)|浏览(147)

我试图得到一个基于这个过程的输出，最好用一个例子来解释。
例如在微笑中，
C（N）（N）CC（N）C，[0，1，2，0，0，1，0]
这是我试图得到的输出。
它会计算分支（用括号表示）。所以对于上面的例子，它计算第一个（N）为1，则第二个然后，一旦该计数到达未分支的原子，就重置该计数（或括号内）。它继续得到0，计数开始并再次重置。问题是我没有得到预期的输出。下面是我的输出，预期输出和代码。谢谢
另外，我需要确保像这些CC（CC（C））这样的情况没有被错误地索引。它不应该计数多余的，不应该重置，不应该连续计数。那个微笑应该有输出[0 0 1 1 1]。
另一个例子：CC（CCC）CCCC [0 0 1 1 1 0 0 0 0]
对于嵌套的括号，我将重新运行此过程，并从1开始计数。
我来拿这个

SMILES                             branch_count
0  C(N)(N)CC(N)C  [0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0]
1            CCC                                [0, 0, 0]
2          C1CC1                          [0, 0, 0, 0, 0]
3      C1CC1(C)C              [0, 0, 0, 0, 0, 0, 1, 0, 0]
4         CC(C)C                       [0, 0, 0, 1, 0, 0]

但其实应该是这样
x一个一个一个一个x一个一个二个x

pandas

来源：https://stackoverflow.com/questions/75910879/indexing-of-strings

2条答案

按热度按时间

gwbalxhn1#

这个循环包含了作为字符的方括号，所以对于每个开括号和闭括号，你的代码会把它算作一个原子。你应该使用.isalpha()检查字符是否是字母。然后你也应该检查（我的是n）是否应该用数字来替换字符。例如，在你的坏代码中，括号和数字也被替换为0/1，这意味着你有额外的原子，你不想要的.阅读我的评论额外的解释，并在自己的引擎中运行这段代码，以确保它是正确的（虽然我已经检查了多次）.

import pandas as pd
import numpy as np
from rdkit import Chem

# All changes in function
def get_branch_count(smile):
    # Initialize variables
    n = 0 # This is to make sure that only the needed characters are added, so it doesn't include 
    length_smile = 0
    for char in smile:
        if char.isalpha():
            length_smile += 1
    branch_count = [0] * length_smile
    bracket_count = 0
    bracket_together = 0 # Use this variable for when the brackets are next to each other for less confusing code
    current_count = 0
    # Loop through each character in the smile
    for i, c in enumerate(smile):
        if c == '(':
            bracket_count += 1
        
        # Continue after the IF statement because the letters are now inside of the brackets
        elif bracket_count >= 1 and c.isalpha():
            current_count = bracket_count
            branch_count[n] = current_count
            n += 1
        # This is to check if there are consecutive branches
        elif c ==')':
            if smile[i+1] != '(':
                bracket_count = 0
            
            
        # If the character is not surrounded by brackets and if it is alphabetical
        elif c.isalpha() and bracket_count == 0:
            current_count = 0
            branch_count[n] = current_count # Do this inside of each IF statement for the alphabetical chars so that it doesn't include the brackets
            n += 1
            
    return branch_count

def collect_branch_count(smile_list):
    rows = []

    for smile in smile_list:
        branch_count = get_branch_count(smile)
        data = {"branch_count": branch_count}

        row = {"SMILES": smile}
        for key, value in data.items():
            row[key] = value
        rows.append(row)

    df = pd.DataFrame(rows)
    return df

smile_list = ["C(N)(N)CC(N)C", "CCC", "C1CC1", "C1CC1(C)C", "CC(C)C"]
df = collect_branch_count(smile_list)
print(df)

正如你所看到的，我改变了一些东西：

而不是执行branch_count = [0] * len(smile)，我将其改为：

```python
 # This is to make sure that there are no extra numbers (for example the brackets and the non-alphabetical characters.
 length_smile = 0
 for char in smile:
     if char.isalpha():
         length_smile += 1
 branch_count = [0] * length_smile

赞(0）回复(0）举报 2023-04-04

7eumitmz2#

这就是我的解决方案。
首先，我用C替换所有的C1，以计算一个字母作为一个可选组。然后，我计数开括号。如果只有一个开括号，我就有一个新组。如果我有一个闭括号，我检查下一个字母是开括号，以检查是否有连续的组。如果没有，我将计数器重置为0。

import pandas as pd

def smile_grouping(s):
    s = s.replace('C1', 'C')
    open_brackets = 0
    group_counter = 0

    res = []
    for i, letter in enumerate(s):
        if letter == '(':
            open_brackets += 1
            if open_brackets == 1:
                group_counter += 1
        elif letter == ')':
            open_brackets -= 1
        else:
            res.append(group_counter)

        if open_brackets == 0:
            if i+1<len(s) and s[i+1] != '(':
                group_counter = 0
    return res

这就是结果

df = pd.DataFrame(
    {'smile':[
        "C(N)(N)CC(N)C",
        "CCC",
        "C1CC1",
        "C1CC1(C)C",
        "CC(C)C",
        "C(N)(N)(N)CC(N)C",
        "C((N)(N)N)CC(N)C",
        "CC(CCC)CCCC",
        "CC(CC(C))"
    ]})
df['branch_count'] = df['smile'].apply(smile_grouping)
>>> df
              smile                 branch_count
0     C(N)(N)CC(N)C        [0, 1, 2, 0, 0, 1, 0]
1               CCC                    [0, 0, 0]
2             C1CC1                    [0, 0, 0]
3         C1CC1(C)C              [0, 0, 0, 1, 0]
4            CC(C)C                 [0, 0, 1, 0]
5  C(N)(N)(N)CC(N)C     [0, 1, 2, 3, 0, 0, 1, 0]
6  C((N)(N)N)CC(N)C     [0, 1, 1, 1, 0, 0, 1, 0]
7       CC(CCC)CCCC  [0, 0, 1, 1, 1, 0, 0, 0, 0]
8         CC(CC(C))              [0, 0, 1, 1, 1]

赞(0）回复(0）举报 2023-04-04

我来回答

pandas 字符串索引

2条答案

相关问题

热门标签

最新问答