pandas 如何在导入JSON格式的数据时保留名称/值对的“名称”部分中的空格(Python)?

r1wp621o  于 2024-01-04  发布在  Python
关注(0)|答案(2)|浏览(115)

我一直在寻求从URL导入JSON格式数据的帮助(就处理JSON而言,我是一个新手),并收到了对这个问题的很好的回答。
然而,我遇到了一个复杂的问题。我的一些属性名称包含空格。例如,“Property1”和我上一个问题中的其他几个属性名称实际上可能是“Property1_word1 Property1_word2”。目前的解决方案只保留属性名称的第一个单词。我一开始可以这样做,但现在需要所有单词。如果有人可以给我任何提示,我会很感激的。我还没找到。
编辑(在这里提供所有信息,这样就不需要参考以前的帖子):
我想从网站导入数据。首先,我将网站的内容(如下)保存保存为文件。在我的上一个问题中,每个属性名称仅由一个单词组成。现在我正在处理由多个单词组成的属性名称。我在下面提供了一个示例,其中Property1,Property4和Property8的名称包含多个单词。

{
    "payload": {
        "allShortcutsEnabled": false,
        "fileTree": {
            "": {
                "items": [
                    {
                        "name": "thing",
                        "path": "thing",
                        "contentType": "directory"
                    },
                    {
                        "name": ".repurlignore",
                        "path": ".repurlignore",
                        "contentType": "file"
                    },
                    {
                        "name": "README.md",
                        "path": "README.md",
                        "contentType": "file"
                    },
                    {
                        "name": "thing2",
                        "path": "thing2",
                        "contentType": "file"
                    },
                    {
                        "name": "thing3",
                        "path": "thing3",
                        "contentType": "file"
                    },
                    {
                        "name": "thing4",
                        "path": "thing4",
                        "contentType": "file"
                    },
                    {
                        "name": "thing5",
                        "path": "thing5",
                        "contentType": "file"
                    },
                    {
                        "name": "thing6",
                        "path": "thing6",
                        "contentType": "file"
                    },
                    {
                        "name": "thing7",
                        "path": "thing7",
                        "contentType": "file"
                    },
                    {
                        "name": "thing8",
                        "path": "thing8",
                        "contentType": "file"
                    },
                    {
                        "name": "thing9",
                        "path": "thing9",
                        "contentType": "file"
                    },
                    {
                        "name": "thing10",
                        "path": "thing10",
                        "contentType": "file"
                    },
                    {
                        "name": "thing11",
                        "path": "thing11",
                        "contentType": "file"
                    }
                ],
                "totalCount": 500
            }
        },
        "fileTreeProcessingTime": 5.262188,
        "foldersToFetch": [],
        "reducedMotionEnabled": null,
        "repo": {
            "id": 1234567,
            "defaultBranch": "main",
            "name": "repository",
            "ownerLogin": "contributor",
            "currentUserCanPush": false,
            "isFork": false,
            "isEmpty": false,
            "createdAt": "2023-10-31",
            "ownerAvatar": "https://avatars.repurlusercontent.com/u/98765432?v=1",
            "public": true,
            "private": false,
            "isOrgOwned": false
        },
        "symbolsExpanded": false,
        "treeExpanded": true,
        "refInfo": {
            "name": "main",
            "listCacheKey": "v0:13579",
            "canEdit": false,
            "refType": "branch",
            "currentOid": "identifier"
        },
        "path": "thing2",
        "currentUser": null,
        "blob": {
            "rawLines": [
                "        C_1H_4   Methane                  ",
                "            5.00000        Property1_word1 Property1_word2                              ",
                "             20.00000        Property2                     ",
                "           500.66500        Property3                              ",
                "           100.00000        Property4_word1 Property4_word2                                           ",
                "         -4453.98887        Property5                                      ",
                "           100.48200        Property6                                   ",
                "            59.75258        Property7                                         ",
                "             5.33645        Property8_word1 Property8_word2                                         ",
                "             0.00000        Property9         ",
                "           645.07777        Property10                                       ",
                "             0.00000        Property11                           ",
                "             0.00000        Property12                           ",
                "             0.00000        Property13                             ",
                "             0.00000        Property14                             ",
                "             0.00000        Property15                             ",
                "             0.00000        Property16                             ",
                "             0.00000        Property17                   ",
                "             0.00000        Property18                            ",
                "             0.00000        Property19                   ",
                "             0.00000        Property20                             ",
                "             0.00000        Property21                   ",
                "             0.00000        Property22                             ",
                "             0.00000        Property23                   ",
                "             0.00000        Property24                    ",
                "             0.00000        Property25                    ",
                "             0.57876        Property26                                           ",
                "             4.00000        Property27                                               ",
                "             0.00000        Property28                    ",
                "             0.00000        Property29               ",
                "             0.00000        Property30                  ",
                "             0.00000        Property31            ",
                "             0.00000        Property32                  ",
                "             1.00000        Property33                         ",
                "             0.00000        Property34                       ",
                "            26.00000        Property35                             ",
                "             1.44571        Property36                               ",
                "             1.08756        Property37                            ",
                "             0.00000        Property38                          ",
                "             0.00000        Property39                        ",
                "             0.00000        Property40                        ",
                "             6.00000        Property41                       ",
                "             9.00000        Property42                                         ",
                "             0.00000        Property43                                         "
            ],
            "stylingDirectives": [
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                []
            ],
            "csv": null,
            "csvError": null,
            "dependabotInfo": {
                "showConfigurationBanner": false,
                "configFilePath": null,
                "networkDependabotPath": "/contributor/repository/network/updates",
                "dismissConfigurationNoticePath": "/settings/dismiss-notice/dependabot_configuration_notice",
                "configurationNoticeDismissed": null,
                "repoAlertsPath": "/contributor/repository/security/dependabot",
                "repoSecurityAndAnalysisPath": "/contributor/repository/settings/security_analysis",
                "repoOwnerIsOrg": false,
                "currentUserCanAdminRepo": false
            },
            "displayName": "thing2",
            "displayUrl": "https://repurl.com/contributor/repository/blob/main/thing2?raw=true",
            "headerInfo": {
                "blobSize": "3.37 KB",
                "deleteInfo": {
                    "deleteTooltip": "You must be signed in to make or propose changes"
                },
                "editInfo": {
                    "editTooltip": "XXX"
                },
                "ghDesktopPath": "https://desktop.repurl.com",
                "repurlLfsPath": null,
                "onBranch": true,
                "shortPath": "5678",
                "siteNavLoginPath": "/login?return_to=identifier",
                "isCSV": false,
                "isRichtext": false,
                "toc": null,
                "lineInfo": {
                    "truncatedLoc": "33",
                    "truncatedSloc": "33"
                },
                "mode": "executable file"
            },
            "image": false,
            "isCodeownersFile": null,
            "isPlain": false,
            "isValidLegacyIssueTemplate": false,
            "issueTemplateHelpUrl": "https://docs.repurl.com/articles/about-issue",
            "issueTemplate": null,
            "discussionTemplate": null,
            "language": null,
            "languageID": null,
            "large": false,
            "loggedIn": false,
            "newDiscussionPath": "/contributor/repository/issues/new",
            "newIssuePath": "/contributor/repository/issues/new",
            "planSupportInfo": {
                "repoOption1": null,
                "repoOption2": null,
                "requestFullPath": "/contributor/repository/blob/main/thing2",
                "repoOption4": null,
                "repoOption5": null,
                "repoOption6": null,
                "repoOption7": null
            },
            "repoOption8": {
                "repoOption9": "/settings/dismiss-notice/repoOption10",
                "releasePath": "/contributor/repository/releases/new=true",
                "repoOption11": false,
                "repoOption12": false
            },
            "rawBlobUrl": "https://repurl.com/contributor/repository/raw/main/thing2",
            "repoOption13": false,
            "richText": null,
            "renderedFileInfo": null,
            "shortPath": null,
            "tabSize": 8,
            "topBannersInfo": {
                "overridingGlobalFundingFile": false,
                "universalPath": null,
                "repoOwner": "contributor",
                "repoName": "repository",
                "repoOption14": false,
                "citationHelpUrl": "https://docs.repurl.com/en/repurl/archiving/about",
                "repoOption15": false,
                "repoOption16": null
            },
            "truncated": false,
            "viewable": true,
            "workflowRedirectUrl": null,
            "symbols": {
                "timedOut": false,
                "notAnalyzed": true,
                "symbols": []
            }
        },
        "collabInfo": null,
        "collabMod": false,
        "wtsdf_signifier": {
            "/contributor/repository/branches": {
                "post": "identifier"
            },
            "/repos/preferences": {
                "post": "identifier"
            }
        }
    },
    "title": "repository/thing2 at main \\u0000 contributor/repository"
}

字符串
下面是处理由一个单词组成的属性名称的代码(去除空格的命令只导入由多个单词组成的名称的第一个单词):

import json
import pandas as pd

f = open("yourJson.json", "r")
data = json.load(f)
f.close()

# Get what we want to extract from the json
to_extract = data["payload"]["blob"]["rawLines"]

# Remove useless whitespace
stripped = [e.strip() for e in to_extract]
trimmed = [" ".join(e.split()) for e in stripped]

# Transform the list of string to a dict
as_dict = {e.split(' ')[0]: e.split(' ')[1] for e in trimmed}

# Load the dict with pandas
df = pd.DataFrame(as_dict.items(), columns=['Value', 'Property'])


我已经尝试了各种解决方案(例如,不剥离空白,指定与我需要的数据相关联的确切属性名称),但我对JSON如此迷失,以至于错误没有意义。

goucqfw6

goucqfw61#

让我们将您的示例分解为两行数据。

to_extract = [
    "        C_1H_4   Methane                  ",
    "            5.00000        Property1_word1 Property1_word2                              ",
]
stripped = [e.strip() for e in to_extract]
trimmed = [" ".join(e.split()) for e in stripped]
print(f"{trimmed=}")

字符串
这将为我们提供清理后的数据:
第一个月
在代码的下一部分中,您将拆分此列表中的字符串并构造字典。让我们看看我们在这里得到了什么:

for e in trimmed:
    print(e.split(' '))


生成的列表如下所示

['C_1H_4', 'Methane']
['5.00000', 'Property1_word1', 'Property1_word2']


正如你所看到的,第二个字符串被拆分成了一个包含3个部分的列表,而第三个部分(索引2)在你的代码中丢失了。你可以再次将这些部分连接在一起,但有一个更简单的方法。split方法有一个maxsplit参数,我们可以使用它来只进行一次拆分。

for e in trimmed:
    print(e.split(' ', 1))


两个列表现在都只有2个条目。

['C_1H_4', 'Methane']
['5.00000', 'Property1_word1 Property1_word2']


所以你只需要改变你的旧代码
as_dict = {e.split(' ')[0]: e.split(' ')[1] for e in trimmed}
as_dict = {e.split(' ')[0]: e.split(' ', 1)[1] for e in trimmed}的值。“
此外,还应:我不喜欢我们做两次split,而且在构造trimmed时,先拆分然后重新连接字符串似乎也太麻烦了。
我们可以抛开中间创建的strippedtrimmed,并将所有这些归结为:
as_dict = dict(line.strip().split(None, 1) for line in to_extract)
其结果是:
{'C_1H_4': 'Methane', '5.00000': 'Property1_word1 Property1_word2'}

8cdiaqws

8cdiaqws2#

你可以在json键中使用空格,如果这是你的问题,它不是无效的。

{
    "My name is": "Efe"
}

字符串
另外,如果你想从字符串中删除不需要的空格,你可以使用这个:

mystring = " Hello "
mystring = mystring.strip()

#'Hello'


如果你能在一个问题中编辑所有的材料,而不参考旧的问题,那么看到问题和代码就更容易了。

相关问题