用C程序读取.txt文件并解析,如何处理空字段

pjngdqdw  于 2023-08-03  发布在  其他
关注(0)|答案(1)|浏览(81)

我试图解析一个.txt文件的一些数据,格式确实不容易处理。分隔符是空格字符。该文件包含一个长度可变的字段,这是从右侧开始的第五列。因此,直到第四列,我从左边开始解析日期,然后从右边开始解析数据,直到到达可变长度的字段。这个工作正常。但我的主要问题是,我有时字段中没有任何内容,请参见第三行第3列。因此,我的代码无法准确地解析数据。在输出文件中,并非所有行的解析都正常。是否有可能跳过空字段,以便sscanf可以识别这些字段?如果有人能给予我一个如何正确解析数据的提示,那就太好了。代码onlinegdb:https://onlinegdb.com/8rOBlIfMhU


的数据

#include <stdio.h>
#include <string.h>

#define BUF 1500

// reverse a string
char *strrev(char *str)
{
      char *p1, *p2;

      if (! str || ! *str)
            return str;
      for (p1 = str, p2 = str + strlen(str) - 1; p2 > p1; ++p1, --p2)
      {
            *p1 ^= *p2;
            *p2 ^= *p1;
            *p1 ^= *p2;
      }
      return str;
}

int main()
{
    FILE *ptr=fopen("/tmp/abc123", "w");
    fputs(
   "10000   07/01/1986   68391610   68391610   OPTIMUM MANUFACTURING INC             OMFGA          7952    10                 10396      3       3         3990       3990     39       399    03/12/1986   OMFGA                     Q          A         R       -2.56250         1000         .             2.75000        2.37500        .             .             .       C             C           3680       2    30/01/1986      .          .            .         .              .                 .       .          .             .          .             .             .        1.00000     1.00             .            .            .        .        .      1        1        9       2       0.013809     0.013800     0.011061     0.011046     0.014954\n"
   "12781   30/11/1970   84857L10   50558810   LACLEDE GAS CO                        LG            21080    11                     0      1       1         4925       2741     27       274             .                             N          A         R       25.00000         3500         .            25.00000       24.00000        .             .             .      0.041667      0.041667     4141       .             .      .          .            .         .              .                 .       .          .             .          .             .             .        4.00000     4.0              .            .            .        .        .      .        .        .       .       0.016698     0.016439     0.021276     0.020949     0.014779\n"
   "13901   27/05/1955   02209S10              PHILIP MORRIS & CO LTD                              21398    11                     0      1       1         2110       2111     21       211             .                             N          A         R       42.00000         4400       40.87500       42.00000       40.87500        .             .             .      0.030675      0.030675      2887       .             .      .          .            .         .              .                 .       .          .             .          .             .             .        2479.29   576.000            .            .            .        .        .      .        .        .       .       0.001626     0.001543     0.001477     0.001381      .\n"    
   "13901   31/05/1955   02209S10              PHILIP MORRIS & CO LTD                              21398    11                     0      1       1         2110       2111     21       211             .                             N          A         R       41.37500         5600       42.12500       42.12500       41.00000        .             .             .     -0.014881     -0.014881      2887       .             .      .          .            .         .              .                 .       .          .             .          .             .             .        2479.29   576.000            .            .            .        .        .      .        .        .       .       0.000496     0.000165    -0.000448    -0.000851      .\n"    
   "13901   01/06/1955   02209S10              PHILIP MORRIS INC                                   21398    11                     0      1       1         2110       2111     21       211    01/07/1962                             N          A         R       40.00000        11300       40.87500       40.87500       40.00000        .             .             .     -0.033233     -0.033233      2887       2    29/12/1955      .          .            .         .              .                 .       .          .             .          .             .             .        2479.29   576.000            .            .            .        .        .      .        .        .       .       0.001683     0.001476    -0.000496    -0.000724      .\n"      
   "13901   02/06/1955   02209S10              PHILIP MORRIS INC                                   21398    11                     0      1       1         2110       2111     21       211             .                             N          A         R       39.87500         9600       40.00000       40.12500       39.87500        .             .             .     -0.003125     -0.003125      2887       .             .      .          .            .         .              .                 .       .          .             .          .             .             .        2479.29   576.000            .            .            .        .        .      .        .        .       .       0.003036     0.002973     0.002027     0.001912      .\n"      
   "13901   03/06/1955   02209S10              PHILIP MORRIS INC                                   21398    11                     0      1       1         2110       2111     21       211             .                             N          A         R       40.12500         5500       40.00000       40.62500       40.00000        .             .             .      0.006270      0.006270      2887       .             .      .          .            .         .              .                 .       .          .             .          .             .             .        2479.29   576.000            .            .            .        .        .      .        .        .       .       0.006440     0.006420     0.004233     0.004141      .\n"      
  ,ptr);
    fclose(ptr);

    FILE *fp, *fpp;
    fp=fopen("/tmp/abc123","r");
        char puffer[BUF];
        char a[1000],b[1000],c[1000],d[1000],e[1000],f[1000],g[1000],h[1000],i[1000],j[1000],k[1000],l[1000],m[1000],n[1000],o[1000],p[1000],q[1000],r[1000],s[1000],tt[1000],u[1000],v[1000]; // a->PERMNO; b->date; c->CUSIP; d->NCUSIP; e->COMNAM; f->DIVAMT; g->CFACPR
        char w[1000],x[1000],y[1000],z[1000],aa[1000],ab[1000],ac[1000],ad[1000],ae[1000],af[1000],ag[1000],ah[1000],ai[1000],aj[1000],ak[1000],al[1000],am[1000],an[1000],ao[1000],ap[1000];
        char aq[1000],ar[1000],as[1000],at[1000],au[1000],av[1000],aw[1000],ax[1000],ay[1000],az[1000],ba[1000],bb[1000],bc[1000],bd[1000],be[1000],bf[1000],bg[1000],bh[1000],bi[1000],bj[1000],bk[1000],bl[1000] ;
      
    fpp=fopen("output.txt","w");

    if(fpp==NULL)
    {
        printf("file could not be opened\n");
        return 1;
    }
    
  while(fgets(puffer, BUF, fp) != NULL)
    {
        int n1,n2;
        char t[1000];
        //parse first four columns from left side
        if( 4==sscanf(puffer,"%s%s%s%s%n",a,b,c,d,&n1) )
        //parse 57 cloumns from the right side
        if( 57 ==sscanf(strrev(strcpy(t,puffer)),"%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%n",bl,bk,bj,bi,bh,bg,bf,be,bd,bc,bb,ba,az,ay,ax,aw,av,au,at,as,ar,aq,ap,ao,an,am,al,ak,aj,ai,ah,ag,af,ae,ad,ac,ab,aa,z,y,x,w,v,u,tt,s,r,q,p,o,n,m,l,k,j,i,h,g,f,&n2));
        //parse the variable field, is simply what is left in the middle.
        if( 1==sscanf(puffer+n1+1,"%[^\n]",e) )
        e[strlen(e)-n2]=0,a,b,c,d,e;
                strrev(f), strrev(g),strrev(h), strrev(i), strrev(j), strrev(k),
                strrev(l),strrev(m), strrev(n), strrev(o), strrev(p), strrev(q);
                strrev(r), strrev(s), strrev(tt), strrev(u), strrev(v), strrev(w),
                strrev(x), strrev(y), strrev(z), strrev(aa), strrev(ab), strrev(ac),
                strrev(ad), strrev(ae), strrev(af), strrev(ag), strrev(ah), strrev(ai),
                strrev(aj), strrev(ak),strrev(al), strrev(am), strrev(an), strrev(ao),
                strrev(ap), strrev(aq), strrev(ar), strrev(as), strrev(at), strrev(au),
                strrev(av), strrev(aw), strrev(ax), strrev(ay), strrev(az), strrev(ba);
                strrev(bb), strrev(bc), strrev(bd), strrev(be), strrev(bf), strrev(bg);
                strrev(bh), strrev(bi);
        // print first 5 columns in the console
         printf("%s %s %s %s %s\n",a, b, c, d, e);
        // print all parsed columns in output.txt file
         fprintf(fpp,"%s; %s; %s; %s; %s; %s; %s; %s; %s; %s; %s; %s; %s; %s; %s; %s; %s; %s; %s; %s; %s; %s; %s; %s; %s; %s; %s; %s; %s; %s; %s; %s; %s; %s; %s; %s; %s; %s; %s; %s; %s; %s; %s; %s; %s; %s; %s; %s; %s; %s; %s; %s; %s; %s; %s; %s; %s; %s; %s; %s; %s\n",a,b,c,d,e,f, g, h, i, j, k, l, m ,n, o,p,q,r,s,tt,u,v,w,x,y,z,aa,ab,ac,ad,ae,af,ag,ah,ai,aj,ak,al,am,an,ao,ap,aq,ar,as,at,au,av,aw,ax,ay,az,ba,bb,bc,bd,be,bf,bg,bh,bi);
       
    }
     
    fclose(ptr);
    return 0;
}

字符串

zsbz8rwp

zsbz8rwp1#

如果所有列都有固定的长度(包括填充),那么问题中提到的问题很容易解决。除第30/31列外,所有列似乎都是这种情况。既然你已经声明输入中的这种不一致是无意的,我将把我的回答限制在前10列,因为这些是你在问题中询问的列。
您的输入数据似乎具有以下格式:
文件中的每一行

  • column #1开始,由5字符组成,后跟3空格,
  • 后跟column #2,由10字符组成,后跟3空格
  • 后跟column #3,由8字符组成,后跟3空格
  • 然后是column #4,它由8个字符(也可以是空格)组成,然后是3个空格,
  • 接着是column #5,它由25字符(也可以是空格)组成,接着是13空格,
  • 后跟column #6,它由5字符(也可以是空格)组成,后跟9空格
  • 然后是column #7,它由5字符(也可以是空格)组成,然后是4空格,
  • 后跟column #8,它由2字符组成,后跟17空格,
  • 接着是column #9,它由5字符(也可以是空格)组成,接着是6空格,
  • 后跟column #10,它由1字符组成,后跟7空格,
  • 然后是进一步的输入,由于上述原因,我将忽略这些输入。

如果您的输入数据比您提交的数据多,则上述规则可能不正确。有可能在您没有显示给我们的输入数据中,一个字段可能更大,因此两列之间的空格填充更少。然而,出于演示的目的,我将假设输入数据的规则如上所述。您可能需要根据实际的输入数据调整这些规则。
如果你告诉程序每一列中数据的长度和该列后面用于填充的空格数,那么程序就能够确定哪些字符属于哪一列。这样,程序将能够查找和读取仅由空格组成的空列。
我不建议将%s转换格式说明符与sscanf一起使用,因为这将忽略任何前导空格字符。不允许忽略这些空格字符,因为需要检查空格字符的数量才能找到空字段。相反,我建议单独处理字符。
下面是一个使用输入数据的前10列的示例:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define NUM_COLUMNS 10
#define MAX_COLUMN_LENGTH 200

//NOTE: This number should not be set too high, otherwise a
//stack overflow may occur. On non-embedded platforms, a
//size of up to 500,000 should generally be safe.
#define MAX_LINE_LENGTH 8000

int main( void )
{
    //local variable declarations
    FILE *fp;
    char line[MAX_LINE_LENGTH+1];

    //define the column lengths (data length and padding length)
    const struct column_length
    {
        int data;
        int padding;

    } column_lengths[NUM_COLUMNS] =
    {
        {  5,  3 },
        { 10,  3 },
        {  8,  3 },
        {  8,  3 },
        { 25, 13 },
        {  5,  9 },
        {  5,  4 },
        {  2, 17 },
        {  5,  6 },
        {  1,  0 }
    };

    //open temporary file
    fp = tmpfile();
    if ( fp == NULL )
    {
        fprintf( stderr, "Error opening temporary file!\n" );
        exit( EXIT_FAILURE );
    }

    //write data to temporary file
    fputs(
       "10000   07/01/1986   68391610   68391610   OPTIMUM MANUFACTURING INC             OMFGA          7952    10                 10396      3\n"
       "12781   30/11/1970   84857L10   50558810   LACLEDE GAS CO                        LG            21080    11                     0      1\n"
       "13901   27/05/1955   02209S10              PHILIP MORRIS & CO LTD                              21398    11                     0      1\n"
       "13901   31/05/1955   02209S10              PHILIP MORRIS & CO LTD                              21398    11                     0      1\n"
       "13901   01/06/1955   02209S10              PHILIP MORRIS INC                                   21398    11                     0      1\n"
       "13901   02/06/1955   02209S10              PHILIP MORRIS INC                                   21398    11                     0      1\n"
       "13901   03/06/1955   02209S10              PHILIP MORRIS INC                                   21398    11                     0      1\n",
       fp
    );

    //seek back to the beginning of the temporary file
    rewind( fp );

    //process one row per loop iteration
    for ( int row = 0; fgets(line,sizeof line,fp) != NULL; row++ )
    {
        //2D array for reading in all fields of a row
        char fields[NUM_COLUMNS][MAX_COLUMN_LENGTH+1];

        //this pointer will always point to the next
        //character of the line to process
        const char *p = line;

        printf( "Processing row #%d:\n", row );

        //verify that an entire line was read
        if ( strchr(line,'\n') == NULL && !feof(fp) )
        {
            fprintf(
                stderr,
                "Error processing row #%d:\n"
                "Buffer was too small to read the entire line.\n"
                "The macro constant MAX_COLUMN_LENGTH may have to be increased.\n",
                row
            );
            exit( EXIT_FAILURE );
        }

        //process one column per loop iteration
        for ( int col = 0; col < NUM_COLUMNS; col++ )
        {
            //verify that buffer size is large enough
            if ( column_lengths[col].data >= (int)sizeof fields[0] )
            {
                fprintf(
                    stderr,
                    "Error processing column #%d on row #%d: Buffer is too small!\n"
                    "The macro constant MAX_COLUMN_LENGTH must be increased.\n",
                    col, row
                );
                exit( EXIT_FAILURE );
            }

            //extract data of the field
            for ( int i = 0; i < column_lengths[col].data; i++ )
            {
                if ( *p == '\0' || *p == '\n' )
                {
                    fprintf(
                        stderr,
                        "Error processing column #%d on row #%d:\n"
                        "Unexpected end of input encountered while reading the data area of a field!\n",
                        col, row
                    );
                    exit( EXIT_FAILURE );
                }

                fields[col][i] = *p;

                p++;
            }

            //add terminating null character
            fields[col][column_lengths[col].data] = '\0';

            //print the extracted field
            printf( "  Field #%d: \"%s\"\n", col, fields[col] );

            //skip padding and verify that it consists only of spaces
            for ( int i = 0; i < column_lengths[col].padding; i++ )
            {
                if ( *p != ' ' )
                {
                    if ( *p == '\0' || *p == '\n' )
                    {
                        fprintf(
                            stderr,
                            "Error processing column #%d on row #%d:\n"
                            "Unexpected end of input encountered while skipping the padding area of a field!\n",
                            col, row
                        );
                    }
                    else
                    {
                        fprintf(
                            stderr,
                            "Error processing column #%d on row #%d:\n"
                            "Non-space padding character encountered!\n",
                            col, row
                        );
                    }

                    exit( EXIT_FAILURE );
                }

                p++;
            }
        }

        printf( "\n" );
    }

    fclose( fp );
}

字符串
此程序具有以下输出:

Processing row #0:
  Field #0: "10000"
  Field #1: "07/01/1986"
  Field #2: "68391610"
  Field #3: "68391610"
  Field #4: "OPTIMUM MANUFACTURING INC"
  Field #5: "OMFGA"
  Field #6: " 7952"
  Field #7: "10"
  Field #8: "10396"
  Field #9: "3"

Processing row #1:
  Field #0: "12781"
  Field #1: "30/11/1970"
  Field #2: "84857L10"
  Field #3: "50558810"
  Field #4: "LACLEDE GAS CO           "
  Field #5: "LG   "
  Field #6: "21080"
  Field #7: "11"
  Field #8: "    0"
  Field #9: "1"

Processing row #2:
  Field #0: "13901"
  Field #1: "27/05/1955"
  Field #2: "02209S10"
  Field #3: "        "
  Field #4: "PHILIP MORRIS & CO LTD   "
  Field #5: "     "
  Field #6: "21398"
  Field #7: "11"
  Field #8: "    0"
  Field #9: "1"

Processing row #3:
  Field #0: "13901"
  Field #1: "31/05/1955"
  Field #2: "02209S10"
  Field #3: "        "
  Field #4: "PHILIP MORRIS & CO LTD   "
  Field #5: "     "
  Field #6: "21398"
  Field #7: "11"
  Field #8: "    0"
  Field #9: "1"

Processing row #4:
  Field #0: "13901"
  Field #1: "01/06/1955"
  Field #2: "02209S10"
  Field #3: "        "
  Field #4: "PHILIP MORRIS INC        "
  Field #5: "     "
  Field #6: "21398"
  Field #7: "11"
  Field #8: "    0"
  Field #9: "1"

Processing row #5:
  Field #0: "13901"
  Field #1: "02/06/1955"
  Field #2: "02209S10"
  Field #3: "        "
  Field #4: "PHILIP MORRIS INC        "
  Field #5: "     "
  Field #6: "21398"
  Field #7: "11"
  Field #8: "    0"
  Field #9: "1"

Processing row #6:
  Field #0: "13901"
  Field #1: "03/06/1955"
  Field #2: "02209S10"
  Field #3: "        "
  Field #4: "PHILIP MORRIS INC        "
  Field #5: "     "
  Field #6: "21398"
  Field #7: "11"
  Field #8: "    0"
  Field #9: "1"


如您所见,空字段已正确读取。
如果需要,这些字段可以在之后进一步处理,例如通过删除所有前导和尾随空格字符并使用函数strtol将数字转换为int值。

相关问题