javascript 如何查找和删除第一/开始字符串从阿拉伯语字符串有变音符号,但保持原来的变音符号的其余字符串

yrwegjxp  于 2023-06-20  发布在  Java
关注(0)|答案(6)|浏览(109)

目的是从一个阿拉伯字符串中找到并删除一个起始字符串/字符/单词,我们不知道它是否有变音符号**,但必须保留剩余字符串的任何和所有变音符号(如果有的话)**。
对于从StackOverflow上的英语字符串中删除第一个/开始字符串/字符有很多答案,但是在StackOverflow上没有找到这个问题的现有解决方案,可以保持阿拉伯字符串的原始形式。
如果原始字符串在处理之前被规范化(删除变音符号,tanween等),那么返回的剩余字符串将是规范化字符串的余额,而不是原始字符串的剩余部分。
假设以下原始字符串可以是以下任何形式(即相同的字符串但不同的变音符号):

1.“ا لسل ا م علي م ورحمة ا ل له"

2.“ا لس ل ا م علي م ورحمة ا ل له"

3.“ا لس ل ا م ع لي م ور حمة ا ل له"

4.“ا لس ل ا م ع ل ي م و ر ح م ة ا ل له"

现在,假设我们想要移除第一个/起始字符“السلام”,只有当字符串以这样的字符开头时(它确实这样做了),并返回剩余的“original”字符串及其原始变音符号
当然,我们正在寻找没有变音符号的字符“السلام”,因为我们不知道原始字符串是如何用变音符号格式化的。
因此,在这种情况下,返回的每个字符串的剩余部分必须是:

1.“علي م ورحمة ال له"

2.“علي م ورحمة ال له"

3.“ع لي م ور حمة ال له"

4.“ع ل ي م و ر ح م ة ال له"

下面的代码适用于英语字符串(有许多其他解决方案),但不适用于阿拉伯语字符串,如上所述。

function removeStartWord(string,word) {
if (string.startsWith(word)) string=string.slice(word.length);
return string;
}

上面的代码使用了基于字符长度对从原始字符串中找到的起始字符进行切片的原理;这对于英语文本来说很好。
对于阿拉伯字符串,我们不知道原始字符串的变音符号的形式,因此我们在原始字符串中查找的字符串/字符的长度将是不同的和未知的。

编辑:添加了示例图像,以便更好地说明。

以下图像表提供了进一步的示例:

avwztpqn

avwztpqn1#

为了跟踪讨论,我添加了一个新的答案,请尝试!

function removeStartWord(string, word) {
  const alphabeticString =  string.replace(/[^a-zA-Zء-ي0-9/]+/g, '');
  if(!alphabeticString.startsWith(word)) return string;
  const letters = [...word];
  let cleanString = '';
  string.split('').forEach((_letter) => {
    if(letters.indexOf(_letter) > -1) {
      delete letters[letters.indexOf(_letter)]
    }else{
      cleanString += _letter;
    }
  });
  return cleanString.replace(/[^a-zA-Zء-ي0-9/\s]*/i, '');
}

const sampleData = `السَّلَامُ عَلَيْكُمُ وَرَحْمَةُ الله`;

console.log('sampleData ...', sampleData);
console.log(
  "removeStartWord(sampleData, 'السلام') ...",
  removeStartWord(sampleData, 'السلام')
);
console.log(
  "removeStartWord(sampleData, 'الس') ...",
  removeStartWord(sampleData, 'الس')
);
console.log(
  "removeStartWord(sampleData, 'السلام ') ...",
  removeStartWord(sampleData, 'السلام ')
);
console.log(
  "removeStartWord(sampleData, ' السلام') ...",
  removeStartWord(sampleData, ' السلام')
);
.as-console-wrapper { min-height: 100%!important; top: 0; }
qjp7pelc

qjp7pelc2#

我提出了以下可能的解决方案。
将以下溶液分成2部分;首先,函数startsWithAr()用于“部分地”模仿javascriptx 1 m1n1x方法,但是用于阿拉伯字符串。
但是,它不会返回'true''false',而是返回源字符串开头的index after the characters we are looking for(即在源字符串中找到的字符串的长度,包括它的Tashkeel(变音符号)(如果有的话),否则,如果在字符串的开头没有找到指定字符串的字符,则返回-1
使用startsWithAr()函数,然后创建(在第二部分)一个函数,如果使用slice()方法在源字符串的开头找到指定字符串的字符,则删除该字符串; removeStartString()函数
这种方法不仅允许保持源字符串的其余部分的Tashkeel(变音符号),而且还允许搜索和删除具有Tahmeez的字符串。
该函数忽略**源字符串和Look-For Search字符串中的Tashkeel(diacritics)和Tahmeez,并将在从源字符串的开头删除指定的起始字符后,完整地返回源字符串的剩余部分及其原始Tashkeel(diacritics)。
这样,我们就可以使用该函数处理阿拉伯语脚本中的所有Unicode,而不是将其限制在定义的范围内,因为任何语言的任何其他字符都被忽略。
我们也可以通过匹配“ه”和“ة”来轻松地改进它,这样我们就可以通过在2个.replace()行添加.replace(/[ة]/g,'ه')来删除字符串“ال س ي دة”,即使它被写为“ال س ي ده”。
我在下面列出了使用startsWithAr()函数和removeStartString()函数的单独测试用例。
如果需要,这两个功能可以组合成一个功能。
请根据需要改进;欢迎提出任何建议。

第一部分:startsWithAr()

//=====================================================================
// startsWithAr() function
// Purpose:
// Determines whether an Arabic string (the "Source String") begins with the characters
// of a specified string (the "Look-For String").
// Return the position (index) after the Look-For String if found, else return -1 if not found.
// Ignores Tashkeel (diacritics) and Tahmeez in both the Source and Look-For Strings.
// The returned position index is zero based.
// By knowing the position (index) after the Look-For String, one can remove the
// starting string using the slice() method while maintaining the remainder of the Source String with
// its original tashkeel (diacritics) unchanged.
//
// Parameters:
// str     : The Source String to search in.
// lookFor : The characters to be searched for at the start of this string.
//=====================================================================
function startsWithAr(str,lookFor) {
let indexLookFor=0, tshk=/[ؐ-ًؕ-ٖٓ-ٟۖ-ٰٰۭـ]/, w=/[ؤ]/g,hamz=/[آأإٱٲٳٵ]/g;
lookFor=lookFor.replace(hamz,'ا').replace(w,'و').replace(/[ؐ-ًؕ-ٖٓ-ٟۖ-ٰٰۭـ]/g,''); // normalize the lookFor string
for (let indexStr=0; indexStr<str.length;indexStr++) {
while(tshk.test(str[indexStr])&&indexStr<str.length)++indexStr; // skip tashkeel & increase index
if (lookFor[indexLookFor]!==str[indexStr].replace(hamz,'ا').replace(w,'و')) return-1; // no match, so exit -1
indexLookFor++;                               // match found so next char in lookFor String
    if (indexLookFor>=lookFor.length) {       // if end of Source String then WE FOUND IT
      indexStr+=1;                            // point after source char
      while(tshk.test(str[indexStr])&&indexStr<str.length)++indexStr; // skip tashkeel after Source String if any
    return indexStr;      // return index in Source String after lookFor string and after any tashkeel
    }
}
return-1; // not found end of string reached
}
//=========================================
// test cases for startsWithAr() function
//=========================================
var r =0; // test tracking flag
r |= test("السلام عَلَيَكُمُ ورحمة الله","السلام",6);  // find the start letters 'السلام'
r |= test("الْسًّلامُ عَلَيَكُمُ ورحمة الله","السلام",10); // find the start letters 'السلام'
r |= test("الْسًّلامُ عَلَيَكُمُ وَرَحَمَةَ الله","السَّلام",10); // find the start letters 'السَّلام'
r |= test("ألْسًّلامُ عَلَيَكُمُ وَرَحَمَةَ الله","السَّلام",10); // find the start letters 'السَّلام'
r |= test("السؤال هو التالي","السوال",6);      // find the start letters 'السوال'
r |= test("السيد/علي","السيد",5);           // find the start letters 'السيد'
r |= test("السيد/علي","ف",-1);           // find the start letters 'السيد'
r |= test(" السيد"," ",1);               // find the start letter ' ' (space)
r |= test("المجد لنا","ال",2);             // find the start letters 'ال'
r |= test("المجد لنا","ا",1);              // find the start letter  'ا'
r |= test("ألمجد لنا","ال",2);             // find the start letters 'ال'
r |= test("إلمجد لنا","ال",2);             // find the start letters 'ال'
r |= test("إلمجد لنا","ألْ",2);             // find the start letters 'ألْ'
r |= test("إلْمَجد لَنا","ألْ",3);             // find the start letters 'ألْ'
r |= test("","ا",-1);                  // empty Source String
r |= test("","",-1);                  // empty Source String and Look-For String

if (r==0) console.log("✅ All startsWithAr() test cases passed");

//-----------------------------------
function test(str,lookfor,should) {
  let result= startsWithAr(str,lookfor);
  if (result !== should) {console.log(`
  ${str} Output   :${result}
  ${str} Should be:${should}
  `);return 1;}
  }

第二部分:removeStartString()

//=====================================================================
// removeStartString() function
// Purpose:
// Determines whether an Arabic string (the "Source String") begins with the characters
// of a specified string (the "Look-For String").
// If found the Look-For String is removed and the reminder of the Source String is returned
// with its original Tashkeel (diacritics);
// If no match then return original Source String.
//
// Ignores Tashkeel (diacritics) and Tahmeez in both the Source and Look-For Strings.
// The function uses the startsWithAr() function to determine the index after the matched
// starting string/characters.
//
// Parameters:
// str     : The Source String to search in.
// toRemove: The characters to be searched for and removed if at the start of this string.
//=====================================================================
function removeStartString(str,toRemove) {
let index=startsWithAr(str,toRemove);
if (index>-1) str=str.slice(index);
return str;
}

//=========================================
// test cases for removeStartString() function
//=========================================
var r =0; // test tracking flag
r |= test2("السلام عَلَيَكُمُ ورحمة الله","السلام"," عَلَيَكُمُ ورحمة الله");  // remove the start letters 'السلام'
r |= test2("ألْسًّلامُ عَلَيَكُمُ ورحمة الله","السلام"," عَلَيَكُمُ ورحمة الله");  // remove the start letters 'ألْسًّلامُ'
r |= test2("السلام عَلَيَكُمُ ورحمة الله","ألْسًّلامُ"," عَلَيَكُمُ ورحمة الله");  // remove the start letters 'ألْسًّلامُ'
r |= test2(" السلام عَلَيَكُمُ ورحمة الله"," ألْسًّلامُ"," عَلَيَكُمُ ورحمة الله");// remove the start letters 'ألْسًّلامُ '
r |= test2("السلام عَلَيَكُمُ ورحمة الله","ال","سلام عَلَيَكُمُ ورحمة الله"); // remove the start letters 'ال'
r |= test2("أَهْلًا وَسَهلًا","ا","هْلًا وَسَهلًا");             // remove the start letter 'ا'    r |= test2("أَهْلًا وَسَهلًا"," ","أَهْلًا وَسَهلًا");                // remove the start letter ' '
r |= test2("أَهْلًا وَسَهلًا","","أَهْلًا وَسَهلًا");             // remove the start letter ''
r |= test2("أَهْلًا وَسَهلًا","إلى","أَهْلًا وَسَهلًا");           // remove the start letters 'إلى'

if (r==0) console.log("✅ All removeStartString() test cases passed");

//-----------------------------------
function startsWithAr(str,lookFor) {
let indexLookFor=0, tshk=/[ؐ-ًؕ-ٖٓ-ٟۖ-ٰٰۭـ]/, w=/[ؤ]/g,hamz=/[آأإٱٲٳٵ]/g;
lookFor=lookFor.replace(hamz,'ا').replace(w,'و').replace(/[ؐ-ًؕ-ٖٓ-ٟۖ-ٰٰۭـ]/g,''); 
for (let indexStr=0; indexStr<str.length;indexStr++) {
while(tshk.test(str[indexStr])&&indexStr<str.length)++indexStr; 
if (lookFor[indexLookFor]!==str[indexStr].replace(hamz,'ا').replace(w,'و')) return-1;
indexLookFor++;                                           
    if (indexLookFor>=lookFor.length) {                    
      indexStr+=1;                                         
      while(tshk.test(str[indexStr])&&indexStr<str.length)++indexStr; 
    return indexStr;
    }
}
return-1;
}
//-----------------------------------
function test2(str,toRemove,should) {
  let result= removeStartString(str,toRemove);
  if (result !== should) {console.log(`
  ${str} Output   :${result}
  ${str} Should be:${should}
  `);return 1;}
  }
rsl1atfo

rsl1atfo3#

使用 regex unicode escapes 可能已经足够好了,尽管JavaScript不支持像\p{Arabic}这样的 *unicode脚本 *。
/^[\p{L}\p{M}]+\p{Z}+/gmureplace这样的基于类别的模式已经完全满足了OP的要求……

  • 查找并删除阿拉伯语字符串中的第一个起始单词,该字符串具有diac *

模式... ^[\p{L}\p{M}]+\p{Z}+ ...读起来是这样的

  • ^...从新行的开头开始...
  • [ ... ]+ ...在列表中查找指定字符类的一个字符...
  • \p{L} ...任何语言的任何类型的字母,
  • \p{M} ...或旨在与另一字符组合的字符(例如,重音符号、变音符号、封闭框等)
  • 然后是\p{Z}+任何类型的空白或不可见分隔符中的至少一个。
console.log(`السلام عليكم ورحمة الله
السَلام عليكمُ ورحمةُ الله
السَلامُ عَليكمُ ورَحمةُ الله
السَّلَامُ عَلَيْكُمُ وَرَحْمَةُ الله`.replace(/^[\p{L}\p{M}]+\p{Z}+/gmu, ''));
.as-console-wrapper { min-height: 100%!important; top: 0; }

编辑

因为现在很清楚OP真正想要的是什么,所以上面的方法仍然存在,只是通过利用replacer函数和基于Intl.Collator对象的额外比较逻辑来提升到下一个级别,该对象考虑了阿拉伯语和基本字母的比较。
通过提供(除了'ar'局部变量之外)一个具有 base sensitivity 的选项,可以最不严格地初始化collator。因此,当通过排序器的compare方法比较两个相似(但不完全相等)的字符串时,例如:'السلام''السَّلَامُ'将被认为是相等的,尽管后者具有(很多)变音符号。
证明/示例...

const baseLetterCollator = new Intl.Collator('ar', { sensitivity: 'base' } );

console.log(
  "('السلام عليكم ورحمة الله' === 'السَّلَامُ عَلَيْكُمُ وَرَحْمَةُ الله') ?..",
  ('السلام عليكم ورحمة الله' === 'السَّلَامُ عَلَيْكُمُ وَرَحْمَةُ الله')
);
console.log('\n');

console.log(`new Intl.Collator()
  .compare('السلام عليكم ورحمة الله' ,'السَّلَامُ عَلَيْكُمُ وَرَحْمَةُ الله') === 0

  ?..`,
  new Intl.Collator()
    .compare('السلام عليكم ورحمة الله' ,'السَّلَامُ عَلَيْكُمُ وَرَحْمَةُ الله') === 0
);
console.log(`new Intl.Collator('ar', { sensitivity: 'base' } )
  .compare('السلام عليكم ورحمة الله' ,'السَّلَامُ عَلَيْكُمُ وَرَحْمَةُ الله') === 0

  ?..`,
  new Intl.Collator('ar', { sensitivity: 'base' } )
    .compare('السلام عليكم ورحمة الله' ,'السَّلَامُ عَلَيْكُمُ وَرَحْمَةُ الله') === 0
);
.as-console-wrapper { min-height: 100%!important; top: 0; }

基于以上所说的…最终的解决方案

function removeFirstMatchingWordFromEveryNewLine(search, multilineString) {
  const baseLetterCollator
    // - [ar]abic
    // - base sensitivity
    //   ... only strings that differ in base letters compare as unequal.
    = new Intl.Collator('ar', { sensitivity: 'base' } );

  const replacer = word => {
    return (baseLetterCollator.compare(search, word.trim()) === 0)
      ? ''    // - remove the matching word (whitespace included).
      : word; // - keep the word since there was no match. 
  }
  const regXFirstLineWord = /^[\p{L}\p{M}]+\p{Z}+/gmu;

  search = String(search).trim();

  return String(multilineString).replace(regXFirstLineWord, replacer);  
}
const sampleData = `السلام عليكم ورحمة الله
السَلام عليكمُ ورحمةُ الله
أهلا ومرحبا
السَلامُ عَليكمُ ورَحمةُ الله
السَّلَامُ عَلَيْكُمُ وَرَحْمَةُ الله`;

console.log('sampleData ...', sampleData);
console.log(
  "removeFirstMatchingWordFromEveryNewLine('السلام', sampleData) ...",
  removeFirstMatchingWordFromEveryNewLine('السلام', sampleData)
);
.as-console-wrapper { min-height: 100%!important; top: 0; }
yzuktlbb

yzuktlbb4#

由于需求变化(d)和信息一片一片地进来,……

  • “[...]答案删除第一个匹配的单词,并在单词后添加空格。但是我们正在寻找的字符串可能不一定后面跟着空格(即不是独立的词)。例如,从句子“ا ل س يد/مح س ن ا ل ي ا ت ع ي”中删除字符“ا ل س يد”,只返回“/مح س ن ا ل ي ا ت ع ي”。- Mohsen Alyafei”*

我也会从一张白纸开始。
组合的方法是对基于Intl.Collator的区域设置compare进行匹配,而基于Unicode property escapes的正则表达式匹配任何阿拉伯语单词,而不管组合字符,如重音,变音等。如果要查找/匹配任何类型的字符串(这里是在新行的开头),则不能再使用。
但是,任何试图简单地迭代字符串并逐字符比较两个字符串的方法都将失败。
示例代码比文字更能说明问题……让我们看看...

console.log(`
  ... remember ...
  new Intl.Collator('ar', { sensitivity: 'base' } )
    .compare('السَّلَامُ' ,'السلام') === 0

  ?..`, new Intl.Collator('ar', { sensitivity: 'base' } )
    .compare('السَّلَامُ' ,'السلام') === 0, `

  ... but ...
  new Intl.Collator('ar')
    .compare('السَّلَامُ' ,'السلام') === 0

  ?..`, new Intl.Collator('ar')
    .compare('السَّلَامُ' ,'السلام') === 0
);
console.log('\n... explanation ...\n\n');

console.log("'السلام'.length ...", 'السلام'.length);
console.log("'السَّلَامُ'.length ...", 'السَّلَامُ'.length);

console.log("'السلام'.split('') ...", 'السلام'.split(''));
console.log("'السَّلَامُ'.split('') ...", 'السَّلَامُ'.split(''));
.as-console-wrapper { min-height: 100%!important; top: 0; }

幸运的是,ECMAScript的国际化API Intl也可以在这里提供帮助。有Intl.Segmenter,这将有助于将字符串分解为可比较的片段。对于OP的用例来说,在默认的granularity级别'grapheme'上执行它就足够了,这似乎等于 * 分割成locale可比字母 *...

console.log(`[
  ...new Intl.Segmenter('ar', { granularity: 'grapheme' }).segment('السلام')
]
.map(({ segment }) => segment) ...`, [

  ...new Intl.Segmenter('ar', { granularity: 'grapheme' }).segment('السلام')
  ]
  .map(({ segment }) => segment)
);
console.log(`[
  ...new Intl.Segmenter('ar').segment('السَّلَامُ')
]
.map(({ segment }) => segment) ...`, [

    ...new Intl.Segmenter('ar').segment('السَّلَامُ')
  ]
  .map(({ segment }) => segment)
);
.as-console-wrapper { min-height: 100%!important; top: 0; }

因此,最后一步是通过将上面介绍的Intl.Segmenter与现在已经熟悉的Intl.Collator...

function removeEveryMatchingNewLineStart(search, multilineString) {
  const letterSegmenter
    // - [ar]abic
    // - default grapheme granularity (locale comparable letters).
    = new Intl.Segmenter('ar'/*, { granularity: 'grapheme' }*/);

  const letterCollator
    // - [ar]abic
    // - base sensitivity
    //   ... Non-zero comparator result value for strings only
    //   that for a base letter comparison are considered unequal.
    = new Intl.Collator('ar', { sensitivity: 'base' } );

  const getLocaleComparableLetterList = str =>
    [...letterSegmenter.segment(str)].map(({ segment }) => segment);

  function replaceLineStartByBoundComparableLetters(line) {
    const searchLetters = this;
    let lineLetters = getLocaleComparableLetterList(line);

    if (searchLetters.every((searchLetter, idx/*, arr*/) =>
      (letterCollator.compare(searchLetter, lineLetters[idx]) === 0)
    )) {
      lineLetters = lineLetters.slice(searchLetters.length);

      let leadingBlanks = '';
      while (lineLetters[0] === ' ') {
        leadingBlanks = leadingBlanks + lineLetters.shift();
      }
      line = `${ lineLetters.join('') }${ leadingBlanks }`;

      // // due to keeping/restoring leading witespace sequences ...
      // // ... all the above additional computation instead of ...
      // // ... a simple ...
      // line = lineLetters.slice(searchLetters.length).join('')
    }
    return line;
  }
  return String(multilineString)
    .split(/(\n)/)
    .map(
      replaceLineStartByBoundComparableLetters.bind(
        getLocaleComparableLetterList(String(search))
      )
    )
    .join('');
}
const sampleData = `السلام عليكم ورحمة الله
السَلام عليكمُ ورحمةُ الله
أهلا ومرحبا
السَلامُ عَليكمُ ورَحمةُ الله
السَّلَامُ عَلَيْكُمُ وَرَحْمَةُ الله`;

console.log('sampleData ...', sampleData);
console.log(
  "removeEveryMatchingNewLineStart('السلام', sampleData) ...",
  removeEveryMatchingNewLineStart('السلام', sampleData)
);
console.log(
  "removeEveryMatchingNewLineStart('الس', sampleData) ...",
  removeEveryMatchingNewLineStart('الس', sampleData)
);
console.log(
  "removeEveryMatchingNewLineStart('السلام ', sampleData) ...",
  removeEveryMatchingNewLineStart('السلام ', sampleData)
);
.as-console-wrapper { min-height: 100%!important; top: 0; }
mm5n2pyu

mm5n2pyu5#

我看不出你的代码有什么问题,但这里有另一种方法:

function removeStartWord(string, word) {
  return string.split(' ').filter((_word, index) => index !== 0 || _word.replace(/[^a-zA-Zء-ي]+/g, '') !== word).join(' ');
}

const sampleData = `السَّلَامُ عَلَيْكُمُ وَرَحْمَةُ الله`;

console.log('sampleData ...', sampleData);
console.log(
  "removeStartWord(sampleData, 'السلام') ...",
  removeStartWord(sampleData, 'السلام')
);
console.log(
  "removeStartWord(sampleData, 'الس') ...",
  removeStartWord(sampleData, 'الس')
);
console.log(
  "removeStartWord(sampleData, 'السلام ') ...",
  removeStartWord(sampleData, 'السلام ')
);
console.log(
  "removeStartWord(sampleData, ' السلام') ...",
  removeStartWord(sampleData, ' السلام')
);
.as-console-wrapper { min-height: 100%!important; top: 0; }
6rqinv9w

6rqinv9w6#

我已经创建了一个npm包来解决这个问题,你所要做的就是

npm install arabic-utils

在你的代码中,你可以这样做:

import ArabicString from "arabic-utils";

console.log(ArabicString("السَّلَامُ عَلَيْكُمُ").remove("السلام")) // " عَلَيْكُمُ"

如果你想确保文本只存在于字符串的开头你可以这样做

if (ArabicString(originalText).startsWith(stringToRemove)) {
  // Your code here
}

另外请记住,该软件包不会规范化标记字符串,因此在使用startsWithremove方法之前,您可能需要删除其中的变音符号

const normalizedToken = ArabicString("السَّلَامُ").removeDiacritics(); // => "السلام"

包仓库**https://github.com/justgo97/arabic-utils**
如果你出于某种原因不想使用一个包,那么这是需要的代码
假设我们有这两个字符串

const inputText = "السَّلَامُ عَلَيْكُمُ";
const textToRemove = "السلام";

我们将需要一个函数来删除变音符号,因为我们将使用它来获得原始文本的 backbone ,这是我个人使用的,但您可以根据需要使用不同的方法

const commonArabicDiacritics = {
  kasra: { value: " ِ" }, //  ِ Arabic kasra - Garshuni: i
  shadda: { value: " ّ" }, //  ّ Arabic shadda - Garshuni
  sukun: { value: " ْ" }, //  ْ Arabic sukun

  fathatan: { value: " ً" }, //  ً Arabic fathatan - Garshuni: an
  kasratan: { value: " ٍ" }, //  ٍ Arabic kasratan - Garshuni: in
  dammatan: { value: " ٌ" }, //  ٌ Arabic dammatan - Garshuni: un
  fatha: { value: " َ" }, //  َ Arabic fatha - Garshuni: a
  damma: { value: " ُ" }, //  ُ Arabic damma - Garshuni: u
};

export const arabicSymbolsArray = Object.values(commonArabicDiacritics).map(
  (symbol) => symbol.value.trim()
);

function removeDiacritics(arabicText){
  return arabicText
    .split("")
    .filter((char) => !arabicSymbolsArray.includes(char))
    .join("");
}

然后我们将需要一个函数来将原始字符串分割成一个字母数组,并带有相应的变音符号,这样我们就可以将它与我们从中剥离变音符号的字符串的长度相匹配

const validArabicLetters = [
  "ا",
  "أ",
  "إ",
  "آ",
  "ب",
  "ت",
  "ث",
  "ج",
  "ح",
  "خ",
  "د",
  "ذ",
  "ر",
  "ز",
  "س",
  "ش",
  "ص",
  "ض",
  "ط",
  "ظ",
  "ع",
  "غ",
  "ف",
  "ق",
  "ك",
  "ل",
  "م",
  "ن",
  "ه",
  "و",
  "ي",
  "ى",
  "ة",
  "ء",
  "ؤ",
  "ئ",
];

function isStringEmpty(str) {
  return str.trim().length === 0;
}

function splitArabicLetters(arabicText) {
  const result = [];

  for (const char of arabicText) {
    if (!isStringEmpty(char) && !validArabicLetters.includes(char)) {
      result[result.length - 1] += char;
    } else {
      result.push(char);
    }
  }

  return result;
}

现在我们有了这些函数,我们可以创建removeFromStart函数

const removeFromStart(arabicText, textToRemove) {
  const normalizedText = removeDiacritics(arabicText);

  // Check if the text to remove exists at the start
  if (!normalizedText.startsWith(textToRemove)) {
    // If not found, return the original string
    return arabicText;
  }

  // Find the starting index of the text to remove in the normalized text
  const startIdx = normalizedText.indexOf(textToRemove);

  // Split the original Arabic text into separate letters
  const textSeparated = splitArabicLetters(arabicText);

  // Remove the specified text from the array using splice
  textSeparated.splice(startIdx, textToRemove.length);

  // Join the modified array elements to get the resulting string
  const result = textSeparated.join("");

  // Return the modified string
  return result;
}

就这样

const inputText = "السَّلَامُ عَلَيْكُمُ";
const textToRemove = "السلام";

const commonArabicDiacritics = {
  kasra: { value: " ِ" }, //  ِ Arabic kasra - Garshuni: i
  shadda: { value: " ّ" }, //  ّ Arabic shadda - Garshuni
  sukun: { value: " ْ" }, //  ْ Arabic sukun

  fathatan: { value: " ً" }, //  ً Arabic fathatan - Garshuni: an
  kasratan: { value: " ٍ" }, //  ٍ Arabic kasratan - Garshuni: in
  dammatan: { value: " ٌ" }, //  ٌ Arabic dammatan - Garshuni: un
  fatha: { value: " َ" }, //  َ Arabic fatha - Garshuni: a
  damma: { value: " ُ" }, //  ُ Arabic damma - Garshuni: u
};

const arabicSymbolsArray = Object.values(commonArabicDiacritics).map(
  (symbol) => symbol.value.trim()
);

function removeDiacritics(arabicText){
  return arabicText
    .split("")
    .filter((char) => !arabicSymbolsArray.includes(char))
    .join("");
}

const validArabicLetters = [
  "ا",
  "أ",
  "إ",
  "آ",
  "ب",
  "ت",
  "ث",
  "ج",
  "ح",
  "خ",
  "د",
  "ذ",
  "ر",
  "ز",
  "س",
  "ش",
  "ص",
  "ض",
  "ط",
  "ظ",
  "ع",
  "غ",
  "ف",
  "ق",
  "ك",
  "ل",
  "م",
  "ن",
  "ه",
  "و",
  "ي",
  "ى",
  "ة",
  "ء",
  "ؤ",
  "ئ",
];

function isStringEmpty(str) {
  return str.trim().length === 0;
}

function splitArabicLetters(arabicText) {
  const result = [];

  for (const char of arabicText) {
    if (!isStringEmpty(char) && !validArabicLetters.includes(char)) {
      result[result.length - 1] += char;
    } else {
      result.push(char);
    }
  }

  return result;
}

function removeFromStart(arabicText, textToRemove) {
  const normalizedText = removeDiacritics(arabicText);

  // Check if the text to remove exists at the start
  if (!normalizedText.startsWith(textToRemove)) {
    // If not found, return the original string
    return arabicText;
  }

  // Find the starting index of the text to remove in the normalized text
  const startIdx = normalizedText.indexOf(textToRemove);

  // Split the original Arabic text into separate letters
  const textSeparated = splitArabicLetters(arabicText);

  // Remove the specified text from the array using splice
  textSeparated.splice(startIdx, textToRemove.length);

  // Join the modified array elements to get the resulting string
  const result = textSeparated.join("");

  // Return the modified string
  return result;
}

console.log(removeFromStart(inputText, textToRemove)) // " عَلَيْكُمُ"

相关问题