搜索框关键词优先匹配算法
时间:2011-03-10 来源:jannyz
结合jquery的autocomplete的应用,通常为根据输入关键词进行匹配查找给出相关的suggestion选项列表,对于一个关键词(或将用户输入看作一个词进行完全匹配)的查找比较简单,但类似google搜索引擎的搜索输入框则要求多关键词的组合匹配(对于关键词的取舍还要进行语义分析,这部分暂且不说)。
对于多个关键词(以“a b c d”为例,全匹配度M=4)匹配优先级及相应的正则表达式应依次为:
- 匹配所有[匹配度m=4]关键词(忽略空字符,连续且保持原有顺序) :a\s*b\s*c\s*d
- 匹配所有[m=4]关键词(允许不连续,但保持原有顺序) :a.*b.*c.*d
- 匹配所有[m=4]关键词(不要求顺序):(该步暂未实现,寻求好的正则表达)
- 从"a b c d"中去掉一个关键词(从后往前删),即匹配其中[m=3]关键词,得到关键词组合"a b c","a b d","a c d","b c d"共4组,然后对这四组按全匹配度M=3分别做1~3步骤;
- 然后从前面4组中分别按步骤4的方法得到[m=2]的关键词组合,应为"a b","a c","b c";"a b","a d","b d";"a c","a d","c d";"b c","b d","c d",然后distinct一下得到共6组[m=2]的关键词组合,同样按全匹配度M=2分别做1~3步骤;
- 若m>1则继续按步骤5依次对递减的关键词组合进行匹配;
- 最后匹配[m=1]的关键词:a|b|c|d
C#中的算法实现如下:
1. 在"public static class DataFactory"类中添加通用方法:
public static IEnumerable<TSource> FilterList<TSource>
(
IEnumerable<TSource> source
, String term
, int? maxCount
, IEqualityComparer<TSource> comparer
, Func<TSource, Regex, bool> regexMatch
)
{
if (!maxCount.HasValue)
{
maxCount = int.MaxValue;
}
//简单提取关键词
term = Regex.Replace(term, @"(\*|\^|\\|\$|\.|\+|\?|\=|\!|\:|\||\/|\(|\)|\[|\]|\{|\})", @"\$1");
term = term.ToUpper(System.Threading.Thread.CurrentThread.CurrentUICulture);
term = Regex.Replace(term, " +", "|");
term = term.Trim(new char[] { '|' });
//按关键词组合的优先匹配规则生成匹配正则表达式列表
int cnt = term.Split('|').Length;
List<string> regList = new List<string>();
List<string> buildWords = new List<string>(new string[] { term });
do
{
List<string> tempBuildWords = new List<string>();
for (int widx = 0; widx < buildWords.Count; widx++)
{
string tempTerm = buildWords[widx];
regList.AddRange(new string[]{
tempTerm.Replace("|", "\\s*"), //步骤1
tempTerm.Replace("|", ".*") //步骤2
});
string[] words = tempTerm.Split('|');
for (int i = words.Length - 1; i >= 0; i--)
{
tempBuildWords.Add(("|" + tempTerm + "|").Replace("|" + words[i] + "|", "|").Trim(new char[] { '|' }));
}
}
buildWords = tempBuildWords.Distinct<string>().ToList<string>();
cnt--;
} while (cnt > 1);
//match a keyword at least
regList.Add(term);
//按优先级从高到低匹配返回最终结果
List<TSource> matchList = new List<TSource>();
foreach (string regKey in regList)
{
Regex regex = new Regex(regKey, RegexOptions.IgnoreCase);
int targetCount = maxCount.Value - matchList.Count<TSource>();
foreach (var item in source.Except<TSource>(matchList, comparer))
{
if (regexMatch(item, regex))
{
matchList.Add(item);
if (--targetCount == 0)
{
return matchList;
}
}
}
}
return matchList;
}
2. 在实体工厂类中添加实例方法:
public class ProductFactory : BaseFactory, IFactory
{
private List<Product> linerProductList = new List<Product>();
//……
public IEnumerable<Product> FilterFromAllPESProducts(String term)
{
return FilterFromAllPESProducts(term, null);
}
public IEnumerable<Product> FilterFromAllPESProducts(String term, int? maxCount)
{
ProductEqualityComparer comparer = new ProductEqualityComparer();
var pesProducts = this.linerProductList.Where(p => p.PESID.HasValue).Distinct(comparer);
//调用静态搜索方法
return DataFactory.FilterList<Product>(pesProducts, term, maxCount, comparer, ProductRegexMatch);
}
//设定具体实例的匹配规则
public bool ProductRegexMatch(Product item, Regex regex)
{
return regex.IsMatch(item.Name);
}
}
/// <summary>
/// Equality comparer for product
/// Used to remove duplicate products
/// TODO: revise later
/// </summary>
public class ProductEqualityComparer : IEqualityComparer<Product>
{
#region IEqualityComparer<Product> Members
bool IEqualityComparer<Product>.Equals(Product x, Product y)
{
return x.ID == y.ID;
}
int IEqualityComparer<Product>.GetHashCode(Product obj)
{
return obj.ID.GetHashCode();
}
#endregion
}
3. 应用层或业务逻辑层调用实体工厂类中的搜索方法:
[OperationContract]
[AspNetCacheProfile("ClientCacheProfile")]
[WebGet(ResponseFormat = WebMessageFormat.Json)]
public IEnumerable<product> FilterProducts(String term)
{
IEnumerable<product> products = null;
products = DataFactory.GetProductFactory(Context.Current).FilterFromAllPESProducts(term, Constant.MaxItemOfAutoComplete);
if (products == null || products.Count() == 0)
{
products = new List<product>();
(products as List<product>).Add(new Product()
{
ID = 0,
PESID = 0,
Name = Resources.String.Filter_NoProgramMatch
});
}
return products;
}
至此,搜索出来的结果基本能满足一般项目的需求。其中步骤3的匹配规则未实现,待后续补充……
相关阅读 更多 +










