搜索框关键词优先匹配算法
时间:2011-03-10 来源:jannyz
结合jquery的autocomplete的应用,通常为根据输入关键词进行匹配查找给出相关的suggestion选项列表,对于一个关键词(或将用户输入看作一个词进行完全匹配)的查找比较简单,但类似google搜索引擎的搜索输入框则要求多关键词的组合匹配(对于关键词的取舍还要进行语义分析,这部分暂且不说)。
对于多个关键词(以“a b c d”为例,全匹配度M=4)匹配优先级及相应的正则表达式应依次为:
- 匹配所有[匹配度m=4]关键词(忽略空字符,连续且保持原有顺序) :a\s*b\s*c\s*d
- 匹配所有[m=4]关键词(允许不连续,但保持原有顺序) :a.*b.*c.*d
- 匹配所有[m=4]关键词(不要求顺序):(该步暂未实现,寻求好的正则表达)
- 从"a b c d"中去掉一个关键词(从后往前删),即匹配其中[m=3]关键词,得到关键词组合"a b c","a b d","a c d","b c d"共4组,然后对这四组按全匹配度M=3分别做1~3步骤;
- 然后从前面4组中分别按步骤4的方法得到[m=2]的关键词组合,应为"a b","a c","b c";"a b","a d","b d";"a c","a d","c d";"b c","b d","c d",然后distinct一下得到共6组[m=2]的关键词组合,同样按全匹配度M=2分别做1~3步骤;
- 若m>1则继续按步骤5依次对递减的关键词组合进行匹配;
- 最后匹配[m=1]的关键词:a|b|c|d
C#中的算法实现如下:
1. 在"public static class DataFactory"类中添加通用方法:
public static IEnumerable<TSource> FilterList<TSource> ( IEnumerable<TSource> source , String term , int? maxCount , IEqualityComparer<TSource> comparer , Func<TSource, Regex, bool> regexMatch ) { if (!maxCount.HasValue) { maxCount = int.MaxValue; } //简单提取关键词 term = Regex.Replace(term, @"(\*|\^|\\|\$|\.|\+|\?|\=|\!|\:|\||\/|\(|\)|\[|\]|\{|\})", @"\$1"); term = term.ToUpper(System.Threading.Thread.CurrentThread.CurrentUICulture); term = Regex.Replace(term, " +", "|"); term = term.Trim(new char[] { '|' }); //按关键词组合的优先匹配规则生成匹配正则表达式列表 int cnt = term.Split('|').Length; List<string> regList = new List<string>(); List<string> buildWords = new List<string>(new string[] { term }); do { List<string> tempBuildWords = new List<string>(); for (int widx = 0; widx < buildWords.Count; widx++) { string tempTerm = buildWords[widx]; regList.AddRange(new string[]{ tempTerm.Replace("|", "\\s*"), //步骤1 tempTerm.Replace("|", ".*") //步骤2 }); string[] words = tempTerm.Split('|'); for (int i = words.Length - 1; i >= 0; i--) { tempBuildWords.Add(("|" + tempTerm + "|").Replace("|" + words[i] + "|", "|").Trim(new char[] { '|' })); } } buildWords = tempBuildWords.Distinct<string>().ToList<string>(); cnt--; } while (cnt > 1); //match a keyword at least regList.Add(term); //按优先级从高到低匹配返回最终结果 List<TSource> matchList = new List<TSource>(); foreach (string regKey in regList) { Regex regex = new Regex(regKey, RegexOptions.IgnoreCase); int targetCount = maxCount.Value - matchList.Count<TSource>(); foreach (var item in source.Except<TSource>(matchList, comparer)) { if (regexMatch(item, regex)) { matchList.Add(item); if (--targetCount == 0) { return matchList; } } } } return matchList; }
2. 在实体工厂类中添加实例方法:
public class ProductFactory : BaseFactory, IFactory { private List<Product> linerProductList = new List<Product>(); //…… public IEnumerable<Product> FilterFromAllPESProducts(String term) { return FilterFromAllPESProducts(term, null); } public IEnumerable<Product> FilterFromAllPESProducts(String term, int? maxCount) { ProductEqualityComparer comparer = new ProductEqualityComparer(); var pesProducts = this.linerProductList.Where(p => p.PESID.HasValue).Distinct(comparer); //调用静态搜索方法 return DataFactory.FilterList<Product>(pesProducts, term, maxCount, comparer, ProductRegexMatch); } //设定具体实例的匹配规则 public bool ProductRegexMatch(Product item, Regex regex) { return regex.IsMatch(item.Name); } } /// <summary> /// Equality comparer for product /// Used to remove duplicate products /// TODO: revise later /// </summary> public class ProductEqualityComparer : IEqualityComparer<Product> { #region IEqualityComparer<Product> Members bool IEqualityComparer<Product>.Equals(Product x, Product y) { return x.ID == y.ID; } int IEqualityComparer<Product>.GetHashCode(Product obj) { return obj.ID.GetHashCode(); } #endregion }
3. 应用层或业务逻辑层调用实体工厂类中的搜索方法:
[OperationContract] [AspNetCacheProfile("ClientCacheProfile")] [WebGet(ResponseFormat = WebMessageFormat.Json)] public IEnumerable<product> FilterProducts(String term) { IEnumerable<product> products = null; products = DataFactory.GetProductFactory(Context.Current).FilterFromAllPESProducts(term, Constant.MaxItemOfAutoComplete); if (products == null || products.Count() == 0) { products = new List<product>(); (products as List<product>).Add(new Product() { ID = 0, PESID = 0, Name = Resources.String.Filter_NoProgramMatch }); } return products; }
至此,搜索出来的结果基本能满足一般项目的需求。其中步骤3的匹配规则未实现,待后续补充……
相关阅读 更多 +