且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

SQL 模糊匹配

更新时间:2023-09-03 18:46:40

一个相当快速的领域特定解决方案可能是使用 SOUNDEX 和 2 个字符串之间的数字距离来计算字符串相似度.只有当您拥有大量产品代码时,这才会真正有用.

A rather quick domain specific solution may be to calculate a string similarity using SOUNDEX and a numeric distance between 2 strings. This will only really help when you have a lot of product codes.

使用像下面这样的简单 UDF,您可以从字符串中提取数字字符,这样您就可以从CLC 2200npk"中获取 2200 个字符,从CLC 1100"中获取 1100 个字符,因此您现在可以根据 SOUNDEX 输出确定接近度每个输入以及每个输入的数字分量的接近度.

Using a simple UDF like below you can extract the numeric chars from a string so that you can then get 2200 out of 'CLC 2200npk' and 1100 out of 'CLC 1100' so you can now determine closeness based on the SOUNDEX output of each input as well as closeness of the numeric component of each input.

CREATE Function [dbo].[ExtractNumeric](@input VARCHAR(1000))
RETURNS INT
AS
BEGIN
    WHILE PATINDEX('%[^0-9]%', @input) > 0
    BEGIN
        SET @input = STUFF(@input, PATINDEX('%[^0-9]%', @input), 1, '')
    END
    IF @input = '' OR @input IS NULL
        SET @input = '0'
    RETURN CAST(@input AS INT)
END
GO

就通用算法而言,有几种可能会根据数据集大小和性能要求帮助您取得不同程度的成功.(两个链接都有可用的 TSQL 实现)

As far as general purpose algorithms go there are a couple which might help you with varying degrees of success depending on data set size and performance requirements. (both links have TSQL implementations available)

  • Double Metaphone - 这种算法会给你一个比 soundex 更好的匹配,代价是虽然速度很快,但对于拼写纠正来说确实很有用.
  • Levenshtein 距离 - 这将计算如何将一个字符串转换为另一个字符串需要多次按键,例如从CLC 2200npk"到CLC 2200"是 3,而从CLC 2200npk"到CLC 1100"是 5.
  • Double Metaphone - This algo will give you a better match than soundex at the cost of speed it is really good for spelling correction though.
  • Levenshtein Distance - This will calculate how many keypresses it would take to turn one string into another for instance to get from 'CLC 2200npk' to 'CLC 2200' is 3, while from 'CLC 2200npk' to 'CLC 1100' is 5.

这里是一篇有趣的文章,它同时应用了这两种算法可能会给你一些想法.

Here is an interesting article which applies both algos together which may give you a few ideas.

希望其中的一些帮助有点小.

Well hopefully some of that helps a little.

这里是一个更快的部分 Levenshtein 距离实现(阅读帖子它不会返回与正常结果完全相同的结果).在我的 125000 行测试表上,它在 6 秒内运行,而我链接到的第一个表则为 60 秒.

Here is a much faster partial Levenshtein Distance implementation (read the post it wont return exact same results as the normal one). On my test table of 125000 rows it runs in 6 seconds compared to 60 seconds for the first one I linked to.