更新时间:2023-09-03 18:46:40
一个相当快速的领域特定解决方案可能是使用 SOUNDEX 和 2 个字符串之间的数字距离来计算字符串相似度.只有当您拥有大量产品代码时,这才会真正有用.
A rather quick domain specific solution may be to calculate a string similarity using SOUNDEX and a numeric distance between 2 strings. This will only really help when you have a lot of product codes.
使用像下面这样的简单 UDF,您可以从字符串中提取数字字符,这样您就可以从CLC 2200npk"中获取 2200 个字符,从CLC 1100"中获取 1100 个字符,因此您现在可以根据 SOUNDEX 输出确定接近度每个输入以及每个输入的数字分量的接近度.
Using a simple UDF like below you can extract the numeric chars from a string so that you can then get 2200 out of 'CLC 2200npk' and 1100 out of 'CLC 1100' so you can now determine closeness based on the SOUNDEX output of each input as well as closeness of the numeric component of each input.
CREATE Function [dbo].[ExtractNumeric](@input VARCHAR(1000))
RETURNS INT
AS
BEGIN
WHILE PATINDEX('%[^0-9]%', @input) > 0
BEGIN
SET @input = STUFF(@input, PATINDEX('%[^0-9]%', @input), 1, '')
END
IF @input = '' OR @input IS NULL
SET @input = '0'
RETURN CAST(@input AS INT)
END
GO
就通用算法而言,有几种可能会根据数据集大小和性能要求帮助您取得不同程度的成功.(两个链接都有可用的 TSQL 实现)
As far as general purpose algorithms go there are a couple which might help you with varying degrees of success depending on data set size and performance requirements. (both links have TSQL implementations available)
这里是一篇有趣的文章,它同时应用了这两种算法可能会给你一些想法.
Here is an interesting article which applies both algos together which may give you a few ideas.
希望其中的一些帮助有点小.
Well hopefully some of that helps a little.
这里是一个更快的部分 Levenshtein 距离实现(阅读帖子它不会返回与正常结果完全相同的结果).在我的 125000 行测试表上,它在 6 秒内运行,而我链接到的第一个表则为 60 秒.
Here is a much faster partial Levenshtein Distance implementation (read the post it wont return exact same results as the normal one). On my test table of 125000 rows it runs in 6 seconds compared to 60 seconds for the first one I linked to.