T-SQL을 사용한 퍼지 매칭
개인 데이터 등이 포함된 테이블이 있습니다.많은 열이 있지만 여기서 한 번 주목해야 할 부분은 다음과 같습니다.addressindex
,lastname
★★★★★★★★★★★★★★★★★」firstname
서 ''는addressindex
아파트 문까지 드릴로 뚫린 고유 주소입니다. '두 명 '아래처럼'이 두 명 lastname
는 ★★★★★★★★★★★★★★★★★★★.firstnames
중복될 가능성이 가장 큽니다.
이 복제품들을 나열할 방법이 필요합니다.
tabledata:
personid 1
firstname "Carl"
lastname "Anderson"
addressindex 1
personid 2
firstname "Carl Peter"
lastname "Anderson"
addressindex 1
모든 열에 정확히 일치해야 하는 경우 이 방법을 알고 있지만 위의 예에서 나온 결과를 사용하여 다음과 같은 트릭을 수행하려면 퍼지 매치가 필요합니다.
Row personid addressindex lastname firstname
1 2 1 Anderson Carl Peter
2 1 1 Anderson Carl
.....
어떻게 하면 이 문제를 잘 해결할 수 있을까요?
SQL Server에서 제공하는 퍼지 매칭은 매우 투박하다는 것을 알게 되었습니다.Levenshtein 거리 알고리즘과 가중치를 사용한 CLR 기능은 정말 운이 좋았습니다.이 알고리즘을 사용하여 GetSimilarityScore라는 UDF를 만들었습니다.이 UDF는 2개의 스트링을 사용하여 0.0에서 1.0 사이의 점수를 반환합니다.1.0에 가까울수록 좋습니다.그런 다음 임계값 >=0.8 정도로 쿼리하여 가장 가능성이 높은 일치를 가져옵니다.다음과 같은 경우:
if object_id('tempdb..#similar') is not null drop table #similar
select a.id, (
select top 1 x.id
from MyTable x
where x.id <> a.id
order by dbo.GetSimilarityScore(a.MyField, x.MyField) desc
) as MostSimilarId
into #similar
from MyTable a
select *, dbo.GetSimilarityScore(a.MyField, c.MyField)
from MyTable a
join #similar b on a.id = b.id
join MyTable c on b.MostSimilarId = c.id
정말 큰 테이블에서는 하지 마세요.느린 과정입니다.
CLR UDF는 다음과 같습니다.
''' <summary>
''' Compute the distance between two strings.
''' </summary>
''' <param name="s1">The first of the two strings.</param>
''' <param name="s2">The second of the two strings.</param>
''' <returns>The Levenshtein cost.</returns>
<Microsoft.SqlServer.Server.SqlFunction()> _
Public Shared Function ComputeLevenstheinDistance(ByVal string1 As SqlString, ByVal string2 As SqlString) As SqlInt32
If string1.IsNull OrElse string2.IsNull Then Return SqlInt32.Null
Dim s1 As String = string1.Value
Dim s2 As String = string2.Value
Dim n As Integer = s1.Length
Dim m As Integer = s2.Length
Dim d As Integer(,) = New Integer(n, m) {}
' Step 1
If n = 0 Then Return m
If m = 0 Then Return n
' Step 2
For i As Integer = 0 To n
d(i, 0) = i
Next
For j As Integer = 0 To m
d(0, j) = j
Next
' Step 3
For i As Integer = 1 To n
'Step 4
For j As Integer = 1 To m
' Step 5
Dim cost As Integer = If((s2(j - 1) = s1(i - 1)), 0, 1)
' Step 6
d(i, j) = Math.Min(Math.Min(d(i - 1, j) + 1, d(i, j - 1) + 1), d(i - 1, j - 1) + cost)
Next
Next
' Step 7
Return d(n, m)
End Function
''' <summary>
''' Returns a score between 0.0-1.0 indicating how closely two strings match. 1.0 is a 100%
''' T-SQL equality match, and the score goes down from there towards 0.0 for less similar strings.
''' </summary>
<Microsoft.SqlServer.Server.SqlFunction()> _
Public Shared Function GetSimilarityScore(string1 As SqlString, string2 As SqlString) As SqlDouble
If string1.IsNull OrElse string2.IsNull Then Return SqlInt32.Null
Dim s1 As String = string1.Value.ToUpper().TrimEnd(" "c)
Dim s2 As String = string2.Value.ToUpper().TrimEnd(" "c)
If s1 = s2 Then Return 1.0F ' At this point, T-SQL would consider them the same, so I will too
Dim flatLevScore As Double = InternalGetSimilarityScore(s1, s2)
Dim letterS1 As String = GetLetterSimilarityString(s1)
Dim letterS2 As String = GetLetterSimilarityString(s2)
Dim letterScore As Double = InternalGetSimilarityScore(letterS1, letterS2)
'Dim wordS1 As String = GetWordSimilarityString(s1)
'Dim wordS2 As String = GetWordSimilarityString(s2)
'Dim wordScore As Double = InternalGetSimilarityScore(wordS1, wordS2)
If flatLevScore = 1.0F AndAlso letterScore = 1.0F Then Return 1.0F
If flatLevScore = 0.0F AndAlso letterScore = 0.0F Then Return 0.0F
' Return weighted result
Return (flatLevScore * 0.2F) + (letterScore * 0.8F)
End Function
Private Shared Function InternalGetSimilarityScore(s1 As String, s2 As String) As Double
Dim dist As SqlInt32 = ComputeLevenstheinDistance(s1, s2)
Dim maxLen As Integer = If(s1.Length > s2.Length, s1.Length, s2.Length)
If maxLen = 0 Then Return 1.0F
Return 1.0F - Convert.ToDouble(dist.Value) / Convert.ToDouble(maxLen)
End Function
''' <summary>
''' Sorts all the alpha numeric characters in the string in alphabetical order
''' and removes everything else.
''' </summary>
Private Shared Function GetLetterSimilarityString(s1 As String) As String
Dim allChars = If(s1, "").ToUpper().ToCharArray()
Array.Sort(allChars)
Dim result As New StringBuilder()
For Each ch As Char In allChars
If Char.IsLetterOrDigit(ch) Then
result.Append(ch)
End If
Next
Return result.ToString()
End Function
''' <summary>
''' Removes all non-alpha numeric characters and then sorts
''' the words in alphabetical order.
''' </summary>
Private Shared Function GetWordSimilarityString(s1 As String) As String
Dim words As New List(Of String)()
Dim curWord As StringBuilder = Nothing
For Each ch As Char In If(s1, "").ToUpper()
If Char.IsLetterOrDigit(ch) Then
If curWord Is Nothing Then
curWord = New StringBuilder()
End If
curWord.Append(ch)
Else
If curWord IsNot Nothing Then
words.Add(curWord.ToString())
curWord = Nothing
End If
End If
Next
If curWord IsNot Nothing Then
words.Add(curWord.ToString())
End If
words.Sort(StringComparer.OrdinalIgnoreCase)
Return String.Join(" ", words.ToArray())
End Function
여기에 있는 다른 유용한 정보 외에도 일반적으로 SOUNDEX보다 더 낫다고 여겨지는 이중 메타포네 음성 알고리즘을 사용하는 것을 고려해 보는 것이 좋습니다.
Tim Feiffer는 기사 Double Metaphone Sounds Great Convert the C++ Double Metaphone 알고리즘을 T-SQL(원래는 SQL Mag에서, 그 후 SQL Server Pro에서)에서 SQL에서의 구현에 대해 자세히 설명합니다.
이름 철자가 약간 틀리면 칼 대 칼 대 칼과 같은 이름을 일치시키는 데 도움이 됩니다.칼.
업데이트: 실제 다운로드 가능한 코드가 사라진 것 같습니다만, 여기 github repo에서 발견된 구현이 있습니다.원래 코드를 복제한 것으로 보입니다.
SQL Server 전체 텍스트 인덱싱을 사용하면 검색 및 반환을 수행할 수 있으며 단어뿐만 아니라 철자 오류도 있을 수 있습니다.
Master Data Services의 첫 출시 이후 SOUNDEX가 구현하는 것보다 더 고급 퍼지 논리 알고리즘에 액세스할 수 있게 되었습니다.따라서 MDS가 설치되어 있는 경우 mdq 스키마(MDS 데이터베이스)에서 유사성()이라는 함수를 찾을 수 있습니다.
동작의 상세한 것에 대하여는, http://blog.hoegaerden.be/2011/02/05/finding-similar-strings-with-fuzzy-logic-functions-built-into-mds/ 를 참조해 주세요.
저는 개인적으로 Jaro-Winkler 알고리즘의 CLR 실장을 사용하고 있습니다.이 알고리즘은 약 15자를 넘는 문자열로 조금 고생하고 이메일 주소의 매칭을 좋아하지 않지만, 그 이외는 꽤 좋습니다.전체 실장 가이드는 여기에서 찾을 수 있습니다.
다음은 t-sql의 샘플입니다.
SELECT *
FROM myTable1 as t1
INNER JOIN myTable2 as t2
ON dbo.StringDistance(t1.MyField, t2.MyField) > 0.85
Soundex에 비해 이 접근법의 장점은 오타를 더 쉽게 처리할 수 있다는 것입니다. 오타가 다른 소리를 낼 경우 SOUNDEX와 올바르게 일치하지 않기 때문입니다.
예를 들어, "type" 대신 "ytpe"를 입력하면 SOUNDEX에서 올바른 일치를 알 수 없습니다.SOUNDEX('ytpe') 및 SOUNDEX('type')에 대해 반환되는 코드는 다음과 같습니다.
Ytpe Y310
T100 타입
여기서 Jaro-Winkler String Distance('ytpe' 'type')를 사용하면 0.91(6)을 얻을 수 있습니다.이것은 좋은 일치를 의미합니다.1은 정확히 일치하므로 0.91이 매우 가깝습니다.
어떤 이유로든 CLR 함수를 사용할 수 없는 경우 SSIS 패키지(퍼지 변환 룩업 사용)를 통해 데이터를 실행해 보십시오.상세한 내용은 여기를 참조하십시오.
코드 파트 2 Redfilter 응답 링크의 경우:
참고 자료:
https://github.com/mb16/geocoderNet/blob/master/build/sql/doubleMetaphone.sql
BEGIN
SET @MP1 = @MP1 + 'N'
SET @MP2 = @MP2 + 'KN'
END
ELSE
BEGIN
SET @MP1 = @MP1 + 'KN'
SET @MP2 = @MP2 + 'KN'
END
END
SET @CurrentPosition = @CurrentPosition + 2
END
--'tagliaro'
ELSE IF (dbo.fnStringAt((@CurrentPosition + 1),@Word,'LI')=1) AND
(dbo.fnSlavoGermanic(@Word)=0)
BEGIN
SET @MP1 = @MP1 + 'KL'
SET @MP2 = @MP2 + 'L'
SET @CurrentPosition = @CurrentPosition + 2
END
-- -ges-,-gep-,-gel-, -gie- at beginning
-- This call to fnStringAt() is the 'worst case' in number of values passed. A UDF that used DEFAULT values instead of
-- a multi-valued argument would require ten DEFAULT arguments for EP, EB, EL, etc. (assuming the first was not defined with a DEFAULT).
ELSE IF ((@CurrentPosition = 1)
AND ((SUBSTRING(@Word,@CurrentPosition + 1,1) = 'Y')
OR (dbo.fnStringAt((@CurrentPosition +
1),@Word,'ES,EP,EB,EL,EY,IB,IL,IN,IE,EI,ER')=1)) )
BEGIN
SET @MP1 = @MP1 + 'K'
SET @MP2 = @MP2 + 'J'
SET @CurrentPosition = @CurrentPosition + 2
END
-- -ger-, -gy-
ELSE IF (((dbo.fnStringAt((@CurrentPosition + 1), @Word, 'ER')=1) OR
(SUBSTRING(@Word,@CurrentPosition + 1,1) = 'Y'))
AND (dbo.fnStringAt(1, @Word, 'DANGER,RANGER,MANGER')=0)
AND (dbo.fnStringAt((@CurrentPosition - 1), @Word,
'E,I,RGY,OGY')=0) )
BEGIN
SET @MP1 = @MP1 + 'K'
SET @MP2 = @MP2 + 'J'
SET @CurrentPosition = @CurrentPosition + 2
END
-- italian e.g, 'biaggi'
ELSE IF (dbo.fnStringAt((@CurrentPosition + 1),@Word,'E,I,Y')=1) OR
(dbo.fnStringAt((@CurrentPosition - 1),@Word,'AGGI,OGGI')=1)
BEGIN
--obvious germanic
IF ((dbo.fnStringAt(1,@Word,'VAN ,VON ,SCH')=1)
OR (dbo.fnStringAt((@CurrentPosition + 1),@Word,'ET')=1))
BEGIN
SET @MP1 = @MP1 + 'K'
SET @MP2 = @MP2 + 'K'
END
ELSE
BEGIN
--always soft if french ending
IF (dbo.fnStringAt((@CurrentPosition + 1),@Word,'IER ')=1)
BEGIN
SET @MP1 = @MP1 + 'J'
SET @MP2 = @MP2 + 'J'
END
ELSE
BEGIN
SET @MP1 = @MP1 + 'J'
SET @MP2 = @MP2 + 'K'
END
END
SET @CurrentPosition = @CurrentPosition + 2
END
ELSE
BEGIN
IF (SUBSTRING(@Word,@CurrentPosition + 1,1) = 'G')
BEGIN
SET @CurrentPosition = @CurrentPosition + 2
END
ELSE
BEGIN
SET @CurrentPosition = @CurrentPosition + 1
END
SET @MP1 = @MP1 + 'K'
SET @MP2 = @MP2 + 'K'
END
END
ELSE IF @CurrentChar = 'H'
BEGIN
--only keep if first & before vowel or btw. 2 vowels
IF (((@CurrentPosition = 1) OR
(dbo.fnIsVowel(SUBSTRING(@Word,@CurrentPosition - 1,1))=1))
AND (dbo.fnIsVowel(SUBSTRING(@Word,@CurrentPosition + 1,1))=1))
BEGIN
SET @MP1 = @MP1 + 'H'
SET @MP2 = @MP2 + 'H'
SET @CurrentPosition = @CurrentPosition + 2
END
--also takes care of 'HH'
ELSE
BEGIN
SET @CurrentPosition = @CurrentPosition + 1
END
END
ELSE IF @CurrentChar = 'J'
BEGIN
--obvious spanish, 'jose', 'san jacinto'
IF (dbo.fnStringAt(@CurrentPosition,@Word,'JOSE')=1) OR
(dbo.fnStringAt(1,@Word,'SAN ')=1)
BEGIN
IF (((@CurrentPosition = 1) AND (SUBSTRING(@Word,@CurrentPosition
+ 4,1) = ' ')) OR (dbo.fnStringAt(1,@Word,'SAN ')=1) )
BEGIN
SET @MP1 = @MP1 + 'H'
SET @MP2 = @MP2 + 'H'
END
ELSE
BEGIN
SET @MP1 = @MP1 + 'J'
SET @MP2 = @MP2 + 'H'
END
SET @CurrentPosition = @CurrentPosition + 1
END
ELSE IF ((@CurrentPosition = 1) AND
(dbo.fnStringAt(@CurrentPosition,@Word,'JOSE')=0))
BEGIN
SET @MP1 = @MP1 + 'J'
--Yankelovich/Jankelowicz
SET @MP2 = @MP2 + 'A'
--it could happen!
IF (SUBSTRING(@Word,@CurrentPosition + 1,1) = 'J')
BEGIN
SET @CurrentPosition = @CurrentPosition + 2
END
ELSE
BEGIN
SET @CurrentPosition = @CurrentPosition + 1
END
END
ELSE
BEGIN
--spanish pron. of e.g. 'bajador'
IF( (dbo.fnIsVowel(SUBSTRING(@Word,@CurrentPosition - 1,1))=1)
AND (dbo.fnSlavoGermanic(@Word)=0)
AND ((SUBSTRING(@Word,@CurrentPosition + 1,1) = 'A') OR
(SUBSTRING(@Word,@CurrentPosition + 1,1) = 'O')))
BEGIN
SET @MP1 = @MP1 + 'J'
SET @MP2 = @MP2 + 'H'
END
ELSE
BEGIN
IF (@CurrentPosition = @WordLength)
BEGIN
SET @MP1 = @MP1 + 'J'
SET @MP2 = @MP2 + ''
END
ELSE
BEGIN
IF ((dbo.fnStringAt((@CurrentPosition + 1), @Word,
'L,T,K,S,N,M,B,Z')=0)
AND (dbo.fnStringAt((@CurrentPosition - 1), @Word,
'S,K,L')=0))
BEGIN
SET @MP1 = @MP1 + 'J'
SET @MP2 = @MP2 + 'J'
END
END
END
--it could happen!
IF (SUBSTRING(@Word,@CurrentPosition + 1,1) = 'J')
BEGIN
SET @CurrentPosition = @CurrentPosition + 2
END
ELSE
BEGIN
SET @CurrentPosition = @CurrentPosition + 1
END
END
END
ELSE IF @CurrentChar = 'K'
BEGIN
IF (SUBSTRING(@Word,@CurrentPosition + 1,1) = 'K')
BEGIN
SET @CurrentPosition = @CurrentPosition + 2
END
ELSE
BEGIN
SET @CurrentPosition = @CurrentPosition + 1
END
SET @MP1 = @MP1 + 'K'
SET @MP2 = @MP2 + 'K'
END
ELSE IF @CurrentChar = 'L'
BEGIN
IF (SUBSTRING(@Word,@CurrentPosition + 1,1) = 'L')
BEGIN
--spanish e.g. 'cabrillo', 'gallegos'
IF (((@CurrentPosition = (@WordLength - 3))
AND (dbo.fnStringAt((@CurrentPosition -
1),@Word,'ILLO,ILLA,ALLE')=1))
OR (((dbo.fnStringAt((@WordLength - 1),@Word,'AS,OS')=1)
OR (dbo.fnStringAt(@WordLength,@Word,'A,O')=1))
AND (dbo.fnStringAt((@CurrentPosition -
1),@Word,'ALLE')=1)) )
BEGIN
SET @MP1 = @MP1 + 'L'
SET @MP2 = @MP2 + ''
SET @CurrentPosition = @CurrentPosition + 2
END
ELSE
BEGIN
SET @CurrentPosition = @CurrentPosition + 2
SET @MP1 = @MP1 + 'L'
SET @MP2 = @MP2 + 'L'
END
END
ELSE
BEGIN
SET @CurrentPosition = @CurrentPosition + 1
SET @MP1 = @MP1 + 'L'
SET @MP2 = @MP2 + 'L'
END
END
ELSE IF @CurrentChar = 'M'
BEGIN
--'dumb','thumb'
IF (((dbo.fnStringAt((@CurrentPosition - 1), @Word,'UMB')=1)
AND (((@CurrentPosition + 1) = @WordLength) OR
(dbo.fnStringAt((@CurrentPosition + 2),@Word,'ER')=1)))
OR (SUBSTRING(@Word,@CurrentPosition + 1,1) = 'M') )
BEGIN
SET @CurrentPosition = @CurrentPosition + 2
END
ELSE
BEGIN
SET @CurrentPosition = @CurrentPosition + 1
END
SET @MP1 = @MP1 + 'M'
SET @MP2 = @MP2 + 'M'
END
ELSE IF @CurrentChar = 'N'
BEGIN
IF (SUBSTRING(@Word,@CurrentPosition + 1,1) = 'N')
BEGIN
SET @CurrentPosition = @CurrentPosition + 2
END
ELSE
BEGIN
SET @CurrentPosition = @CurrentPosition + 1
END
SET @MP1 = @MP1 + 'N'
SET @MP2 = @MP2 + 'N'
END
ELSE IF @CurrentChar = 'Ñ'
BEGIN
SET @CurrentPosition = @CurrentPosition + 1
SET @MP1 = @MP1 + 'N'
SET @MP2 = @MP2 + 'N'
END
ELSE IF @CurrentChar = 'P'
BEGIN
--What about Michelle Pfeiffer, star of Grease 2? Price-Pfister?, Pfizer?
--Don't just look for an 'F' next, what about 'topflight', helpful, campfire, leapfrog, stepfather
--Sorry, Mark Knopfler, I don't know how to help you
IF (SUBSTRING(@Word,@CurrentPosition + 1,1) = 'H')
OR ((@CurrentPosition = 1) AND
(SUBSTRING(@Word,@CurrentPosition + 1,1) = 'F') AND
(dbo.fnIsVowel(SUBSTRING(@Word,@CurrentPosition+2,1))=1))
BEGIN
SET @MP1 = @MP1 + 'F'
SET @MP2 = @MP2 + 'F'
SET @CurrentPosition = @CurrentPosition + 2
END
--also account for "campbell", "raspberry"
ELSE
BEGIN
IF (dbo.fnStringAt((@CurrentPosition + 1),@Word, 'P,B')=1)
BEGIN
SET @CurrentPosition = @CurrentPosition + 2
END
ELSE
BEGIN
SET @CurrentPosition = @CurrentPosition + 1
END
SET @MP1 = @MP1 + 'P'
SET @MP2 = @MP2 + 'P'
END
END
ELSE IF @CurrentChar = 'Q'
BEGIN
IF (SUBSTRING(@Word,@CurrentPosition + 1,1) = 'Q')
BEGIN
SET @CurrentPosition = @CurrentPosition + 2
END
ELSE
BEGIN
SET @CurrentPosition = @CurrentPosition + 1
END
SET @MP1 = @MP1 + 'K'
SET @MP2 = @MP2 + 'K'
END
ELSE IF @CurrentChar = 'R'
BEGIN
--QQ: Will SQL short circuit eval? Otherwise, I could try to read before string begins here...
--french e.g. 'rogier', but exclude 'hochmeier'
IF ((@CurrentPosition = @WordLength)
AND (dbo.fnSlavoGermanic(@Word)=0)
AND (dbo.fnStringAt((@CurrentPosition - 2), @Word, 'IE')=1)
AND (dbo.fnStringAt((@CurrentPosition - 4), @Word,
'ME,MA')=0))
BEGIN
SET @MP1 = @MP1 + ''
SET @MP2 = @MP2 + 'R'
END
ELSE
BEGIN
SET @MP1 = @MP1 + 'R'
SET @MP2 = @MP2 + 'R'
END
IF (SUBSTRING(@Word,@CurrentPosition + 1,1) = 'R')
BEGIN
SET @CurrentPosition = @CurrentPosition + 2
END
ELSE
BEGIN
SET @CurrentPosition = @CurrentPosition + 1
END
END
ELSE IF @CurrentChar = 'S'
BEGIN
--special cases 'island', 'isle', 'carlisle', 'carlysle'
IF (dbo.fnStringAt((@CurrentPosition - 1), @Word, 'ISL,YSL')=1)
BEGIN
SET @CurrentPosition = @CurrentPosition + 1
END
--special case 'sugar-'
ELSE IF ((@CurrentPosition = 1) AND (dbo.fnStringAt(@CurrentPosition,
@Word, 'SUGAR')=1))
BEGIN
SET @MP1 = @MP1 + 'X'
SET @MP2 = @MP2 + 'S'
SET @CurrentPosition = @CurrentPosition + 1
END
ELSE IF (dbo.fnStringAt(@CurrentPosition, @Word, 'SH')=1)
BEGIN
--germanic
IF (dbo.fnStringAt((@CurrentPosition + 1), @Word,
'HEIM,HOEK,HOLM,HOLZ')=1)
BEGIN
SET @MP1 = @MP1 + 'S'
SET @MP2 = @MP2 + 'S'
END
ELSE
BEGIN
SET @MP1 = @MP1 + 'X'
SET @MP2 = @MP2 + 'X'
END
SET @CurrentPosition = @CurrentPosition + 2
END
--italian & armenian
ELSE IF (dbo.fnStringAt(@CurrentPosition, @Word, 'SIO,SIA')=1) OR
(dbo.fnStringAt(@CurrentPosition, @Word, 'SIAN')=1)
BEGIN
IF (dbo.fnSlavoGermanic(@Word)=0)
BEGIN
SET @MP1 = @MP1 + 'S'
SET @MP2 = @MP2 + 'X'
END
ELSE
BEGIN
SET @MP1 = @MP1 + 'S'
SET @MP2 = @MP2 + 'S'
END
SET @CurrentPosition = @CurrentPosition + 3
END
--german & anglicisations, e.g. 'smith' match 'schmidt', 'snider' match 'schneider'
--also, -sz- in slavic language altho in hungarian it is pronounced 's'
ELSE IF (((@CurrentPosition = 1)
AND (dbo.fnStringAt((@CurrentPosition + 1), @Word, 'M,N,L,W')=1))
OR (dbo.fnStringAt((@CurrentPosition + 1), @Word, 'Z')=1))
BEGIN
SET @MP1 = @MP1 + 'S'
SET @MP2 = @MP2 + 'X'
IF (dbo.fnStringAt((@CurrentPosition + 1), @Word, 'Z')=1)
BEGIN
SET @CurrentPosition = @CurrentPosition + 2
END
ELSE
BEGIN
SET @CurrentPosition = @CurrentPosition + 1
END
END
ELSE IF (dbo.fnStringAt(@CurrentPosition, @Word, 'SC')=1)
BEGIN
--Schlesinger's rule
IF (SUBSTRING(@Word,@CurrentPosition + 2,1) = 'H')
BEGIN
--dutch origin, e.g. 'school', 'schooner'
IF (dbo.fnStringAt((@CurrentPosition + 3), @Word,
'OO,ER,EN,UY,ED,EM')=1)
BEGIN
--'schermerhorn', 'schenker'
IF (dbo.fnStringAt((@CurrentPosition + 3), @Word, 'ER,EN')=1)
BEGIN
SET @MP1 = @MP1 + 'X'
SET @MP2 = @MP2 + 'SK'
END
ELSE
BEGIN
SET @MP1 = @MP1 + 'SK'
SET @MP2 = @MP2 + 'SK'
END
SET @CurrentPosition = @CurrentPosition + 3
END
ELSE
BEGIN
IF ((@CurrentPosition = 1) AND
(dbo.fnIsVowel(SUBSTRING(@Word,3,1))=0) AND (SUBSTRING(@Word,3,1) <> 'W'))
BEGIN
SET @MP1 = @MP1 + 'X'
SET @MP2 = @MP2 + 'S'
END
ELSE
BEGIN
SET @MP1 = @MP1 + 'X'
SET @MP2 = @MP2 + 'X'
END
SET @CurrentPosition = @CurrentPosition + 3
END
END
ELSE IF (dbo.fnStringAt((@CurrentPosition + 2), @Word, 'I,E,Y')=1)
BEGIN
SET @MP1 = @MP1 + 'S'
SET @MP2 = @MP2 + 'S'
SET @CurrentPosition = @CurrentPosition + 3
END
ELSE
BEGIN
SET @MP1 = @MP1 + 'SK'
SET @MP2 = @MP2 + 'SK'
SET @CurrentPosition = @CurrentPosition + 3
END
END
ELSE
BEGIN
--french e.g. 'resnais', 'artois'
IF ((@CurrentPosition = @WordLength) AND
(dbo.fnStringAt((@CurrentPosition - 2), @Word, 'AI,OI')=1))
BEGIN
SET @MP1 = @MP1 + ''
SET @MP2 = @MP2 + 'S'
END
ELSE
BEGIN
SET @MP1 = @MP1 + 'S'
SET @MP2 = @MP2 + 'S'
END
IF (dbo.fnStringAt((@CurrentPosition + 1), @Word, 'S,Z')=1)
BEGIN
SET @CurrentPosition = @CurrentPosition + 2
END
ELSE
BEGIN
SET @CurrentPosition = @CurrentPosition + 1
END
END
END
ELSE IF @CurrentChar = 'T'
BEGIN
IF (dbo.fnStringAt(@CurrentPosition, @Word, 'TION,TIA,TCH')=1)
BEGIN
SET @MP1 = @MP1 + 'X'
SET @MP2 = @MP2 + 'X'
SET @CurrentPosition = @CurrentPosition + 3
END
ELSE IF (dbo.fnStringAt(@CurrentPosition, @Word, 'TH,TTH')=1)
BEGIN
--special case 'thomas', 'thames' or germanic
IF (dbo.fnStringAt((@CurrentPosition + 2), @Word, 'OM,AM')=1)
OR (dbo.fnStringAt(1, @Word, 'VAN ,VON ,SCH')=1)
BEGIN
SET @MP1 = @MP1 + 'T'
SET @MP2 = @MP2 + 'T'
END
ELSE
BEGIN
SET @MP1 = @MP1 + '0'
SET @MP2 = @MP2 + 'T'
END
SET @CurrentPosition = @CurrentPosition + 2
END
ELSE
BEGIN
IF (dbo.fnStringAt((@CurrentPosition + 1), @Word, 'T,D')=1)
BEGIN
SET @CurrentPosition = @CurrentPosition + 2
END
ELSE
BEGIN
SET @CurrentPosition = @CurrentPosition + 1
END
SET @MP1 = @MP1 + 'T'
SET @MP2 = @MP2 + 'T'
END
END
ELSE IF @CurrentChar = 'V'
BEGIN
IF (SUBSTRING(@Word,@CurrentPosition + 1,1) = 'V')
BEGIN
SET @CurrentPosition = @CurrentPosition + 2
END
ELSE
BEGIN
SET @CurrentPosition = @CurrentPosition + 1
END
SET @MP1 = @MP1 + 'F'
SET @MP2 = @MP2 + 'F'
END
ELSE IF @CurrentChar = 'W'
BEGIN
--can also be in middle of word
IF (dbo.fnStringAt(@CurrentPosition, @Word, 'WR')=1)
BEGIN
SET @MP1 = @MP1 + 'R'
SET @MP2 = @MP2 + 'R'
SET @CurrentPosition = @CurrentPosition + 2
END
ELSE IF ((@CurrentPosition = 1)
AND ((dbo.fnIsVowel(SUBSTRING(@Word,@CurrentPosition + 1,1))=1)
OR (dbo.fnStringAt(@CurrentPosition, @Word, 'WH')=1)))
BEGIN
--Wasserman should match Vasserman
IF (dbo.fnIsVowel(SUBSTRING(@Word,@CurrentPosition + 1,1))=1)
BEGIN
SET @MP1 = @MP1 + 'A'
SET @MP2 = @MP2 + 'F'
END
ELSE
BEGIN
--need Uomo to match Womo
SET @MP1 = @MP1 + 'A'
SET @MP2 = @MP2 + 'A'
END
SET @CurrentPosition = @CurrentPosition + 1
END
--Arnow should match Arnoff
ELSE IF (((@CurrentPosition = @WordLength) AND
(dbo.fnIsVowel(SUBSTRING(@Word,@CurrentPosition - 1,1))=1))
OR (dbo.fnStringAt((@CurrentPosition - 1), @Word,
'EWSKI,EWSKY,OWSKI,OWSKY')=1)
OR (dbo.fnStringAt(1, @Word, 'SCH')=1))
BEGIN
SET @MP1 = @MP1 + ''
SET @MP2 = @MP2 + 'F'
SET @CurrentPosition = @CurrentPosition + 1
END
--polish e.g. 'filipowicz'
ELSE IF (dbo.fnStringAt(@CurrentPosition, @Word, 'WICZ,WITZ')=1)
BEGIN
SET @MP1 = @MP1 + 'TS'
SET @MP2 = @MP2 + 'FX'
SET @CurrentPosition = @CurrentPosition + 4
END
-- skip it
ELSE
BEGIN
SET @CurrentPosition = @CurrentPosition + 1
END
END
ELSE IF @CurrentChar = 'X'
BEGIN
--french e.g. breaux
IF (NOT((@CurrentPosition = @WordLength)
AND ((dbo.fnStringAt((@CurrentPosition - 3), @Word, 'IAU,EAU')=1)
OR (dbo.fnStringAt((@CurrentPosition - 2), @Word,
'AU,OU')=1))) )
BEGIN
SET @MP1 = @MP1 + 'KS'
SET @MP2 = @MP2 + 'KS'
END
IF (dbo.fnStringAt((@CurrentPosition + 1), @Word, 'C,X')=1)
BEGIN
SET @CurrentPosition = @CurrentPosition + 2
END
ELSE
BEGIN
SET @CurrentPosition = @CurrentPosition + 1
END
END
ELSE IF @CurrentChar = 'Z'
BEGIN
--chinese pinyin e.g. 'zhao'
IF (SUBSTRING(@Word,@CurrentPosition + 1,1) = 'H')
BEGIN
SET @MP1 = @MP1 + 'J'
SET @MP2 = @MP2 + 'J'
SET @CurrentPosition = @CurrentPosition + 2
END
ELSE
BEGIN
IF ((dbo.fnStringAt((@CurrentPosition + 1), @Word, 'ZO,ZI,ZA')=1)
OR ((dbo.fnSlavoGermanic(@Word)=1) AND ((@CurrentPosition >
1) AND SUBSTRING(@Word,@CurrentPosition - 1,1) <> 'T')))
BEGIN
SET @MP1 = @MP1 + 'S'
SET @MP2 = @MP2 + 'TS'
END
ELSE
BEGIN
SET @MP1 = @MP1 + 'S'
SET @MP2 = @MP2 + 'S'
END
IF (SUBSTRING(@Word,@CurrentPosition + 1,1) = 'Z')
BEGIN
SET @CurrentPosition = @CurrentPosition + 2
END
ELSE
BEGIN
SET @CurrentPosition = @CurrentPosition + 1
END
END
END
ELSE
BEGIN
SET @CurrentPosition = @CurrentPosition + 1
END
END
--only give back 4 char metaphone
IF (LEN(@MP1) > 4)
BEGIN
SET @MP1 = LEFT(@MP1, 4)
END
IF (LEN(@MP2) > 4)
BEGIN
SET @MP2 = LEFT(@MP2, 4)
END
IF @MP2 = @MP1
BEGIN
SET @MP2 = ''
END
INSERT @DMP(Metaphone1,Metaphone2) VALUES( @MP1, @MP2 )
RETURN
END
GO;
------------------------------------------------------------------------
IF OBJECT_ID('fnDoubleMetaphoneScalar') IS NOT NULL BEGIN DROP FUNCTION
fnDoubleMetaphoneScalar END
GO;
CREATE FUNCTION fnDoubleMetaphoneScalar( @MetaphoneType int, @Word varchar(50) )
RETURNS char(4)
AS
BEGIN
RETURN (SELECT CASE @MetaphoneType WHEN 1 THEN Metaphone1
WHEN 2 THEN Metaphone2 END FROM fnDoubleMetaphoneTable( @Word ))
END
Red Filter 코드를 두 부분으로 붙여 링크 부패를 방지하다
참고 자료:
https://github.com/mb16/geocoderNet/blob/master/build/sql/doubleMetaphone.sql 파트 1:
--WEB LISTING 1: Double Metaphone Script
-------------------------------------
IF OBJECT_ID('fnIsVowel') IS NOT NULL BEGIN DROP FUNCTION fnIsVowel END
GO;
CREATE FUNCTION fnIsVowel( @c char(1) )
RETURNS bit
AS
BEGIN
IF (@c = 'A') OR (@c = 'E') OR (@c = 'I') OR (@c = 'O') OR (@c = 'U') OR (@c = 'Y')
BEGIN
RETURN 1
END
--'ELSE' would worry SQL Server, it wants RETURN last in a scalar function
RETURN 0
END
GO;
-----------------------------------------------
IF OBJECT_ID('fnSlavoGermanic') IS NOT NULL BEGIN DROP FUNCTION fnSlavoGermanic
END
GO;
CREATE FUNCTION fnSlavoGermanic( @Word char(50) )
RETURNS bit
AS
BEGIN
--Catch NULL also...
IF (CHARINDEX('W',@Word) > 0) OR (CHARINDEX('K',@Word) > 0) OR
(CHARINDEX('CZ',@Word) > 0)
--'WITZ' test is in original Lawrence Philips C++ code, but appears to be a subset of the first test for 'W'
-- OR (CHARINDEX('WITZ',@Word) > 0)
BEGIN
RETURN 1
END
--ELSE
RETURN 0
END
GO;
---------------------------------------------------------------------------------------------------------------------------------
----------------------
--Lawrence Philips calls for a length argument, but this has two drawbacks:
--1. All target strings must be of the same length
--2. It presents an opportunity for subtle bugs, ie fnStringAt( 1, 7, 'Search me please', 'Search' ) returns 0 (no matter what is in the searched string)
--So I've eliminated the argument and fnStringAt checks the length of each target as it executes
--DEFAULTS suck with UDFs. Have to specify DEFAULT in caller - why bother?
IF OBJECT_ID('fnStringAtDef') IS NOT NULL BEGIN DROP FUNCTION fnStringAtDef END
GO;
CREATE FUNCTION fnStringAtDef( @Start int, @StringToSearch varchar(50),
@Target1 varchar(50),
@Target2 varchar(50) = NULL,
@Target3 varchar(50) = NULL,
@Target4 varchar(50) = NULL,
@Target5 varchar(50) = NULL,
@Target6 varchar(50) = NULL )
RETURNS bit
AS
BEGIN
IF CHARINDEX(@Target1,@StringToSearch,@Start) > 0 RETURN 1
--2 Styles, test each optional argument for NULL, nesting further tests
--or just take advantage of CHARINDEX behavior with a NULL arg (unless 65 compatibility - code check before CREATE FUNCTION?
--Style 1:
--IF @Target2 IS NOT NULL
--BEGIN
-- IF CHARINDEX(@Target2,@StringToSearch,@Start) > 0 RETURN 1
-- (etc.)
--END
--Style 2:
IF CHARINDEX(@Target2,@StringToSearch,@Start) > 0 RETURN 1
IF CHARINDEX(@Target3,@StringToSearch,@Start) > 0 RETURN 1
IF CHARINDEX(@Target4,@StringToSearch,@Start) > 0 RETURN 1
IF CHARINDEX(@Target5,@StringToSearch,@Start) > 0 RETURN 1
IF CHARINDEX(@Target6,@StringToSearch,@Start) > 0 RETURN 1
RETURN 0
END
GO;
-------------------------------------------------------------------------------------------------
IF OBJECT_ID('fnStringAt') IS NOT NULL BEGIN DROP FUNCTION fnStringAt END
GO;
CREATE FUNCTION fnStringAt( @Start int, @StringToSearch varchar(50), @TargetStrings
varchar(2000) )
RETURNS bit
AS
BEGIN
DECLARE @SingleTarget varchar(50)
DECLARE @CurrentStart int
DECLARE @CurrentLength int
--Eliminate special cases
--Trailing space is needed to check for end of word in some cases, so always append comma
--loop tests should fairly quickly ignore ',,' termination
SET @TargetStrings = @TargetStrings + ','
SET @CurrentStart = 1
--Include terminating comma so spaces don't get truncated
SET @CurrentLength = (CHARINDEX(',',@TargetStrings,@CurrentStart) -
@CurrentStart) + 1
SET @SingleTarget = SUBSTRING(@TargetStrings,@CurrentStart,@CurrentLength)
WHILE LEN(@SingleTarget) > 1
BEGIN
IF SUBSTRING(@StringToSearch,@Start,LEN(@SingleTarget)-1) =
LEFT(@SingleTarget,LEN(@SingleTarget)-1)
BEGIN
RETURN 1
END
SET @CurrentStart = (@CurrentStart + @CurrentLength)
SET @CurrentLength = (CHARINDEX(',',@TargetStrings,@CurrentStart) -
@CurrentStart) + 1
IF NOT @CurrentLength > 1 --getting trailing comma
BEGIN
BREAK
END
SET @SingleTarget =
SUBSTRING(@TargetStrings,@CurrentStart,@CurrentLength)
END
RETURN 0
END
GO;
------------------------------------------------------------------------
IF OBJECT_ID('fnDoubleMetaphoneTable') IS NOT NULL BEGIN DROP FUNCTION
fnDoubleMetaphoneTable END
GO;
CREATE FUNCTION fnDoubleMetaphoneTable( @Word varchar(50) )
RETURNS @DMP TABLE ( Metaphone1 char(4), Metaphone2 char(4) )
AS
BEGIN
DECLARE @MP1 varchar(4), @MP2 varchar(4)
SET @MP1 = ''
SET @MP2 = ''
DECLARE @CurrentPosition int, @WordLength int, @CurrentChar char(1)
SET @CurrentPosition = 1
SET @WordLength = LEN(@Word)
IF @WordLength < 1
BEGIN
RETURN
END
--ensure case insensitivity
SET @Word = UPPER(@Word)
IF dbo.fnStringAt(1, @Word, 'GN,KN,PN,WR,PS') = 1
BEGIN
SET @CurrentPosition = @CurrentPosition + 1
END
IF 'X' = LEFT(@Word,1)
BEGIN
SET @MP1 = @MP1 + 'S'
SET @MP2 = @MP2 + 'S'
SET @CurrentPosition = @CurrentPosition + 1
END
WHILE (4 > LEN(RTRIM(@MP1))) OR (4 > LEN(RTRIM(@MP2)))
BEGIN
IF @CurrentPosition > @WordLength
BEGIN
BREAK
END
SET @CurrentChar = SUBSTRING(@Word,@CurrentPosition,1)
IF @CurrentChar IN('A','E','I','O','U','Y')
BEGIN
IF @CurrentPosition = 1
BEGIN
SET @MP1 = @MP1 + 'A'
SET @MP2 = @MP2 + 'A'
END
SET @CurrentPosition = @CurrentPosition + 1
END
ELSE IF @CurrentChar = 'B'
BEGIN
SET @MP1 = @MP1 + 'P'
SET @MP2 = @MP2 + 'P'
IF 'B' = SUBSTRING(@Word,@CurrentPosition + 1,1)
BEGIN
SET @CurrentPosition = @CurrentPosition + 2
END
ELSE
BEGIN
SET @CurrentPosition = @CurrentPosition + 1
END
END
ELSE IF @CurrentChar = 'Ç'
BEGIN
SET @MP1 = @MP1 + 'S'
SET @MP2 = @MP2 + 'S'
SET @CurrentPosition = @CurrentPosition + 1
END
ELSE IF @CurrentChar = 'C'
BEGIN
--various germanic
IF (@CurrentPosition > 2)
AND (dbo.fnIsVowel(SUBSTRING(@Word,@CurrentPosition-2,1))=0)
AND (dbo.fnStringAt(@CurrentPosition-1,@Word,'ACH') = 1)
AND ((SUBSTRING(@Word,@CurrentPosition+2,1) <> 'I')
AND ((SUBSTRING(@Word,@CurrentPosition+2,1) <> 'E') OR
(dbo.fnStringAt(@CurrentPosition-2,@Word,'BACHER,MACHER')=1)))
BEGIN
SET @MP1 = @MP1 + 'K'
SET @MP2 = @MP2 + 'K'
SET @CurrentPosition = @CurrentPosition + 2
END
-- 'caesar'
ELSE IF (@CurrentPosition = 1) AND
(dbo.fnStringAt(@CurrentPosition,@Word,'CAESAR') = 1)
BEGIN
SET @MP1 = @MP1 + 'S'
SET @MP2 = @MP2 + 'S'
SET @CurrentPosition = @CurrentPosition + 2
END
-- 'chianti'
ELSE IF dbo.fnStringAt(@CurrentPosition,@Word,'CHIA') = 1
BEGIN
SET @MP1 = @MP1 + 'K'
SET @MP2 = @MP2 + 'K'
SET @CurrentPosition = @CurrentPosition + 2
END
ELSE IF dbo.fnStringAt(@CurrentPosition,@Word,'CH') = 1
BEGIN
-- Find 'michael'
IF (@CurrentPosition > 1) AND
(dbo.fnStringAt(@CurrentPosition,@Word,'CHAE') = 1)
BEGIN
--First instance of alternate encoding
SET @MP1 = @MP1 + 'K'
SET @MP2 = @MP2 + 'X'
SET @CurrentPosition = @CurrentPosition + 2
END
--greek roots e.g. 'chemistry', 'chorus'
ELSE IF (@CurrentPosition = 1) AND (dbo.fnStringAt(2, @Word,
'HARAC,HARIS,HOR,HYM,HIA,HEM') = 1) AND (dbo.fnStringAt(1,@Word,'CHORE') = 0)
BEGIN
SET @MP1 = @MP1 + 'K'
SET @MP2 = @MP2 + 'K'
SET @CurrentPosition = @CurrentPosition + 2
END
--germanic, greek, or otherwise 'ch' for 'kh' sound
ELSE IF ((dbo.fnStringAt(1,@Word,'VAN ,VON ,SCH')=1) OR
(dbo.fnStringAt(@CurrentPosition-
2,@Word,'ORCHES,ARCHIT,ORCHID')=1) OR
(dbo.fnStringAt(@CurrentPosition+2,@Word,'T,S')=1) OR
(((dbo.fnStringAt(@CurrentPosition-1,@Word,'A,O,U,E')=1)
OR
(@CurrentPosition = 1))
AND
(dbo.fnStringAt(@CurrentPosition+2,@Word,'L,R,N,M,B,H,F,V,W, ')=1)))
BEGIN
SET @MP1 = @MP1 + 'K'
SET @MP2 = @MP2 + 'K'
SET @CurrentPosition = @CurrentPosition + 2
END
ELSE
BEGIN
--is this a given?
IF (@CurrentPosition > 1)
BEGIN
IF (dbo.fnStringAt(1,@Word,'MC') = 1)
BEGIN
--eg McHugh
SET @MP1 = @MP1 + 'K'
SET @MP2 = @MP2 + 'K'
END
ELSE
BEGIN
--Alternate encoding
SET @MP1 = @MP1 + 'X'
SET @MP2 = @MP2 + 'K'
END
END
ELSE
BEGIN
SET @MP1 = @MP1 + 'X'
SET @MP2 = @MP2 + 'X'
END
SET @CurrentPosition = @CurrentPosition + 2
END
END
--e.g, 'czerny'
ELSE IF (dbo.fnStringAt(@CurrentPosition,@Word,'CZ')=1) AND
(dbo.fnStringAt((@CurrentPosition - 2),@Word,'WICZ')=0)
BEGIN
SET @MP1 = @MP1 + 'S'
SET @MP2 = @MP2 + 'X'
SET @CurrentPosition = @CurrentPosition + 2
END
--e.g., 'focaccia'
ELSE IF(dbo.fnStringAt((@CurrentPosition + 1),@Word,'CIA')=1)
BEGIN
SET @MP1 = @MP1 + 'X'
SET @MP2 = @MP2 + 'X'
SET @CurrentPosition = @CurrentPosition + 3
END
--double 'C', but not if e.g. 'McClellan'
ELSE IF(dbo.fnStringAt(@CurrentPosition,@Word,'CC')=1) AND NOT
((@CurrentPosition = 2) AND (LEFT(@Word,1) = 'M'))
--'bellocchio' but not 'bacchus'
IF (dbo.fnStringAt((@CurrentPosition + 2),@Word,'I,E,H')=1) AND
(dbo.fnStringAt((@CurrentPosition + 2),@Word,'HU')=0)
BEGIN
--'accident', 'accede' 'succeed'
IF (((@CurrentPosition = 2) AND
(SUBSTRING(@Word,@CurrentPosition - 1,1) = 'A'))
OR (dbo.fnStringAt((@CurrentPosition -
1),@Word,'UCCEE,UCCES')=1))
BEGIN
SET @MP1 = @MP1 + 'KS'
SET @MP2 = @MP2 + 'KS'
END
--'bacci', 'bertucci', other italian
ELSE
BEGIN
SET @MP1 = @MP1 + 'X'
SET @MP2 = @MP2 + 'X'
END
SET @CurrentPosition = @CurrentPosition + 3
END
--Pierce's rule
ELSE
BEGIN
SET @MP1 = @MP1 + 'K'
SET @MP2 = @MP2 + 'K'
SET @CurrentPosition = @CurrentPosition + 2
END
ELSE IF (dbo.fnStringAt(@CurrentPosition,@Word,'CK,CG,CQ')=1)
BEGIN
SET @MP1 = @MP1 + 'K'
SET @MP2 = @MP2 + 'K'
SET @CurrentPosition = @CurrentPosition + 2
END
ELSE IF (dbo.fnStringAt(@CurrentPosition,@Word,'CI,CE,CY')=1)
BEGIN
--italian vs. english
IF (dbo.fnStringAt(@CurrentPosition,@Word,'CIO,CIE,CIA')=1)
BEGIN
SET @MP1 = @MP1 + 'S'
SET @MP2 = @MP2 + 'X'
END
ELSE
BEGIN
SET @MP1 = @MP1 + 'S'
SET @MP2 = @MP2 + 'S'
END
SET @CurrentPosition = @CurrentPosition + 2
END
ELSE
BEGIN
SET @MP1 = @MP1 + 'K'
SET @MP2 = @MP2 + 'K'
--name sent in 'mac caffrey', 'mac gregor
IF (dbo.fnStringAt((@CurrentPosition + 1),@Word,' C, Q, G')=1)
BEGIN
SET @CurrentPosition = @CurrentPosition + 3
END
ELSE
BEGIN
IF (dbo.fnStringAt((@CurrentPosition + 1),@Word,'C,K,Q')=1)
AND (dbo.fnStringAt((@CurrentPosition + 1), 2, 'CE,CI')=0)
BEGIN
SET @CurrentPosition = @CurrentPosition + 2
END
ELSE
BEGIN
SET @CurrentPosition = @CurrentPosition + 1
END
END
END
END
ELSE IF @CurrentChar = 'D'
BEGIN
IF (dbo.fnStringAt(@CurrentPosition, @Word, 'DG')=1)
BEGIN
IF (dbo.fnStringAt((@CurrentPosition + 2),@Word,'I,E,Y')=1)
BEGIN
--e.g. 'edge'
SET @MP1 = @MP1 + 'J'
SET @MP2 = @MP2 + 'J'
SET @CurrentPosition = @CurrentPosition + 3
END
ELSE
BEGIN
--e.g. 'edgar'
SET @MP1 = @MP1 + 'TK'
SET @MP2 = @MP2 + 'TK'
SET @CurrentPosition = @CurrentPosition + 2
END
END
ELSE IF (dbo.fnStringAt(@CurrentPosition,@Word,'DT,DD')=1)
BEGIN
SET @MP1 = @MP1 + 'T'
SET @MP2 = @MP2 + 'T'
SET @CurrentPosition = @CurrentPosition + 2
END
ELSE
BEGIN
SET @MP1 = @MP1 + 'T'
SET @MP2 = @MP2 + 'T'
SET @CurrentPosition = @CurrentPosition + 1
END
END
ELSE IF @CurrentChar = 'F'
BEGIN
IF (SUBSTRING(@Word,@CurrentPosition + 1,1) = 'F')
BEGIN
SET @CurrentPosition = @CurrentPosition + 2
END
ELSE
BEGIN
SET @CurrentPosition = @CurrentPosition + 1
END
SET @MP1 = @MP1 + 'F'
SET @MP2 = @MP2 + 'F'
END
ELSE IF @CurrentChar = 'G'
BEGIN
IF (SUBSTRING(@Word,@CurrentPosition + 1,1) = 'H')
BEGIN
IF (@CurrentPosition > 1) AND
(dbo.fnIsVowel(SUBSTRING(@Word,@CurrentPosition - 1,1)) = 0)
BEGIN
SET @MP1 = @MP1 + 'K'
SET @MP2 = @MP2 + 'K'
SET @CurrentPosition = @CurrentPosition + 2
END
--'ghislane', ghiradelli
ELSE IF (@CurrentPosition = 1)
BEGIN
IF (SUBSTRING(@Word,@CurrentPosition + 2,1) = 'I')
BEGIN
SET @MP1 = @MP1 + 'J'
SET @MP2 = @MP2 + 'J'
END
ELSE
BEGIN
SET @MP1 = @MP1 + 'K'
SET @MP2 = @MP2 + 'K'
END
SET @CurrentPosition = @CurrentPosition + 2
END
--Parker's rule (with some further refinements) - e.g., 'hugh'
ELSE IF (((@CurrentPosition > 2) AND (dbo.fnStringAt((@CurrentPosition
- 2),@Word,'B,H,D')=1) )
--e.g., 'bough'
OR ((@CurrentPosition > 3) AND (dbo.fnStringAt((@CurrentPosition
- 3),@Word,'B,H,D')=1) )
--e.g., 'broughton'
OR ((@CurrentPosition > 4) AND (dbo.fnStringAt((@CurrentPosition
- 4),@Word,'B,H')=1) ) )
BEGIN
SET @CurrentPosition = @CurrentPosition + 2
END
ELSE
BEGIN
--e.g., 'laugh', 'McLaughlin', 'cough', 'gough', 'rough', 'tough'
IF ((@CurrentPosition > 3)
AND (SUBSTRING(@Word,@CurrentPosition - 1,1) = 'U')
AND (dbo.fnStringAt((@CurrentPosition -
3),@Word,'C,G,L,R,T')=1) )
BEGIN
SET @MP1 = @MP1 + 'F'
SET @MP2 = @MP2 + 'F'
END
ELSE
BEGIN
IF ((@CurrentPosition > 1) AND
SUBSTRING(@Word,@CurrentPosition - 1,1) <> 'I')
BEGIN
SET @MP1 = @MP1 + 'K'
SET @MP2 = @MP2 + 'K'
END
END
SET @CurrentPosition = @CurrentPosition + 2
END
END
ELSE IF (SUBSTRING(@Word,@CurrentPosition + 1,1) = 'N')
BEGIN
IF ((@CurrentPosition = 2) AND (dbo.fnIsVowel(LEFT(@Word,1))=1) AND
(dbo.fnSlavoGermanic(@Word)=0))
BEGIN
SET @MP1 = @MP1 + 'KN'
SET @MP2 = @MP2 + 'N'
END
ELSE
BEGIN
--not e.g. 'cagney'
IF ((dbo.fnStringAt((@CurrentPosition + 2),@Word,'EY')=0)
AND (SUBSTRING(@Word,@CurrentPosition + 1,1) <>
'Y') AND (dbo.fnSlavoGermanic(@Word)=0))
SQL Server에서 SOUNDEX 및 관련 Difference 기능을 사용하여 유사한 이름을 찾을 수 있습니다.MSDN에 대한 참조는 다음과 같습니다.
중복 제거에 관해서는 스트링이 분할되어 일치하는 것이 첫 번째 컷입니다.워크로드 감소 및/또는 더 나은 결과를 얻기 위해 활용할 수 있는 데이터에 대해 알려진 항목이 있는 경우 항상 이러한 항목을 활용하는 것이 좋습니다.대부분의 경우 중복 제거를 위해 수동 작업을 완전히 제거하는 것은 불가능하지만, 가능한 한 많은 정보를 자동으로 포착한 후 "불확실성 사례" 보고서를 생성하면 훨씬 쉽게 작업을 수행할 수 있습니다.
이름 매칭에 대해서: SOUNDEX는 매칭 품질에 좋지 않고, 특히 목표와 너무 멀리 떨어져 있기 때문에 작업 타입에 좋지 않습니다.이중 메타포네 결과와 Levenshtein 거리를 조합하여 이름 매칭을 수행하는 것이 좋습니다.적절한 바이어스를 사용하면 이 기능은 매우 잘 작동하며, 알려진 항목을 정리한 후 두 번째 패스로 사용할 수 있습니다.
SSIS 패키지를 사용하여 Fuzzy Lookup and Grouping 변환(http://msdn.microsoft.com/en-us/library/ms345128(SQL.90).aspx)을 참조하는 것도 고려할 수 있습니다.
SQL 전체 텍스트 검색(http://msdn.microsoft.com/en-us/library/cc879300.aspx)을 사용하는 것도 가능하지만 특정 문제 영역에는 적합하지 않을 수 있습니다.
이렇게 하다
create table person(
personid int identity(1,1) primary key,
firstname varchar(20),
lastname varchar(20),
addressindex int,
sound varchar(10)
)
나중에 트리거를 만듭니다.
create trigger trigoninsert for dbo.person
on insert
as
declare @personid int;
select @personid=personid from inserted;
update person
set sound=soundex(firstname) where personid=@personid;
이제 제가 할 수 있는 것은 이렇게 생긴 프로시저를 만드는 것입니다.
create procedure getfuzzi(@personid int)
as
declare @sound varchar(10);
set @sound=(select sound from person where personid=@personid;
select personid,firstname,lastname,addressindex from person
where sound=@sound
특정 개인 ID에 대해 에서 제공한 이름과 거의 일치하는 모든 이름이 반환됩니다.
언급URL : https://stackoverflow.com/questions/921978/fuzzy-matching-using-t-sql
'programing' 카테고리의 다른 글
Xcode: 타깃과 스킴이란 쉬운 언어로 무엇입니까? (0) | 2023.04.18 |
---|---|
python에서 작성 날짜별로 정렬된 디렉토리 목록을 가져오려면 어떻게 해야 합니까? (0) | 2023.04.18 |
Excel VBA에서 진행 표시줄을 만들려면 어떻게 해야 합니까? (0) | 2023.04.18 |
NSString을 NSNumber로 변환하는 방법 (0) | 2023.04.13 |
ERROR ITMS-9000: "용장 바이너리 업로드.열차 '1.0'의 빌드 버전이 '1.0'인 바이너리 업로드가 이미 존재합니다." (0) | 2023.04.13 |