Boyer–Moore 字符搜索算法

Boyer–Moore 字符搜索算法

因为字符比较是从右往左比较的,所以第一层循环 needle.length + 1 <= i < haystack.length

1
2
3
4
5
6
7
8
9
10
11
12
13
                  start=needle.length - 1                                             end=haystack.length - 1
+ +
| |
| |
v-------------------------------------------------------------------v->
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| t | h | i | s | | i | s | | a | | s | i | m | p | l | e | | e | x | a | m | p | l | e |
+-------------------------------+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
+---------------------------+
| e | x | a | m | p | l | e |
+---+---+---+---+---+---+---+

第二层循环中i变量表示坏字符的位置、j表示搜索坏字符开始位置

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
                          i
+
|
v
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| t | h | i | s | | i | s | | a | | s | i | m | p | l | e | | e | x | a | m | p | l | e |
+-------------------------------+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
+---------------------------+
| e | x | a | m | p | l | e |
+---+---+---+---+---+---+-+-+
^
|
+
j

i,j指针向左搜索,如果完全匹配直接返回i即可

1
2
3
4
5
for (j = needle.length - 1; needle[j] == haystack[i]; --i, --j) {
if (j == 0) {
return i;
}
}

坏字符规则和好字符规则

坏字符:从右向左第一个不匹配的字符
好字符:从坏字符下一个字符直到最后的字符

可以认为字符移动有两种策略:坏字符对齐和好字符对齐,然后选择字符移动距离大的策略即可

wiki实现

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
/**
* Returns the index within this string of the first occurrence of the
* specified substring. If it is not a substring, return -1.
*
* There is no Galil because it only generates one match.
*
* @param haystack The string to be scanned
* @param needle The target string to search
* @return The start index of the substring
*/
public static int indexOf(char[] haystack, char[] needle) {
if (needle.length == 0) {
return 0;
}
int charTable[] = makeCharTable(needle);
int offsetTable[] = makeOffsetTable(needle);
for (int i = needle.length - 1, j; i < haystack.length;) {
for (j = needle.length - 1; needle[j] == haystack[i]; --i, --j) {
if (j == 0) {
return i;
}
}
// i += needle.length - j; // For naive method
i += Math.max(offsetTable[needle.length - 1 - j], charTable[haystack[i]]);
}
return -1;
}

/**
* Makes the jump table based on the mismatched character information.
*/
private static int[] makeCharTable(char[] needle) {
final int ALPHABET_SIZE = Character.MAX_VALUE + 1; // 65536
int[] table = new int[ALPHABET_SIZE];
for (int i = 0; i < table.length; ++i) {
table[i] = needle.length;
}
for (int i = 0; i < needle.length - 2; ++i) {
table[needle[i]] = needle.length - 1 - i;
}
return table;
}

/**
* Makes the jump table based on the scan offset which mismatch occurs.
* (bad character rule).
*/
private static int[] makeOffsetTable(char[] needle) {
int[] table = new int[needle.length];
int lastPrefixPosition = needle.length;
for (int i = needle.length; i > 0; --i) {
if (isPrefix(needle, i)) {
lastPrefixPosition = i;
}
table[needle.length - i] = lastPrefixPosition - i + needle.length;
}
for (int i = 0; i < needle.length - 1; ++i) {
int slen = suffixLength(needle, i);
table[slen] = needle.length - 1 - i + slen;
}
return table;
}

/**
* Is needle[p:end] a prefix of needle?
*/
private static boolean isPrefix(char[] needle, int p) {
for (int i = p, j = 0; i < needle.length; ++i, ++j) {
if (needle[i] != needle[j]) {
return false;
}
}
return true;
}

/**
* Returns the maximum length of the substring ends at p and is a suffix.
* (good suffix rule)
*/
private static int suffixLength(char[] needle, int p) {
int len = 0;
for (int i = p, j = needle.length - 1;
i >= 0 && needle[i] == needle[j]; --i, --j) {
len += 1;
}
return len;
}
LeetCode 最长回文子串算法

LeetCode 最长回文子串算法

Manacher 算法 容易理解,实现起来也没什么大坑,复杂度还是 O(n)的, 花半个小时实现下很有意思

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
class Solution {
public String longestPalindrome(String s) {
StringBuilder sb = new StringBuilder(2 + 2 * s.length());
sb.append("^");
for (int i = 0; i < s.length(); i++) {
sb.append("#").append(s.charAt(i));
}
sb.append("#$");

String s2 = sb.toString();

int maxStart = 1, max = 0, rC = 1, rR = 1;
int[] p = new int[s2.length()];

for (int i = 1; i < s2.length() - 1; i++ ) {

p[i] = i < rR ? Math.min(p[2 * rC - i], rR - i) : 0;

while (s2.charAt(i+p[i]+1) == s2.charAt(i - p[i] - 1)) {
p[i] = p[i]+1;
}

if (i + p[i] > rR) {
rC = i;
rR = i + p[i];
}

if (p[i] > max) {
maxStart = i - p[i];
max = p[i];
}
}


return s2.substring(maxStart,maxStart + 2*max+1).replace("#", "");
}
}

再谈最长公共子串

序言

这次遇到贝壳花名的需求,需要使用最长公共子串对花名做校验。这种算法在面试题中算是必会题,各种四层循环,三层循环,两层循环的代码在我脑中闪过,但是今天就是要带你实现不一样的最长公共子串!

教科书式实现

使用动态规划,两层循环,使用二维数组存储状态,时间复杂度O(n^2^),空间复杂度O(n^2^)或O(2n)

一张图解释原理:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
                  先 横 向 处 理
+--------------------------->

e a b c b c f
+ +---+---+---+---+---+---+---+
| a | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
纵 | +---------------------------+
向 | b | 0 | 0 | 2 | 0 | 1 | 0 | 0 |
累 | +---------------------------+
加 | c | 0 | 0 | 0 | 3 | 0 | 2 | 0 |
| +---------------------------+
| d | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| +---------------------------+
| e | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
v +---+---+---+---+---+---+---+
e a b c b c f

优化空间复杂度至O(1)

从上图可以发现,在纵向累加时实际只需要左上方的计数器即可, O(n^2^)的空间白白被浪费了,最优的空间复杂度应该是O(1)。那么该如何处理呢?

一张图解释原理:

                                   e   a   b   c   b   c   f
    +---+                        +---+---+---+---+---+---+---+
  a | 0 |                            | 1 | 0 | 0 | 0 | 0 | 0 | a
    +-------+                        +-----------------------+
  b | 0 | 0 |                            | 2 | 0 | 1 | 0 | 0 | b
    +-----------+                        --------------------+
  c | 0 | 0 | 0 |                            | 3 | 0 | 2 | 0 | c
    +---------------+                        ----------------+
  d | 0 | 0 | 0 | 0 |                            | 0 | 0 | 0 | d
    +-------------------+                        +-----------+
  e | 1 | 0 | 0 | 0 | 0 |                            | 0 | 0 | e
    +---+---+---+---+---+-------+                    +---+---+
      e   a   b   c   b   c   f

答案就是沿着等长对着线处理。

有意思的代码抽象

大家可以根据上面思路写一下,一般会把算法分成两部分:处理长方形的左下部分和处理长方形的右上部分,两部分都是双层循环,时间复杂度和空间负载度已经变为了O(n^2^) ,O(1)。

肯定有人已经发现自己的代码处理左下角的代码和处理右上角的代码不能复用,一个是从中间向左下角处理,一个是从中间向右上角处理,明明很类似,但是就是没发合并。

那么有没有方法把这两部分处理抽象成公共代码呢?不卖关子了,直接上图:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
                                                         +
e |
+--+
e a b c b c f a | 1|
+---+---+---+---+---+---+---+ +-----+
| 1 | 0 | 0 | 0 | 0 | 0 | a b | 0| 2|
+-----------------------+ +--------+
| 2 | 0 | 1 | 0 | 0 | b c | 0| 0| 3|
--------------------+ 翻 折 +-----------+
| 3 | 0 | 2 | 0 | c +------------> b | 0| 1| 0| 0|
----------------+ +--------------+
| 0 | 0 | 0 | d c | 0| 0| 2| 0| 0|
+-----------+ +--------------+
| 0 | 0 | e f | 0| 0| 0| 0| 0|
+---+---+ +--------------+
a b c d e

如果你想使用公共代码同时实现处理左下角和右上角是不可能的了。所以你需要把右上角的三角翻折,然后你就得到了两个三角:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
                                      +
e |
+--+
a | 1|
+---+ +-----+
a | 0 | b | 0| 2|
+-------+ +--------+
b | 0 | 0 | c | 0| 0| 3|
+-----------+ +-----------+
c | 0 | 0 | 0 | b | 0| 1| 0| 0|
+---------------+ +--------------+
d | 0 | 0 | 0 | 0 | c | 0| 0| 2| 0| 0|
+-------------------+ +--------------+
e | 1 | 0 | 0 | 0 | 0 | f | 0| 0| 0| 0| 0|
+---+---+---+---+---+------ +--------------+
e a b c b c f a b c d e

这样就变成了处理两遍左下角了,代码也可以完美复用!!!

最终实现

我的完整思考过程已经分析完毕,这样沿对着线处理还有一个小小的优点:提前结束搜索。这一点大家可以自行思考,这里不做过多解释。

直接干货上场:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
public class Solution {
/**
* @param A: A string
* @param B: A string
* @return: the length of the longest common substring.
*/
public int longestCommonSubstring(String A, String B) {
// write your code here
char[] achars = A.toCharArray(), bchars = B.toCharArray();
return getMaxLength(bchars, achars, getMaxLength(achars, bchars, 0, 0), 1);
}

private static int getMaxLength(char[] s1, char[] s2, int maxLength, int startIndex) {

for (int start = startIndex; start < s1.length; start++) {
int upper = Math.min(s2.length, s1.length - start);
// if (upper <= maxLength) break; //提前结束搜索
for (int currentLineLength = 0, x = 0, y = start; x < upper; x++, y++) {
if (s1[y] == s2[x])
maxLength = Math.max(maxLength, ++currentLineLength);
else {
// if (upper - x - 1 <= maxLength) break; //提前结束搜索
currentLineLength = 0;
};
}
}
return maxLength;
}
}

结尾

怎么样,经历这次优化过程是否感觉自己对最长公共子串的认识又更深了一步呢?虽然不能保证是首创(也可能是首创?),但是这次一步一步真切思考优化直到获得成果让我无比兴奋。

说了这么多,我就是要给我们贝壳招聘开发组打个广告>_>,期待更多爱思考优秀的同学加入!

![](/Users/sage/Desktop/屏幕快照 2018-11-10 13.33.20.png)