期望最大化(EM)算法matlab实现

2015-10-03 Tech matlab, 数据挖掘, 机器学习, 算法 Comments Word Count: 17k(words) Read Count: 1:03(minutes)

一、前言

看了吴军博士的《数学之美》感觉受益颇多，好书！
第二版第27章讲的是期望最大化(EM)算法，看完就感觉如醍醐灌顶，而且原理上也不难，趁国庆的机会，把它实现了，加深一下理解。

EM算法 的范畴很广，之前几章的 最大熵算法、隐含马尔科夫模型的训练算法 都包括在内。
用它做一些文本的自收敛分类效果很好，而且运算复杂度也不高，是一个很不错的选择。
当然，关于文本的分类，前几章也有介绍一些其它的算法，也可以去参考一下。

二、研究思路

数据初始化
1.1 随机生成5个中心点
1.2 以这5个中心点为圆心，RADIUS为半径内，随机生成1000个点
EM算法主体
2.1 在图中，另外再取5个随机的初始点
2.2 对于所有5000个点，计算它相对5个初始点的距离，保存到 5000 * 5 的矩阵中 (Expectation)
2.3 把所有5000个点按照 相对5个初始点的距离 归成5类 (Maximization)
2.4 计算5类中，所有点的坐标的平均值，得到5个新的点，作为下次迭代的5个初始点
2.5 从2.2开始迭代，直到收敛
2.6 把n次迭代得到的点连成线
迭代2，计算不同随机初始点的收敛结果

三、代码

matlab代码如下：
(ps: matlab语法不熟，100多行写了一整个下午，早知道算法用c++或者python实现，matlab仅用于作图了，不过也算是练一下matlab吧，这次的代码真心觉得写的好丑)

% global parameter which can be altered easily to run a different result
WIDTH = 10000;
HEIGHT = 10000;
RADIUS = 6000;
ITERATE_COUNT = 10;

% generate the initial points
center_points = [];
all_points = [];
color = ['b', 'm', 'g', 'y', 'c'];
path = ['*', 'o', '^'];
for i = 1:5
    center_point = [randi(WIDTH) + RADIUS, randi(HEIGHT) + RADIUS];
    center_points(i, :) = center_point;
    for j = 1:1000
        while (true)
            point = [randi(WIDTH + RADIUS * 2), randi(HEIGHT + RADIUS * 2)];
            if (sqrt((point(1) - center_point(1))^2 + (point(2) - center_point(2))^2) <= RADIUS)
                if (length(all_points) == 0)
                    all_points(1, :) = point;
                    break;
                end
                all_points(length(all_points(:,1)) + 1, :) = point;
                break;
            end
        end
        plot(point(1), point(2), [color(i), '.']);
        hold on;
    end
end

% calculate three paths with three independent initial start points
for m = 1:3
    % init the start points
    res = [];
    for i = 1:5
        point = [randi(WIDTH), randi(HEIGHT)];
        res(1, i, :) = point;
    end

    % iterate
    distance = [];
    for i = 1:ITERATE_COUNT
        s = size(res);
        % cur_center is a 3-dimension matrix
        % the length of first dimension is 1
        cur_center = res(s(1), :, :);
        for j = 1:length(all_points);
            for k = 1: s(2);
                point_1 = all_points(j, :);
                point_2 = [cur_center(1, k, 1), cur_center(1, k, 2)];
                distance(j, k) = sqrt((point_1(1) - point_2(1))^2 + (point_1(2) - point_2(2))^2);
            end
        end

        % assign each point to the new defined category
        assign = zeros(length(distance), 1);
        for j = 1:s(2)
            temp = distance(:, j);
            smallest = sort(temp);
            smallest = smallest(1:10);
            for l=1:length(smallest)
                for k = 1:length(temp)
                    if (smallest(l) == temp(k))
                        assign(k) = j;
                    end
                end
            end
        end
        for j = 1:length(distance)
            if (assign(j) ~= 0)
                continue;
            end
            [value, index] = min(distance(j, :));
            assign(j) = index;
        end

        % calculate the total value of x, y for each category
        temp2 = zeros(s(2), 3);
        for j = 1:length(assign)
            temp2(assign(j), 3) = temp2(assign(j), 3) + 1;
            temp2(assign(j), 1) = temp2(assign(j), 1) + all_points(j, 1);
            temp2(assign(j), 2) = temp2(assign(j), 2) + all_points(j, 2);
        end

        % get the new local center
        for j = 1:s(2)
            res(s(1) + 1, j, :) = [temp2(j, 1) / temp2(j, 3), temp2(j, 2) / temp2(j, 3)];
        end
    end

    % plot the local center changing path
    s = size(res);
    for i = 1:s(2)
        x = res(1:s(1), i, 1);
        y = res(1:s(1), i, 2);
        plot(x, y, ['r', path(m), '-']);
        hold on;
    end
end

% save the picture
title(['期望最大化算法 半径', num2str(RADIUS)]);
saveas(gcf, ['em_', num2str(RADIUS), '.jpg'])
hold off;

四、实验结果

半径2000

半径4000

半径6000

五、总结

分类的效果还是比较显著的

聚类结果总是能收敛的
圆形点阵比较分散的时候，无论随机起始点在哪里，总是能收敛到相同的位置
当圆形点阵比较接近的时候，可能会出现若干个局部极点，随机的起始点的位置一定程度上可以影响最后收敛的位置
一般迭代规模若干次就能够收敛

最后再贴上一张《数学之美》中很有比喻意义原图，可以非常形象的诠释这个算法的原理：

本文链接： http://xidui.github.io/2015/10/03/%E6%9C%9F%E6%9C%9B%E6%9C%80%E5%A4%A7%E5%8C%96-EM-%E7%AE%97%E6%B3%95matlab%E5%AE%9E%E7%8E%B0/

版权声明： 本博客所有文章除特别声明外，均采用 CC BY 4.0 CN协议许可协议。转载请注明出处！

xiduiSoftware Engineer

个人简介。