数据挖掘中决策树算法实现——Bash

关于决策树，几乎是数据挖掘分类算法中最先介绍到的。
决策树，顾名思义就是用来做决定的树，一个分支就是一个决策过程。

每个决策过程中涉及一个数据的属性，而且只涉及一个。然后递归地，贪心地直到满足决策条件(即可以得到明确的决策结果)。

决策树的实现首先要有一些先验（已经知道结果的历史）数据做训练，通过分析训练数据得到每个属性对结果的影响的大小，这里我们通过一种叫做信息增益的理论去描述它，期间也涉及到熵的概念。也可参考文章信息增益与熵.

下面我们结合实例说一下决策树实现过程中的上述关键概念：

假设我们有如下数据：

age job house credit class
1 0 0 1 0
1 0 0 2 0
1 1 0 2 1
1 1 1 1 1
1 0 0 1 0
2 0 0 1 0
2 0 0 2 0
2 1 1 2 1
2 0 1 3 1
2 0 1 3 1
3 0 1 3 1
3 0 1 2 1
3 1 0 2 1
3 1 0 3 1
3 0 0 1 0
(一)
我们首先要通过计算找到哪个属性的所有属性值能更好地表达class字段的不同。通过计算，我们发现house的属性值最能表现class字段的不同。这个衡量标准其实就是信息增益。计算方法是：首先计算全部数据的熵，然后除class之外的其他属性逐个遍历，找到熵最小的那个属性(house)，然后将全部数据的熵减去按照house属性划分数据之后的数据的熵。

这个值如果满足条件假如(>0.1)，我们认为数据应该按照这个节点进行易做图，也就是说这个属性(house)构成了我们的一次决策过程。

(二)
然后
在按照house易做图的每个数据集上，针对其他属性(house除外)进行与(一)相同的过程，直到信息增益不足以满足数据易做图的条件。

这样，我们就得到了一个关于属性数据划分的一棵树。可以作为class字段未知的数据的决策依据。

二、决策树代码实现：

具体计算代码如下：---假设上述数据我们保存为descision.dat文件，以及需要bash4.0及以上支持运行。

Bash代码
#!/home/admin/bin/bash_bin/bash_4

input=$1;

if [ -z $input ]; then
    echo "please input the traning file";
    exit 1;
fi

## pre calculate the log2 value for the later calculate operation
declare -a log2;
logi=0;
records=$(cat $input | wc -l);
for i in `awk -v n=$records 'BEGIN{for(i=1;i<n;i++) print log(i)/log(2);}'`
do
    ((logi+=1));
    log2[$logi]=$i;
done

## function for calculating the entropy for the given distribution of the class
function getEntropy {
    local input=`echo $1`;
    if [[ $input == *" "* ]]; then
        local current_entropy=0;
        local sum=0;
        local i;
        for i in $input
        do
            ((sum+=$i));
            current_entropy=$(awk -v n=$i -v l=${log2[$i]} -v o=$current_entropy 'BEGIN{print n*l+o}');
        done
        current_entropy=$(awk -v n=$current_entropy -v b=$sum -v l=${log2[$sum]} 'BEGIN{print n/b*-1+l;}')
        eval $2=$current_entropy;
    else
        eval $2=0;
    fi
}

### the header title of the input data
declare -A header_info;
header=$(head -1 $input);
headers=(${header//,/ })
length=${#headers[@]};
for((i=0;i<length;i++))
do
    attr=${headers[$i]};
    header_info[$attr]=$i;
done

### the data content of the input data
data=${input}_dat;
sed -n '2,$p' $input > $data

## use an array to store the information of a descision tree
## the node structure is {child,slibling,parent,attr,attr_value,leaf,class}
## the root is a virtual node with none used attribute
## only the leaf node has class flag and the "leaf,class" is meaningfull
## the descision_tree
declare -a descision_tree;

## the root node with no child\slibing and anythings else
descision_tree[0]="0:0:0:N:N:0:0";

## use recursive algrithm to build the tree
## so we need a trace_stack to record the call level infomation
declare -a trace_stack;

## push the root node into the stack
trace_stack[0]=0;
stack_deep=1;

## begin to build the tree until the trace_stack is empty
while [ $stack_deep -ne 0 ]
do
    ((stack_deep-=1));
    current_node_index=${trace_stack[$stack_deep]};
    current_node=${descision_tree[$current_node_index]};
    current_node_struct=(${current_node//:/ });

    ## select the current data set
    ## get used attr and their values
    attrs=${current_node_struct[3]};
    attrv=${current_node_struct[4]};

    declare -a grepstra=();

    if [ $attrs != "N" ];then
        attr=(${attrs//,/ });
        attrvs=(${attrv//,/ });
        attrc=${#attr[@]};
        for((i=0;i<attrc;i++))
        do
            a=${attr[$i]};
            index=${header_info[$a]};
            grepstra[$index]=${attrvs[$i]};
        done
    fi

    for((i=0;i<length;i++))
    do
        if [ -z ${grepstra[$i]} ]; then
            grepstra[$i]=".*";
        fi
    done
    grepstrt=${grepstra[*]};
    grepstr=${grepstrt// /,};
    grep $grepstr $data > current_node_data

    ## calculate the entropy before split the records
    entropy=0;
    input=`c

补充：软件开发 , Java ,

数据挖掘中 决策树算法实现——Bash

数据挖掘中决策树算法实现——Bash