Tuesday, March 6, 2012

A possible improvement for BigTop's Hadoop integration test using Groovy scripts

Well, in my previous blog, I wrote a small Hadoop MR program to sum up values for different key.

1. a program to generate random key/value pairs, save them into a file as input
2. copy the input file from local file directory to HDFS
3. do a simple Mapper/Reducer to calculate the sum for each key
4. copy the Hadoop output to a local directory
5. read the contents from input and output directories, calculate the sum for each key and compare the sum

As I am "learning" forward and contemplate ways to integrate this test program into BigTop's smoke test. I have discovered that,

The groovy script - bigtop/bigtop-tests/test-artifacts/hadoop/src/main/groovy/org/apache/bigtop/itest/hadoopexamples/TestHadoopExamples.groovy

stores the shell command to test Hadoop installation.
  static Map examples =
    [
        pi                :'20 10',
        wordcount         :"$EXAMPLES/text $EXAMPLES_OUT/wordcount",
        multifilewc       :"$EXAMPLES/text $EXAMPLES_OUT/multifilewc",
//        aggregatewordcount:"$EXAMPLES/text $EXAMPLES_OUT/aggregatewordcount 5 textinputformat",
//        aggregatewordhist :"$EXAMPLES/text $EXAMPLES_OUT/aggregatewordhist 5 textinputformat",
        grep              :"$EXAMPLES/text $EXAMPLES_OUT/grep '[Cc]uriouser'",
        sleep             :"-m 10 -r 10",
        secondarysort     :"$EXAMPLES/ints $EXAMPLES_OUT/secondarysort",
        randomtextwriter  :"-Dtest.randomtextwrite.total_bytes=1073741824 $EXAMPLES_OUT/randomtextwriter"
    ];

  private String testName;
  private String testJar;
  private String testArgs;

  @Parameters
  public static Map<String, Object[]> generateTests() {
    Map<String, Object[]> res = [:];
    examples.each { k, v -> res[k] = [k.toString(), v.toString()] as Object[]; }
    return res;
  }

  public TestHadoopExamples(String name, String args) {
    testName = name;
    testArgs = args;
    testJar = HADOOP_EXAMPLES_JAR;
  }

  @Test
  void testMRExample() {
    sh.exec("$hadoop jar $testJar $testName $HADOOP_OPTIONS $testArgs");

    assertTrue("Example $testName failed", 
               sh.getRet() == 0);
  } 


Very clear way to run Hadoop test! On the other hand, store the commands in Groovy scripts could be challenges to system administrators.

another interesting file is bigtop/bigtop-tests/test-execution/smokes/hadoop/target/testConfCluster.xml



...
  <test> <!-- TESTED -->
      <description>touchz: touching many files </description>
      <test-commands>
        <command>-fs NAMENODE -touchz file0 file1 file2</command>
        <command>-fs NAMENODE -du file*</command>
      </test-commands>
      <cleanup-commands>
        <command>-fs NAMENODE -rm file*</command>
      </cleanup-commands>
      <comparators>
        <comparator>
          <type>TokenComparator</type>
          <expected-output>Found 3 items</expected-output>
        </comparator>
        <comparator>
          <type>RegexpComparator</type>
          <expected-output>^0( |\t)*hdfs://\w+[.a-z]*:[0-9]*/user/[a-z]*/file0</expected-output>
          <expected-output>^0( |\t)*hdfs://\w+[.a-z]*:[0-9]*/user/[a-z]*/file1</expected-output>
          <expected-output>^0( |\t)*hdfs://\w+[.a-z]*:[0-9]*/user/[a-z]*/file2</expected-output>
        </comparator>
      </comparators>
    </test>
...



Very flexible way to specify the command and corresponding expected output. Now, we can compare output with expected result with different Comparator class. Cool. I like this approach better.

Well, back to my small Hadoop MR program that sums key values. I am NOT able to put down expected-output as a string, since all data are dynamically generated. I think, maybe, we could use some improvement.

Here is one idea - instead of specify the expected result as a string, we can put down the actual shell command. Well, any 'expected output string' can be easily captured as command like "echo 'expected output string'". And we can use the command to capture the dynamically generated data. 

Take a look at the new improved version of testConfCluster.xml:


 
<?xml version="1.0" encoding="ISO-8859-1"?>
<bigtop-itest-suite>
 <bigtop-itest-suite-test>
  <test-name>Calculate summation in MR</test-name>
  <test-desc>Here is simple MR test to calculate sum</test-desc>
  <test-pre-integration-test>
  </test-pre-integration-test>
  <test-integration-test>
            <command-set>
            <command>hadoop jar ./target/LeiBigTop-1.1.jar com.lei.bigtop.hadoop.calsum.CalSum ./data ./output</command>
            <command-comparator-type>com.lei.bigtop.hadoop.integration.test.ExtactComparatorIgnoreWhiteSpace</command-comparator-type>
            <command-comparator-compare-to><![CDATA[ cat ./output/* ]]></command-comparator-compare-to>
            </command-set>
  </test-integration-test>
  <test-post-integration-test>
  </test-post-integration-test>
        </bigtop-itest-suite-test>

        <bigtop-itest-suite-test>
            <test-name>calculate pi</test-name>
            <test-desc>calculate pi using hadoop MR</test-desc>
            <test-pre-integration-test>
            </test-pre-integration-test>
            <test-integration-test>
                <command-set>
                <command>hadoop jar $HADOOP_HOME/hadoop-examples-0.*.0.jar pi 5 5</command>
                <command-comparator-type>org.apache.hadoop.cli.util.SubstringComparator</command-comparator-type>
                <command-comparator-compare-to><![CDATA[echo "Pi is 3.68"]]></command-comparator-compare-to>
                </command-set>
            </test-integration-test>
            <test-post-integration-test>
            </test-post-integration-test>
        </bigtop-itest-suite-test>


        <bigtop-itest-suite-test>
            <test-name>count word in MR</test-name>
            <test-desc>count word in Hadoop MR</test-desc>
            <test-pre-integration-test>
                <command-set><command>rm -rf ./wordcount</command></command-set>
                <command-set><command>rm -rf ./wordcount_out</command></command-set>
                <command-set><command>mkdir ./wordcount</command></command-set>
                <command-set><command><![CDATA[curl http://www.meetup.com/HandsOnProgrammingEvents/events/53837022/ | sed -e :a -e 's/<[^>]*>//g;/</N;//ba' | sed 's/&nbsp//g' | sed 's/^[ \t]*//;s/[ \t]*$//'  | sed '/^$/d' | sed '/"http[^"]*"/d' > ./wordcount/content]]></command></command-set>
                <command-set><command>hadoop fs -mkdir /wordcount</command></command-set>
                <command-set><command>hadoop fs -put ./wordcount/* /wordcount</command></command-set>
            </test-pre-integration-test>
                <test-integration-test>
                    <command-set><command>hadoop jar $HADOOP_HOME/hadoop-examples-0.*.0.jar wordcount /wordcount /wordcount_out</command></command-set>
                    <command-set><command>mkdir ./wordcount_out</command></command-set>
                    <command-set><command>hadoop fs -get /wordcount_out/* ./wordcount_out</command></command-set>
                    <command-set><command>hadoop fs -rmr /wordcount</command></command-set>
                    <command-set><command>hadoop fs -rmr /wordcount_out/</command></command-set>
                </test-integration-test>
                <test-post-integration-test>
                    <command-set>
                    <command>cat ./wordcount_out/* | grep  Roman | sed 's/[^0-9.]*\([0-9.]*\).*/\1/'</command>
                    <command-comparator-type>com.lei.bigtop.hadoop.integration.test.ExtactComparatorIgnoreWhiteSpace</command-comparator-type>
                    <command-comparator-compare-to><![CDATA[cat wordcount/* | grep -c Roman]]></command-comparator-compare-to>
                    </command-set>
                </test-post-integration-test>
        </bigtop-itest-suite-test>

</bigtop-itest-suite> 


Everything inside <command></command> tag is a shell command, the <command-comparator-type></command-comparator-type> is to specify comparator class, <command-comparator-compare-to></command-comparator-compare-to> is to let tester to write a shell command which the result can be compare with.

I have wrote a small mixed Java/Groovy program to implement the XML schema and run mixed Hadoop tests. Given the XML file above, here are the output,

 
$ mvn clean integration-test
[INFO] Scanning for projects...
[INFO]                                                                         
[INFO] ------------------------------------------------------------------------
[INFO] Building LeiBigTop 1.1
[INFO] ------------------------------------------------------------------------
[INFO] 
[INFO] --- maven-clean-plugin:2.4.1:clean (default-clean) @ LeiBigTop ---
[INFO] Deleting /home/lei//workspace/LeiBigTop/target
[INFO] 
[INFO] --- maven-resources-plugin:2.4.3:resources (default-resources) @ LeiBigTop ---
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] skip non existing resourceDirectory /home/lei//workspace/LeiBigTop/src/main/resources
[INFO] 
[INFO] --- maven-compiler-plugin:2.3.2:compile (default-compile) @ LeiBigTop ---
[INFO] Compiling 51 source files to /home/lei//workspace/LeiBigTop/target/classes
[INFO] 
[INFO] --- gmaven-plugin:1.0:generateStubs (default) @ LeiBigTop ---
[INFO]  Generated 5 Java stubs
[INFO] 
[INFO] --- gmaven-plugin:1.0:compile (default) @ LeiBigTop ---
[INFO]  Compiled 6 Groovy classes
[INFO] 
[INFO] --- gmaven-plugin:1.0:generateTestStubs (default) @ LeiBigTop ---
[INFO]  Generated 5 Java stubs
[INFO] 
[INFO] --- gmaven-plugin:1.0:testCompile (default) @ LeiBigTop ---
[INFO]  Compiled 6 Groovy classes
[INFO] 
[INFO] --- maven-resources-plugin:2.4.3:testResources (default-testResources) @ LeiBigTop ---
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] skip non existing resourceDirectory /home/lei//workspace/LeiBigTop/src/test/resources
[INFO] 
[INFO] --- maven-compiler-plugin:2.3.2:testCompile (default-testCompile) @ LeiBigTop ---
[INFO] Nothing to compile - all classes are up to date
[INFO] 
[INFO] --- maven-surefire-plugin:2.7.2:test (default-test) @ LeiBigTop ---
[INFO] Surefire report directory: /home/lei//workspace/LeiBigTop/target/surefire-reports

-------------------------------------------------------
 T E S T S
-------------------------------------------------------
Running com.lei.bigtop.hadoop.test.RunHadoopTest
Tests run: 0, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.003 sec
There are no tests to run.

Results :

Tests run: 0, Failures: 0, Errors: 0, Skipped: 0

[INFO] 
[INFO] --- maven-jar-plugin:2.3.1:jar (default-jar) @ LeiBigTop ---
[INFO] Building jar: /home/lei//workspace/LeiBigTop/target/LeiBigTop-1.1.jar
[INFO] 
[INFO] >>> exec-maven-plugin:1.2.1:java (default) @ LeiBigTop >>>
[INFO] 
[INFO] <<< exec-maven-plugin:1.2.1:java (default) @ LeiBigTop <<<
[INFO] 
[INFO] --- exec-maven-plugin:1.2.1:java (default) @ LeiBigTop ---
Run test suite in XML file [./bigtop-testcases.xml]
Run test case name [Calculate summation in MR]
Run test case description [Calculate summation in MR]
Command line [hadoop jar ./target/LeiBigTop-1.1.jar com.lei.bigtop.hadoop.calsum.CalSum ./data ./output]
ComparatorClass - com.lei.bigtop.hadoop.integration.test.ExtactComparatorIgnoreWhiteSpace
CommandComparator -  cat ./output/* 
CommandComparator line [ cat ./output/* ]
CommandComparator return code is 0 Output is [a 883.0, b 1185.0, c 1614.0, d 1213.0, e 806.0, f 1226.0, g 898.0, h 1071.0, i 1064.0, j 886.0, k 1267.0, l 1269.0, m 1377.0]
  SUCCESS! 
  actual output - [a 883.0 b 1185.0 c 1614.0 d 1213.0 e 806.0 f 1226.0 g 898.0 h 1071.0 i 1064.0 j 886.0 k 1267.0 l 1269.0 m 1377.0 ] 
  expected -[a 883.0, b 1185.0, c 1614.0, d 1213.0, e 806.0, f 1226.0, g 898.0, h 1071.0, i 1064.0, j 886.0, k 1267.0, l 1269.0, m 1377.0]  
  compare class - com.lei.bigtop.hadoop.integration.test.ExtactComparatorIgnoreWhiteSpace
Run test case name [calculate pi]
Run test case description [calculate pi]
Command line [hadoop jar $HADOOP_HOME/hadoop-examples-0.*.0.jar pi 5 5]
ComparatorClass - org.apache.hadoop.cli.util.SubstringComparator
CommandComparator - echo "Pi is 3.68"
CommandComparator line [echo "Pi is 3.68"]
CommandComparator return code is 0 Output is [Pi is 3.68]
  SUCCESS! 
  actual output - [Number of Maps  = 5, Samples per Map = 5, Wrote input for Map #0, Wrote input for Map #1, Wrote input for Map #2, Wrote input for Map #3, Wrote input for Map #4, Starting Job, Job Finished in 19.29 seconds, Estimated value of Pi is 3.68000000000000000000] 
  expected -[Pi is 3.68]  
  compare class - org.apache.hadoop.cli.util.SubstringComparator
Run test case name [count word in MR]
Run test case description [count word in MR]
Command line [rm -rf ./wordcount]
SUCCESS! return code is 0 Output is []
Command line [rm -rf ./wordcount_out]
SUCCESS! return code is 0 Output is []
Command line [mkdir ./wordcount]
SUCCESS! return code is 0 Output is []
Command line [curl http://www.meetup.com/HandsOnProgrammingEvents/events/53837022/ | sed -e :a -e 's/<[^>]*>//g;/ ./wordcount/content]
SUCCESS! return code is 0 Output is []
Command line [hadoop fs -mkdir /wordcount]
SUCCESS! return code is 0 Output is []
Command line [hadoop fs -put ./wordcount/* /wordcount]
SUCCESS! return code is 0 Output is []
Command line [hadoop jar $HADOOP_HOME/hadoop-examples-0.*.0.jar wordcount /wordcount /wordcount_out]
SUCCESS! return code is 0 Output is []
Command line [mkdir ./wordcount_out]
SUCCESS! return code is 0 Output is []
Command line [hadoop fs -get /wordcount_out/* ./wordcount_out]
SUCCESS! return code is 0 Output is []
Command line [hadoop fs -rmr /wordcount]
SUCCESS! return code is 0 Output is [Deleted hdfs://lei.hadoop.local:9000/wordcount]
Command line [hadoop fs -rmr /wordcount_out/]
SUCCESS! return code is 0 Output is [Deleted hdfs://lei.hadoop.local:9000/wordcount_out]
Command line [cat ./wordcount_out/* | grep  Roman | sed 's/[^0-9.]*\([0-9.]*\).*/\1/']
ComparatorClass - com.lei.bigtop.hadoop.integration.test.ExtactComparatorIgnoreWhiteSpace
CommandComparator - cat wordcount/* | grep -c Roman
CommandComparator line [cat wordcount/* | grep -c Roman]
CommandComparator return code is 0 Output is [4]
  SUCCESS! 
  actual output - [4] 
  expected -[4]  
  compare class - com.lei.bigtop.hadoop.integration.test.ExtactComparatorIgnoreWhiteSpace
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 1:08.489s
[INFO] Finished at: Tue Mar 06 10:07:27 PST 2012
[INFO] Final Memory: 19M/265M
[INFO] ------------------------------------------------------------------------


There are 3 test cases. The first is my small Hadoop MR program, the result for keys are: a 883.0 b 1185.0 c 1614.0 d 1213.0 e 806.0 f 1226.0 g 898.0 h 1071.0 i 1064.0 j 886.0 k 1267.0 l 1269.0 m 1377.0. They matched. The second one is command "hadoop jar $HADOOP_HOME/hadoop-examples-0.*.0.jar pi 5 5" to estimate value of Pi, the Pi value is 3.68. In the last one, you can follow the long list of commands, basically it download our meetup URL http://www.meetup.com/HandsOnProgrammingEvents/events/53837022/ and do a word count, and find out the name Roman has been referenced 4 times in both downloaded content and output out of Hadoop wordcount.
   
Stayed tuned. I will publish the implementation details in a day or so. Enjoy the journey.


PS: If you want the source, drop me an email, I can show you how to get it.


No comments:

Post a Comment