Monday, February 25, 2008

More test coverage and (rspec, scalacheck, and hunit)

I added more test coverage to botlist; basically adding tests by language. I don't have much to add to this entry except to say, here are some examples on how I used these test frameworks in botlist. Links to the actual libraries are at the bottom of the post.

RSpec

http://openbotlist.googlecode.com/svn/trunk/openbotlist/tests/integration/ruby/rspec


# create mock tests

include_class 'org.spirit.bean.impl.BotListCoreUsers' unless defined? BotListCoreUsers
include_class 'org.spirit.util.BotListUniqueId' unless defined? BotListUniqueId
include_class 'org.acegisecurity.providers.encoding.Md5PasswordEncoder' unless defined? Md5PasswordEncoder
include_class 'org.spirit.bean.impl.BotListProfileSettings' unless defined? BotListProfileSettings
include_class 'org.spirit.contract.BotListContractManager' unless defined? BotListContractManager

include_class 'java.text.SimpleDateFormat' unless defined? SimpleDateFormat
include_class "java.util.Calendar" unless defined? Calendar

include_class "org.spirit.contract.BotListCoreUsersContract"
include_class "org.spirit.bean.impl.BotListEntityLinks"

describe "Creating simple mock objects=" do
  
  before(:each) do
    @ac = $context
    @rad_controller = @ac.getBean("radController")
    @cur_sess_id = rand(1000000)
  end

  it "Should create the entity links" do
    dao = @rad_controller.entityLinksDao
    mock_link = BotListEntityLinks.new
    mock_link.mainUrl = "http://www.google1.com"
    mock_link.fullName = "bot_tester"
    mock_link.urlTitle = "The Google"
    mock_link.keywords = "google cool man yea"
    mock_link.urlDescription = "google is the best yea man"
    mock_link.rating = 0
    dao.createLink(mock_link)
  end

end

ScalaCheck

http://openbotlist.googlecode.com/svn/trunk/openbotlist/tests/integration/scala/spirit/tests


package org.spirit.check.tests

import org.spirit.lift.agents.model._

import org.scalacheck._
import org.scalacheck.Test._
import org.scalacheck.Gen._
import org.scalacheck.Arbitrary._
import org.scalacheck.Prop._

object ExampleTests {
  def runTests() = {
    val prop_ConcatLists = property((l1: List[Int], l2: List[Int]) => 
      l1.size + l2.size == (l1 ::: l2).size)

    Test.check(prop_ConcatLists)
  }
}

HUnit


--
-- Test Stop Words

module Tests.Unit.TestStopWords where

import Test.HUnit

import Data.SpiderNet.Bayes
import Data.List

stopWordsDb = "../../var/lib/spiderdb/lexicon/stopwords/stopwords.tdb"

foo :: Int -> (Int, Int)
foo x = (1, x)

runTestStopWords = do
  stopwords <- readStopWords stopWordsDb
  let datatest = [ "the", "school", "is", "over", "there", "canada", "chicken" ]
      rmwords = datatest \\ stopwords
  putStrLn $ show rmwords
  putStrLn $ "Stop Word Density=" ++ 
                   show (stopWordDensity "the school is over there canada chicken dogs" stopwords)

test1 :: Test
test1 = TestCase (assertEqual "for (foo 3)," (1,2) (foo 3))

allTests :: Test
allTests = TestList [ TestLabel "stop words" test1 ]

Resources

http://rspec.info/
http://code.google.com/p/scalacheck/

----

Sunday, February 24, 2008

Adding rspecs with jruby and for the spring framework

I will have to discuss this in future blog entries. But this is a rspec helper script for setting up rspec and spring (with jruby):


###
### Author: Berlin Brown
### spec_helper.rb
### Date: 2/22/2008
### Description: RSpec, JRuby helper for setting up
### spring rspec tests for botlist
###
lib_path = File.expand_path("#{File.dirname(__FILE__)}/../../lib")
$LOAD_PATH.unshift lib_path unless $LOAD_PATH.include?(lib_path)

require 'spec'
require 'java'
include_class('java.lang.String') { 'JString' }
include_class('java.lang.System') { 'JSystem' }

# Have to manually find the spring config files
spring_config = File.expand_path("#{File.dirname(__FILE__)}/../../../../WEB-INF/botlistings-servlet.xml")
spring_util_config = File.expand_path("#{File.dirname(__FILE__)}/../../../../WEB-INF/spring-botlist-util.xml")

puts spring_config

# The logs will get output here
JSystem.getProperties().put("catalina.base", "util")

# Ensure correct java type String for spring config filename array
# spring_conf_arr = Java::JavaClass.for_name("java.lang.String").new_array(2)

string_class = Java::JavaClass.for_name("java.lang.String")
spring_conf_arr = string_class.new_array(2)
spring_conf_arr[0] = Java.primitive_to_java("file://" + spring_config)
spring_conf_arr[1] = Java.primitive_to_java("file://" + spring_util_config)

#
# Spring specific includes
include_class "org.springframework.context.support.FileSystemXmlApplicationContext"
# Set the application context.  Use this in the rspec tests.
$context = FileSystemXmlApplicationContext.new(spring_conf_arr)

# End of File

Friday, February 22, 2008

Botlist Hackathon - Adding test/build server and Lisp web frontend

It is going to be a fun weekend. I am building a simple build server/jobs that will create daily builds. Also, I plan on creating more test coverage for the botlist system.

Fun, fun, fun.

Botlist is built with many languages so it will be interesting to build a complete test suite.

In case you are interested, languages actively being used. In order of use. (leaving out html, xml, bash and other misc items).

1. Ruby/JRuby - Main web front-end (see botspiritcompany.com).
2. Python - Misc scripting tasks, web scraping, etc.
2b. Python/Django - New web front-end.
3. Java/J2EE/SpringMVC - Part of the main web front-end.
4. Haskell - Text processing back-end.
5. Erlang - Web scraping, IRC bot.
6. Lisp - New web front-end.

Other notables.

A. Perl - for misc scripting tasks
B. Factor - used for web test framework, I started it, but didn't get to work on it further. Still a powerful language for such tasks, probably better than some above.

Updated:
Actually, if you really want to get generic.

1. 3 imperative languages are used; it is easy to swap between Python, Ruby, and Java code.

2. Haskell is the most unique; it is a function language, but pretty different from Lisp or Erlang.

3. Erlang is pretty unique also.

4. Ditto for lisp.

Really there were 4 environments used.

TDD in one sentence, "Only ever write code to fix a failing test"

"Only ever write code to fix a failing test"

EOM

Thursday, February 21, 2008

Great application

Great application for persisting key/value data. Seems to just have been released.

(Memcachedb)

http://code.google.com/p/memcachedb/wiki/Performance

Tuesday, February 19, 2008

More fun this week, Analysis for wikipedia data

Analysis for WEX:

http://blog.freebase.com/?p=108

"Growing at approximately 1,700 articles a day, Wikipedia is a significant repository of human knowledge. With its focus and depth, Wikipedia has emerged as a public good of information, fueling a small industry of computer science research. And though Wikipedia contains a wealth of collective knowledge, due to is idiosyncratic markup and semi-structured design, developers wishing to utilize this resource each incur significant start-up costs simply handling, parsing and decoding the raw corpus."

Semantic web indexing

Here is a good article on semanatic web indexing.

"The most surprising figure here is probably the abundance of FOAF namespaced arcs, this appears to be largely due to the FOAF data automatically generated by services such as Live Journal (http://livejournal.com/) which, of the documents indexed so far account for 89% of the documents using the FOAF namespace."

FOAF Semantic indexing from w3

Monday, February 18, 2008

Makings of a simple web scraper in Erlang

This code parse a web page and tokenizes the content. The code uses Joe's www_tools library and I was trying to get the rfc4627 code to parse unicode documents. That particular code is a work in progress. Ultimately, I would like to be able to use this code to crawl FOAF documents.

Simple Driver Code (uses url.erl and disk_cache).


%%
%% Simple Statistic Analysis of social networking sites
%% Author: Berlin Brown
%% Date: 2/12/2008
%%

-module(socialstats).

-export([start_social/0]).

-import(url, [test/0, raw_get_url/2, start_cache/1, stop_cache/0]).
-import(rfc4627, [unicode_decode/1]).
-import(html_analyze, [disk_cache_analyze/1]).

-define(SocialURL, "http://botnode.com/").

start_social() ->
    io:format("*** Running social statistics~n"),
    %% First, setup the URL disk cache
    url:start_cache("db_cache/socialstats.dc"),
    case url:raw_get_url(?SocialURL, 60000) of
        {ok, Data} ->
            io:format("Data found from URL, storing=~s~n", [?SocialURL]),
            disk_cache:store(?SocialURL, Data),
            %% val = list_to_binary(xmerl_ucs:from_utf8([Data])),
            %% val = rfc4627:unicode_decode(Data),
            {ok, Data};
        {error, What} ->
            io:format("ERR:~p ~n", [What]),
            {error, What}
        end,
    %% Analyze the disk cache
    case disk_cache:fetch(?SocialURL) of
        {ok, Bin} ->
            io:format("Data found from disk cache, fetching=~s~n", [?SocialURL]),
            Toks = html_tokenise:disk_cache2toks(?SocialURL),
            io:format("Data found from disk cache, fetching=~p~n", [Toks]),
            {ok, Bin};
        {error, Err} ->
            io:format("ERR:~p ~n", [Err]),
            {error, Err}
        end,
    %% Stop the disk cach
    url:stop_cache(),
    io:format("*** Done [!]~n").

%% End of File

This in turn, brings up Joe Armstrong's tcp/ip code for retrieving the document.


raw_get_url(URL, Timeout) -> 
    case url_parse:parse(URL) of
    {error, Why} ->
        {error, {badURL,URL}};
    {http, HostName, Port, File} ->
        get_http(HostName, Port, File, ["Host: ", HostName], Timeout);
    {file, Location} ->
        get_file(Location)
    end.

raw_get_url(URL, Timeout, {IP, Port}) ->
    get_http(IP, Port, URL, [], Timeout).

get_file(Location) ->
    file:read_file(Location).

get_http(IP, Port, URL, Opts, Timeout) ->
    %% io:format("ip = ~p, port = ~p, url = ~p~n", [ IP, Port, URL ]),
    Cmd = ["GET ", URL, " HTTP/1.1\r\n", Opts, "\r\n\r\n"],
    io:format("Cmd=~p\n", [Cmd]),
    io:format("url_server: fetching ~p ~p ~p~n", [IP, Port, URL]),
    case catch
      gen_tcp:connect(IP, Port,
       [binary, {packet, raw}, {nodelay, true}, {active, true}]) of
    {'EXIT', Why} -> 
        %% io:format("Socket exit:~p~n", [Why]),
        {error, {socket_exit, Why}};
    {error, Why} -> 
        %% io:format("Socket error:~p~n", [Why]),
        {error, {socket_error, Why}};
    {ok, Socket} ->
        %% io:format("Socket = ~p~n", [Socket]),
        gen_tcp:send(Socket, Cmd),
        receive_data(Socket, Timeout, list_to_binary([]))
    end.

receive_data(Socket, Timeout, Bin) ->
    receive
    {tcp, Socket, B} ->
        %io:format(".", []),
        receive_data(Socket, Timeout, concat_binary([Bin,B]));
    {tcp_closed, Socket} ->
        Data0 = binary_to_list(Bin),
        %% io:fwrite("Socket closed: ~p~n", [Data0]),
        {Data1, Info} = get_header(Data0, []),
        Bin1 = list_to_binary(Data1),
        {ok, Bin1};
    Other ->
            %% io:fwrite("Other: ~p~n", [Other]),
        {error, {socket, Other}}
    after
        Timeout ->
        {error, timeout}
    end.

Botnode wiki re-rereleased (language wiki site)

I have been wanting to create a multi-language programming wiki for a while now. This is it. Basically the botnode wiki will contain a collection of code snippets in various languages (haskell, erlang, scala, etc).

http://www.botnode.com/botwiki/index.php?title=Main_Page

Sunday, February 17, 2008

Apparently the junglerl www_tools has issues

I am guessing that the www_tools erlang library doesn't support a valid HTTP request. Because I can't even get a valid response from a simple lighttpd based page. Sigh, I guess I have to fix it.

In any case, here is the code I am testing.


-module(socialstats).

-export([start_social/0]).

-import(url, [test/0, raw_get_url/2]).

start_social() ->
 io:format("*** Running social statistics~n"),
 case url:raw_get_url("http://botnode.com", 80) of
  {ok, Data} ->
   io:format("Data: ~p ~n", [Data]),
   {ok, Data};
  {error, What} ->
   io:format("ERR:~p ~n", [What]),
      {error, What}
     end,
 io:format("*** Done [!]~n").

%% End of File

Saturday, February 16, 2008

Wikipedia Definition: Erlang programming language

http://en.wikipedia.org/wiki/Erlang_(programming_language)

"Erlang is a general-purpose concurrent programming language and runtime system. The sequential subset of Erlang is a functional language, with strict evaluation, single assignment, and dynamic typing. For concurrency it follows the Actor model. It was designed by Ericsson to support distributed, fault-tolerant, soft-real-time, non-stop applications. It supports hot swapping so code can be changed without stopping a system. [1] Erlang was originally a proprietary language within Ericsson, but was released as open source in 1998. The Ericsson implementation primarily runs interpreted virtual machine code, but it also includes a native code compiler (not supported on all platforms), developed by the High-Performance Erlang Project (HiPE) at Uppsala University. It also now supports interpretation via escript as of r11b-4."

Paul Graham and Design

"Here it is: I like to find (a) simple solutions (b) to overlooked problems (c) that actually need to be solved, and (d) deliver them as informally as possible, (e) starting with a very crude version 1, then (f) iterating rapidly."

So true, so true.

Thursday, February 14, 2008

Blogspam - SQLite Article on Atomic transactions

http://www.sqlite.org/atomiccommit.html

"An important feature of transactional databases like SQLite is "atomic commit". Atomic commit means that either all database changes within a single transaction occur or none of them occur. With atomic commit, it is as if many different writes to different sections of the database file occur instantaneously and simultaneously. Real hardware serializes writes to mass storage, and writing a single sector takes a finite amount of time. So it is impossible to truly write many different sectors of a database file simultaneously and/or instantaneously. But the atomic commit logic within SQLite makes it appear as if the changes for a transaction are all written instantaneously and simultaneously."

Find the truth

Because there is no other path.

Broken Saints Series Review

I posted this to Amazon; a review of the Broken Saints series.

I don't even know what to write. I just finished watching the entire thing and am going; awesome, awesome, awesome, awesome. Amazing. If you can dream up the perfect story that combines young, old, technology, religion, good, bad and put it together; Broken Saints will be 1000 times better than anything you could come up with.

It is part Cyberpunk, part religious tale, part storytelling. Truly, truly amazed.

In terms of Anime or other things that are considered different or strange;

Broken Saints is better than Akira, probably better than some of the Ghost in the Shell series. It doesn't really compare to any hollywood stories, but it beats the story of Lord of the Rings.

Good job. I was lucky that was able to experience this.

Anybody who gives this a bad review. They probably didn't watch most it, have really low IQ or flat out crazy. You can easily ignore the bad ratings, I almost listened to them and missed out on a great series.

Wednesday, February 13, 2008

Joy compared with other functional programming languages

http://www.latrobe.edu.au/philosophy/phimvt/joy/j08cnt.html

"Joy is a functional programming language which is not based on the application of functions to arguments but on the composition of functions. This paper compares and contrasts Joy with the theoretical basis of other functional formalisms and the programming languages based on them. One group comprises the lambda calculus and the programming languages Lisp, ML and Miranda. Another comprises combinatory logic and the language FP by Backus. A third comprises Cartesian closed categories. The paper concludes that Joy is significantly different from any of these formalisms and programming languages."

Tuesday, February 12, 2008

ANN: Major Release: Botlist 0.5 Valentine Release, would you like some cake?

This is a big release; it won't be visible on the web frontend, but botlist is morphing into the creation that envisioned. Here is where we are and where we are going:

(1) Find information from RSS feeds (almost complete, but functional)

(2) Find interesting articles from raw online content (getting there, part of Valentine release)

(3) Extract content from the web and convert the raw information into machine readable format, semantic web (future of botlist)

And don't forget to visit botlist to see the new updates. The spirits of the bots are alive.

http://www.botspiritcompany.com

Look out for: Reuters, semantic web and Calais

http://opencalais.com/

"What is Calais?

We want to make all the world's content more accessible, interoperable and valuable. Some call it Web 2.0, Web 3.0, the semantic web or the Giant Global Graph - we call our piece of it Calais.

The core of Calais is our web service. We're working to make this service more accessible by developing sample applications, supporting developers and offering bounties for specific capabilities.

For more information - please visit our FAQ."

I just heard about this link through reddit; this is the kind of system that botlist could be.

Monday, February 11, 2008

Python script; check running process

If you launch a long running processing, sometimes you don't want to relaunch the script while the other process is still running. There are bash oriented ways of checking for this, but I wanted to make these complicated and use a more robust language. Here is a script to check a PID file, check and grep the 'ps aux' for a particular name and return 0 exit code if the process is not running.


"""
 Berlin Brown
 Date: 2/2/2008
 Copyright: Public Domain

 Utility for checking if process is running.

 Versions:
 Should work with python 2.4+

 Use case includes:
 * If PID file found, read the contents
 * If PID file found or not found, also check the 'ps aux' status of the script
   to make sure that the script is not running.

 Additional FAQ:
 * What if the PID file gets created but does not get removed?
   + In this scenario, we need to issue a 'force' command.  But also,
   check the running process with the 'ps aux' command.

 Script/App Exit Codes:
 0 - Pass, sucess
 1 - catchall for general errors
 3 - Used for botlist purposes

 References:
 http://docs.python.org/lib/node536.html
"""

__author__ = "Berlin Brown"
__version__ = "0.1"
__copyright__ = "Copyright (c) 2006-2008 Berlin Brown"
__license__ = "Public Domain"

import sys
import os
from subprocess import Popen, call, PIPE

PROC_SCRIPT_NAME = "check_process.py"

SUCCESS_EXIT_CODE=0
ERR_PS_SCRIPT_RUNNING=3
ERR_PID_SCRIPT_RUNNING=4

def check_ps_cmd(script_name):
    try:
        p1 = Popen(["ps", "aux"], stdout=PIPE)
        p2 = Popen(["grep", script_name], stdin=p1.stdout, stdout=PIPE)
        output = p2.communicate()[0]
        return output       
    except Exception, e:
        print >>sys.stderr, "Execution failed:", e
        return None

def is_pid_running(full_cmd):
    try:
        p = Popen(full_cmd, shell=True, stdout=PIPE)
        output = p.communicate()[0]
        if output:
            # if something is there then we can return true
            return True
    except Exception, e:
        print >>sys.stderr, "Execution failed:", e
        return False
    # Final exit
    return False
    
def find_std_output(std_output, script_name):
    # split the ps aux output and check the parameters
    data = std_output.split()
    # first, ignore the current script and eliminate
    for i in data:
        if i.find(PROC_SCRIPT_NAME) > 0:
            return False
    # Begin search again, for the target script name
    for i in data:
        if i.find(script_name) > 0:
            return True     
    return False

def is_script_running(script_name):
    res = False
    std_output = check_ps_cmd(script_name)
    if std_output:
        std_output = std_output.split('\n')
        for curline in std_output:
            res = find_std_output(curline, script_name)         
    return res
            
def launch_process(full_cmd):
    try:
        retcode = call(full_cmd, shell=True)
        if retcode < 0:
            print >>sys.stderr, "Child was terminated by signal", -retcode
        else:
            print >>sys.stderr, "Child returned", retcode
        return retcode
    except OSError, e:
        print >>sys.stderr, "Execution failed:", e
        return -1

def main(args): 
    if len(args) < 3:
        return -1
    else:       
        # Arg - ID:1 = PID file to read
        pid_file = args[1]
        script_name = args[2]
        try:
            f = open(pid_file)
            data = f.readline()         
            pid = data.strip()
            cmd = "ps -p %s --no-heading" % pid
            res = is_pid_running(cmd)
            
            if res:
                return ERR_PID_SCRIPT_RUNNING
            else:
                # It isn't running, that is good.
                return SUCCESS_EXIT_CODE
            
        except Exception, e:
            # Something happened with the file
            print e
            print "Checking process list for command"
            res = is_script_running(script_name)
            # If script is running and file not found
            # exit, otherwise success
            if res ==  True:
                return ERR_PS_SCRIPT_RUNNING
            else:
                return SUCCESS_EXIT_CODE
            
    return -1

if __name__ == '__main__':
    res = main(sys.argv)
    sys.exit(res)

Saturday, February 9, 2008

My FOAF profile at livejournal

Here is my FOAF/RDF profile from livejournal. I would also like to scan foaf repositories with botlist. That is a future enhancement.

http://berlinbrown.livejournal.com/data/foaf

Why I only post code snippets without much explanation?

First reason is that I eat, drink and breathe code. Code of many different paradigms, idioms? When I try to explain a topic, I just can't help but throwing the code out there. You may not even know it, but I a selective about the code that I throw at you. For those expecting a detailed analysis of the examples, you won't find that here. I hope that code snippets are useful; I post them because I can't find similar examples out on the web. Most of them are practical, procedural examples as opposed to looking at the aspects of the language. For example, I posted an entry on XML processing in Scala. It introduced a couple of concepts that you may not have seen elsewhere; working with existing java code, simple code from a model class to XML, simple liftweb responses.

Enjoy.

Scala and Lift snippet: taste of XML with Scala and Lift for simple XML over HTTP RPC protocol

The botlist application is a distributed system. Bots/Agents run in the background on some remote machine and send payloads to a web front end server. In this case, a J2EE server (botlist). Here is some of the code that makes that happen.

On the receiving end; a liftweb based application running on Tomcat:

The method remote_agent is associated with the remote_agent URI. It returns a XML response when a GET request is encountered. The remote_agent_send function is used to process POST requests from the stand-alone client.


import java.util.Random
import org.springframework.context.{ApplicationContext => AC}
import org.spirit.dao.impl.{BotListUserVisitLogDAOImpl => LogDAO}
import org.spirit.dao.impl.{BotListSessionRequestLogDAOImpl => SessDAO}
import org.spirit.bean.impl.{BotListUserVisitLog => Log}
import org.spirit.bean.impl.{BotListSessionRequestLog => Sess}
import net.liftweb.http._
import net.liftweb.http.S._
import net.liftweb.http.S
import scala.xml.{NodeSeq, Text, Group}
import net.liftweb.util.Helpers._
import javax.servlet.http.{HttpServlet, HttpServletRequest, HttpServletResponse, HttpSession}

import org.spirit.lift.agents._

class RemoteAgents (val request: RequestState, val httpRequest: HttpServletRequest) extends SimpleController {

  def remote_agent_req : XmlResponse = {
 // Cast to the user visit log bean (defined in the spring configuration)
 val log_obj = AgentUtil.getAC(httpRequest).getBean("userVisitLogDaoBean")
 val log_dao = log_obj.asInstanceOf[LogDAO]
 val sess_obj = AgentUtil.getAC(httpRequest).getBean("sessionRequestLogDaoBean")
 val sess_dao = sess_obj.asInstanceOf[SessDAO]
 val uniq_id = AgentUtil.buildRequestSession(sess_dao, httpRequest, "request_auth", "true")
 AgentUtil.auditLogPage(log_dao, httpRequest, "remote_agent")

    XmlResponse(
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" >
 <agentmsg>
   <botid>serverbot</botid>
   <message>Hello my name is serverbot.  Would you like some cake?</message>
   <status>200</status>
   <requestid>{ Text(uniq_id) }</requestid>
   <majorvers>0</majorvers>
   <minorvers>0</minorvers>
  </agentmsg>
</rdf:RDF>)
  } // End of Method Request
  
  def remote_agent_send : XmlResponse = {
 var payload = ""
 S.param("types_payload").map { (u => payload = u) }

 Console.println(payload.toString)
 if (S.post_?) XmlResponse { (
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" >
   <message>Enjoy your cake</message>
</rdf:RDF>) }
    else AgentUtil.invalidXMLResponse
  } // End of Method Send

}

The client is a simple main application that connects to that URL and sends the XML payload request.


package org.spirit.spiderremote

import scala.xml._
import java.sql.{DriverManager}
import org.spirit.loadtest.{LoadTestManager}
import scala.collection.jcl.{HashMap, Map}

import org.spirit.spiderremote.model.AgentMessage

object SpiderRemote {

  def main(args: Array[String]): Unit = {  
 Console.println("spider remote")
    Class.forName("org.sqlite.JDBC")
    val conn = DriverManager.getConnection("jdbc:sqlite:../../../var/lib/spiderdb/contentdb/spider_queue.db")
    val stat = conn.createStatement()
 Console.println(conn)
 val rs = stat.executeQuery("select * from remotequeue")

 while (rs.next()) {
   Console.println("name = " + rs.getString("url"))
 }
 // Connect to the server
 LoadTestManager.verifySystemDirs
 val res_from_req = LoadTestManager.connectURL("http://127.0.0.1:8080/botlist/lift/pipes/types/remote_agent_req", false)
 val serv_agent_doc = XML.loadString(res_from_req(1))
 val b = serv_agent_doc \\ "message"
 Console.println(b)

 // Post to server
 val m = new java.util.HashMap
 val map = new HashMap[String, String](m)
 val types_msg_payload = new AgentMessage {
   val message = "I enjoyed my cake."
   val status = 200
   val agentName = "botspiderremote"
   val messageReqId = "123"
 }
 
 map("types_payload") = types_msg_payload.toXML.toString
 val res_from_snd = LoadTestManager.postData(m, "http://127.0.0.1:8080/botlist/lift/pipes/types/remote_agent_send", false)
 Console.println(res_from_snd(1))
 Console.println("done");
  }
}

Because my blog is at low-activity, I don't plan on explaining any further.

Friday, February 8, 2008

Web page DNA

I am developing a type of web page DNA. If you look at any webpage. What are some of its characteristics? Things that might stick out. A human can easily tell if a page is interesting a not. But how would a bot do it?

1. For example, botlist may extract the following information from a page:

linktype: () views: 23 links: 4 images: 6 para: 7 chars: 8 proctime: 10 objid:123sdfsdf

2. Some other interesting things might include last-modified date or host name for example.

3. Keywords and description are always important.

Thursday, February 7, 2008

Botlist the only medium sized web technology where one programming language was not enough

Here are the following programming language technologies that are used with botlist. If you are interested in the source. It is all freely available.

http://code.google.com/p/openbotlist/

http://www.botspiritcompany.com/botlist/

Web Front End:
Java - bean classes/some view logic (pojos used with hibernate)
JRuby - business logic, database connectivity
Spring Framework - J2EE framework
Hibernate - ORM framework
Scala/Lift - business logic, XML-HTTP api
(Future additions):
Python Django - additional web front end
Lisp web server - additional web front end

Back End:
Python - web crawling
Haskell - Text mining analysis
Scala - Remote APIs

Said what I have been thinking; session state is evil

I am surprised I missed this on the blogosphere, David tells the truth. Session state is evil. If you have worked with low-level HTTP applications, really getting at HTTP then you know this is true. If you write basic ASP pages and save variables, you might not have to deal with this issue.

http://davidvancouvering.blogspot.com/2007/09/session-state-is-evil.html

The basics of saving variables on the server side is an easy thing to do. But once you start to get millions of users and then relying on the application server to maintain state for each user, it gets tricky. And trusting the application server may not be the best idea. I will let you read the article and let you decide.

Wednesday, February 6, 2008

Accessing the spring framework from LiftWeb

One of the benefits (if you can figure it out) of working with the JVM languages is the ability to integrate technologies. One of the problems is how to do so. The botlist web application is built on JRuby and Spring. I am now building future functionality with Scala/Lift and Spring.

If you have worked with spring; the spring ApplicationContext contains a link between the servlet world to the spring world. In this lift example, I extract the application context through the http servlet request and the session instance.


 def getAC(request: HttpServletRequest) = {
 val sess = request.getSession
 val sc = sess.getServletContext
 
 // Cast to the application context
 val acobj = sc.getAttribute("org.springframework.web.servlet.FrameworkServlet.CONTEXT.botlistings")
 acobj.asInstanceOf[AC] 
  }

After getting the application context, it is faily straight-forward to access the spring bean objects.


// Cast to the user visit log bean (defined in the spring configuration)
val log_obj = AgentUtil.getAC(httpRequest).getBean("userVisitLogDaoBean")
val log_dao = log_obj.asInstanceOf[LogDAO]
AgentUtil.auditLogPage(log_dao, httpRequest, "remote_agent")


val link = new Log()
link.setRequestUri(request.getRequestURI)
link.setRequestPage(curPage)
link.setHost(request.getHeader("host"))
link.setReferer(request.getHeader("referer"))
link.setRemoteHost(request.getRemoteAddr())
link.setUserAgent(request.getHeader("user-agent"))
dao.createVisitLog(link)

That is the core of linking lift to spring. Here is the full example.

RemoteAgents.scala


package org.spirit.lift.agents

import java.util.Random
import org.springframework.context.{ApplicationContext => AC}
import org.spirit.dao.impl.{BotListUserVisitLogDAOImpl => LogDAO}
import org.spirit.bean.impl.{BotListUserVisitLog => Log}
import net.liftweb.http._
import net.liftweb.http.S._
import net.liftweb.http.S
import scala.xml.{NodeSeq, Text, Group}
import net.liftweb.util.Helpers._
import javax.servlet.http.{HttpServlet, HttpServletRequest, HttpServletResponse, HttpSession}


object AgentUtil {
  def uniqueMsgId (clientip: String) : String = {
 val r = new Random()
 val rand_long = r.nextLong
 hexEncode(md5( (clientip + rand_long).getBytes ))
  }  
  def auditLogPage (dao: LogDAO, request: HttpServletRequest, curPage: String) = {
 val link = new Log()
 link.setRequestUri(request.getRequestURI)
 link.setRequestPage(curPage)
 link.setHost(request.getHeader("host"))
 link.setReferer(request.getHeader("referer"))
 link.setRemoteHost(request.getRemoteAddr())
 link.setUserAgent(request.getHeader("user-agent"))
 dao.createVisitLog(link)
  }

  def getAC(request: HttpServletRequest) = {
 val sess = request.getSession
 val sc = sess.getServletContext
 
 // Cast to the application context
 val acobj = sc.getAttribute("org.springframework.web.servlet.FrameworkServlet.CONTEXT.botlistings")
 acobj.asInstanceOf[AC] 
  }
}
/**
 * Example request:
 * http://localhost:8080/botlist/lift/pipes/agents/remote_agent
 */
class RemoteAgents (val request: RequestState, val httpRequest: HttpServletRequest) extends SimpleController {
  def remote_agent: XmlResponse = {
 // Cast to the user visit log bean (defined in the spring configuration)
 val log_obj = AgentUtil.getAC(httpRequest).getBean("userVisitLogDaoBean")
 val log_dao = log_obj.asInstanceOf[LogDAO]
 AgentUtil.auditLogPage(log_dao, httpRequest, "remote_agent")

    XmlResponse(
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
         xmlns:botmsg="http://xmlns.com/botmsg/0.1/" >
 <botmsg:agentmsg>
   <botmsg:botid>serverbot</botmsg:botid>
   <botmsg:message>Hello my name is serverbot, go ahead with your request</botmsg:message>
   <botmsg:status>200</botmsg:status>
   <botmsg:requestid>{ Text(AgentUtil.uniqueMsgId(httpRequest.getRemoteAddr)) }</botmsg:requestid>
   <botmsg:majorvers>0</botmsg:majorvers>
   <botmsg:minorvers>0</botmsg:minorvers>
  </botmsg:agentmsg>
</rdf:RDF>)
  } // End of Method
}

Boot.scala


package bootstrap.liftweb

import net.liftweb.http._
import net.liftweb.util.{Helpers, Can, Full, Empty, Failure, Log}
import javax.servlet.http.{HttpServlet, HttpServletRequest , HttpServletResponse, HttpSession}
import scala.collection.immutable.TreeMap
import Helpers._

import org.spirit.lift.agents._
 
class Boot {
  def boot {
 LiftServlet.addToPackages("org.spirit.lift.agents")
 val dispatcher: LiftServlet.DispatchPf = {         
      // if it's a web service, pass it to the web services invoker
   case RequestMatcher(r, ParsePath("lift" :: "pipes" :: "agents" :: c :: _, _,_),_, _) => invokeAgents(r, c)
    }
 LiftServlet.addDispatchBefore(dispatcher)
  }
  private def invokeAgents(request: RequestState, methodName: String)(req: HttpServletRequest): Can[ResponseIt] =
 createInvoker(methodName, new RemoteAgents(request, req)).flatMap(_() match {
   case Full(ret: ResponseIt) => Full(ret)
      case _ => Empty
 })
} // End of Class

Person month calculations on an opensource project

I was browsing the web and came upon the botlist project on koders.com. Koders.com archives the source of various projects and adds the source to their search engine. It also collects interesting project statistics on a particular project. Here are the botlist numbers:

Development Cost
$135,685
Lines of code: 27,137
Person months (PM): 27.14
Labor Cost/Month: $5000

Here is a larger project (jboss)
Development Cost
LOC: 1,695,805
$8,479,025
Assumptions
Lines of code: 1,695,805
Person months (PM): 1695.81
Labor Cost/Month: $5000

Here is a question; what does it take to develop a useful opensource (or possibly commercial) project. I am going to use some arbitrary numbers for the sake of argument. And yes, number of lines of code is a bad metric to use, but there is a big difference between 10 lines of code, 100,000 lines of code and a million lines of code.

I want to create a project which will end up with 200,000 lines of code. It is a generic widget server. Developed in java, python, or C#.

According to koders.com:

200,000 / 1000 lines of code = 200 person months.

It would take one person to develop this project over the course of 200 months. You get the interest of the community, so now you have 10 developers. So, it will take 20 months together. One year and 8 months to create this project. And it will cost $1,000,000 to pay those developers. $100,000 per developer, about $83,000 dollars a year.

That is all.

ANN: Setup guide for botlist web application front-end

http://code.google.com/p/openbotlist/wiki/QuickStart

The botlist J2EE web frontend might be considered a medium sized web application. Make sure that you have a J2EE servlet container. Tomcat 5.5+ is recommended but not required. MySQL database server is required (expect a Postgres configuration in the future). The java build tool Ant is also required for building the project.

Test Environment and Recommended Configuration

* Mysql Ver 14.12 Distrib 5.0.51a, for Win32 (ia32) (db server)
* Ant 1.7.0 (java build tool)
* Tomcat 5.5.26 (application server)
* Java SDK java version "1.5.0_11" (java compiler), 1.6 recommended
* Operating systems: WinXP and Ubuntu Linux 7.10
* All other libraries are provided in the subversion source or download

Check out source from subversion

As of 2/2/2008

Checking out the botlist source is the recommended way to get build and run the application. In the future, regular releases and snapshots will be available, for now you should retrieve the latest source code.

I extracted tomcat to my home directory for development.

tomcat_home = ~/projects/tomcat/tomcat5526

cd ~/projects/tomcat/tomcat5526/webapps

svn co http://openbotlist.googlecode.com/svn/trunk/openbotlist

mv openbotlist botlist

The project name is called openbotlist, it is best to change the directory
name to just botlist because of URI references to 'botlist' in the web
application.

Run mysqld and setup the database

Start the mysql daemon and create the databases.

* mysqld (leave the daemon running)
* Open a new shell environment and cd to the tomcat botlist directory
* example: cd ~/projects/tomcat/tomcat5526/webapps/botlist
* cd db
* mysql -uroot (enter the mysql shell)
* source create_database.sql; (create the databases)
* source create_tables.sql; (create the tables)
* source insert_link_groups.sql; (addtional step to setup link group table)

Example output:

Query OK, 1 row affected (0.00 sec)

Query OK, 1 row affected (0.00 sec)

Query OK, 1 row affected (0.00 sec)

Query OK, 1 row affected (0.00 sec)

mysql> source insert_link_groups.sql;

At this point, you have created the MySQL database.
Build the project

To build the project, you simply need to enter the botlist web app directory and invoke the ant command.

* cd the botlist web application directory
* example: cd ~/projects/tomcat/tomcat5526/webapps/botlist
* ant
* ant tomcat.deploy (this will copy the java class files to the WEB-INF dir)

Example output:

$ ant
Buildfile: build.xml

prepare:
[mkdir] Created dir: c:\projects\tools\home\projects\tomcat\tomcat5526\webap
ps\botlist\build
[mkdir] Created dir: c:\projects\tools\home\projects\tomcat\tomcat5526\webap
ps\botlist\build\classes
...
...

Web application database configuration

To configure the application, you need to set the database parameters including username and password.

cp example_botlist_config.properties botlist_config.properties

Edit the botlist_config.properties file.

botlist.db.url=jdbc:mysql:///openbotlist_development
botlist.username=USER
botlist.password=PASSWORD

Run tomcat and navigate to the botlist site

At this point, launch the tomcat server.

* example: cd ~/projects/tomcat/tomcat5526/bin
* ./startup.sh
* navigate your browser to: http://127.0.0.1:8080/botlist/

Tuesday, February 5, 2008

Ping the semantic web, datasets

Interesting, look at all of the RDF datasets that are out there:

http://pingthesemanticweb.com/stats/namespaces.php

http://xmlns.com/foaf/0.1/ 900, 799
http://blogs.yandex.ru/schema/foaf/ 581, 133
http://www.w3.org/2003/01/geo/wgs84_pos# 145, 758
http://rdfs.org/sioc/ns# 80, 097
http://rdfs.org/sioc/types#

Monday, February 4, 2008

Business Model: The web is junk, why you could do hosted semantic web services

Web is junk

One thing that is always amazed me, how does Google make all of that money through web advertising? Let me ask that question another way. Why do people pay Google such much for web advertising? Recently, I heard a Google talk from a head Google advertising consultant on the billions of dollars in advertising revenue. It was a good presentation on the relationships between search terms and text based advertising links. But there was one question that I didn't find an answer to. Does web advertising work? Do clicks turn into a real return on investment? I have been on the web since 1997, I haven't ever intentionally clicked on an advertising link. Maybe once, twice; but after spending hours a day for 10 something years, I can honestly say don't have any interest in clicking on Google's or any other search engine's advertising. From Google's perspective, it really doesn't matter. If they receive $100 a month from a customer, Google has already complete their transaction. Google may place that particular ad at the right place, at the right time. Who knows? In print media, advertising fits. There is a picture caption of a product, possibly with price and contact information. Related products are positioned next to each other. Works for print, not so much for web media. In any case, Google is one of the hottest technology companies in the history of the world. One of their revenue sources is through web advertising. They make a lot of money, I don't.

The web is junk/noise. By and large, their isn't a whole lot of relevant, organized information on the web. Wikipedia, Reddit, Digg are useful sites with relevant, dense amount of information and there are a dozen, hundred, hundreds of sites with relevant bits of information. The other millions and millions of sites are mostly filled with junk. The text mining phrase is noise and even beyond noise is spam. There is a lot of noise out there. Take wikipedia, which is a valuable source of information. Wikipedia is great, but they could have gone further and used RDF metadata to organize the information. Some research projects are manually and using automated approaches to convert Wikipedia data into RDF dumps and OWL Ontologies (see semantic web). It is unfortunate that the the major players haven't pushed for these WWW extensions. Imagine that you are interested in parsing a web document to extract valuable information. It is doable but not straight forward. How would you extract the creation date? The key proper nouns? Topic information? HTML, TABLE, SPAN, DIV HTML tags provide the layout structure for the browser to render but doesn't describe what the document is about or if relevant information is available. If you are familiar with RSS, early version RDF Site Summary provided a format for describing the when a page is added to a blog post with title, date, and description information:


<?xml version="1.0"?>
 
<rdf:RDF 
 xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
 xmlns="http://purl.org/rss/1.0/">
 
 <channel rdf:about="http://www.xml.com/xml/news.rss">
   <description>
     XML.com features a rich mix of information and services 
     for the XML community.
   </description>
   <image rdf:resource="http://xml.com/universal/images/xml_tiny.gif" />
   <items>
     <rdf:Seq>
       <rdf:li rdf:resource="http://xml.com/pub/2000/08/09/xslt/xslt.html" />
       <rdf:li rdf:resource="http://xml.com/pub/2000/08/09/rdfdb/index.html" />
     </rdf:Seq>
   </items>
   <textinput rdf:resource="http://search.xml.com" />
 </channel>
 
 <image rdf:about="http://xml.com/universal/images/xml_tiny.gif">
 </image>
 
 <item rdf:about="http://xml.com/pub/2000/08/09/xslt/xslt.html">
   <description>
    Processing document inclusions with general XML tools can be 
    problematic. This article proposes a way of preserving inclusion 
    information through SAX-based processing.
   </description>
 </item>
 
 <item rdf:about="http://xml.com/pub/2000/08/09/rdfdb/index.html">
   <description>
    Tool and API support for the Resource Description Framework 
    is slowly coming of age. Edd Dumbill takes a look at RDFDB, 
    one of the most exciting new RDF toolkits.
   </description>
 </item>
 
 <textinput rdf:about="http://search.xml.com">
   <description>Search XML.com's XML collection</description>
   <name>s</name>
 </textinput>
</rdf:RDF>

Host the data

Imagine setting up a data hosting service. Host various data formats like RDF. Give users at least 10gigs. Charge a light fee such as $10-20 a month. This is where it gets a little complicated. As opposed to throwing HTML at the users. You could host their RDF and then output visual data tools. Graphs charts, simple web interfaces. Kind of like, geocities used to be a free, shared web hosting service. Now, you are doing RDF hosting service.

Sunday, February 3, 2008

Haskell-snippet: Split with regex

The first listing shows perl code for splitting a string with a delimiter "::|".

Listing 2.3.2008.1:


#!/usr/bin/perl
# Simple example, show regex split usage
print "Running\n";
$string  = "file:///home/baby  ::|  test1::|  test2";
$string2 = "file:///home/baby  ,  test1,  test2";
my @data = split /\s*::\|\s*/, $string;
print "----\n";
print join("", @data);
print "\n----\n";
print "Done\n";

The second listing below shows a haskell regex approach for performing the same operation:

Listing 2.3.2008.2:


import Text.Regex (splitRegex, mkRegex)

csv string = "abc ::| 123 ::|"

let csv_lst = splitRegex (mkRegex "\\s*(::\\|)+\\s*") csv
linkUrlField = (csv_lst !! 0) ...

End of code snippet.

Haskell snippet; CRUD operations with haskell hsql and hsql-sqlite3

The source listing below is not complicated, showing a basic create, read unit test (minus the update/delete) against a simple sqlite3 database. You may have some trouble setting up hsql, especially because it seems that the module is not being maintained. The code is still useful and viable, but for now, you are going to have issues building the module. The build failure will probably be resolved pretty soon as I see updates to that particular code.

Ensure that you are running the latest ghc. Tested with ghc 6.8.2

Download hsql-1.7 (or greater)
http://hackage.haskell.org/packages/archive/hsql/1.7/hsql-1.7.tar.gz

Change the hsql.cabal to what is shown in the listing; Rank2Types, DeriveDataTypeable extensions were added. This will not work with previous versions of ghc (at least I had to build it on 6.8.2).



name:  hsql
version: 1.7
license: BSD3
author:  Krasimir Angelov 
category: Database
description: Simple library for database access from Haskell.
exposed-modules:
 Database.HSQL,
 Database.HSQL.Types
build-depends: base, old-time
extensions:     ForeignFunctionInterface, 
TypeSynonymInstances, CPP, Rank2Types, DeriveDataTypeable

Do the runhaskell Setup.lhs configure (then build and then install).

Get the latest version of hsql-sqlite3:

http://hackage.haskell.org/cgi-bin/hackage-scripts/package/hsql-sqlite3-1.7

hsql-sqlite3 also has a cabal script that won't run with the current version of ghc and cabal; I don't know if this is the definite solution to resolve the build errors, but making these changes resolved the build failures.

Add the following to hsql-sqlite3.cabal
extra-libraries: sqlite3

There was a lot of script config code in Setup.lhs, I removed all of it and just added these five lines

Setup.lhs:


#!/usr/bin/runghc

\begin{code}
import Distribution.Simple
main = defaultMain
\end{code}

I then ran runhaskell configure (build, install).

The example

The haskell source listing shows the code for creating the filesystem sqlite3 database called simple.db and the other create/read hsql operations. (As a prerequisite, create the tmp directory in the current working path)


module Tests.Data.TestBasicHSQL where

import IO
import Database.HSQL as Hsql
import Database.HSQL.SQLite3 as Hsql

simpleDB = "tmp/simple.db"

sqlCreate = "create table if not exists simpletable(mydata)"
sqlInsert = "insert into simpletable values('dogs and cats')"
sqlSelect = "select mydata from simpletable"

--
-- Get Rows routine from David at davblog48
getRows :: Statement -> IO [[String]]
getRows stmt = do
  let fieldtypes = map (\(a,b,c) -> a) $ getFieldsTypes stmt
  rowdata <- collectRows (\s -> mapM (getFieldValue s) fieldtypes ) stmt
  return rowdata

runTestBasicHSQL = do
  putStrLn "Test HSQL"
  tryconn <- try $ Hsql.connect simpleDB ReadWriteMode
  conn <- case tryconn of
            Left _ -> error "Invalid Database Path"
            Right conn -> return conn
  
  -- Run a simple create query
  stmt <- Hsql.query conn sqlCreate
  Hsql.closeStatement stmt
  stmt <- Hsql.query conn sqlInsert
  Hsql.closeStatement stmt
  stmt <- Hsql.query conn sqlSelect
  rows <- getRows stmt
  putStrLn $ "Length rows=" ++ show (length rows)
  mapM_ (\val -> putStrLn $ show val) rows
  Hsql.closeStatement stmt
  Hsql.disconnect conn
-- End of File

Friday, February 1, 2008

Haskell HSQL/SQLite with ghc 6.8 setup is ...a little...messed up?

I am doing my research on the intertubes and it looks like database access with ghc/haskell is not that high on the list of priorities. I am sure, somewhere in the that haskell source is working code. It is just a matter of getting it to work with the most recent stuff; like Cabal 1.2+, GHC6.8+, and gasp Sqlite. I didn't even think about messing with postgres/sqlite.

This the difference between opensource and commercial software. Not that there are sometimes issues with the code. It is an issue of doing the right thing and/or making people happy. Most opensource projects do it the right way? Huh? Ideally, you don't want to mix the base haskell system with the database drivers. GHC has done just that. E.g. Java may include a whole mess of garbage that you don't normally need. In theory, it doesn't make sense.

But, if you are a lazy developer, sometimes being able to just run your database code is a lot easier even though, theoretically, the base compiler should be separate from
external libraries.

It would nice if I could just write my SQL code and go about my day, but I can understand the decoupling of components. Only drawback is that the things that normally would be be important in modern development like database access. Shrug.

I will keep you up to date.

Some of my research.

"> something seemed to have changed in cabal
> (compiling hsql-1.7 with ghc version 6.8.0.20070921)
>
> HSQL/MySQL$ runhaskell Setup.lhs configure -p -O
>
> Setup.lhs:8:33:
> Module
> `Distribution.Simple.Utils'
> does not export
> `rawSystemVerbose'
>
> (same: http://hackage.haskell.org/packages/archive/hsql-mysql/1.7/log )
>
> is there a workaround?

The api that the Setup.lhs is using has changed, it's not just rawSystemVerbose.
The return types are now IO () not IO ExitCode and verbosity is now a proper
Verbosity type, not an Int. The Setup.lhs can be considerably simplified by
using configurations and other new features.

So basically it needs updating for the newer cabal. It may also need to be
updated for the fact that many modules from the base package have been split off.

I've attached an example updated Setup.lhs and hsql-mysql.cabal files.

Don was suggesting we start a wiki page with advice to package maintainers on
what updates are common for the transition to ghc-6.8 and cabal-1.2. This is a
certainly a good idea. By trying a range of packages it should also tell us if
there are any minor cabal changes that we could do to make the transition
smoother (eg adding back rawSystemVerbose with a deprecation warning)."