A Blog about Software Development. Software development is the act of developing a software product.
Thursday, January 31, 2008
Web specification for botist front ends
The botlist backend and web frontend can operate independently. The backend sends data to the frontend and the web frontend is used to display that information.
http://www.botspiritcompany.com/botlist/
In terms of specification. The primary goal is to display URL information from the botlist repository.
For example.
Specification 1:
1. Display links in a listed format
2. Have a link to the host name
3. Allow for user submissions
That is the idea.
Why have a specification? Well, because I believe in programming language agnostic oriented development. The web front end could be implemented with the Spring framework, Lisp, or Django.
Actually, I plan on adding a Lisp/Django component in the future.
Wednesday, January 30, 2008
Haskell Snippet: looking at foldl (also a java version)
"It takes the second argument and the first item of the list and applies the function to them, then feeds the function with this result and the second argument and so on. See scanl for intermediate results."
The type definition is:
Foldl: (a -> b -> a) -> a -> [b] -> a
Lets say that out loud; I can call the foldl function with the following parameters; the first parameter passed to foldl is a generic function that has two inputs a and b. This input function returns a type of a. The second parameter has a type of a, the first, initial item or the result of the previous operation. The third parameter is an array of type b.
The listing below shows how you might use foldl.
let a = foldl (\x y -> (++) x (show y)) "-->" [77, 88, 99]
To make it more complicated and unreadable, I defined the first parameter function input as a lambda function with inputs 'x' and 'y':
(\x y -> (++) x (show y))
Here are some other variations of calling foldl:
-- foldl :: (a -> b -> a) -> a -> [b] -> a
-- Fold given an input array [77,88,99] = b
-- b = Array of Integer type
-- a = String type
-- Output = -->778899
let a = foldl (\x y -> (++) x (show y)) "-->" [77, 88, 99]
-- Other functions, operators
-- (++) :: [a] -> [a] -> [a]
-- In our case; x ++ (show y) where y = the integer from [b]
-- Other variations:
let lst = [77, 88, 99]
-- Pass an empty list as 'a'
b = foldl (\x y -> x ++ show y) [] lst
-- What happens if want to find the max given an array of ints
c = foldl max 0 [1, 2, 3, 4,99, 5]
-- Look at it differently with a lambda call
d = foldl (\x y -> x `max` y) 0 [1, 2, 3, 4, 99, 5]
-- How would I take the sum of a list of values (w/o using sum)
e = foldl (\x y -> x + y) 0 [1, 2, 3]
-- Output as you would expect is e = 6
The key is to use the type definition as your guide:
Foldl: (a -> b -> a) -> a -> [b] -> a
foldl with inputs function (a -> b -> a) and input a and input array [b] returns a.
How would this example look in java?
At first, I was going to give a simpler java example that outputs the same result as in haskell, but instead I ended up with this more generic java function. Try this, visualize the result of running the first haskell example shown in the listing above:
Output=-->778899
Ok, define how you would solve that result in an imperative language (like Java) with the simplest possible implementation. You might end up with this (pseudo code):
public String foldl(String init_a, int b[]) {
StringBuffer buf = init_a + b[0]
for (int i = 0; i < b.length(); i++) {
buf.append(b[i]
}
return buf.toString();
}
Something like that, now here is a generic example in Java:
// Date: 1/29/2008
// Example mimic of foldl function in haskell in java.
// foldl :: (a -> b -> a) -> a -> [b] -> a
public class A {
public interface FuncCall {
public Object exec_AtoBtoA(Object o1, Object o2);
}
public static Object exFoldl(Object initA, Object b[],
FuncCall func) {
Object a_res = func.exec_AtoBtoA(initA, b[0]);
for (int i = 1; i < b.length; i++) {
a_res = func.exec_AtoBtoA(a_res, b[i]);
}
return a_res;
}
public static void main(String [] args) {
System.out.println("running");
// -- How would I take the sum of a list of values (w/o using sum)
// let a = foldl (\x y -> (++) x (show y)) "-->" [77, 88, 99]
Object b[] = { 77, 88, 99 };
Object initA = "-->";
Object finalRes = exFoldl(initA, b, new FuncCall () {
public Object exec_AtoBtoA(Object x, Object y) {
String lambdaRes = (x.toString() + y.toString());
return lambdaRes;
}
});
System.out.println("Output=" + finalRes);
Object b2[] = { 1, 2, 3 };
Integer initA2 = 0;
finalRes = exFoldl(initA2, b2, new FuncCall () {
public Object exec_AtoBtoA(Object x, Object y) {
Integer lambdaRes = Integer.parseInt(x.toString()) +
Integer.parseInt(y.toString());
return lambdaRes;
}
});
System.out.println("Output=" + finalRes);
/*
Output after running example:
Output=-->778899
Output=6
*/
}
}
That is all there is to it to understanding foldl and other prelude haskell functions. Read the definition from the haskell documentation to get an idea of the inputs and output. Then pass the correct inputs.
For example, some other useful functions are zip and unzip. Zip has the following type:
zip :: [a] -> [b] -> [(a, b)]
Create a list of tuples from two input lists.
Prelude> let a = [1, 2, 3]
Prelude> let b = ["a", "b", "c"]
Prelude> zip a b
[(1,"a"),(2,"b"),(3,"c")]
Prelude> :t
Prelude> :t zip
zip :: [a] -> [b] -> [(a, b)]
Prelude> unzip (zip a b)
([1,2,3],["a","b","c"])
Prelude>
Purpose of botlist
http://botspiritcompany.com/botlist/
A lot of web applications provide you with information based on your input. Google and yahoo work in this manner. You pull information from them based on some query. Botlist hopes to operate based on a push architecture. Much in the way that other news aggregation services works. Information is piped through to the botlist web application and the top articles and links are displayed to the user. The other links that aren't as popular will still be available through search.
Pretty simple, yet a lot of work. The back end crawler is built with a combination of haskell and python. Haskell for the content analysis and python for the crawler.
Tuesday, January 29, 2008
Paul Graham's Arc has been released
http://paulgraham.com/arc0.html
"We're releasing a version of Arc today, along with a site about it at arclanguage.org. This site will seem very familiar to users of Hacker News. It's mostly the same code, with a few colors and messages changed."
Monday, January 28, 2008
Haskell Snippet: read CSV file, marshall into type
data PageURLFieldInfo = PageURLFieldInfo {
linkUrlField :: String,
aUrlField :: Integer,
blockquoteUrlField :: Integer,
divUrlField :: Integer,
h1UrlField :: Integer,
imgUrlField :: Integer,
pUrlField :: Integer,
strongUrlField :: Integer,
tableUrlField :: Integer
}
--
-- The info content file contains html document information.
-- It may not exist but should, also contains URL info.
readInfoContentFile :: String -> IO PageURLFieldInfo
readInfoContentFile extr_file = do
let extr_n = (length ".extract")
extr_path = take ((length extr_file) - extr_n) extr_file
info_file = extr_path ++ ".info"
-- Extract the file, in CSV format.
-- URL::|a::|b::|blockquote::|div::|h1::|h2::|i::|img::|p::|span::|strong::|table
csvtry <- try $ readFile info_file
-- Handler error
info <- case csvtry of
Left _ -> return defaultPageFieldInfo
Right csv -> do let csv_lst = splitRegex (mkRegex "\\s*[::|]+\\s*") csv
return PageURLFieldInfo {
linkUrlField = csv_lst !! 0,
aUrlField = read (csv_lst !! 1) :: Integer,
blockquoteUrlField = read (csv_lst !! 2) :: Integer,
divUrlField = read (csv_lst !! 3) :: Integer,
h1UrlField = read (csv_lst !! 4) :: Integer,
imgUrlField = read (csv_lst !! 5) :: Integer,
pUrlField = read (csv_lst !! 6) :: Integer,
strongUrlField = read (csv_lst !! 7) :: Integer,
tableUrlField = read (csv_lst !! 8) :: Integer
}
return info
-- End of File
Sunday, January 27, 2008
Haskell Snippet; tokenize and clean a file
Example Input:
A AAND ABLE ABOUT ABOVE
ACQUIRE ACROSS AFOREMENTIONED AFTER AGAIN
AGAINST AGO AGREE AGREEABLE AHEAD
AKIN ALBEIT ALCUNE ALGHOUGH ALGUNOS
ALL
Example output:
aand
able
about
above
accordance
according
acquire
across
aforementioned
after
again
import qualified Data.Set as Set
import Data.List
wordTokens :: String -> [String]
wordTokens content = tokens
where maxwordlen = 100
lowercase str = map toLower str
alltokens = splitRegex (mkRegex "\\s*[ \t\n]+\\s*") (lowercase content)
tokens = filter (\x -> length x > 1 && length x < maxwordlen) alltokens
--
--
-- Given an unclean content set; tolower, filter by length, get unique tokens,
-- tokenize, join the list back together with a token on each line.
-- @see intersperse ',' "abcde" == "a,b,c,d,e"
tokenizeInput :: String -> IO String
tokenizeInput content = return $ concat . intersperse "\n" $ unify
where tokens = wordTokens content
unify = Set.toList . Set.fromList $ tokens
Stop Word Database
The tool was used to create a STOP WORD database, a database of words that are important to any particular document, words like "the", "than", "if".
http://openbotlist.googlecode.com/svn/trunk/botlistprojects/botspider/spider/var/lib/spiderdb/lexicon/stopwords/stopwords.tdb
"It doesnt matter about the drugs and rock and roll, sex " - Haskell bayes and other classifications
That was an example test case that I used against my bayes classification application. Bayes and other probability algorithms are used in most of the spam filtering software that is out in the net. I use various probability routines to classify the news articles. I begin with Toby's python code on the subject and ported is examples to haskell. So far, the results have been pretty satisfactory. The naives bayes classification is dismal, but maybe I am looking at the numbers wrong. In classic form, I will throw code at you and will give a lollipop to the mind reader that can figure out what is going on.
The output of the test case:
In my first test case, I extracted a couple of articles from the web with these subjects; politics, business, entertainment. A human can easily tell what is going on. A software program is going to have a lot of difficultly for several reasons, but two main ones. One, there are too many STOP words and too many ambiguous phrases in my sample training set. The feature word global could easily apply to entertainment, politics, or business. Which one does the computer decide upon? In any case, the first listing below shows the input data set.
let phrases =
[
"It doesnt matter about the drugs and rock and roll, sex",
"I agree citi, global market money business business driving force",
"ron paul likes constitution war international freedom america",
"viagra drugs levitra bigger",
"Movies are fun and too enjoy",
"war america not good"
]
And the output:
Running Tests
At: Sun Jan 27 05:08:37 EST 2008 t:1201428517
Test Train Bayes
793
["train/business.train","train/entertainment.train","train/politics.train"]
---- train/business.train ----
. [It doesnt matter about the drugs and rock and roll, sex ]
Bayes Probability=3.989729693042545e-7
Fisher Probability=0.836128249211062
. [I agree citi, global market money business business driving force ]
Bayes Probability=4.850155473781951e-5
Fisher Probability=0.9541001591541463
. [ron paul likes constitution war international freedom america ]
Bayes Probability=3.780639186633039e-4
Fisher Probability=0.6089235797669089
. [viagra drugs levitra bigger ]
Bayes Probability=2.419609079445145e-2
Fisher Probability=0.6980297367583733
. [Movies are fun and too enjoy ]
Bayes Probability=1.9649303466097898e-4
Fisher Probability=0.4827094766567699
. [war america not good ]
Bayes Probability=1.2098045397225725e-2
Fisher Probability=0.5440441299962482
---- train/entertainment.train ----
. [It doesnt matter about the drugs and rock and roll, sex ]
Bayes Probability=2.0531041666810224e-7
Fisher Probability=0.6270914273866588
. [I agree citi, global market money business business driving force ]
Bayes Probability=2.339809268600252e-5
Fisher Probability=0.4542155835210187
. [ron paul likes constitution war international freedom america ]
Bayes Probability=1.8718474148802017e-4
Fisher Probability=0.6089235797669089
. [viagra drugs levitra bigger ]
Bayes Probability=1.197982345523329e-2
Fisher Probability=0.6980297367583733
. [Movies are fun and too enjoy ]
Bayes Probability=1.0245769451548432e-4
Fisher Probability=0.8278707842798139
. [war america not good ]
Bayes Probability=5.989911727616645e-3
Fisher Probability=0.5440441299962482
---- train/politics.train ----
. [It doesnt matter about the drugs and rock and roll, sex ]
Bayes Probability=4.237119541681478e-7
Fisher Probability=0.4319928795655645
. [I agree citi, global market money business business driving force ]
Bayes Probability=5.141422998108449e-5
Fisher Probability=0.4542155835210187
. [ron paul likes constitution war international freedom america ]
Bayes Probability=4.1625450234461715e-4
Fisher Probability=0.8928739462341001
. [viagra drugs levitra bigger ]
Bayes Probability=2.632408575031526e-2
Fisher Probability=0.6980297367583733
. [Movies are fun and too enjoy ]
Bayes Probability=2.131121584070195e-4
Fisher Probability=0.46619446182733354
. [war america not good ]
Bayes Probability=1.3240857503152584e-2
Fisher Probability=0.7855657341350474
Done
Like I said before, the bayes probability numbers are all over the place. It could be an issue with that I didn't extract stop words from the training set. Or could be the training set is too small or both.
The interesting part is that the fisher probability algorithm figured out the correct classification in each instance.
It doesnt matter about the drugs and rock and roll, sex
I wanted to associate this term with entertainment. There was a 0.6270914273866588 that is entertainment and a 45% "I agree citi, global market money business business driving force" chance that this phrase is an entertainment document. Subsequently, it matched in all of the other instances as well. 100% success rate for the fisher probability.
The next listing shows the test case.
module Tests.Data.TestTrainBayes where
import Monad (liftM)
import System.Directory (getDirectoryContents)
import List (isPrefixOf, isSuffixOf)
import Data.SpiderNet.Bayes
trainDir = "train"
runTestTrainBayes :: IO ()
runTestTrainBayes = do
putStrLn "Test Train Bayes"
-- Process only files with 'train' extension
files <- getDirectoryContents trainDir
let trainfiles = filter (isSuffixOf ".train") files
trainpaths = map (\x -> trainDir ++ "/" ++ x) trainfiles
lst_content <- liftM (zip trainpaths) $ mapM readFile trainpaths
-- Print a count of the training set size
let info = buildTrainSet lst_content []
putStrLn $ show (length info)
putStrLn $ show (categories info)
let phrases =
[
"It doesnt matter about the drugs and rock and roll, sex",
"I agree citi, global market money business business driving force",
"ron paul likes constitution war international freedom america",
"viagra drugs levitra bigger",
"Movies are fun and too enjoy",
"war america not good"
]
-- process the following input phrases
mapM_ (\cat -> do
putStrLn $ "---- " ++ cat ++ " ----"
mapM_ (\phrase -> do
putStrLn $ " . [" ++ phrase ++ " ]"
putStrLn $ " Bayes Probability=" ++
(show ((bayesProb info (wordTokens phrase) cat 1.0)))
putStrLn $ " Fisher Probability=" ++
(show ((fisherProb info (wordTokens phrase) cat)))
) phrases
) (categories info)
Source
This module implements the classification source; step through the example to get an understanding of the code. Start with the test case and then move to the bayesProb and fisherProb functions.
http://openbotlist.googlecode.com/svn/trunk/botlistprojects/botspider/spider/lib/haskell/src/Data/SpiderNet/Bayes.hs
fisherProb :: [WordCatInfo] -> [String] -> String -> Double
fisherProb features tokens cat = invchi
where initp = 1.0
weight = 1.0
p = foldl (\prb f -> (prb * (weightedProb features f cat weight))) initp tokens
fscore = (negate 2) * (log p)
invchi = invChi2 fscore ((genericLength tokens) * 2)
And here is an example input document, for training the business class:
Look no further than the European Central Bank, which was notably absent when the Fed made its emergency rate cut amid falling global stocks on Tuesday. In testimony Wednesday before the European Parliament, ECB President Jean-Claude Trichet came about as close as a member of the brotherhood ever will to calling out a fellow central banker: "In demanding times of significant market correction and turbulences, it is the responsibility of the central bank to solidly anchor inflation expectations to avoid additional volatility in already highly volatile markets
Economics is the social science that studies the production, distribution, and consumption of goods and services. The term economics comes from the Greek for oikos (house) and nomos (custom or law), hence "rules of the house(hold)."[1]
Although discussions about production and distribution have a long history, economics in its modern sense as a separate discipline is conventionally dated from the publication of Adam Smith's The Wealth of Nations in 1776.[7] In this work Smith describes the subject in these practical and exacting terms:
Political economy, considered as a branch of the science of a statesman or legislator, proposes two distinct objects: first, to supply a plentiful revenue or product for the people, or, more properly, to enable them to provide such a revenue or subsistence for themselves; and secondly, to supply the state or commonwealth with a revenue sufficient for the public services. It proposes to enrich both the people and the sovereign.
Smith referred to the subject as 'political economy', but that term was gradually replaced in general usage by 'economics' after 1870.
In economics, a business (also called firm or enterprise) is a legally recognized organizational entity existing within an economically free country designed to provide goods and/or services to consumers. Businesses are predominate in capitalist economies, where most are privately owned and typically formed to earn profit to increase the wealth of their owners. The owners and operators of a business have as one of their main objectives the receipt or generation of a financial return in exchange for their work and their acceptance of risk. Notable exceptions to this rule include cooperative businesses and government institutions. This model of business functioning is contrasted with socialistic systems, which involve either government, public, or worker ownership of most sizable businesses.
Saturday, January 26, 2008
What is autonomiccomputing?
"What is autonomic computing? It is the ability of systems to be more self-managing. The term autonomic comes from the autonomic nervous system, which controls many organs and muscles in the human body. Usually, we are unaware of its workings because it functions in an involuntary, reflexive manner -- for example, we don't notice when our heart beats faster or our blood vessels change size in response to temperature, posture, food intake, stressful experiences and other changes to which we're exposed. And, by the way, our autonomic nervous system is always working."
Alan Ganek, VP Autonomic Computing, IBM
Haskell Snippet; cat data in files
import System.Directory (getDirectoryContents)
import List (isPrefixOf, isSuffixOf)
import Data.SpiderNet.Bayes
trainDir = "train"
wordTokens :: String -> [String]
wordTokens content = tokens
where maxwordlen = 100
lowercase str = map toLower str
alltokens = splitRegex (mkRegex "\\s*[ \t\n]+\\s*") (lowercase content)
tokens = filter (\x -> length x > 1 && length x < maxwordlen) alltokens
runTestTrainBayes :: IO ()
runTestTrainBayes = do
putStrLn "Test Train Bayes"
files <- getDirectoryContents trainDir
let trainfiles = filter (isSuffixOf ".train") files
trainpaths = map (\x -> trainDir ++ "/" ++ x) trainfiles
d <- concatMap lines `fmap` mapM readFile trainpaths
let z = wordTokens (concat d)
putStrLn $ show z
Update: Simpler Example
runTestTrainBayes :: IO ()
runTestTrainBayes = do
putStrLn "Test Train Bayes"
files <- getDirectoryContents trainDir
let trainfiles = filter (isSuffixOf ".train") files
trainpaths = map (\x -> trainDir ++ "/" ++ x) trainfiles
d <- mapM readFile trainpaths
putStrLn $ show (length d)
Update: The snippet below shows how to associate the content of the file with the filename
files <- getDirectoryContents trainDir
let trainfiles = filter (isSuffixOf ".train") files
trainpaths = map (\x -> trainDir ++ "/" ++ x) trainfiles
d <- liftM (zip trainpaths) $ mapM readFile trainpaths
Friday, January 25, 2008
2007 Review of Books
"This year I only read 70 books, down from over 120 last year. I guess that's not too bad considering this was the Year of Other People -- I still beat my general goal of 52. Once again, here are the books that struck me as completely worth reading. This year, though, I've intermixed them with the other books I read:"
http://www.aaronsw.com/weblog/books2007
Aaron Swartz is well read; he said he read 70 books and 120 the previous year. I can't top that. But I do read a lot of technical books and like him only interested in government in politics. So there are only two categories. And normally technical is typically software/programming related.
Here is my list; I didn't completely read all of them, "DNA String algorithms" doesn't exactly make for sit down type of reading material.
Technical
Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology
Programming Erlang: Software for a Concurrent World by Joe Armstrong
The Haskell School of Expression: Learning Functional Programming through Multimedia by Paul Hudak
Haskell: The Craft of Functional Programming (2nd Edition) by Simon Thompson
Amazon Hacks: 100 Industrial-Strength Tips & Tools (Hacks) by Paul Bausch
Programming Perl (3rd Edition) by Larry Wall, Tom Christiansen, and Jon Orwant
Programming Ruby: The Pragmatic Programmers' Guide, Second Edition by Dave Thomas, Chad Fowler, and Andy Hunt
Text Mining Application Programming
Collective Intelligence
Structure and Interpretation of Computer Programs - 2nd Edition
Core J2EE Patterns: Best Practices and Design Strategies (2nd Edition) (Core Series) by Deepak Alur, Dan Malks, and John Crupi
Professional Java Development with the Spring Framework by Rod Johnson, Juergen Hoeller, Alef Arendsen, and Thomas Risberg (Paperback - Jul 8, 2005)
The Elements of Technical Writing (Elements of Series) by Gary Blake and Robert W. Bly (Paperback - Dec 19, 2000)
Scala Programming (released on PDF)
Freakonomics [Revised and Expanded]: A Rogue Economist Explores the Hidden Side of Everything by Steven D. Levitt and Stephen J. Dubner (Hardcover - Oct 17, 2006)
Non Fiction
The Constitution in Exile - Napolitano
Legacy of Ashes: The History of the CIA by Tim Weiner
Dead Certain: The Presidency of George W. Bush by Robert Draper
The World Is Flat: A Brief History of the Twenty-first Century by Thomas L. Friedman
Against All Enemies: Inside America's War on Terror--What Really Happened
Machine Learning in future releases of the Lucene project
http://www.infoq.com/news/2008/01/lucene-23-mahout
"That will be a separate project, but may be beneficial to Lucene users. There are currently some patches in JIRA for Lucene that implement ML algorithms. The goal of this project is to provide commercial quality, large scale machine learning (ML) algorithms built on Hadoop under an Apache license. I have seen a fair amount of interest already, and hope to have this project underway in the coming month."
http://ml-site.grantingersoll.com/index.php?title=Incubator_proposal
Would be interesting if the team could introduce the following algorithms.
* 1.3.1 Naive Bayes
* 1.3.2 Neural Networks
* 1.3.3 Support Vector Machines
* 1.3.4 Logistic Regression
* 1.3.5 Locally Weighted Linear Regression
* 1.3.6 k-Means
* 1.3.7 Principal Components Analysis
* 1.3.8 Independent Component Analysis
* 1.3.9 Expectation Maximization
* 1.3.10 Gaussian Discriminative Analysis
Wednesday, January 23, 2008
Haskell is a very expressive language
Let me reword that so the previous sentence makes sense:
I use Haskell to accomplish complicated programming tasks; the expressive nature of the language provides an effective conduit to make that happen.
One more time:
Haskell is readable, fast, and expressive. I like it.
I like pizza
Tacos are also very appetizing.
Tuesday, January 22, 2008
New real world haskell book is coming out
I am a rare breed and love picking up technical books. Some say they are too behind the technology trends. That is true, but if you need a book on text parsing or artificial intelligence, then a 30 year old book is as good as a 2 year old one. The real world haskell book is one that I am looking forward to; I have already picked up two haskell books and this will only enhance the collection.
Simple comparison of iteration with Python and use of Map in Haskell
Perform the invchi2 calculation in python
example_data = [
(4.3, 6),
(2.2, 60),
(60, 2.2),
(0.3, 4),
(32.123, 20),
(12.4, 5),
(0.04, 3),
(0, 0),
(1, 1)
]
def invchi(chi, df):
m = chi / 2.0
sum = term = math.exp(-m)
for i in range(1, df//2):
term *= m / i
sum += term
return min(sum, 1.0)
if __name__ == '__main__':
print "Running InvChi Test"
for chi, df in example_data:
print "[ chi=%s df=%s ] res=%s" % (chi, df, invchi(chi, df))
print "Done"
Similar code in Haskell
exampleData :: [(Double, Double)]
exampleData = [
(4.3, 6),
(2.2, 60),
(60, 2.2),
(0.3, 4),
(32.123, 20),
(12.4, 5),
(0.04, 3),
(0, 0),
(1, 1)]
-- Inverted Chi2 formula
invChi :: Double -> Double -> Double
invChi chi df = minimum([snd newsum, 1.0])
where m = chi / 2.0
initsum = exp (-m)
trm = exp (-m)
maxrg = fromIntegral (floor (df / 2.0)) :: Double
-- Return a tuple with current sum and term, given these inputs
newsum = foldl (\(trm,sm) elm -> ((trm*(m/elm)), sm+trm))
(trm,initsum) [1..maxrg]
runBayesTest = do
putStrLn "Bayes Test: Inv Chi"
mapM_ (\x -> putStrLn $ (show x) ++ " res=" ++ (show (invChi (fst x) (snd x)))) exampleData
putStrLn "Done Bayes"
Monday, January 21, 2008
Haskell Snippet, start of bayes probability library
Libraries used:
import System.Environment
import qualified Data.Map as Map
import qualified Data.Set as Set
import Data.List
import Text.Regex (splitRegex, mkRegex)
Type definitions:
type WordCat = (String, String)
type WordCatInfo = (WordCat, Int)
type WordInfo = (String, Int)
Utilities for finding the word frequency in a document:
--
-- | Find word frequency given an input list using "Data.Map" utilities.
-- With (Map.empty :: Map.Map String Int), set k = String and a = Int
-- Map.empty :: Map k a
-- foldl' is a strict version of foldl = foldl': (a -> b -> a) -> a -> [b] -> a
-- Also see: updmap nm key = Map.insertWith (+) key 1 nm
-- (Original code from John Goerzen's wordFreq)
wordFreq :: [String] -> [WordInfo]
wordFreq inlst = Map.toList $ foldl' updateMap (Map.empty :: Map.Map String Int) inlst
where updateMap freqmap word = case (Map.lookup word freqmap) of
Nothing -> (Map.insert word 1 freqmap)
Just x -> (Map.insert word $! x + 1) freqmap
--
-- | Word Category Frequency, modified version of wordFreq to
-- handle Word Category type.
wordCatFreq :: [WordCat] -> [WordCatInfo]
wordCatFreq inlst = Map.toList $ foldl'
updateMap (Map.empty :: Map.Map WordCat Int) inlst
where updateMap freqmap wordcat = case (Map.lookup wordcat freqmap) of
Nothing -> (Map.insert wordcat 1 freqmap)
Just x -> (Map.insert wordcat $! x + 1) freqmap
-- | Pretty print the word/count tuple and output a string.
formatWordFreq :: WordInfo -> String
formatWordFreq tupl = fst tupl ++ " " ++ (show $ snd tupl)
formatWordCat :: WordCatInfo -> String
formatWordCat tupl = frmtcat (fst tupl) ++ " " ++ (show $ snd tupl)
where frmtcat infotupl = (fst infotupl) ++ ", " ++ (snd infotupl)
Utilities for calculating the fisher probability:
wordFreqSort :: [String] -> [(String, Int)]
wordFreqSort inlst = sortBy freqSort . wordFreq $ inlst
--
-- | bayes classification train
trainClassify :: String -> String -> [WordCatInfo]
trainClassify content cat = let tokens = splitRegex (mkRegex "\\s*[ \t\n]+\\s*") content
wordcats = [(tok, cat) | tok <- tokens]
in wordCatFreq wordcats
--
-- | Return only the tokens in a category.
tokensCat :: [WordCatInfo] -> String -> [WordCatInfo]
tokensCat tokens cat = let getTokCat row = snd (fst row)
tokbycat = filter (\x -> ((getTokCat x) == cat)) tokens
in tokbycat
tokensByFeature :: [WordCatInfo] -> String -> String -> [WordCatInfo]
tokensByFeature tokens tok cat = filter (\x -> ((fst x) == (tok, cat))) tokens
--
-- | Count of number of features in a particular category
-- Extract the first tuple to get the WordCat type and then the
-- second tuple to get the category.
catCount :: [WordCatInfo] -> String -> Integer
catCount tokens cat = genericLength $ tokensCat tokens cat
-- Find the distinct categories
categories :: [WordCatInfo] -> [String]
categories tokens = let getTokCat row = snd (fst row)
allcats = Set.toList . Set.fromList $ [ getTokCat x | x <- tokens ]
in allcats
featureCount :: [WordCatInfo] -> String -> String -> Integer
featureCount tokens tok cat = genericLength $ tokensByFeature tokens tok cat
--
-- | Feature probality, count in this category over total in category
featureProb :: [WordCatInfo] -> String -> String -> Double
featureProb features tok cat = let fct = featureCount features tok cat
catct = catCount features cat
in (fromIntegral fct) / (fromIntegral catct)
--
-- | Calcuate the category probability
categoryProb :: [WordCatInfo] -> String -> String -> Double
categoryProb features tok cat = initfprob / freqsum
where initfprob = featureProb features tok cat
freqsum = sum [ (featureProb features tok x) | x <- categories features ]
weightedProb :: [WordCatInfo] -> String -> String -> Double -> Double
weightedProb features tok cat weight = ((weight*ap)+(totals*initprob))/(weight+totals)
where initprob = categoryProb features tok cat
ap = 0.5
totals = fromIntegral $ sum [ (featureCount features tok x) | x <- categories features ]
-- Inverted Chi2 formula
invChi2 :: Double -> Double -> Double
invChi2 chi df = minimum([snd newsum, 1.0])
where m = chi / 2.0
initsum = exp (-m)
trm = exp (-m)
maxrg = fromIntegral (floor (df / 2.0)) :: Double
-- Return a tuple with current sum and term, given these inputs
newsum = foldl (\(trm,sm) elm -> ((trm*(m/elm)), sm+trm))
(trm,initsum) [1..maxrg]
fisherProb :: [WordCatInfo] -> [String] -> String -> Double
fisherProb features tokens cat = invchi
where initw = 1.0
p = foldl (\prb f -> (prb * (weightedProb features f cat initw))) 1.0 tokens
fscore = (-2) * (log p)
invchi = invChi2 fscore ((genericLength features) * 2)
Some example test cases:
simpleTest1 :: IO ()
simpleTest1 = do
content <- readFile badfile
let tokens = splitRegex (mkRegex "\\s*[ \t\n]+\\s*") content
wordfreq = wordFreqSort tokens
mapM_ (\x -> (putStrLn $ formatWordFreq x)) wordfreq
putStrLn $ "Number of tokens found: " ++ (show . length $ wordfreq)
simpleTest2 :: IO ()
simpleTest2 = do
let badfreq = trainClassify "viagra is bad cialis is good" "bad"
goodfreq = trainClassify "I like to run with foxes they cool" "good"
allfreq = badfreq ++ goodfreq
mapM_ (\x -> (putStrLn $ formatWordCat x)) allfreq
simpleTest3 :: IO ()
simpleTest3 = do
let aa = [(("1", "aa") :: (String, String), -1), (("2", "aa"), -1), (("3", "bb"), -1)]
tokensAA = tokensCat aa "aa"
countAA = catCount aa "aa"
c = featureProb aa "1" "aa"
putStrLn $ "-->" ++ (show countAA) ++ " // " ++ (show tokensAA) ++ " // " ++ (show c)
simpleTest4 :: IO ()
simpleTest4 = do
let aa = [(("dogs dogs", "good") :: (String, String), 3),
(("viagra", "bad") :: (String, String), 5),
(("fox", "good") :: (String, String), 2),
(("dogs", "good"), 4),
(("3", "bad"), 5)]
bb = categories aa
tokensAA = tokensByFeature aa "dogs" "good"
c = featureProb aa "dogs" "good"
d = catCount aa "good"
x = categoryProb aa "xdogs" "good"
z = weightedProb aa "dogs" "good" 1.0
putStrLn $ "-->" ++ (show d) ++ "//" ++ (show bb) ++ "//" ++ (show z)
simpleTest5 :: IO ()
simpleTest5 = do
let aa = [(("dogs dogs", "good") :: (String, String), 3),
(("viagra", "bad") :: (String, String), 5),
(("fox", "good") :: (String, String), 2),
(("dogs", "good"), 4),
(("3", "bad"), 5)]
testdata = [ "xdog" ]
bb = fisherProb aa testdata "bad"
putStrLn $ "-->" ++ show bb
Why Rails Sucks, 100% magic and 0% design
A quote from MacieJ.
"First, I completely agree with what Slava said. Now to expand that a bit
with my own thoughts:
Rails is 100% magic with 0% design. It sports all the great quality and
consistency you've come to expect from PHP, except with loads more magic.
There's no overarching design or scheme of things, it's just a bucket of
tools with some glue poured in. This has a couple of important
consequences:
- There's no reasoning about Rails -- more familiarity won't give you
better chances of figuring out something new because that's what follows
from the design, only because that's how it usually ends up being
implemented and because you have memorised more things, so your guesses
are better. In essence, being better in Rails means being better at
grepping the internet.
- There's no thought given to the general problem to solve, it's just
improved PHP + "ActiveRecord, lol". This means Rails doesn't have
solutions that are particularly good or scalable or make sense, only
hacks that happened to solve someone's specific problem at a time and
were implementable in 5 minutes or less. Rails is heaps better than PHP,
but it's still only good for what PHP is good, and that's not writing
webapps. This permeates Rails all the way down: it's got hacky modules
that only solve particular problems, those modules have hacky functions
that only solve particular problems and those functions only do hacky
things that solved someone's particular problem.
Some examples:
* AR's find family of functions. It's a horrible hack, for instance,
they support the :group clause, which has semantics ("return a collection
of groups of things") incompatible with find's base semantics ("return a
collection of things"). Rails answer? It implicitly relies on MySQL's
retarded interpretation of SQL and the fact that given a table with two
columns, id and colour, it will silently interpret "SELECT * FROM table
GROUP BY colour" as "SELECT FIRST(id), colour FROM table GROUP BY
colour". End result? A valid combination of clauses in AR will generate
incorrect SQL Postgres will (correctly) choke on.
* AR's find again, it supports :join (which documentation hilariously
describes as "rarely needed"), except that it doesn't return real objects
then, but make-believe fake readonly objects that will only have some
attributes filled in (because they have no backing with physical rows),
but will *still* have the base type and all the methods of the class you
queried on! So if you go with that and try to call one of the methods
that depend on unfilled attributes, you die horribly.
- Reading and, in general, understanding Rails is horribly difficult,
since it's no design and layers upon layers of magic. Very often you will
find 5-7 layers of functions delegating the work deeper and deeper in,
until you arrive to a completely undocumented internal function that in
turn splits the work to three other, unrelated internal functions. Given
that each of those 10 functions takes a hash called "options", each layer
altering it subtly, but none having the complete picture, and that 9
times out of 10 that hash is not documented on any level, figuring out
what your choices are is pure hell. It's made even more fun by the fact
that different parts of Rails are accessible at various points of
handling the request, and you can't just jump in and poke things from the
console, since it won't have 99% of the things that only magically spring
to life once a request is live.
- As a rule, there's no quality assurance, and the average quality is
very low. Because it's such a patchwork, it will only have parts of
things implemented, some other (not necessarily overlapping) parts
documented, and a whole load of missing things just dangling and waiting
for you to run into them. For example, the docs for
text_field_with_auto_complete discuss how you can use options to make it
complete only parts of entries (which was exactly what I needed, to show
"foo (123)" in the popup, but only insert "foo" in the text field). What
it doesn't tell you, however, is that none of the stock completion-
generating methods will prepare such splittable data for you, and you're
supposed to copy their code and hack it on your own instead. It took me
half a day to finally understand that it wasn't me being stupid and
unable to find it, but it was actually Rails documenting interfaces that
_weren't there_.
- And to stress the first point again, Rails never concerns itself with
the big-picture problem of "writing webapps". It only thinks as big as
"outputting HTML strings" and "querying the DB for a list of things".
This means the important, actually hard stuff like handling the stateless
nature of HTTP, or sanitising and escaping the user input is just not
adressed at all, and you only learn about them when one day you discover
84 possible XSS injection points (actual number from a Rails app I'm
somewhat famililar with).
I'm a huge fan of innovative frameworks like Weblocks, because they
actually stopped and thought about the whole thing for a moment, and try
things that will address the problem as a whole. And even if some
specific things will turn out to be harder to do that by just churning
out raw HTML, and there will be abstraction leaks, it's still better
because they try out new things to find a solution that has a chance
work. Rails doesn't, because its entire culture seems to be fundamentally
incapable of thinking more than 5 minutes into the future.
Cheers,
Maciej"
Sunday, January 20, 2008
One word programming language summary revisted.
1. Haskell: General Purpose Language
2. Python: Twisted network framework, scripting, easy to read, easy to use
3. Ruby: Ruby on Rails, dynamic, easy to use
4. Scala: Improvement over the java language, work with existing java libraries
5. Java: Lots of libraries, lots of community, commercial involvement
5. Factor: New and emerging, fast, smart people working on the project.
6. C: Desktop development, fast, low-level development.
7. Erlang: Parallel development, server development
Saturday, January 19, 2008
Simple Word Frequency Functions in Haskell
import System.Environment
import qualified Data.Map as Map
import Data.List
import Text.Regex (splitRegex, mkRegex)
--
-- | Find word frequency given an input list using "Data.Map" utilities.
-- With (Map.empty :: Map.Map String Int), set k = String and a = Int
-- Map.empty :: Map k a
-- foldl' is a strict version of foldl = foldl': (a -> b -> a) -> a -> [b] -> a
-- (Original code from John Goerzen's wordFreq)
wordFreq :: [String] -> [(String, Int)]
wordFreq inlst = Map.toList $ foldl' updateMap (Map.empty :: Map.Map String Int) inlst
where updateMap freqmap word = case (Map.lookup word freqmap) of
Nothing -> (Map.insert word 1 freqmap)
Just x -> (Map.insert word $! x + 1) freqmap
-- | Pretty print the word/count tuple and output a string.
formatWordFreq :: (String, Int) -> String
formatWordFreq tupl = fst tupl ++ " " ++ (show $ snd tupl)
-- Given an input list of word tokens, find the word frequency and sort the values.
-- sortBy :: (a -> a -> Ordering) -> [a] -> [a]
wordFreqSort :: [String] -> [(String, Int)]
wordFreqSort inlst = sortBy freqSort . wordFreq $ inlst
Usage:
let tokens = splitRegex (mkRegex "\\s*[ \t\n]+\\s*") content
wordfreq = wordFreqSort tokens
mapM_ (\x -> (putStrLn $ formatWordFreq x)) wordfreq
The utility code was used against a spam document and output the following:
*** Content Analysis
Viagra 28
Cialis 11
to 10
with 7
Here 6
Order 5
the 5
Articles 4
Click 4
Viagra. 4
Friday, January 18, 2008
Haskell Function calls and Parentheses
As you may have noticed, passing input to a function in haskell is simple because of the statically type nature of the language. A function can have a complicated input and output return, but at least you are given the parameters of that particular function. In python you can pass anything to a function; you can pass a function, a integer constant, an instance of a class and yet the function really only expects an boolean.
[2.] Examples
putStrLn :: String -> IO ()
For example, putStrLn expects a string as an input and will return unit in the IO Monad.
>putStrLn "abc"
Will print abc to the console.
Prelude> putStrLn "abc" ++ "123"
:1:0:
Couldn't match expected type `[a]' against inferred type `IO ()'
In the first argument of `(++)', namely `putStrLn "abc"'
In the expression: putStrLn "abc" ++ "123"
In the definition of `it': it = putStrLn "abc" ++ "123"
Prelude>
The following example gave an error. Why? We gave putStrLn three arguments as opposed to the one. Here are the extra arguments, "++" and "123".
To fix the issue:
putStrLn ("abc" ++ "23")
Or even:
(putStrLn ("abc" ++ "23"))
[3.] Dollar Sign, syntatic sugar
We can also use the syntatic sugar, the dollar sign function to alleviate us from having to wrap the arguments in parentheses. Here is the type:
Prelude] :type ($)
($) :: (a -> b) -> a -> b
Pass ($) The infix function has a first input parameter (a -> b), a function and then the 'a' argument which will output b. The following code demonstrates three different calls to putStrLn, but the final output is the same.
Prelude] putStrLn "abc"
abc
Prelude] putStrLn $ "abc"
abc
Prelude] (putStrLn "abc")
abc
Prelude]
Here is another working example:
putStrLn $ "abc: " ++ show(3) ++ show(4)
If we return to the ($) definition and putStrLn.
($) :: (a -> b) -> a -> b
putStrLn represents this part of the input (a -> b)
'a' represents the String input and b represents the output of putStrLn String which is the IO () monad.
For this aspect of the call, "abc: " ++ show(3) ++ show(4);
The (++) operator has the following type definition:
Prelude> :t (++)
(++) :: [a] -> [a] -> [a]
Where [a] can be represented by [Char].
Here are another set of working examples.
Prelude> putStrLn $ "abc: " ++ (show(3) ++ show(4))
abc: 34
Prelude> putStrLn $ ("abc: " ++ (show(3) ++ show(4)))
abc: 34
Prelude> (putStrLn $ ("abc: " ++ (show(3) ++ show(4))))
abc: 34
Prelude> (putStrLn $ ("abc: " ++ (show(3) ++ show(4))))
abc: 34
Prelude> (putStrLn ("abc: " ++ (show(3) ++ show(4))))
abc: 34
Those examples are a little confusing because the (++) operator calls are resulting to one "string" input. For example:
let z = "abc: " ++ show(3) ++ show(4)
Is of type String.
[4.] Function composition
The infix dot operator shown below takes one function, another function and an input and returns the output c.
Prelude] :t (.)
(.) :: (b -> c) -> (a -> b) -> a -> c
If we used the putStrLn function, and show function; what combinations could we come up with?
Note, show is defined by the following signature.
Prelude] :t show
show :: (Show a) => a -> String
Prelude] putStrLn . show $ 5
5
Prelude]
I will use the following notation to denote the operation invoked above:
### Function: **putStrLn** ||| Type: ( String -> IO () ) ###
infix dot-operator and then:
### Function: **show** ||| Type: ( a -> String ) ###
infix dollar sign operator and then the value of 5.
Prelude] putStrLn . show $ 5
5
Prelude]
The following calls also worked
Prelude> putStrLn $ show 5
5
Prelude> putStrLn (show $ 5)
5
Prelude> putStrLn $ (show $ 5)
5
Prelude> (putStrLn $ (show $ 5))
5
If you are given a list and need to print each element on its own line, you can use the following snippet.
mapM_ (\x -> (putStrLn ("Token: " ++ x ++ " Len: " ++ (show (length x))))) ["ab", "1231", "ddd"]
Do two things to make function composition look easier, think backwards and consider that putStrLn, show and ($) are operations with only one input and one output. As you can see, with haskell are many different ways to do something as simple as invoke a function.
edited: Apparently, I got so carried away, I didn't acknowledge that you can just call the function without any operators. E.g; [ putStrLn $ show 5 ] Sorry if that simple fact wasn't made clear.
Thursday, January 17, 2008
One or so word summary of the programming languages I work with
1. Haskell: Functional, cool [most projects]
2. Python: Dynamic, semi-fast, indented blocks [most projects]
3. Ruby: Dynamic, semi-slow [some projects]
4. Scala: Functional, better java, actors [some projects]
5. Java: Static, lots of libraries, ugly, verbose [most projects]
5. Factor: Cool, new, fast [a couple of projects]
6. C: Cool, sloppy [a couple of projects]
7. Erlang: Parallel [very few projects]
(Less used):
7. C++: scary (I don't use C++)
8. Common Lisp (sbcl): fast, simple, dynamic
9. Scheme: simple, dynamic
(Useful):
10. Bash: ok I guess
(Want to learn):
11. Perl6
12. Relearn Lisp
First look at web/page content analysis
The link below contains the data that generated the chart above.
http://openbotlist.googlecode.com/svn/trunk/openbotlist/docs/media/content/chart_content.txt
Wednesday, January 16, 2008
Python Idiom; remove stop words
BOTLIST_STOP_WORDS = [ "abc", "123" ]
keywords = "kjskldjfkljskdfksdkl abc lkjsdfksdklfd 123123"
def filterStopwords(self, keywords):
""" Find all of the stopwords and remove them"""
res = []
keywords_list = keywords.lower().split()
res = list(set(keywords_list).difference(set(BOTLIST_STOP_WORDS)))
# Return the new keyword string
return " ".join(res)
In that example, the stop words will be removed from the string.
Tuesday, January 15, 2008
What am I listening to? Bach, Brandenburg concertos; 4, 5, 6
This particular rendition features a pipe, string ensemble that is a mix of slow and spirited; but never dry or boring. This particular set seems to be more oriented towards strings than 1, 2, 3. I could be wrong about that though.
So far, Bach has impressed me the most.
http://en.wikipedia.org/wiki/Brandenburg_concertos
Joining theinfo.org; large datasets
http://theinfo.org/
Sunday, January 13, 2008
Interesting Person of the day: Hunter Thompson
http://en.wikipedia.org/wiki/Hunter_S._Thompson
Hunter Stockton Thompson (July 18, 1937 – February 20, 2005) was an American journalist and author, famous for his novel Fear and Loathing in Las Vegas. He is credited as the creator of Gonzo journalism, a style of reporting where reporters involve themselves in the action to such a degree that they become the central figures of their stories.
[ANN] Initial release of Spider Queue Database
instance Binary SpiderQueue where
put dbq = do
BinaryPut.putWord16be (magicNumberA dbq)
BinaryPut.putWord16be (magicNumberB dbq)
BinaryPut.putWord16be (majorVers dbq)
BinaryPut.putWord16be (minorVers dbq)
BinaryPut.putWord32be (queueSize dbq)
-- @see mapM: Monad m => (a -> m b) -> [a] -> m [b]
(mapM_ put (queue dbq))
get = do
magicnumbera <- BinaryGet.getWord16be
magicnumberb <- BinaryGet.getWord16be
major <- BinaryGet.getWord16be
minor <- BinaryGet.getWord16be
len <- BinaryGet.getWord32be
-- *******************************
-- Get the remaining byte string data,
-- So that we can use lazy bytestring to load to load the
-- the data types.
-- Also: queueData <- forM [1..len] (const (get :: Get QueueObject))
-- *******************************
queueData <- replicateM (fromIntegral len) (get :: Get QueueObject)
return (SpiderQueue {magicNumberA=magicnumbera,
magicNumberB=magicnumberb,
majorVers=major,
minorVers=minor,
queueSize=len,
queue=queueData
})
References
http://openbotlist.googlecode.com/svn/trunk/botlistprojects/botspider/spider/lib/haskell/src/Data/SpiderQueue/Queue.hs
From Tim Sweeney - Re: Hundred Year Language
http://lambda-the-ultimate.org/classic/message6475.html#6501
All the content in this post is a quote about various languages out there.
"C# is a great illustration of the superficiality of popular language progress. It is the most polished and fine-tuned mainstream language ever created, yet fails to come to grips with basic mathematics type and theory. You still have absurdities like 2^64=0, 7/2=3, simple expressions like x/y and a[x] which the compiler accepts yet aren't actually typesafe in any reasonable definition of the term, arrays have serious typing flaws, and the basic confusion between things and references to things remains.
Right now, the open language lineages are:
- LISP: This is the apex of untyped programming with nested scoping and metadata manipulation. It's at the end of the line, not because of lack of potential, but it's so simple and powerful than any improvements to it can be implemented in LISP itself without affecting the core language. In 100 years, after the last C and C++ programmers are long gone, there will still be LISP enthusiasts. But don't expect large-scale software development to happen this way.
- C, C++, Java, C#: The problem nowadays isn't with implementation but with the underlying flaws in the paradigm. We can continue to go the route of adding tweaks as with C#, but they're asymptotically approaching a limit that's not far from the current state of affairs.
- Python, Perl, etc: Languages with significant improvements in usability for certain audiences, and are generally a mix of untyped LISP features with typed C/Java-style features. The primary innovations I see here are in the syntax and libraries, for example Python's pioneering of "pseudocode that runs" and Perl's first-class regular expressions and text parsing capabilities. Future languages will certainly inheret these features, even as the underlying paradigm is changed.
- ML, Haskell, simply-typed functional languages with type deduction and, in general, programs consisting of a whole bunch of top-level declarations. The elegance of these languages is based on Hindley-Milner and with most of the neded extensions of the language (subtyping, dependent types), that breaks, and the resulting declarations and error messages become quite a mess. These languages will certainly remain useful to their current fans, but I see no hope for this style of programming to scale up to mainstream development.
So which lineage will be the basis of languages 100 years from now? I think the syntax has long since been determined, and will fall very much in the middle ground between Haskell, Pascal, and C++ -- and not very much outside that triangle.
Regarding the type system and semantics, I think the basic principles in such mathematical-style code as "f(x:int):int -> if x<=0 then 1 else x*f(x-1)" will not change one iota in 100 years, but that the current notions of objects, references, member functions, exceptions, however-many-bit integers, integer being "the" type of 7, and so on will seem terribly archaic. In this area, I don't think any current lineages will stand.
However, I do think we'll converge on a type system that is not far outside of the past 20 years' research into dependent types, intersection and union types, unification, constraint logic, and proofs-as-programs. These type systems essentially have the expressive power of mathematical logic itself, which is fundamental and not a product of invention or whim.
Alex Peake - Re: Hundred Year Language blueArrow
4/11/2003; 6:44:05 PM (reads: 2820, responses: 0)
IMHO (what counts is Productivity)...
"...most important change...around well-defined type systems..."
Type systems are about bug prevention and early detection, and bugs of the "type mismatch" variety only. Has anyone really studied the cost of early detection vs. productivity loss from not using dynamically typed languages?
"LISP...don't expect large-scale software development to happen this way..."
Why? Why are many fairly inadequate languages more successful than good, powerful languages? It's about 1) Money -- money begets money and Microsoft has a vested interest in maintaining their status-quo of followers, as have the Sun/IBM/Oracle gang as a way of competing with Microsoft. They are the ones with marketing might. They are the ones who communicate the loudest. They are the ones that most "choosers of programming languages" hear. And they are "safe", and
2) the great majority of "choosers of programming languages" are either incapable or uninterested in actually evaluating programming languages.
"...ML, Haskell, (Erlang,) simply-typed functional languages..."
I think the real elegance, and productivity, comes from the declarative style and the pattern matching.
BTW, Smalltalk was conspicuously absent - a wonderfully productive dynamically typed, reflective,... development environment!
Nay, I say, the future is LISP (well Scheme maybe) and Smalltalk (if only someone could afford to Market either). "
Friday, January 11, 2008
Redefining functional composition in haskell
In math 101, we learned the following:
given the value of 'x', pass 'x' to the function g and then the result of g of x to the function f to get a result.
f(g(x))
Let's say we are doing the following calculation:
y = g(x) = 2 * x
z = f(y) = 3 * y
z is our result.
If we provide an input of 4, we have the following.
y = 2 * 4 = 8
z = 3 * 2 * 4 = 24
z = 3 * g(x where x = 4) = 24
In Haskell
How would you do this in haskell (in prelude):
Open ghci:
First, here is the function definition for function composition:
(.) :: (b -> c) -> (a -> b) -> a -> c
Prelude> let ggg x = 2 * x
Prelude> let fff y = 3 * y
Prelude> (fff . ggg) 4
24 (our result)
Or ((fff . ggg) 4)
Now, let's define this this in a new haskell source file:
module Main where
-- x -> y
ggg :: Integer -> Integer
ggg x = 2 * x
-- y -> z
fff :: Integer -> Integer
fff y = 3 * y
-- x -> z
compose :: Integer -> Integer
compose x = (fff . ggg) x
main = do
putStrLn $ "Z Result should be 24=" ++ show (compose 4)
Thursday, January 10, 2008
More haskell simple networking client code and connecting to RabbitMQ
Key code snippet:
The following uses the Data.Binary.Get monad to receive frame data back from the rabbit mq server. Get a frame type, channel, size of the payload and then the payload byte data.
instance Binary AMQPFrame where
get = do
frameType <- getWord8
chan <- getWord16be
sz <- getWord32be
bytes <- BinaryGet.getLazyByteString (fromIntegral sz)
chw <- getWord8
return (AMQPFrame { frameType=frameType,
channel=chan,
size=sz,
payload=bytes,
ch=chw
})
Download
http://haskellnotebook.googlecode.com/svn/trunk/haskell/simplenetwork
You can see the python example, just imagine porting the python library to haskell. Syntactically, with haskell we can use some of the code verbatim. For example, look at the METHOD_MAP_NAME in the python code and compare that with the pattern matching in the haskell source.
http://barryp.org/software/py-amqplib/
Wednesday, January 9, 2008
More on the AMQP (RabbitMQ) haskell client (and an example)
The "Data.Binary" module is used here to have fully control over the size and endianess of all data sent across the network. There are several utilities for writing shorts, byte words, longs, 64 bit longs, etc.
Here is a data structure defined with the Binary.Put monad. Our task is write this data to the socket connection.
data AMQPData = AMQPData {
amqpHeaderA :: [Word8],
amqpHeaderB :: Word32
}
instance Binary AMQPData where
put amq = do
BinaryPut.putByteString (pack (amqpHeaderA amq))
BinaryPut.putWord32be (amqpHeaderB amq)
amqInstance :: IO AMQPData
amqInstance = return (AMQPData { amqpHeaderA = (unpack (C.pack amqpHeadA)),
amqpHeaderB = amqpHeadB
})
amq <- amqInstance
let bs_amq = encode amq
-- Through the use of lazy hPut, write out the data to the socket handle
LazyC.hPut h bs_amq
Download
You can download the haskell source for this entry at the following URL including a simple java server to mimic a AMQP node.
http://haskellnotebook.googlecode.com/svn/trunk/haskell/simplenetwork
Resources
(1) http://www.haskell.org/ghc/docs/6.8.2/html/libraries/network/Network.html
(2) http://hackage.haskell.org/packages/archive/binary/0.4.1/doc/html/Data-Binary.html
(3) http://hackage.haskell.org/packages/archive/bytestring/0.9.0.1/doc/html/Data-ByteString.html
Tuesday, January 8, 2008
Understanding monads in haskell from a reddit user (808140)
A quote from reddit, 808140:
"The first step to understanding "monads" is realizing that the name is scary but the concept is simple. See, the term itself comes from math and was so-named because the people that found a use for it in programming were math people. Despite what some people seem to think, though, you don't need to be a math person to understand them.
Sunday, January 6, 2008
Updated botnode.com site
http://www.botnode.com/
Friday, January 4, 2008
Link Blog Specification (use of Emacs/Scala/Lift/Antlr)
Overview:
A link blog contains blog entries which in turn contain a list of links. This tool will aid in the creation of a link blog service. The emacs IDE is a powerful development environment that also includes a lisp based plugin system. We can build a Text over HTTP protocol that interfaces with an emacs plugin. A user can then use the emacs plugin to add entries to the blog.
Good (readable) publication on practical support vector machines.
Saurav Sahay
"Automatic Text categorization using machine learning methods like Support Vector Machines (SVM) have tremendous potential for effectively organizing electronic resources. Human categorization is very costly and time-consuming, thus limiting its application for large or rapidly changing collections. SVM is a comparatively new technique with a very solid mathematical foundation for solving a variety of ‘learning from examples’ problem and gives high performance in practical applications."
Resources
http://www-static.cc.gatech.edu/~ssahay/sauravsahay7001-2.pdf
Keywords
Reuters 21578, SVM
Thursday, January 3, 2008
Lisp like s-expressions in haskell
Prelude> ((+) ((*) 1 2) 10)
12
Prelude> (1 * 2) + 10
12
Prelude Text.Printf> (printf "%d\n" 5)
5
Prelude Text.Printf> printf "%d %d\n" 1 2
1 2
Prelude Text.Printf> ((printf "%d %d\n" 1) 2)
1 2
Of course, you have to match up the arguments in haskell accordingly; not all functions take a just a list like you might find in common lisp.
Wednesday, January 2, 2008
Haskell Notes: Basic List operation
http://www.haskell.org/ghc/docs/latest/html/libraries/base/Data-List.html
Haskell: Difference between "<-" left arrow and 'let' expressions
Saizan on freenode does a better job of explaining it...
When asked; What are the main differences between left-arrow and binding a variable with let.
"the <- arrow "extracts" the value from an IO computation, while using let you'd just define another name for the computation as a whole"
"so in let x = getChar, x :: IO Char, but in do x <- getChar; ... x :: Char"
Sick of Internet Journalism and response to the Ron Paul Evolution Video
I am still waiting on where he said he (Ron Paul) doesn't believe in "EVOLUTION". Big difference from not siding with EVOLUTIONARY "THEORY" in which you aren't 100% on the entire details on how human beings arrived here today.
I am a life long person of science. Lets understand evolution as fact and push everything else away. But I can understand how a Christian can take issue with an some established origins of life; how some out of the 6 billion people can be confused about the origins of life.
If you can give me the video where Ron Paul says the following "I DO NOT BELIEVE IN EVOLUTION AT ALLL, WILL NEVER BELIEVE EVOLUTION" or "I AM A CREATIONIST FROM THE DAY I WAS BORN TILL NOW"
Yes, I saw the youtube video mentioned earlier, but you can play all kinds of "Bill Clinton wordisms" with that particular Ron Paul blurb.
Also, here is the difference between Evolution and Evolutionary theory.
http://en.wikipedia.org/wiki/Evolution_as_theory_and_fact
F**king "Redditards"; We want freedom for all but we can't give this man his personal beliefs. Seems like hypocrisy to me. Beliefs that don't have anything to do with the office he is running for.
"BUT HE DOESN'T accept a Darwin's or Gould theory of evolution, he can't reason on complex issues."
Cut that bull-shit. If a former Air force surgeon with a medical degree from Duke Medical school can't reason complex situations over a fucking-tard living off dad's Oil money (Bush) than god help us. Call your representatives, asking to introduce a bill that all persons running for the presidency must have a PHD in Astrophysics.
In the mean-time, lets go with these requirements:
U.S. citizen, 35 years of age, 14 years resident of U.S. Must have an absolute majority in the ELECTORAL COLLEGE – 270/538.
Personally, I am not really THAT big a Ron Paul supporter. He has interesting ideas that might possibly work. So do the other candidates. I am just tired of this internet journalism where we take 30 second sound bites and ignore a person's 70 years of life experience.
That is fine, be an media idiot, personally, I would like to hear criticism or favoritism of Ron Paul's plans for the country. Will his plan to eliminate the IRS get passed? Will it work? Is it a long term strategy? Will it rejuvenate the economy?
Early linux source code 0.1, working on modern compiler.
Linux kernel to compile with GCC-4.x, allowing it to run on emulators such as QEMU and Bochs. After applying his series of small patches, Abdel explains that the 0.01 kernel can be built on a system running the 2.6 Linux kernel. He added that he's successfully ported bach-3.2,
portions of coreutils-6.9, dietlibc-0.31 (instead of glibc), bin86-0.16.17, make-3.81, ncurses-2.0.7, and vim-7.1 all to run on his modified 0.01 kernel." -- Quote from blog entry.
CouchDB and living the dream (2006/6)
I don't know the true origins of CouchDB, but just looking back at his blog, it look likes like it got started in 2006/6.
http://damienkatz.net/2006/06/
Congratulations.
Reading binary files with haskell, simple snippet
http://hackage.haskell.org/packages/archive/binary/0.4.1/doc/html/Data-Binary.htmlDownload and review this haskell source to see what I ended up with. I defined a DatabaseFile type which includes two magic number words (unsigned shorts of 16 bits each) and then populated the fields from the binary file.
http://openbotlist.googlecode.com/svn/trunk/botlistprojects/botspider/spider/tools/misc/dbreader/haskell
instance Binary SpiderDatabase where
put _ = do BinaryPut.putWord16le 0
get = do
magicnumbera <- BinaryGet.getWord16be
magicnumberb <- BinaryGet.getWord16be
return (SpiderDatabase {magicNumberA=magicnumbera,
magicNumberB=magicnumberb
})
Tuesday, January 1, 2008
Best and worst of 2007
Best of the Best:
Reddit, Factor, Haskell
Best Website:
Reddit.com
Best Movie:
3:10 to Yuma
Best Food:
Mexican Food (love the tacos!)
Best new project, you probably haven't heard about:
Factor
Another best language, that is gaining a lot of traction:
Haskell (honorable mention for Scala)
Best Project:
Lucene/Nutch/Hadoop
Best Printer:
Brother HL2040, works with linux (at least so far)
Best Politician:
Ron Paul
Worst Product:
Firefox2-3 on Linux. (Actually, flash plugins have a lot to do with the trouble on linux, but still).
(Honorable mention for HP printers)
Worst developer discussion forum:
Joel on Software, http://discuss.joelonsoftware.com/?joel
when you get a bunch of VB developers together, this is what happens.
Worst websites:
Twitter.com, Facebook.com, MySpace.com
Worst political websites:
Slate.com, Huffingtonpost, DailyKos,
Simple look at the java bytecode through python.
Some useful python libraries:
http://docs.python.org/lib/module-array.html
"struct -- Interpret strings as packed binary data"
http://docs.python.org/lib/module-struct.html
Source:
http://jvmnotebook.googlecode.com/svn/trunk/misc/bytecode/
Another note, the java bytecode class is big-endian format and most of us work on little endian machines these days, just something to keep in mind.
The best way to grok monads is to not try to understand what they are abstractly first. Instead, familiarize yourself with several common monads and their uses. This will allow you to better see what is being abstracted.
A lot of people who are new to monads try to understand the abstraction before understanding the specific cases. Some people (usually people with math backgrounds) can do this, but for most people, the best way to understand something abstract and general is to start with specific examples.
You can learn what a monad is intuitively very quickly this way, because there are a bunch of things that are all monads but that all do very different things.
The most basic monad is the exception monad (in Haskell we call it
Maybe
if there's only one exception, andEither
if there are many). This is not voodoo. It is no different than throwing an exception in a language you're familiar with, there's nothing new or strange about it. Basically, the code:evaluates
f x
wheref
is some function that can throw an exception, checks to see if an exception was thrown, and if it none was, it passes the output off x
into the functiong
, which can also throw an exception. Again, ifg
did not throw an exception, it passes the value intoh
, which itself can throw an exception. Here's an example of what this might be doing. Let's sayf
,g
, andh
are all mathematical functions, all of which at some point divide by their argument. Obviously, if the argument is zero, there will be an error, so each checks first if the argument is zero and if it is, each throws an exception, and if not, does the calculation and returns the value. Essentially, this code is the same as what one might write ash(g(f(x)))
in another language, but the use of the>>=
just makes it clear that we are checking for exceptions.The reason we go to all this hassle, instead of just having exception throwing built into the language the way you might in Java or most any other imperative language, is because in functional languages functions cannot have side effects. That is, they are completely determined by what they return, and cannot do anything else. They cannot throw exceptions, or print to the screen, or set a global variable, or whatever. This allows something we call referential transparency: if you see
f(5)
in purely functional code, you should be able to evaluate it once, and take the value it returns and replace every other instance off(5)
in the whole program and have the program run exactly the same way. This is obviously not possible iff
, say, prints to the screen (it would only print once, instead of 20 times or whatever). It is also not possible iff
throws an exception (what value did it evaluate to? What would you replace other instances off(5)
with?)So how do exceptions work, then? The same way they work in a language like C, which lacks the ability to throw exceptions as a built in part of the language: they encode the exception as part of the return value. So in our example,
f
returns just some number or an error value signaling that an exception was thrown. We call this error valueNothing
.So we create a datatype, which we call
Maybe
, that extends any type (in our example,Integer
) so that it includes the valueNothing
. We say that a variable of typeMaybe Integer
can either beNothing
, signifying that an exception was thrown or an error occurred, or it can be just a number, which we indicate by writingJust 5
orJust 10
(theJust
is there to remind us that we're dealing with aMaybe
type and not just anInteger
).Then the
>>=
operator is defined thusly. For a functionf
and maybe typem
, ifm
looks likeJust x
thenm >>= f
is the same asf x
. That is, the value contained in theMaybe
type is just passed intof
(other languages might writef(x)
).If, however,
m
isNothing
, thenf
is never called and the whole thing just evaluates toNothing
. Since the>>=
operator is left associative, just like division or subtraction, we can look at our example this way:first evaluates
f x
. Maybe it returnsNothing
; then by left associativity,(Referential transparency allows us to reason about our code algebraically, which is why it is good). What if
f x
didn't throw an exception? Let's say it returnsJust 5
:Now we evaluate
g 5
. It too can return a value, or it can returnNothing
. You can probably see already that if it returnsNothing
, the whole expression will evaluate toNothing
. If it doesn't, whatever it returns will be passed on toh
, which also has the ability to returnNothing
. In this way, we have propagated the error -- which is why functional programmers often talk about exceptions being propagated rather than thrown.Now, there is one other function that is useful, which I'll call
unit
(in Haskell, we call it something else, but its name can confuse imperative programmers).unit
just takes a normal value and makes it into aMaybe
value by putting aJust
in front of it:So we could equivalently write the code we wrote before as
You can think of
unit
as lifting a normal value into a value that can also throw an exception.Now, if you've read this far, you've probably noticed that I haven't mentioned monads. Well,
Maybe
is a monad, a specific example of one that you can use today.unit
and>>=
are the monad operators (the latter operator is called "bind"). Many different data structures all define unit and bind in completely different ways. All that makes a monad, really, is that you can define two functions likeunit
andbind
on them.A monad is always a container type, which is to say that like
Maybe
it extends a simple type likeInteger
orString
by adding some extra information in some way (in our example, we added the possibility of beingNothing
, but other monads add other things).For example, it is possible to write out messages in a purely functional way using a monad. The idea is, instead of just returning a value, you return a value and the string that you want to write to the screen. Why do it that way, instead of just writing out to the screen? Because that way, your functions don't have side-effects, you preserve referential transparency, and you -- and more importantly, your compiler -- can reason about your code algebraically, allowing it to simplify your code the way you would an equation.
I won't go into the gory details regarding this monad (we call it
Writer
), but the basic idea is this:unit
takes a normal value and puts it in a data structure which includes an empty string. Our bind operator evaluates a function, which returns that same kind of data structure, except maybe that structure's string isn't empty (in case you haven't already guessed, the string in question is the string the function "writes" to the screen). It concatenates the two strings and continues on. An example:Assume
f 5
evaluates to 3 and writes"f was called\n"
(here++
is the concatenation operator):Now
g
returns 10 and writes out"g was called\n"
:And
h
returns 5 and doesn't write out anything:So here, bind is again passing the value contained in the
Writer
data structure into the next function, but at the same time it concatenating the messages that each returns.You can see that these functions are referentially transparent: earlier I said that if
f(5)
wrote something to the screen, it would not be referentially transparent because you could not evaluate it once and replace every occurence off(5)
with that. But iff
uses theWriter
monad we just defined, then it is referentially transparent, because its output is contained in its return value.There are many other monads. The more monads you learn, the more it will become clear what they all have in common; but in the meantime, you can use them productively as you learn them to write purely functional code if you enjoy that sort of thing.
(For the curious: in Haskell,
unit
is calledreturn
. It is nothing likereturn
in C, which is why I did not call it that in this post. It is calledunit
in category theory, however, so it's not like I just made the name up.)