y-intercept blog: programming

Showing posts with label programming. Show all posts

Tuesday, February 08, 2011

My First MySQL Function

My fancy new cloud server uses MySQL 5.0.91 which includes support for stored functions and procedures. I realize this release is old hat to the annointed, but it's new to me.

Anyway, I spent most of today writing my very first MySQL STORED FUNCTION.

While writing the function, I realized that I had become addicted to a debugger. Apparently, the MySQL doesn't say what is wrong with a program. It simply fails if something is wrong with the syntax.

Finally, I decided to write the program running each line of code through the SQL Query window until I had a procedure that worked.

Anyway, the goal of my first STORED FUNCTION is to encrypt and decrypt the email addresses from the Registration form. The registration form encrypted the email addresses, but I never wrote a decrypt program.

Anyway, since I will release some new features, I decided I should create a password reset program. So, I decided it would be helpful to be able to decrypt the email addresses.

In this new program, I will store the encrypted email addresses in the table Email_X. It has the definition. email_id is a primary key with AUTO_INCREMENT

CREATE TABLE IF NOT EXISTS Email_X (
  email_id int(10) unsigned NOT NULL auto_increment,
  email_bin varbinary(255) NOT NULL,
  decrypt_cnt int(11) NOT NULL default '0',
  block_yn char(1) NOT NULL default 'N',
  PRIMARY KEY  (email_id)
) AUTO_INCREMENT=1 ;

The getEmailId() takes an email address. Converts it to lower case and runs AES_Encrypt with the unimaginative password "password". I convert to make searching for an email address easier. "Bob@example.com" encrypts differently than "bob@example.com" & and I want to find all occurences of Bob's email address.

I will store the email_id in the user table. This makes it easy to find all the accounts owned by the same email address because they will have the name number.

I could find all the accounts for bob@example.com with the SQL Query:

SELECT * 
  FROM User_Table 
 WHERE email_id = getEmailId('bob@example.com');

DROP FUNCTION IF EXISTS getEmailId;

DELIMITER $$

CREATE FUNCTION getEmailId(str VARCHAR(255))
  RETURNS INT
BEGIN
  DECLARE rv INT DEFAULT 0;
  DECLARE encrypted VARBINARY(255) DEFAULT NULL;
  IF (str = '') THEN
    SET rv = 0;
  ELSE 
     SET encrypted = AES_Encrypt(Lower(str),'password');
     SELECT email_id INTO rv FROM Email_X WHERE email_bin = encrypted; 
     IF !(rv > 0) THEN
        INSERT INTO Email_X (email_bin) VALUES (encrypted);
        SET rv = LAST_INSERT_ID();
     END IF;
  END IF;
  RETURN rv;
END$$

DELIMITER ;

I can now call this function with an email address. It will look to see if the address is in Email_X. If so, it returns its current id. If not it will create a new entry in Email_X with an encrypted string in the place of the email addresss.

To decrypt the email addresses, I created the following program:

DROP FUNCTION IF EXISTS getEmailAddr;

DELIMITER $$

CREATE FUNCTION getEmailAddr(eid INT)
  RETURNS VARCHAR(255)
BEGIN
  DECLARE rv VARCHAR(255) DEFAULT NULL;
  DECLARE blockMe CHAR(1) DEFAULT 'N';
  IF eid > 0 THEN
     SELECT AES_Decrypt(email_bin, 'password'), block_yn INTO rv, blockMe  FROM Email_X 
      WHERE email_id = eid; 
     IF blockMe = 'N' THEN 
       UPDATE Email_X SET decrypt_cnt = decrypt_cnt + 1 
        WHERE email_id = eid;
     ELSE 
       SET rv = '';
     END IF;
  END IF;
  RETURN rv;
END$$

DELIMITER ;

getEmailAddr() takes a number then decrypts the encrypted binary string. The function updates a counter every time someone accesses an address. I also added the block_yn feature. If a person requests their email address removed from the system, I can block it at the encryption level.

The next step in this program is to create a send mail program.

The send mail program will decrypt the email address and record the transaction. The number of decrypts should equal the number of emails sent. If it does not, then I know I have a hacker going at the database.

Sunday, February 06, 2011

Magic Quotes and Lesson in Security

PHP is a kludgy language. It is common for a PHP page to take raw data from web input. Translate the data into an SQL string and execute a SQL query. A programmer might have a page with URL "page.php?id=5." The program page.php might have the code:

mysql_query('SELECT * FROM PageData WHERE id = '.$id);

$id contains the variable in the URL string. If the id is 5, PHP sends the SQL command "SELECT * FROM PageData WHERE id = 5" to the database. Since one is sending straight text to mysql, hackers have learned that they can often inject SQL code into PHP strings. For example one might call the page.php with the command ("page.php?id=5; 'SELECT password FROM SecurityTable WHERE user=1').

The variable $id now contains the string "5; 'SELECT password FROM SecurityTable WHERE user=1'". The command in mysql_query() is now sending two SQL commands to the database engine.

When looking at logs for a web site; you will occasionally find hackers sending commands to pages to inject SQL strings into PHP pages. Hackers will even run scripts that systematically attack every variable sent to a page to see if it was properly validates. This type of attack is called an injection script.

The key to this sql injection script is the single quote character "'" which breaks out of one sql command and starts another.

To prevent injection scripts, programmers must validate all input from the web. Validating a string involves changing the quote character into something that won't be wrongly interpretted by SQL. PHP created a program called addslashes(). addslashes() escapes quotes with a slash. addslashes() translates "'" into "\'" which the database correctly interprets as a quote and not a break in a SQL statement.

Early PHP manuals instructed programmers to run addslashes() on all code from the web.

But even one sloppy piece of code in a large program created a security hole.

Tired of being hacked Webhosts came up with the clever idea called "Magic Quotes"

Magic quotes automatically runs addslashes() on all data coming into the web site. There were two big mistakes in this decision.

The first was that addslashes() escapes slashes. If you add slashes on the same data twice, you start getting extra slashes. the command addslashes("'") returns "\'". addslashes(addslashes("'")) returns "\\\'". Running the program recursively three times returns "\\\\\\\'"

This is actually an interesting application of the reflective paradox.

All of the code written to PHP best practices prior to MagicQuotes started returning multiple slashes. Programmers had to convert their code to addslashes(stripslashes(data)).

The second major flaw of Magic Quotes is that people around the world want to write programs in their native language using their native character sets.

Different character sets around the world use different delimiters for text. addslashes() and Magic Quotes assumes everything was written in a latin based character set. Different character sets use different delimiters.

One does not know what characters need to be escaped until after establishing the connection to the database. Each database has its own preferred mechanism for validating data. MySQL offers an awkward function called mysql_real_escape_string().

Magic Quotes turned out be a fiasco. Programmers who've written their own validation procedures want the feature off. A small number of programs have grown dependent on it.

Now here is the interesting security problem: Any code written with the assumption that magic quotes was running becomes a security hole when magic quotes is off.

So, Magic Quotes created an interesting example in which a well intentioned security mechanism made programs less secure.

Monday, May 10, 2010

SEO v Localization

I've been working on the localization for a new site.

The design I want to use allows the user to select a language. The program would then display the pages for the rest of the session in the user's chosen language.

I like the way the way user specific localization works.

The problem is that the googlebot balks at websites that show different content for different users. When the googlebot balks at a site, the site doesn't get included in the search engines.

Despite the fact that I think customized websites provide a superior user experience, I decided to shelve the design I prefer and create a segregated web site with different directories for each language. There will be an "/en" directory with English, an "/fr" directory for French, an "/fa" directory for Persian, and so on.

I dislike how I have to shelve dynamic design to keep favor with the search engines.

Anyway, for the last couple days, I've spent time trolling the net to see how other web sites handle the localization challenge. To my disappointment, I've discovered most companies seem to be going with monolingual designs.

Even companies that pride themselves on hiring bilingual customer service reps seem to go with monolingual websites.

I thought the opposite would happen. I thought that content hungry web designers would be jamming out poorly translated multilingual pages as search engines are likely to count poorly translated multilingual pages as original content.

As for my specific effort, I am discovering that several of the programs I use have a poor implementation of UTF8 that's causing encoding errors. Perhaps web designers have found the goal of localization too big a hassle and have simply taken to the easier route of designing a different site for each language.

Sunday, June 08, 2008

The Java Shakes

I have a problem. I want to include graphics in my current research project. The Adobe CS3 suite and flash animation would be the ideal tool for the job. Unfortunately, it is out of the budget for the project.

Anyway, a week ago, I decided to see if I could do the job with Java Applets. So, I dusted off all of my old Java books and have been trolling web sites for current best practices.

Most of the examples I found were dated in the 1990s. Several programs out there actually crash my browser (not a good sign). Sadly, the graphics in most the programs I've found lack quality and seem to be resource hogs.

Yesterday, I went to a local bookstore. I was considering buying a new book because I wanted to find current best practices. Do they have a better solution for reducing flickering than resource intensive double buffering?

I was shocked. Just a few years ago, Java books filled a full bookcase. Today, selection of Java books only fills half a shelf.

I was extremely disappointed with the books I found.

Java books tend to be full of prose lauding the greatness of the language. They are generally weak on showing what can be done with the language. Conservsely, languages like PERL admit upfront that the language is ugly, but are full of examples showing what can be done with the language.

I am wondering if the Java language is in already in its sunset years.

I admit, I have never used Java in a professional situation. Everytime I've evaluated Java, I ended up using other technologies that were better suited for the task at hand.

IMHO, the great fault of Java programs is that Java programmers tend to move the complexity of programs from the methods into the object model. This method of thinking works okay for desktop applications, but does not work well on the web.

The site jFreeChart has a program for producing nice looking charts. Unfortunately, the program is designed for producing charts, and not displaying them. None of their preformatted charts do what I want. Making jFreeCharts into a tool that displays graphs would take a massive re-engineering effort.

Oddly, the best graphics program I ever used was the original Postscript language. It took a little effort to get used the reverse polish notation. But it was possible to sit at a computer and write stuff with decent looking output.

Wednesday, April 04, 2007

Long Object

I am behind on everything. Anyway, I was wanting to write a program to calculate the check sum for an ISBN13 code. As book lovers know, on January of this year, ISBNs went from 10 digits to 13 digits. It would make sense to write the program as an object. With an object, you would store the ISBN number and extract either the 10 or 13 digit version of the number.

The site lastflood.com has a nicely written PHP Object for ISBNs. This is an example of very well written code; However, it strikes me that this well written code is over 1500 lines!

The size of the code doesn't matter for compiled languages like Java and C++. The problem with a scripting language like PHP is that you have to compile it with each call. In a typical year, this object would be called and compiled several hundred million times. I am left scratching my head wondering if I should write "well written" code; or go for speed. The object would be running on a crowded shared server.

I am really caught in a quandry: Should I write good code or should I write fast code? How big of a hit do I get for using 1500 line objects opposed to short procedures that take maybe 80 lines.

The fact that there seems to be a dichotomy between good and fast strikes me as wrong.

Back to my rant on Unicode. An ISBN is simply a number. You can represent a 13 digit number with a 44 digit binary string. If I am using a 64 bit unicode character set to store the ISBNs, I would need 832 bits to store each ISBN. When I store this ISBN as a string in a database, the size of the key is almost twenty times the size that I really need. This does not matter for the majority of applications. When you get into something where you get into analyzing data, inefficiencies stack up. For example, if you were working on a database that recorded crossreferencing in books, you could very quickly end up with lists of millions of ISBNs.

Yes, I know that the 1500 line program is well written. It is better than anything I ever write, but my brain keeps screaming at me the word "efficiency"! The ISBN is just one tiny element that I have on pages. If I wrote everything in good style, my programs would be 80,000+ lines of code in size.

I neither want to write nor maintain code that is that long. What I want is code that lets me express the actual logic needed for a task in the cleanest, most efficient manner possible. In other words: I am a dinosaur headed for extinction.

Tuesday, March 27, 2007

Diverse Societies

Rephrasing the long post. I think we are better off having a diversity of character sets than to pretend that we have one character set that encompasses the diversity of all possible languages.

What programmers need is a multidimensional system that allows for a short readable character set for interfacing with the computer. We can they use translation tables for communicating with other cultures.

There is a place for efforts like UTF8 that incorporate a translation table at the character level itself; However, I think that there is more fundamental diversity when we allow people to develop systems optimized for their culture than we are when we try to create monolythic systems that try to encapsulate the full diversity of all human language.

Diversity of Characters

It looks like I will be spending the day trying to coordinate character sets for my database conversion. Both programmers (and computer users for that matter) should be aware of the encoding that they use for their writing. The ecoding is the process that maps the characters that you see on your computer screen into the bits stored on the computer.

Joel on Programming has an interesting read on unicode. Joel seems to have an extremely low opinion of American programmers in general and of PHP in particular. (PHP is maintained by the Zend corporation in Isreal.)

I happen to have a positive view of both PHP and American programmers.

If anything, I am more apt to question the people trying to stuff Unicode down our throat, than to question the people who are in the trenches trying to make programs work.

Back to to Unicode. In the early days of computer programming, processing and storage capacity was expensive. Programmers encoded data in CAPSLOCK because computer space was too valuable to waste on inconsequential details like case.

In the early days, computers were so expensive that if a county wanted to have a computer for their own character set, the natural choice would be to design machines and software from the ground up. It was really not until the '80s that the price of computer capacity dropped to the point where people could start thinking of cheap machines that had the capacity to process the complexity of different languages.

The natural impulse of computer science was to handle the diversity of languages through parallel evolution of different operating systems and character sets. This was accompanied by the coevolution of technologies to translate between the different operating systems.

As a student of languages and linguistics, I was actually hoping that different language groups in the world would end up developing their own approaches to software. I was hoping that the parallel evolution of computer science in different language groups would lead to a diversity of operating systems.

Of course, there were powerful interests who wanted to see one operating system dominate the entire world.

Since the existence of different character sets was leading to parallel evolution of operating systems in different cultures. Powerful multinationals wanting to dominate the world had to act. They did so by stuffing down our collective gullet a new standard called unicode.

The goal of the Unicode effort was to stomp out this natural evolution of computer science by encoding the diversity of the known languages of the world into a single character set. The first hope was that we could do this with a 16bit number. That was too small to encode all of the subtleties of Chinese and other symbolic languages. There was some hope that use 32 bits would suffice. Each character you typed on the screen would be a number between 1 and 4294967296. With a 32bit character set, each character that an English writer used would be 33818640 times larger than what is actually needed to record the character.

A good writing application doesn’t just record the written word. An application might also record revision history, etc.. An implementation of a universal character set means a great deal of wasted space for English writers.

A 32bit character set is so overbearing, that no-one really wants to use it. The current group think is to push an idea called UTF8.

UTF8 uses 8bit characters for Latin languages and a point set scheme for other languages. You can directly translate ASCII to UTF8. Letters from other alphabets would just be bigger.

Unfortunately, the existence of variable length characters is problematic for many languages and database applications that assume that the binary representation of all characters is the same size. By adopting UTF8, you actually end up precluding the use of primitive fix length databases. Which is sad because primitive fixed length databases are fast and easy to program.

Since UTF8 gives precedence to English by making our letters the smaller at the top of the chart; I actually see UTF8 is more imperialistic than the paradigm where Americans used ASCII and allowed other linguistic groups to evolve their own character sets.

In some ways, I see the debate over character sets as a reflection of the overall debate between the classical liberal world view and progressive world view. The classical liberal view would have Americans continuing to pursue the development of operating systems and character sets that best allow the expression of what Americans want to accomplish while people in other linguistic traditions develop character sets and operating systems that best express their desires.

Parallel evolution leads to greater diversity.

Since we are interested in communicating with the world, there would be a natural coevolution of schemas for translating ideas between cultures.

The classical liberal approach to the diversity of languages would be to allow for the parallel evolution of different ideas and character sets. The progressive approach is to try to create a single universal character set and to force everyone to use that one universal character set.

BTW, you may notice that writers favoring Unicode often take a very condescending attitude to traditional coding techniques.

My thoughts on this issue are that programmers should store information at a cardinality that best matches the data. For example, if you are making shoes, you might have 5 colors, 10 sizes and 4 widths. There are only 200 permutations of this shoe description. Ideally, the character set in your shoe database would not require too much more wasted space than what is needed to express these 200 permutations. The wasted space may not look like a lot when you are talking about 1 or 2 shoe orders. But when you are talking about a database recording on millions of shoes, the inefficiencies add up.

Storing this data in ASCII format is already inefficient. Storing data on shoe orders in Unicode multiplies that inefficiency by 4. When ordering one pair of shoes, the fact that you wasted some space doesn't really matter. When you start talking about hundreds of millions of shoes, the space starts to matter.

I am not completely dimissive of unicode. The shoe company may want to sell its shoes in every country. The sales department is likely to want to have a database that contains the name of their shoes (along with sales text) in every language (including Klingon for the big push at the Star Trek convention). A database might encode the attributes of the shoe in ASCII, and the names of the shoes in Unicode.

Having a mix of character sets is both more efficient and allows for greater diversity than trying to force one universal character set at the operating system level.

In most cases, the cardinality of the information you are collecting is quite low, while the quatity of items that you are recording is large. For example in DNA analysis, you might have 20 or so nucleotides. Human chromozone #1 has 220 base pairs of nucleotides. If you are doing DNA analysis, you will want to encode the nucleotides with the smallest symbol possible so that you can analyze the complexity of the DNA string.

Analysis of things like protein folding, you add the complexity of space and time to your analysis. Much of the really interesting computer science these days pushes the limits of information theory.

Even though I've been condemned in life to work on less interesting programs, my sentiments lie with those programmers pushing computer science to its limits. The design of data should be driven by the structure of the data under analysis and not by the anti-American sentiments of the "progressives" in the sociology department of the university.

IMHO, real diversity comes by allowing the free evolution of different approaches to the problems of the world. The grand schemes that are supposed to force diversity upon us tend to be inefficient and become overbearing. Forced conformity does not create real diversity.

Joel on programming smuggly notes at the end of his article that his company stores everything in two byte UCS-2. I think that the better approach is store data in the most compressed format possible and to have translation tables that let you expand as needed. Joel's programming style may be appropriate for small web publishing firms that are trying to reach a universal audience. However, it is not appropriate for the interesting program questions that involve tons of data and computer capacity.

Wednesday, March 14, 2007

Best Practices

The reason I am being so weird at the moment (see last post) is the nature of the project I am working on. I am currently moving a collection of community web sites from one host to another. The sites average from 20,000 - 30,000 page views a day. At currently market rates, the site would have to have about 300,000 page views per day to generate enough revenue for one minimum wage job. IMHO, The sites are still valuable as they provide a community service. The sites provide people in the community who would otherwise not be heard a venue to list their site.

My orginal market plan for Community Color was to pound out a quick prototype site hosted on a discount web server, get market feedback, then write a "real" application in Java. PHP is the ideal prototyping language. It is quick, dirty and easy to support. Now, the first iteration of the program was a quick and dirty prototype. For a variety of reasons, I wrote the prototype as a collection of procedures, and not with object oriented programming techniques. I can feel people trembling ... the horror, the horror.

Writing Object Oriented style code is considered best practices.

I feel the intellectual elite of the programming world snubbing up their collective noses at the idea of someone intentionally designing a web site with old style programming techniques. The horror, the horror.

Anyway, during my current move, I am upgrading from PHP 4 to PHP 5. During the move, I thought I would upgrade from procedures to objects.

Guess what? I am finding that I like the procedural style code better.

Best practices dictates that all computer programs must be written in object form.

I find myself wanting to tell the world why I don't think object oriented programming is the right solution for all applications. In my opinion, oop is not the right solution for web based programs written in a scripted language.

IMHO, the sentiment that object oriented programming is the best solution for all programming problems is an unwarranted absolutist statement. For that matter, as I look back on the evolution of programming, I think that the overemphasis of object oriented design is a cause for many buggy programs that exist today.

I believe that the programming world is best served by a diversity of programming lanugages and programming styles.

Since programming techniques have an affect on society at large, I think that this is a topic that people outside the programming world would find interesting.

I am being weird because I am thinking about how one would make my case to the world.

PHP is a programming language that is immediately available to new programmers. A run of the mill, internet savy computer user can learn PHP 5 and start using it to add functionality to a web site within hours. It is a programming language that I would encourage casual internet.

I think it would be fun to present the case for procedural v. oo programming in a structure that encourages internet users to learn PHP and HTML.

Several years ago, I pounded out a quick and dirty syllabus for a class on PHP. This was a quick seven part tutorial for a general audience. The course was written for HTML 2.0 (old style HTML). I was thinking of writing a new tutorial for PHP 5 and DHTML.

Friday, March 09, 2007

Moving a Domain

I am currently in the process of moving the Community Color web sites to a new host. When I started a community directory, my idea was simply to pound out a quick and dirty prototype in PHP. If the program tapped into a good market, I would have redesigned the program in C++ or Java and moved to a dedicated server. So, moving the site to a new discount webhost is an admission that the idea was a failure.

I should note, I started this set of programs up in Missoula. People in Missoula are so supportive of their community. I also really liked the University of Montana. It appears to be a good school with level headed people. The experiment I started with Missoula.WS was a success. I moved back to Salt Lake to watch the Olympics, and the fact someone else I know wanted to do a Missoula. The general reaction to the Salt Lake directory is simply: "You aren't Mormon. LEAVE!"

Utah has a strong counter culture. I am just not that good of a counter-culture type person. I actually see all of the groups in Utah as part of the Utah culture.

My little idealistic world view is that everyone who lives in an area is part of that community. We should develop the good parts of the community and find ways to overcome the bad things. No group should be dominating everything. When one group starts dominating, a community devolves into action/reaction mode, and the worst of people surfaces.

The goal of Community Color is to include everyone. In a community like Missoula where most people simply love being in Missoula, the idea of a community directory flies. In a fractured community, like Salt Lake, the idea falls flat. Of course, a community that is fractured by ideology is in greater need of things that encourage communication between the factions.

When you analyze the link structure of blogs and web sites in Utah, you will generally find that people only link to their group.

Just today, Natalie R. Collins put up a post about how she is excluded from Mormon link lists. Her site, of course, is part of a nexus of ex-Mormon and anti-polygamy sites. If you put up an intentionally pro-Mormon or an intentionally anti-Mormon site, you will get a very large number of links. Bemoaning the fact that you don't get all links is ludicrous.

The community color directories have several intentional biases. There is a bias toward small independent organizations that are within defined cultural centers of the state.

A business in Salt Lake City would get a higher slot than one in WVC. I allow that bias because I believe that the established cultural centers are important, and that sprawl is one of the major problems facing the Mountain West. The primary reason that I have not moved to Moab is that I don't want to be part of the sprawl that is consuming that little piece of paradise.

The small town is something very near and dear to the Western heart. The problem is that we can't all live in such places. When we try to, the small town we love sprawls.

I tried to figure out how to move to Grand Junction. My sights are currently set on moving the Denver. Although I've lived in Utah most of my life, enough people in my lineage were born in Denver to claim that I am a third generation Coloradan.

After moving from Missoula, I never received any feedback on what people want in a community site. This moving to a new discount host is an admission that the project failed. Most ideas fail. That's the way science and the free market work.

I am not really upset at failure. It is the fact that I haven't received the feedback on a better direction to follow that's got me feeling gloomy.

Anyway, one of the reasons I've been pounding out mindfarting "anti-progressive" posts is that I've been trying to avoid the fact that I need to get this domain moving project completed before the end of March. I started the project in December. So, I will write about PHP programming for the next several blog posts.

Monday, July 03, 2006

Programming Updates

I spend the last 12 hours updating the picture page to DHTML Strict. As with most projects, there was about an hour of programming and 11 hours trying to get CSS to display the image the same in different browsers. CSS was a good idea, it was just a horrible, horrible, horrible implementation. This would have been only a 30 minute job if the bastards who wrote CSS left the align=center option. Instead CSS was written by elitist wanks who had decided that centering an image was too petty and bourgeoisie for a revolutionary formatting language like CSS. My final solution was to measure all of the sizes of the images and calculate the centering myself.

I also got around to adding alt tags to all of the pictures. The reason I didn't add the alt tags in the first place is that I had tight bandwidth constraints. I was scared that google would index all of the image and send the site more traffic than I was prepared to pay for. Fortunately, my web host has increased my bandwidth quota, and I have an upgrade path to more bandwidth.

Hopefully the new DHTML picture page and Context Browse page will bring more traffic.