Enjoy unlimited access to all forum features for FREE! Optional upgrade available for extra perks.

uft-8 encoding question

Status
Not open for further replies.
Joined
Jan 19, 2007
Posts
2,208
Reaction score
47
Really starnge problem that I'm having:

I've different pages some with Japanese, some with French characters (never mixed - either 1 or the other)

Different parts of the pages are displaying the characters correctly and in other parts, some of the characters are replaced with ??? or ??? in diamonds.

I have this in the head (which is actually a php included file):
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

Which I thought was enough.

I think because it's an included file theres an issue with the server default encoding taking over and messing up havf of the characters...

Just not sure if it is a server encoding issue and why only half of the characters are affected

Has anyone experienced anything like this?
 
Are you sure you havent inserted incorrectly encoded data into the database or have the database encoding set differently than the output encoding ?

One common issue I see alot is people who have wrote their content on Microsoft word and the strange characters magically appearing that Microshaft call smart quotes.
 
The text isn't from a database.
In places it's an echo'ed variable, other places its straight text.

The problem seems to happen when the page in question gets its header from an included file:

include "header.php";

Even though the included header has:

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

I think that my charset is being ignored and that the server is using it's own - really strange - I can't override it with the things I've tried in my htaccess, or the php.ini that I created in the directory (I can't change the real php.ini on the server as it's a shared hosting plan)

Really odd. I'm waiting for a reply from support
 
Yep, this is exactly what's happening (what I mentioned above) - really odd.

主題を拾い読みしなさい

^ this is some Japanese.

If I put it in the header.php file and then load the page which includes the header, it shows perfectly.

If I put it in the index.php (which includes the header.php file <- this includes the encoding information) then the Japanese displays as ?????

So I need to either get the server default to be UTF-8 or load my headers sooner (before the file include)
 
Perhaps you're not saving the main file in UTF-8 format from the text editor, or whatever you used to make it? If you save it in ASCII or another format, it could turn out like you posted.
 
Thanks for your help Edwin and Skinner - appreciate it.

Edwin I don't think that the issue is how I'm saving it because if I use the same file editor and save in the same way I can get a success and a failur depending on if the Japanese goes in the header (where the encoding is specified) or in the index (where the encoding info is php included in a header).

So I'm sure it's that the included file's encoding instruction is ignored if the Japanese (or French for that matter) is located in the index file (and not in the header).



I made some progress with my host. They gave me a php.ini file on my server. I see the following in the file:

; As of 4.0b4, PHP always outputs a character encoding by default in
; the Content-type: header. To disable sending of the charset, simply
; set it to be empty.
;
; PHP's built-in default is text/html
default_mimetype = "text/html"
;default_charset = "iso-8859-1"

so the default_charset is commented out with the preceding ";"
If I remove the ";" my french and german pages output correctly, but the Japanese goes even crazier.

I tried making the default_charset = "UTF-8"
but the french, german and japanese are all displayed with errors.

There may be more things I need to change in the php.ini but this issue seems to be a bigger problem in general. Not sure if it's limited to my hosts setup, or if it's a php problem that encoding info isn't transfered when the page header is part of a php include.
 
have you tried editting the page header, not the include the actual apache header ?

header('Content-type: text/html; charset=utf-8');
 
Skinner & Edwin - thanks for your help - I've now kinda fixed the problem (just have other issues :))

Edwin you were right about how the file was saved. I was assuming that a language pack that came with some software would have the French/German/Japanese files saved as UTF-8 - but they weren't :rolleyes:

Also my server needed the php.ini settings had to be tweaked.


Was wondering if you knew about preg_replace and safely displaying Japanese characters on a page.

$translated = preg_replace("/[^a-z âàéèëÉÂÀËçÇ \d]/i", " ", $input);


I'm using code like this to only allow a-z, french accented characters and spaces, other characters translated to a space.

1. Is this a safe way to go about this? Or can quotes and other characters still get through?

2. What do I need to put in there to allow Japanese characters (Kanji I think)? And will it be safe or open up the risk or quotes and other code getting through?


Really appreciate your help. Skinner you always try and answer my coding questions and Edwin you spurred me to double checking the language file encoding on this - thanks again.
 
You would be better using graphics made with GD or ImageMagick to display the Kanji.

You can make quotes etc safe using the htmlentities() functions, which converts < > into &gt; / &lt; and " into &quote; stuff like that. Stops nasties being added in.

There is also utf8_encode and utf8_decode that may help too.

My RegEx sucks major ass but I'm pretty sure you need to add \i to make it case insensitive, otherwise your only going to match lowercase letters :)
 
I saw this Character list on another website of what you need to allow

ÀàÁáÂâÃãÄäÅåÆæÇçÈèÉéÊêËëÌìÍíÎîÏïÐðÑñÒòÓóÔôÕõÖöØøÙù ÚúÛûÜüÝýÞþœŒ
 
Thanks skinner.

I've been working with addslashes instead of preg_replace() to put a string as a page title.

<title><?php $pagetitle = mysql_real_escape_string($pagetitle); echo $pagetitle;?></title>

This outputs the Japanese without any issue now that my earlier encoding problem is fixed. I wondered if addslashes is enough in this case?

I think that the worst that can happen is that the title may appear as
title text \"some word with a\" quote
if there are single or double quotes used (which I can live with)

I tried htmlentities and mysql_real_escape_string but both gave errors in the title when using Japanese. But addslashes seemed fine - as long as it's safe.

These are all functions that I've used a few times in websites but I'd never really carried out a full security check.
 
Add Slashes will stop most Injection Methods, real escape is just a more SQL geared method.

Even if you ran all 3 and a replacement, you can't say its safe as someone will find a way if they really want to, so just take basic steps :)

I personally use htmlentities only where allowing html input by anyone other than me, I use real escape or add slashes for none-html code.
 
Status
Not open for further replies.

The Rule #1

Do not insult any other member. Be polite and do business. Thank you!

Featured Services

Sedo - it.com Premiums

IT.com

Premium Members

AucDom
UKBackorder
Be a Squirrel
Acorn Domains Merch
MariaBuy Marketplace

New Threads

Domain Forum Friends

Other domain-related communities we can recommend.

Our Mods' Businesses

Perfect
Service
Laskos
*the exceptional businesses of our esteemed moderators
Top Bottom