The computer they prefer!
DoudouLinuxThe computer they prefer!
The site's languages [ar] [cs] [de] [en] [es] [fa] [fr] [it] [ms] [nl] [pt] [pt_br] [ro] [ru] [sr] [sr@latin] [th] [uk] [vi] [zh]
All the versions of this article: [English] [русский]
DansGuardian is the DoudouLinux web content filter which prevents children from consulting “naughty” sites or pages. There are two components in DansGuardian: URL blacklisting and real-time content analysis. The second one is subject to translations because it uses a list of words or expressions that are known to be "naughty" or on the contrary safe. Each word or expression is given a score that increases or decreases the total score of a page. The page is declared “naughty” as soon as its score reaches 50, as set in the DansGuardian configuration. This is the recommended triggering level for small children.
NB: the DansGuardian error page is to be translated too but its contents have been moved to PO files, including the predefined DansGuardian messages that were stored in pure text files. Please have a look at TransiFex for the PO files.
Files of words and expressions are located in the directory lang/trunk/apps/system/dansguardian/lists/phraselists/
of the lang SVN tree. They are are copied at build time into /etc/dansguardian/lists/phraselists/
. Files are placed in sub-directories that represent the list category:
$ ls /etc/dansguardian/lists/phraselists/ badwords forums gore malware personals secretsocieties violence chat gambling idtheft music pornography sport warezhacking conspiracy games illegaldrugs news proxies translation weapons domainsforsale goodphrases intolerance nudism rta travel webmail drugadvocacy googlesearches legaldrugs peer2peer safelabel upstreamfilter
Files in sub-directories are named banned_lang
or weighted_lang
where lang is your language name, except for English whose files are simply banned
and weighted
:
$ ls /etc/dansguardian/lists/phraselists/pornography/ banned weighted_danish weighted_italian weighted_portuguese banned_portuguese weighted_dutch weighted_japanese weighted_russian weighted weighted_french weighted_malay weighted_spanish weighted_chinese weighted_german weighted_norwegian
Banned files contain phrases that automatically trigger page rejection, ie. there is no evaluation of the remaining contents. Weighted files contain phrases that are associated with a naughtiness value. Their content is quite simple:
#listcategory: "Pornography (Russian)"<проститутки><50> #prostitutes < фото><5> #photo <бюст><40> #bust < анал ><40> #anal <анальный><40> #anal
Each line contains both a word or expression and its weight. Comments are inserted using the number sign “#”. Beware that spaces before and/or after a word let you specify if the word is a part of a longer word, its beginning, its end or the whole word. This is very important to allow for words like anal which is also the beginning of analysis, analogue, etc. The rules are the following:
spaces | example | effect |
---|---|---|
no space | <abcd> | matches any word containing abcd |
space on the right | < abcd> | matches any word starting with abcd |
space on the left | <abcd > | matches any word ending with abcd |
two spaces | < abcd > | matches the exact word abcd |
Please note that non-English weighted files are very light compared to the English one (which has no language name in the file name). Their size is the following:
$ ls -Ssh1 —hide ’banned*’ /etc/dansguardian/lists/phraselists/pornography/ total 152K 80K weighted 16K weighted_portuguese 12K weighted_italian 8,0K weighted_japanese 4,0K weighted_french 4,0K weighted_spanish 4,0K weighted_danish 4,0K weighted_russian 4,0K weighted_german 4,0K weighted_malay 4,0K weighted_dutch 4,0K weighted_chinese 4,0K weighted_norwegian
On the contrary only the Portuguese banned file is really filled. This means that globally the translation work for DansGuardian is huge… However the whole English file may be replaced by a simpler but exhaustive word or expression list in another language.
The DansGuardian files that are shipped within DoudouLinux use different encodings depending on the language. For example, French uses a Latin-specific encoding while Russian uses a Cyrillic-specific one. This is really a difficulty for us because we have to take care while editing files with the correct encoding depending on the language. For this reason, the DoudouLinux “lang/” tree is provided with files previously converted to UTF-8. This means that now all files must be edited using UTF-8, whatever your language is.
Another issue is that DansGuardian does not try to guess the encoding of the web page that you are requesting. Instead it considers its content as a binary stream and performs byte comparison with word lists. This means that we should provide a weighted file for each possible encoding of each language… Again the “lang/” tree has been designed to host UTF-8 files only. Files in additional encodings are automatically generated at CD build time to reduce the translation effort.
So if you need to add encodings for your language, you have to edit the file lang/trunk/apps/system/dansguardian/lists/weightedphraselist
. It contains the list of files to be loaded by DansGuardian. The trick is that if the CD build script does not find a file in this list, it tries to generate it from another one after guessing from which file it derives and which encoding is requested. These additional files must be named this way: weighted_LANGUAGE-ENCODING
. For example mentioning the inexistent file weighted_russian-cp1251
will make the CD build script convert the file weighted_russian
from UTF-8 to cp1251. Of course the resulting file is named weighted_russian-cp1251
!
Here is a small sample of the file weightedphraselist
:
.Include</etc/dansguardian/lists/phraselists/pornography/weighted_spanish> #ALPHA# .Include</etc/dansguardian/lists/phraselists/pornography/weighted_russian> #BETA# .Include</etc/dansguardian/lists/phraselists/pornography/weighted_russian-cp1251> .Include</etc/dansguardian/lists/phraselists/pornography/weighted_russian-koi8> .Include</etc/dansguardian/lists/phraselists/nudism/weighted>
Here you can see that the UTF-8 Russian word list will be automatically converted into a cp1251 and a koi8 encoded-file at build time.
If you take a look at the Russian phrase list file you will see this kind of comments:
< секс чат ><50> #sex chat < cекс чат ><120> #sex chat (first ’c’ is latin) < секс форум ><50> #sex forum < cекс форум ><120> #sex forum (first ’c’ is latin)
We have apparently the same phrases twice. Indeed they look identical but are not because the C letter in the Cyrillic alphabet does not have the same numerical code as the C letter in the Latin alphabet. As a result it is possible to write a naughty word using different alphabets in the same word, thus bypassing all content tests… This is why this kind of practice is associated a very high naughtiness score!
So if your language uses an alphabet that has letters in common with another alphabet (visually, not pronunciation), you should certainly rewrite naughty words mixing alphabets and putting these wrong words at a very high naughtiness.
Copyright © DoudouLinux.org team - All texts from this site are published under the license Creative Commons BY-SA