EU Funded Anti-majority Artificial Intelligence Watchdogs A study funded by the European Union to create artificially intelligent anti-majority watchdogs. Here is an excerpt:
Since this document may otherwise disappear from the net altogether I’m posting a copy here. PS: As you read this document you will notice that nowhere do they cop to the fact that they are using the technique of “profiling” to “discriminate” between “racist” and “anti-racist” text. Guys like “Godless Capitalist” like to point to government funded technology like this as evidence that “the cognitive elite” will crush any attempt by separatist movements to assert their fundamental human right of freedom of association/self-determination. With such self-deceptive hypocrisy (no, Razib, hypocrisy isn’t simply saying you believe something and then not following it—its preaching something you don’t practice like Godless “Capitalist” does when he uses civil rights, immigration liberalizations and “fair” housing laws to gain access to other’s territory/property) there can be little doubt that “the cognitive elite” has all the intellectual integrity of brie on a hot summer patio table. PRINCIP Project Contract No 2119/27571 Project Deliverable D2.2 Linguistic Features of Racist Documents January 14th, 2003 Maggie Gibbon1,
Edel Greevy1, Heinz Lechleiter1,
Patrick Martin1 Jean-Michel Daube2, Natalia Grabar2, François Rastier2, Monique Slodzian3,
Mathieu Valette3 Armin Burkhardt3,
Reinhardt Hopfer3 Roswitha Bayer4, Petra Drigalla4, Thomas Kässner4, Nico Lausch4,
Franz Stuchlik4, Nico Winkel4 1 Dublin City University (DCU) 2 Institut National des Langues et Civilisations Orientales (INaLCO) 3 Universität Otto-von-Guericke Magdeburg (OVGU) 4
ADI Informatik-Akademie gGmbH (IAMD) Restricted document The work for redacting this document was carried out with the financial support from the European Union. Introduction After a brief re-statement of the aims of the Princip project and the hypotheses underlying its implementation, this paper presents the linguistic features of textual corpora collected from Internet resources. The results we present are not final and will be adjusted and tuned after further analysis during WP2. As explained in Deliverable 2.1, distinctive and racist-specific features will be used as clues for the detection of racist content. These clues will then be characterised by quantitative values (presence, absolute or weighted frequencies). Detection of these clues in Internet documents will be performed with modules and tools, such as those presented in Deliverable 3.1. The Aims of Princip The main aims of Princip project lie in the detection of racist content on the Internet. Various studies of racist content have been addressed by previous research. A major part of this research is aimed at the analysis of racism as a social and psychological phenomenon, mainly in an attempt to explain its origin and development. Approaches here include social cognitive approaches (e.g. Hamilton and Trolier), social identity theory (e.g. Tajfel, Turner, Giles), psychoanalytical approaches (e.g. Ottomeyer, Adorno, Fromm, Horkheimer, Simmel, Reich), political economy (Nikolinakos, Miles), post-modern studies (Hall, Westwood), Marxist and neo-Marxist theories (Miles, Guillaumin, Taguieff), most of them with varying degrees of interest in the linguistic expression of racist attitudes. The main body of linguistic work on racist language has concentrated on discourse analysis of majority groups within European countries and the USA with the aim of uncovering tacit or concealed racist attitudes. This has mainly been achieved by means of interviewing members of majority groups and subjecting the resulting text to a discourse analysis (prototypical example: Teun van Dijk, 1987, 19931 with many followers), or by means of studying existing political textual documents (M. Souchard, 1997) in order to detect recursive linguistic units and themes. The Princip-project differs from previous linguistic research in the type and source of texts that it researches, in the method it uses and in the aims it pursues. The project deals with (mainly) open racist attitudes as expressed on the Internet. There has been research into web-based racist language before but this has generally been limited in its scope: using typical pieces of text from particular websites2. The Princip-project deals with large amounts of text published on the Internet and uses corpus linguistics methods as its primary research tool. The aim of the linguistic studies of the Princip project is to enable an automated multi-agent system to detect racist content without recourse to human input during on-line running. There are a number of assumptions which inform linguistic analysis of racist language in the framework adopted by Princip:
Levels of
analysis Different types of analysis focus on different features of the racist corpus. These are linguistic features, features relative to the structure and organisation of the document, and features relative to its location. Linguistic features include the following: character, morpheme, word, collocation, isotopy, POS tag, POS tagged word, lemma, POS tagged lemma, sentence, paragraph, document, URL. Features relative to the structure of the document are mainly HTML tags and their attributes. Location features concern the URL and IP address of the document. Features can be simple or complex. Simple features correspond to units such as character, morpheme, word, POS tag, sentence, paragraph, document, URL. Complex features correspond to the combination of simple units: collocation, co-occurrence, isotopy, tagged word, word in a given HTML tag. Features can be detected in a full text document or/and in other formats (HTML or tagged formats). Despite clear linguistic differences between the three languages, their racist discourses and the tools used by each partner, we have obtained similar linguistic features. Some of the features are common to all three languages (words, collocations) while others are mostly present in one or two languages only. As stated above, the more implicit nature of French racism, for example, involves specific methodological approaches (and consequently specific tools are required). In the fremainder of the document we present each type of linguistic feature, give examples where available, explain the way the feature can be detected in the document and briefly discuss its efficiency for the detection of racist content. 1: Character The character level corresponds to single character units, such as punctuation, figures, symbols and runes. Characters can be detected in the original text or isolated after HTML-to-full-text conversion or after tokenisation of the document. The analysis already conducted shows that the punctuation can be characteristic of a given topic, such as very frequent use of exclamation marks ! in French racist documents, or the use of the dollar symbol ‘$’ in certain words like holocau$t, juif$ in French revisionist and racist documents. 2: N-gram of characters Detection of linguistically motivated n-grams of characters can be considered as close to the detection of morphemes or to the detection of any type of substrings. Depending on the different types of morphemes, the approaches used can be different. Roughly speaking , two kinds of approaches can be used for the analysis of words and their morphemes in the document: statistical (stemming methods) or simple matching of expected morphemes. Stemming and lemmatisation methods –presented below- can be considered. The suitable method for each language has to be chosen according to the precision and recall it presents and its time-consuming characteristics. Depending on the possible position of usual morphemes in the words of studied languages, we distinguish roots, prefixes and suffixes. Prefix Analysis A prefix is
the first element of a word used in inflection or in word derivation.
The analysis of English corpora allowed to isolate following prefixes
(presented in alphabetical order): ab-, ad-, ag-, al- ,anti- be- com-, con-, counter- de-, di- ex- im- mal-, man- non- ob- per-, pre-, pro- re- se-, sub-, sup- ultra-,
un- Tools used on English corpora (Concordance and Wordsmith) have no way of identifying whether or not these strings are being actually used as prefixes, or whether they are simple bigrams in larger morphemes. For this reason, if the initial substring is not large and/or precise enough, the words returned by wildcard searches and which begin with this initial substring can correspond to an erroneous grouping of words. For instance, the result for the initial substring de- is: dead, deal, destroy. Such subspecified initial substrings have to be evaluated before being used as clues. The contrastive analysis of racist and anti-racist English corpora shows that the number of prefix types per corpus of words beginning with the highlighted prefixes is respectively 8,669 and 8,526. There are slightly more types in the racist corpus but these results are too similar to be useful. When we take a closer look at the frequency of these types within the racist and anti-racist corpora we see that the frequency of tokens is always higher in the anti-racist corpus. In the appendices we present more detailed information about the initial substrings’ distributions. The analysis
of French corpora allowed to isolate prefixes like ex-, im-,
non-, as racist ones, but if we examine each website separately,
we can observe that only one prefix (ex-) is a permanently over-represented
prefix in racist corpora. Suffix Analysis A suffix is a non-independent element at the end of a word that is used in inflection or word formation. We have analysed three suffixes in the English corpora: -ist, -ism and –tion. The following table shows that suffixes –ist and –ism are more specific to the racist corpus, while the suffix –tion is more present in anti-racist corpus.
Table 2.1: The distribution of some suffixes in the English language corpora The next table presents more detailed information about these suffixes: numbers of occurrences (total number of times they are used in the corpus), number of their types (number of different individual words which contain these suffixes) and number of hapaxes (number of words which contain these suffixes and are used only once in the corpus):
Table 2.2:
Type of suffixations in English language corpora The suffix
analysis can allow detection of some grammatical categories with stable
formal marks, such as adverbs with –ally, -ement endings (respectively
in English and French). Adverbs are one important feature of racist
discourse. In French corpora,
we have isolated the following suffixes as being antiracist ones:
And “racist” suffixes are :
and the sometimes pejorative
–ards (politicards). We consider
affixed words are over-represented in French antiracist sub-corpora
because French antiracist people use more compound words (historically
Latin words) than racist people who give priority to short words. For
example, the chart following shows the frequency distribution of short
words (three to five-letter-words) in the whole corpus. Each stick represents
a set of documents extracted from one specific website (the first stick
corresponds to an antiracist website called Droits humains, the
second is Hommes et Migrations and so on). The first nine sticks
represent antiracist websites and the nine following sticks, racist
websites. The chart shows there is a general deficit of short words
in antiracist sub-corpus and a surplus in racist one. Chart 2.1: Frequency distribution of short words (three to five-letter-words) in the French corpus Root Analysis Root conveys the core lexical signification of a word or of a family of words with close meaning. Roots can be seen as statistically obtained stems. Detection of roots are worthy of interest when one would like to group inflectional and/or derivational variants of a given word. For instance, the inflectional family of word Islam in French is: islam and islams. Its derivational family is much larger: islam, islamique, islamiste, islamiser, islamisation, etc. Each of the members of this family can receive one or more inflectional variants. The global weight of an entire morphological family in the document analysed is potentially more important than the weight of one of its members. In the German corpus keywords (like Abstammung, Ausländer, Fremde, fremd, Front, Jude, Kultur, Mensch, Nation, national, Rasse, System, Volk etc.) are frequently used in compounds. Since such derivations are used in equal measure by racists and anti-racists, a detailed morpheme analysis has to be conducted to determine the differences between language usage by both groups. A simple analysis of affixation prefixes and suffixes is in many cases not sufficient. (For example, the significant morphemes for the word „Jude“ are the following: ab+stamm, amerikanisch, arbeit+s, bank, beruf, blut, europäisch, führ+ung+s, gesinnung+s, terrorist, judäo, stämmig, kontingent, kontroll, loge, muster, reform, ver, zentral+rat+s, west, ober, aus+rott+ung+s+these, mächtig+en, chef, führ+er, minderheit, exklusiv, zigeuner, mit, reform, which appear in compounds like: Abstammungsjuden, amerikanischjüdische, Arbeitsjude, Bankjuden, Berufsjude, Blutsjuden,europäischjüdischen, Führungsjuden, Führungsjudentum, Gesinnungsjuden, gesinnungsjüdische, Judenterroristen, judäoamerikanische, jüdischstämmiger, Kontingentjude, Kontrolljude, Logenjudentum, Musterjuden, reformjüdische, verjudet, Zentralratsjuden, Westjuden, Juden- ausrottungsthese, Judenmächtigen, Judenchef, Judenführers, Minderheitsjuden, jüdisch-exklusive, Zigeunerjuden, Mitjuden, reformjüdische). For further examples of significant racist morphemes. See Appendix 3 for more details. 3: String A string is any token delimited by blank characters in the text that can then be isolated. Note that the tokenised document contains more isolated strings than a standard full-text document (mainly due to the separation of punctuation). Note also that according to the techniques and tools used during on-line running, strings and substrings can be matched with the same or different modules. Linguistically and lexically motivated strings or words can correspond to different grammatical categories: nouns, verbs, adjectives, adverbs, determiners, prepositions, pronouns, etc. Each of these categories conveys very specific meaning and plays a specific role in the document, its organisation, argumentation, etc. The most frequently
used features are largely consistent with the distribution of words
in standard English usage (e.g. a, of the, to and, for, is that,
in, are). By itself, this information does not indicate if these
lexical features are predictive of racist discourse. However, their
comparison with the anti-racist corpus and standard English usage (British
National Corpus), indicates which lexical units are likely to be robust
and useful indices of racist content (see appendix 1 for more detail). Hence, in English
corpora there are 509 words which are used 10% or more frequently (consistently)
in racist texts than in anti-racist texts. The linguistic and social
categories to which these words belong are consistent with theoretical
models of racist speech: there are many nominalizations of ethnic, racial
or national groups (whites, Jews, Americans) and, more interestingly,
many words which are strongly associated with the type of over-emphasised
argumentation discourse (even, course, ever). which is typical
of self-conscious, minority belief-holders. Graph 3.1:
Consistency/Frequency comparison of lexis in English language racist
and anti-racist corpora. The French corpora present close examples with words such as rien, assez, grand, and jamais. These two types of words are 30% more likely to be found in English racist texts than in anti-racist texts (see appendices 1 & 2 for more detail). Determiner Usage Essentialising
and reductive logic is typical of stereotyping language, such as is
used heavily in racist discourse. With regard to the use of determiners,
this manifests itself in the choice of the definitive singular the
in preference to the indeterminate a in conjunction with members
of target groups, as the former is more consistent with the argument
that there are singular and unchanging characteristics to the subgroup
and the degree of difference between members of the subgroup, with regard
to their defining (negative) characteristics, is nugatory. Hence the
recurrence of expressions such as the jew, which, somewhat counter
intuitively, are used to refer to unspecified, undetermined mythic examples
of Judaism. This last point is borne out by the fact that the Jew
– the singularising determiner – is a more frequently used expression
than the Jews. This example of a preference of singular version
over plural for identity marking nouns is not the case when reference
is being made to the in-group. To emphasize these considerations, we
present the following table which contains raw frequencies of occurrences
of the and a and their relative distributions (1L means
1 word to the left).
Table 3.1 Comparison of word frequency at 1L to the and a in racist and anti-racist English corpora wh- and qu-words Usage The wh- English words and the qu- French words, which are broadly equivalent, are characteristic of racist speech. The wh- English
words, such as what, when, where, who
and why, can correspond to the interrogative, relative or indirect
pronouns. The following figure shows that 58.51% of all wh- words
are to be found in the racist corpus. The greatest disparity is in the
use of why and what. These words are not often used as
interrogatives but as indirect pronouns. Because of the phrase structures
typically associated with indirect pronouns, our analysis of the usage
of wh- is centred on the collocation of nouns to the left and verbs
to the right of each term. This demonstrates prototypical racist use
of indirect pronouns. Graph 3.2 Overview of consistency
of usage of common wh- interrogatives in racist and anti-racist
English corpora The “over-emphatic presence” of wh- in racism is not sufficiently high to merit criterion status. However, as with many other cases where certain types of speech are over-represented in racist discourse, it allows us to be more confident about the clues which we discern from its pattern of usage within racist texts. We present here the lists of words which appear one place to the left of who (mainly nouns): Jew, Gentiles, Non-whites, Parasites, Any, Female, Colored, Nigger, Niggers, Christ, Enemy, Foreigners, Goyim, Homosexuals, Jewess, Mexicans, Millions, Powers, Southerners and words which appear one place to the right of who (verbs): Control, Created, Built, Love, Opposed, Raped, Support, Bear, Enters, Fail, Dwell, Pay, Settled, Wishes, Advocated, Appear, Buy, Conquered, Feed. A more detailed cases studies of interrogative usage of who, what, when, why is presented in appendix 1. In French corpora,
qu- is highly represented by the conjunction/relative pronoun
que, as we can see on the chart following. It shows the frequency
distribution of que in the whole corpus. Like previously, each
bar represents a set of documents extracted from one specific website.
The first nine bars represent antiracist websites and the nine following
bars, racist websites. The chart shows there is a very significant surplus
of que in seven of the nine racist websites. Graph 3.3:
“que” distribution in the whole French corpus The nearby
thematic (i.e. the most frequently words in close context) of
que: Distance Corpus Excerpt Word 61.25 34667 4074 QUE 39.24 13039 1573 NOUS 16.70 40088 2617 EST 15.04 16896 1219 NE The thematic
research shows relation between words. One can observe that ne
comes in 4th position after nous and est. We
can consider that que fits in a “ne… que” structure
(for instance: “Les initiatives, propositions et conférences internationales
ne sont que pertes de temps et tentatives vaines”). Graph 3.4 In blue
ne ; in red que - que distribution in the whole corpus Argumentation words and structures; hedging and facticity Analysis of
racist discourse reveals language use that is typical of minority belief-holders
with conversionary zeal in that it boasts a disproportionate use of
absolute truth claims. Minority social groups and belief holders conceptualise
truth as something which has been repressed or ‘concealed’ through
socialisation processes and global information control at the hands
of their chosen out-group. This paranoiac worldview leads them to return
to fundamental principles of truth and falsehood. The tendency to claim
ownership of the truth (to arrest the demonisation of their belief community
and selves) is evidenced by the disproportionate use of words such as
certain, fact, truth, knowledge, etc. (in English) and angeblich,
aufdecken, behaupten, Fakten, Tatsachen, wahr, Wahrheit
etc. (in German). Examples of noun phrases specific to the racist content
in German corpus are: Nationaler Aktivist, Ruhm und Ehr, nationaler
Widerstand, nationale Opposition, nationale Kräfte, frei sozial und
national, mit kameradschaftlichem Gruß, Tag der nationalen Arbeit,
unser Kampf, Deutsche Volksgemeinschaft. There are various sociocognitive and psycholinguistic reasons for this: the individuals are reconciling their socially vilified belief system by placing it in a rational context through the use of standard rhetorical and persuasive language use. Because their beliefs are not socially accepted as truth they must explicitly represent their beliefs as such; whereas non-racist discourse participants, when speaking of the same matters, have no requirement to announce the truthfulness of what they are about to say since it is part of a socially sanctioned and validated belief system. Because of
the fact that Racists speak from within a minority belief system, addressing
itself to those who hold different beliefs and assumptions about the
issue of race politics, there is a greater tendency within racist discourse
to hedge, or palliate, statements which the author knows are socially
non-normative. Although this seems to contradict the evidence of greater
frequency of strong truth claims, the two processes actually complement
each other: the expression of doubt is tacit (through use of words such
as perhaps, maybe, almost, quite, nearly) and, in fact, many
hedge words allow the author to encourage a reader to entertain
provisional worldviews (often racist ideas) and explanations of social
phenomena which would be easily dismissed if phrased as strong truth
claims. Graph 3.5
Distribution of truth claim lexis in the English language corpus The presence and frequency of argumentation lexis in the English language corpus, and the syntactical patterns which form around such words, are represented in appendix 1. 4: Local Grammar and Collocation Collocation is a restrictive kind of local grammar which allows us to extend our criteria beyond basic word detection. In this instance, where a great deal of the lexis is shared by opposing discourses, it is also highly useful for disambiguating between different types of context-dependent word usage without requiring structural analysis of the texts. Detection of collocations can be done by separate detection of each word and resultant ordering and reconstitution of the collocation, or by detection of the complete collocation. In both cases, inflectional and derivational variants can be taken into account. When performing collocation analysis, one must specify two things: the headword and the distance parameters within which collocated words are to be indexed. These parameters may greatly affect results. Words which appear immediately to the right and left of the keyword are the most instructive; as the parameters expand, the list of collocates becomes more like the tokenised list for the whole corpus (as it becomes a more sizeable fraction of it). The most useful information about racist discourse is to be found in close proximity to the chosen keywords (one word to the left and one word to the right of the keyword). Collocations correspond often to noun phrases, but also to verbal and adjectival phrases, as well as to any neighbourhood of words. One specific type of collocations is proper nouns. Noun Phrases The analysis of the frequency and distribution patterns of noun phrases with specific words as white, race, black, Jew etc. but also with some grammatical words as we, our, they, their, etc across the corpus, may be helpful in formulating clues: our civilization, our race, white pride, their beliefs, white american genocide, white flight, white genocide OR suicide, white ghettoisation, white knights. Examples of noun phrases specific to the racist content in German corpus are: Nationaler Aktivist, Ruhm und Ehre. Verb phrases Collocation analysis of verbs enables us to create an other type of paradigm of syntactic structure and lexical company in which racist keywords are typically found. For example, the verbs that racists use in conjunction with the word truth (such as, fact, course, knowledge), are completely different to those used by anti-racist discourse. More elaborate and complex type of verbal phrase can correspond to the entire sentence like: To be born white is a privilege; To be born white is an honour and a privilege. Proper Nouns Names of authors and racist organisations fall under the category proper nouns. One approach anti-racist organisations adopt in order to fight hate is exposure of hate groups and racist organisations, the objective being to educate and inform people about the activities and beliefs of racists. This would account for the high frequency of recurring expressions such as authors and racist organisations within the anti-racist corpus. One reason for low frequency in the racist corpus is that in many cases racist authors do not believe they deserve the label ‘racist’ and do not therefore associate themselves with hate groups or racist organisations despite the fact that their views may tally with the views of hate groups. An organisation such as GRECE (in France) which is known to inspire racist ideology is mentioned more in anti-racist websites. In the same way, authors such as Alex Curtis, David Duke, David Irving, Don Black who are either founders of racist organisations or prominent figures within racist organisations are mentioned more in the English language anti-racist corpus. However other authors such as Elisha Strom, Kevin Alfred Strom, Elena Haskins, H. Millard, Per Lennart Aae, Martin Laus, Holger Apfel, Günter Battke, Jürgen Gerg who are not so popular in the public domain have a higher frequency in the racist corpus. Somebody like Guillaume Faye, a theoretician of racism, is mentioned more in racist corpus. See appendix
1 for further information on collocation in English. 5: Isotopy Isotopy is a set of words (or other linguistic features such as morphemes) and/or collocations with strong semantic relationships, which can be considered to be a group of synonyms. This is a complex type of feature: detecting an isotopy depends on separate detection of each element which it contains. The presence of all elements of an isotopy is not compulsory, as each element refers to the overarching isotopy. Isotopies are built during the linguistic analysis of the corpora (and/or result from adapting existing synonym resources, like WordNet and their European language counterparts). Isotopies correspond often to recurring themes of racist discourse: collective character of the enemy (group, organisation, network), danger, destruction, drugs, etc. Being specific to the racist discourse, they do not always correspond to the normal associations between words. If existing synonym resources are used, they have then to be adapted. An example
of derogatory isotopy (people as animals) in a short French text is
done with the terms: “Puces” (i.e. “flea market”
with quotation marks) vermines, à 4 pattes
and some grammatical constructions (the uncountable plural). 6: POS tag Part of Speech (POS) tags reflect the grammatical level of the language: lexical units are replaced by their grammatical label: noun, verb, adjective, adverb, determiner, pronoun, etc. This allows us to analyse on the meta-linguistic level. POS tags are available only on the tagged document. The matching of POS is then performed by the normal string matching module. The performance and robustness of the POS tagging tool have to be tested on the documents we are working with, because these documents often present very special vocabulary and syntactic structures. Our linguistic studies show that in the English racist corpus there is a greater usage of adjectives in specific syntactic slots (particularly before key nouns). The French corpora seem to present a more frequent usage of verbs in racist corpus, and a more frequent usage of nouns (and adjectives) in anti-racist corpus. Some syntactical patterns (noun phrases, noun-noun constructs, the type of verbs occurring near nouns, the adjectives used to describe nouns) seem also be interesting in characterising racist discourse. POS tags can
also be analysed as substrings. In this case, we should aim at rough
grammatical categories (V-, ADJ, N-, etc.) without consideration
of morpho-syntactical features of words. Their detection is then performed
with substring detection techniques. 7: POS tagged word The combination of lexical and POS clues provides a kind of disambiguation of words which can belong to different POS categories. In French, for instance, the word juif can be either a noun or an adjective. In the following example, this word is characterised by its POS tag: juif/SBC:m:s, where SBC means the noun category and m and s are its morpho-syntactical features: masculine gender and singular number. POS tagging gives then the grammatical characterisation of the word analysed. Detection of tagged words is performed with string or substring matching modules. It can be done only on POS-tagged documents. For this type of linguistic features and clues, the performance and robustness of POS tools have to be tested. In French racist and revisionist corpora the word juif seems to be used rather as a noun and like an adjective in anti-racist documents. These findings have emerged from the human linguistic analysis and we have doubts as to the precision software tools can offer in the analysis of such fine grammatical features. The other possible
usage of combining POS tags and lexical clues information appears from
the preliminary results obtained on English corpora, where this combination
seems to offer a very promising perspective. One example is the presence
of ADJ + truth or fact or knowledge or the combination of
ADJ + ADJ in a document. 8: Lemmas When a word form has no more inflectional marks (gender, number, case, tense, mode, etc), it is called a lemma. The lemma corresponds also to the word forms which are used as dictionary entries: singular (nominative, masculine) form for nouns and adjectives, infinitive form for verbs. The matching of lemmas is performed with string or substring matching modules. Reduction of inflected words to their lemmas allows one to group all the occurrences they present, which has a direct influence on the frequencies of these words. Of course, the differences of tense, number or other inflectional forms can be also considered as meaningful for the characterisation of racist and revisionist content. The calculation
of lemmas is not always a successful operation. Hence we have to compare
results one can obtain with POS tagging, lemmatising tools and stemming
algorithms. And then choose the best approach for each language processed. 9: Tagged Lemma The combination
of a lemma and its POS tag allows basically to disambiguate those lemmas
which can belong to few grammatical categories. The matching of lemmas
and their POS categories is performed with string or substring matching
modules. 10: HTML tags The HTML tags reflect information about the structure of documents, their layout and the links they have with other documents (hyperlinks). The detection of isolated and meaningful HTML tags can be done with matching modules. String matching allows to detect complete HTML tags (H1, H2, H3, TR, TD, UL, OL, BR, HR), substring matching allows to detect subtags (H-, T-, -L). In a more complex way, one can analyse given HTML tag and its attribute values. For instance, the meaningful colour combination in German racist documents is black, white and red. The same seems to be pertinent for some French racist documents. If a complete HTML tree is required, one has to use the HTML parsing tool. This concern remains an option because of its time and effort-consuming character. Hyperlink Analysis The English corpus shows that the presence and frequency of hyperlinks is much lower in the racist corpus (5,819) than it is in the anti-racist (22,056). English racist web pages are generally minimalist in appearance. Use of dynamic html, style sheets, JavaScript and such dynamic features of web pages are kept to a minimum. Hyperlinks may be more common in the anti-racist corpus as many of the anti-racist pages were found on domains that impose structure on the web pages (e.g. online newspapers and anti-racist organisations) and that take into account accessibility and user-friendliness. They want to make it easy for readers to find or link to other material. As a result there are many menus and hyperlinks to other documents regardless of whether these documents are related or not. Racists on the other hand do not provide links in such abundance and may prefer to keep readers tuned into their own articles. Apparently, there are variations to be observed between languages. Some German racist web sites are of a high technical standard which use multi-media effects (music and video clips). It is not uncommon to find links to anti-racists web sites and they serve the purpose of knowing the adversaries. Anti-racist websites, on the other hand, refrain from linking to racist web sites in order to avoid unwanted advertising of racist material. First results
obtained seem to show the existence of web rings (z.B. Portalseiten),
where for the most part racist web pages link to other racist pages
and anti-racist web pages link to anti-racist pages. The French corpora
and material show that racist web rings present a more complex structure
than anti-racist ones. 11: Sentence The bulk of linguistic studies are performed on the word level, while a complete meaning can be created on the level of the sentence. To deal with the sentences, one has firstly to detect it with a specific linguistic module. The information about sentences can be used in various ways: detection of co-occurrences, collocations in the same sentence, statistics on the sentence, sentence like a complex lexical collocation. Matching of collocations and co-occurrences in the same sentence is performed with matching modules. Statistical analysis of sentences is performed with counting, averaging and weighting modules. The paragraph can be considered as basic unit for the POS tagging module, since this module is time-consuming. When the sentence appears like a complex lexical collocation as: I hate niggers. its detection
is done with string matching modules. 12: Paragraph The paragraph level is more directly attainable than the sentence level, for instance through the analysis of HTML code of the document. The paragraph level can be taken into account in the same way as the sentence level: detection of co-occurrences, collocations in the same paragraph, statistics on the paragraph. Detection of paragraph can be done by the analysis of the HTML code (detection of some meaningful HTML tags) or further to the HTML-to-full-text conversion. One can use then matching and statistical modules. The paragraph can be considered as the basic unit for the POS tagging module, since this module is time-consuming. First results obtained on the French corpora show that the average number of paragraphs in racist documents is lower (110.78) than in anti-racist (137.7). And the average number of words per paragraph is also lower in racist documents (10.1) than in anti-racist (14.99). In English
corpora, the results are not the same: the average number of words per
paragraph in the racist corpus (83.89) is almost twice as big as the
anti-racist (48.83). 13: Document The document level corresponds to the entire document and all the information it contains. During on-line running, the aim will be to verify if clues of our knowledge bases are present in the documents newly collected on the web, and then to decide if the document contains racist or revisionist content. Linguistic, matching and statistical modules will process the entire document, except, may be, the POS tagging which, being time-consuming, could be applied to a single sentence or paragraph. The other way to exploit information contained in the document is to produce general statistics on the document level: number of words, of characters (bytes), of paragraphs, etc in the document, paragraph, word, etc., their average values. First results obtained by the analysis of English corpora are presented in the figure 14.1. One can see there the size of the English corpora in terms of the number of documents and words. For the sub-corpora which have a comparable size in terms of the number of words, the number of files may vary considerably. The anti-racist corpus contains one third more files than the racist one. This indicates also that racist documents are longer than anti-racist ones (See Appendix 1 for more detail about the statistical differences in English corpora).
Fig. 14.1 – Overview of English corpus size In the French corpora, the average number of words in racist documents is lower (1,121) than in the anti-racist documents (2,064). Racists seem then to produce shorter documents. A similar tendency can be observed in the German corpus (racist = 935; anti-racist = 1.230; revisionist = 2.784; anti-revisionist = 1.199). Hapaxes within the English Racist Corpus A hapax is a word which has been used only once within a document (or a corpus, or a paragraph, according to the level considered). The number of hapaxes varies according to the linguistic processing performed: a full-text document will contain more hapaxes than a lemmatised or stemmed document. In the English language corpora, there are 16,958 hapaxes specific to the racist corpus (when compared with the anti-racist corpus) within a total of 37,061 types in the corpus. Most of these words are used too infrequently (16,130 < 5 files). However, 102 hapaxes each appear in more than 10 separate texts and between them comprise 1,376 instances of usable low-level criteria for categorising the documents. Hapaxes share many of the features commonly discerned from critical readings of racist discourse: fear of the multiplicity of the ethnic out-group (multiply, takeover, teeming); the kind of nominalization associated with mythic stereotyping (Jewess, goy, mestizo); and the attribution of negative and essentialist characteristics to such out-groups (insanity, wickedness, superstition). At present,
the applicability of these unique lexical markers is dependent upon
further testing to discern ways of disambiguating their use in racist
discourse from that of non-racist discourse (their non-usage in anti-racist
discourse having been established). 14: URL The URL concerns the location at which a document has been found on the web. The detection and analysis of a URL inside a document is performed with analysis of the HTML structure and hyperlinks and through the matching modules. The URL presents useful information in that certain domains are known to contain primarily racist material (e.g. Stormfront.org, sos-racaille.org, aaargh.com). Documents that are localised at these domains can be then considered as racist or revisionist with a very low verification of other clues. In other words, the confidence rating of domain URL clue can be considered as relatively high. The domain
URL can also be used to compute the IP-number of the provider, since
some providers clearly specialise in hosting racist and revisionist
sites. 15: Conclusion The classification of linguistic features which we present in this Deliverable aims at grouping and organising the clues found during the linguistic research of the knowledge base. It reflects the linguistic levels and units of documents which seem to be common to the three languages. When new types of clues emerge during further research, they will be incorporated into the existing linguistic knowledge base. During the building of the corpora, we have tried to adopt methods as open and comprehensive as possible (described in Deliverable 1.1). But, on one hand, the general search engines which were used do not index all the web pages. And, on the other, racist and revisionist content evolves and changes with political, ideo Comments:2
Posted by Desmond Jones on Wed, 20 Sep 2006 19:10 | # Will this be part of the transhumanism computer model? 3
Posted by Laban on Wed, 20 Sep 2006 21:24 | # There seems to be more of this document - any chance of getting the rest up ? 4
Posted by Boris on Thu, 21 Sep 2006 00:34 | # From now on I will no longer consider myself a separat…, I would instead become pro-separation. I have personally never used the n word as I believe the message can be brought across without ‘names’. BTW whatever happened to words cannot hurt me, but stones and sticks will? Ask the Lebanese if they’d mind being called goyim or would they rather get carpet bombed? 5
Posted by Nick Tamiroff on Thu, 21 Sep 2006 04:32 | # Crhist,This is scary-Geoge Orwell has to be laughing in his grave.I just printed this crap out,and will try to digest it later ,when I’m totally inebrieated.In the meantime,let me say-Nigger-Nigger -Nigger!!!,Faggot,Faggot ,Faggot!!! Raghead,raghead.,raghead !!! Liberal,Liberal,Liberal!!!Illegal alien,Illegal alien,Illegal alien!! May be soon to the last time I’m allowed to speak such words,as we further fuck up the First Admendment—Thats part of the Constitution of our REPUBLIC {for you shit-heads that think this country was founded as a democracy] 6
Posted by Rnl on Thu, 21 Sep 2006 07:24 | # First results obtained on the French corpora show that the average number of paragraphs in racist documents is lower (110.78) than in anti-racist (137.7). And the average number of words per paragraph is also lower in racist documents (10.1) than in anti-racist (14.99). In English corpora, the results are not the same: the average number of words per paragraph in the racist corpus (83.89) is almost twice as big as the anti-racist (48.83). Translation: Our statistics on “racist paragraphing” have turned out to be completely worthless, which is depressing, since we spent so many hours collecting them. A skeptic could have warned us that the chances were remarkably small that the size of “racist paragraphs” would be significantly different from the size of “anti-racist paragraphs,” but we’re not skeptics, just sinister idiots with too much time on our hands. Adverbs are one important feature of racist discourse. Which isn’t surprising, since they appear in a large percentage of sentences. More significantly: Analysis of racist discourse reveals language use that is typical of minority belief-holders with conversionary zeal in that it boasts a disproportionate use of absolute truth claims. Minority social groups and belief holders conceptualise truth as something which has been repressed or ‘concealed’ through socialisation processes and global information control at the hands of their chosen out-group. This paranoiac worldview leads them to return to fundamental principles of truth and falsehood. The tendency to claim ownership of the truth (to arrest the demonisation of their belief community and selves) is evidenced by the disproportionate use of words such as certain, fact, truth, knowledge, etc. In other words, these “racists” make regular appeals to evidence. Appeals to evidence are here conceptualized as a “return to fundamental principles of truth and falsehood.” On this point the report’s authors are (to use three markers of “racist discourse” in two words) certainly correct. Racialism does, in fact, often appeal to facts and truth, since racialists are convinced, rightly or wrongly, that facts and truth are on our side. “It is widely assumed within the mainstream media that there are no socially significant differences among the various races. In fact, that assumption is false, and here is the evidence ...” “It is often stated in the mainstream media that Islam is a religion of peace. In fact, that claim is false, and here is the truth ...” Both of those sentences, and any longer argument that elaborated on them, would rank high in the textual features that the authors have identified as symptoms of “racist discourse.” The authors of the report are, to use their own language, speaking from the perspective of a dominant discourse. They are attempting to pathologize dissent from this dominant discourse—whose dominance, as they casually note earlier, is often enforced by law—by identifying a distinct form of “racist discourse” which exists in contrast to anti-racist discourse, the privileged discourse in their analysis. That this “racist discourse,” in contrast to anti-racist discourse, regularly makes truth claims becomes evidence of its “racism,” because the privileged discourse, as their analysis has apparently demonstrated, makes such claims less frequently. “Racist discourse” is therefore marked by “a disproportionate use of absolute truth claims”—disproportionate, that is, in comparison with anti-racist discourse, not in comparison with (say) a physics textbook or the Summa Theologica. Thus the fewer truth claims that anti-racists make, the more “disproportionate” every truth claim in a “racist text” becomes. That’s both bizarre and stupid. Of course any minority discourse attempting to attack a dominant discourse will, if its advocates have confidence in its validity, inevitably speak in exactly the language the authors have identified as evidence of paranoiac “racist” speech. (Did anyone spot the signs of “racist discourse” in the preceding sentence? There were two in the first three words alone. It must be a real challenge to write anti-racist sentences.) Because of the fact that Racists speak from within a minority belief system, addressing itself to those who hold different beliefs and assumptions about the issue of race politics, there is a greater tendency within racist discourse to hedge, or palliate, statements which the author knows are socially non-normative. I’ll take a wild guess that VNN wasn’t in their corpus of WN websites. 7
Posted by Kenelm Digby on Thu, 21 Sep 2006 12:57 | # Honestly, I coudn’t give a sh*t. 8
Posted by James Bowery on Thu, 21 Sep 2006 14:46 | # The thing that is most frightening about this isn’t the techniacl expertise demonstrated—it isn’t sophisticated—nor even the fact that they are profiling “racists”. The thing that is disturbing is the use of funding by the European Union to attempt to filter for “racist attitudes”. Aside from the base hypocrisy of using profiling to discriminate between “racists” and “anti-racists”, they are saying that “attitudes” which would have characterized the vast majority of people who opposed Hitler during WW II, are legitimately targeted by the government. 9
Posted by Guessedworker on Thu, 21 Sep 2006 15:00 | # The legitimacy of the government is derived from the consent of the people. There is nothing legitimate in this Marxist hate campaign against the expressed interests of the majority. It is this fundamental absence of legitimacy which entitles the present majority to take back its land, its language and its legal rights at any time and by any means, before or after the passage into minority status in its own homelands. 10
Posted by proofreader on Thu, 21 Sep 2006 15:49 | # The real plan is to monitor private e-mails (presumably of a “racist” nature, i.e. opposition to the EU regime) with the software they´re developing based on these corpora. Scary, were it not for the obvious fact of the inanity of the work published. 12
Posted by Rnl on Sat, 23 Sep 2006 23:13 | # James Bowery wrote: Aside from the base hypocrisy of using profiling to discriminate between “racists” and “anti-racists”, they are saying that “attitudes” which would have characterized the vast majority of people who opposed Hitler during WW II, are legitimately targeted by the government. Churchill himself, the most prominent of the anti-nazis, shared this “racism,” as Auster pointed out several days ago:
14
Posted by Rnl on Sun, 24 Sep 2006 05:31 | # proofreader wrote: The real plan is to monitor private e-mails (presumably of a “racist” nature, i.e. opposition to the EU regime) with the software they’re developing based on these corpora. Scary, were it not for the obvious fact of the inanity of the work published. Unless the authors of the report are remarkably dull, which I suppose is possible, they can’t seriously be planning to turn their research loose on “racist” e-mail communication. If they wanted to detect racialist _content_ in e-mail or other electronic texts, they would concentrate on nouns. The task would not be complex. Does a text contain certain keywords like race, immigration, nationalism, black, white, IQ, crime, Muslim, mestizo, etc? If it does, the chances are good that the writer is expressing impermissible racialist beliefs. Most of the data they have accumulated would be useless for that purpose. Searching for a large cohort of adverbs or the “interrogative usage of who” would not distinguish racialist texts from normal discourse. They claim it would, but it wouldn’t. And there is no possiblity whatever that they could distinguish the “implicit rhetorical construction[s] (euphemism, antiphrasis)” in racialist e-mail from implicit rhetorical constructions in non-racialist e-mail. Any software designed on the data in their report would be just as likely to criminalize participants in a bridge-players listserv. They don’t distinguish racialist texts by analyzing their content because they want to promote the idea that something called “racist discourse” can be identified on the basis of formal features alone. That’s a sign of people who fear evidence, so they prefer not to discuss it. 15
Posted by James Bowery on Sun, 24 Sep 2006 17:23 | # Rnl writes: The researchers began with a collection of “racist texts” culled from WN websites. So far so good. Identifying these texts as “racist” was not problematic, nor should we expect that it would be. It was problematic only in that the word “racist” has multiple senses—ambiguity that is profitably exploited in “anti-racist” propaganda. In other words, if you research race differences of any kind, you can easily be tarred as someone who would sail to Africa, chase down a hapless native, throw a net over him, tie him up, shackle him, throw him in the hold of a diseased slave ship and whip him into submission if he ever even showed the slightest inclination to resist your will. Both the researcher and the slaver are “racist”. It’s easy, from their perspective, to identify a “racist” text. That’s more like it. They can’t define “racist”, otherwise they might not be able to “justify” chasing down, throwing a net over, shackling, etc. a researcher of race differences. But they know a “racist” when they see one. You need only look at its location on the Internet Again… good so far as it goes. and its subject matter. They intend to detect new locations that are “racist”. This is the entire point of their having focused on discriminating between “anti-racist” and “racist” texts in their training corpora. Both have the same subject matter. Any policeman equipped with a dictionary could do the same. A policeman equipped with a dictionary would have the following to work from (Wordnet): rac·ism (r?‘s?z’?m) pronunciation 1. The belief that race accounts for differences in human character or ability and that a particular race is superior to others. So the policeman sees some guy researching race differences. He cannot conclude that the researcher is guilty of “racism” using sense 1 but what about sense 2? He looks up “discrimination”: dis·crim·i·na·tion (d?-skr?m’?-n?‘sh?n) pronunciation 1. The act of discriminating.
16
Posted by James Bowery on Sun, 24 Sep 2006 18:01 | # Leaving aside the totalitarian impulse percolating throughout this report, the authors’ method is seriously flawed, as I noted above. It’s foolish to suggest that a distinct “racist discourse” can be identified by its departures from the linguistic features that characterize an anti-racist corpus culled from anti-racist websites. Don’t conflate the burden of proof required to convict with the more nuanced features used by everyone to make practical discriminations, day to day, hour to hour, minute to minute and second to second. This is something neurons do. But it’s quite possible that a police officer or a judge wouldn’t agree. A judge needn’t be so ridiculous (although the history of the Judiciary here clearly shows they go above and beyond the call of duty striving to best their peers for title of most ridiculous) in order to wreak terrible havoc using this sort of filter. Basically, all he has to do is approve the means by which law enforcement officers and prosecutors find suspects which they then bring before him. For example, even though racial profiling would render much police work far more effective, courts throw out most cases where racial profiling was used to bring the suspect before the bar. The argument here is not that profiling fails to establish guilt beyond a reasonable doubt, which is certainly true but that profiling itself is an illegitimate form of perception (even though all perception is profiling at some level). All the courts have to do is say that it is legitimate to use this kind of profiling (not calling it “profiling” of course) to discriminate (not calling it “discriminate” of course) “racists” (not defining exactly the sense of “racist” of course) when dredging for suspects. Once the suspects have been identified, the system can leave it up to the mushy “I know it when I see it and truth is no defense anyway.” mentality of the politicos populating the law enforcement and judiciary to do the rest of the dirty work. 17
Posted by Rnl on Mon, 25 Sep 2006 05:59 | # I probably shouldn’t waste further time on this report, but I’m impressed by its sinister stupidity. Since racist and anti-racist materials share some similarities (for example certain keywords), we have to analyse not only a racist corpus, but also an anti-racist one and, certainly, general language corpus and then to contrast linguistic features obtained from these different corpora; Now they don’t in fact analyze a “general language corpus.” Instead they contrast their “racist corpus” with an “anti-racist corpus.” On the basis of this contrast they describe the linguistic features that allegedly characterize what they call “racist discourse.” They don’t contrast their “racist corpus” with a collection of science-fiction novels or a corpus culled from discussions of Christian theology. Their “racist discourse” exists as a distinct discourse only insofar as their web-based “racist corpus” differs linguistically from their web-based “anti-racist corpus.” They have a practical reason for this choice, namely that “racist and anti-racist materials share some similarities (for example certain keywords).” They mean that racialists and anti-racialists are likely to discuss similar subjects. Filtering only for racialist content could therefore also detect anti-racialist texts. A keyword like “immigration” could detect both those who strongly approve of non-White immigration and those who strongly disapprove. So they must, if they hope to detect racialist content while avoiding anti-racialist content, devise some system—a system not focused directly on content (e.g. on nouns from the semantic field “race”)—to distinguish good discussions of racial matters (“anti-racism”) from bad discussions of racial matters (“racism”). Although it is highly unlikely that a detection system based on the “clues for the detection of racist content” they have assembled (e.g. an abundance of adverbs) would work, we can at least see its practical purpose from their Stalinist perspective. But their method is moronically wrong if they want, as they say they do, to describe a linguistically distinct “racist discourse.” It would be convincing only if Jehovah or Odin descended to earth and officially declared the linguistic practices of anti-racist websites to be representative of normal language use. Absent some authoritative declaration to that effect, their description of the linguistic features of “racist discourse” is worthless. Anti-racialist websites list names and addresses of their opponents in order to encourage violence against them. The authors of this report are surely aware of that fact. They chose to lie about it. in many cases racist authors do not believe they deserve the label ‘racist’ ... Which should require, if the Stalinists responsible for this report had any intellectual integrity, defining “racism” and “racist,” since, as they have just acknowledged, there is a dispute about the meaning of the terms. There are three categories of “racist” in their report: (i) avowed “racists”; (ii) secret “racists” who unsuccessfully try to conceal their “racism”; (iii) unwitting “racists” who don’t believe they are “racists” but really are. Evidently the writings of all three have been tossed indiscriminately into the “racist corpus,” with no attempt by the report’s authors to describe their common beliefs. 18
Posted by Rnl on Wed, 31 Jan 2007 16:38 | # Dinesh the Dhimmi By Serge Trifkovic [...] D’Souza uses “Islamophobia” with the implicit assumption that the term’s meaning is well familiar to his readers. For the uninitiated it is nevertheless necessary to spell out its formal, legally tested definition, however. It is provided by the European Monitoring Centre on Racism and Xenophobia (EUMC), a lavishly-funded organ of the European Union. Based in Vienna, this body diligently tracks the instances of “Islamophobia” all over the Old Continent and summarizes them in its reports. The Monitoring Center’s definition of Islamophobia includes eight salient features: 1. Islam is seen as a monolithic bloc, static and unresponsive to change. 2. Islam is seen as separate and “other.” 3. Islam is seen as inferior to the West, barbaric, irrational, primitive and sexist. 4. Islam is seen as violent, aggressive, supportive of terrorism and engaged in a clash of civilizations. 5. Islam is seen as a political ideology. 6. Criticisms made of the West by Islam are rejected out of hand. 7. Hostility towards Islam is used to justify discriminatory practices towards Muslims and exclusion of Muslims from mainstream society. 8. Anti-Muslim hostility is seen as natural or normal. This definition is obviously intended to preclude any possibility of meaningful discussion of Islam. The implication that Islamophobia thus defined demands legal sanction is a regular feature of the Race Relations Industry output. It also routinely refers to “institutional Islamophobia” as an inherent social and cultural sickness of most Western societies that needs to be rooted out by education, re-education, and legislation. In reality, of course, all eight proscribed statements are to some extent true. http://frontpagemagazine.com/Articles/ReadArticle.asp?ID=26585 19
Posted by Elena Haskins on Wed, 31 May 2023 02:36 | # Thank you so much for this article re: the Princip Project. I have referred various persons to this article so they can understand the type of scrutiny White “Gentile” Racialists endure. All Best, Post a comment:
Next entry: An exercise in guilt by association
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Existential IssuesDNA NationsCategoriesContributorsEach author's name links to a list of all articles posted by the writer. LinksEndorsement not implied. Immigration
Islamist Threat
Anti-white Media Networks Audio/Video
Crime
Economics
Education General
Historical Re-Evaluation Controlled Opposition
Nationalist Political Parties
Science Europeans in Africa
Of Note MR Central & News— CENTRAL— An Ancient Race In The Myths Of Time by James Bowery on Wednesday, 21 August 2024 15:26. (View) Slaying The Dragon by James Bowery on Monday, 05 August 2024 15:32. (View) The legacy of Southport by Guessedworker on Friday, 02 August 2024 07:34. (View) Ukraine, Israel, Taiwan … defend or desert by Guessedworker on Sunday, 14 April 2024 10:34. (View) — NEWS — Farage only goes down on one knee. by Guessedworker on Saturday, 29 June 2024 06:55. (View) |
Posted by Guessedworker on Wed, 20 Sep 2006 18:14 | #
Straight out of the sick philosophy of unique white evil ... only European-native peoples are capable of the dreaded Jewish-liberal sin.
Presumably, lower-order legal folk of the race-traitorous, Hebreic or merely vibrant type will be required, at public expense, to sift through the tens of thousand of instances of the Jewish-liberal sin, and pick out the choicest morsels for repressive measures.
I’d better post the appeal for my defence fund now.