X-Recipient: archive-cygwin@delorie.com
DomainKey-Signature: a=rsa-sha1; c=nofws; d=sourceware.org; h=list-id
	:list-unsubscribe:list-subscribe:list-archive:list-post
	:list-help:sender:subject:to:references:from:message-id:date
	:mime-version:in-reply-to:content-type
	:content-transfer-encoding; q=dns; s=default; b=UZoPiF3A5g1vG/Zl
	KM8wxYzRo/zWfppWxEAlUTGCvjLidE36y3A099+wtE25UKULMG9WJ6FSOTkZJAWe
	X2K57WHH8FJFyYDCkvevlKF9NlWDcpablxX7XCivgPD6r8ePxxfkM8Xa9exsko2l
	k0DkElrhueLfk7X457cxHSIsWdw=
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=sourceware.org; h=list-id
	:list-unsubscribe:list-subscribe:list-archive:list-post
	:list-help:sender:subject:to:references:from:message-id:date
	:mime-version:in-reply-to:content-type
	:content-transfer-encoding; s=default; bh=hxuEABDtqisdLfDnLLJ0aY
	RMwH8=; b=LAu8KFfkgFfwTPVlOzEXwkSBVJJlABAARvQV8KjcNfmyRH6JpqnQDu
	DLuCqhIY46Vh3mIL/6qOVC8Fyskoi2j5T+1Q43zfGtLY1uLb4YPDnEVr+Il4KoWc
	fBWk74JZX6jbDaIKcNplMstwGhkzXBU59XRrizpjjCwu25v6+NKFs=
Mailing-List: contact cygwin-help@cygwin.com; run by ezmlm
List-Id: <cygwin.cygwin.com>
List-Subscribe: <mailto:cygwin-subscribe@cygwin.com>
List-Archive: <http://sourceware.org/ml/cygwin/>
List-Post: <mailto:cygwin@cygwin.com>
List-Help: <mailto:cygwin-help@cygwin.com>, <http://sourceware.org/ml/#faqs>
Sender: cygwin-owner@cygwin.com
Mail-Followup-To: cygwin@cygwin.com
Delivered-To: mailing list cygwin@cygwin.com
Authentication-Results: sourceware.org; auth=none
X-Spam-SWARE-Status: No, score=0.5 required=5.0 tests=BAYES_40,KAM_NUMSUBJECT,SPF_PASS autolearn=no version=3.3.2 spammy=Spanish, spanish, German, neural
X-HELO: v2201612906741603.powersrv.de
Subject: Re: [ANNOUNCEMENT] Test: tesseract-ocr-4.0.0-0.4
To: cygwin@cygwin.com
References: <announce.16d441c3-aa21-cdc5-949d-966a4acdb94d@gmail.com> <625b4126-96cc-cd4e-c309-b0f9a1eb4895@weilnetz.de> <13b7b33e-e62f-0ce9-e313-0c0fa73051fe@gmail.com>
From: Stefan Weil <sw@weilnetz.de>
Message-ID: <dc956285-c28f-478c-999d-44a990fde238@weilnetz.de>
Date: Thu, 9 Aug 2018 11:19:13 +0200
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.9.1
MIME-Version: 1.0
In-Reply-To: <13b7b33e-e62f-0ce9-e313-0c0fa73051fe@gmail.com>
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
X-IsSubscribed: yes

Am 09.08.2018 um 10:19 schrieb Marco Atzeri:
> My understanding is that the trained data "tessdata, tessdata_fast,
> tessdata_best" are coming from the same training data then version 3
> 
> https://github.com/tesseract-ocr/langdata
> 
> It is not that the languages raw data should be changed.
> 
> Regards
> Marco

https://github.com/tesseract-ocr/langdata is valid for Tesseract 3.05.x
and earlier versions.

Tesseract 4.0.0 still supports the old traineddata format, but added new
(and typically better) traineddata based on neural networks. There is
currently no langdata available for those new traineddata.

tessdata_best only contains the new traineddata.

tessdata_fast also contains only new traineddata, but is faster and less
accurate.

tessdata still contains old traineddata for most languages and
additionally new traineddata made from tessdata_best, but using integer
instead of float models (which makes them faster).

tessdata_best, tessdata_fast and tessdata not only contain traineddata
for many languages, but also for "scripts", for example in
https://github.com/tesseract-ocr/tessdata/tree/master/script. Those
models support all languages using the same script, so
https://github.com/tesseract-ocr/tessdata/blob/master/script/Latin.traineddata
supports all languages which use Latin characters (English, French,
Spanish, Italian, German, Danish, ...). A selection of those script
models would be useful for Cygwin, too.

Regards,
Stefan

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

